How to find your perfect apartment using a web-scraper written in Python

Wessel Huising
8 min readFeb 10, 2021
Some examples of apartments in Amsterdam I will never be able to live in.
Examples of apartments in Amsterdam where I will never be able to live.

I have been living in Amsterdam for almost 10 years. I think Amsterdam is a great place and I am happy to be able to live here. However, what might be less great in Amsterdam, is the current state of the house market for starters (like for me and my girlfriend).

It seems as though the number of reasonably priced, available apartments has shrunk harder than my bank balance after a Friday night in ‘De Pijp’ area.

Concluding, I will be paying a huge amount of money for our rent together with my lovely girlfriend until the house market in Amsterdam cools down, and the majority of apartments are available for a fair price. However, if I don't do something to outsmart the competition, it means only one thing: I will never be able to save up some money to buy my own apartment.

The solution I have developed, enabled by a so called ‘web-scraper’, uses a Python application deployed as a web-service. Its simple and elegant, performing the following tasks: hourly scrapes of multiple brokers’ websites; searches for new apartments; and sends email notifications with the relevant apartments that were found.

My hypothesis: The apartments offered by brokers active on platforms like funda are placed first on their own website and some while later added to the funda platform.

Based on some research in my own bubble, mostly based on biased chats with friends and colleagues, I came to the hypothesis that new available apartments are firstly announced on the brokers’ own websites, instead of being placed first on the funda platform. With that in mind, I want my scraper service to notify me as soon as apartments become available, with a maximum delay of one hour. Using my own web-scraper, I will be one of the first to be informed on new available apartments on the Amsterdam rental market.

I have used three important Python packages for building, messaging and deploying of the scraper.

  • Scrapy (the famous Python scraping package)
  • Yagmail (lazy but dynamite solution to use your Gmail account as a “mailserver”)
  • Scrapyd (for deploying the scraper on an external server, unfortunately not part of this article)

This article will include a few ‘beginner’ steps into scraping the World Wide Web:

  1. Creating a Scrapy project
  2. Defining the ‘house’ Item
  3. Finding and inspecting the right webpage
  4. Coding the scraper logic
  5. Send a mail using the Item Pipeline and Yagmail
  6. Crawl the page

This article will not include the deploying of the scrapers using Scrapyd (if requested, I can think about creating an article for the deploying and serving part).

Creating a Scrapy project

To start with, let’s create a new project using the built-in Scrapy CLI (Command Line Interface) in your terminal. You can use the command ‘startproject’, and you won’t need to do much to get something going on. I named my project ‘housing’, which works better than ‘pleasefixtherentalmarketsituation’ and ‘housing’ is coincidentally the English translation of my last name. This command creates a new Scrapy project within your current working directory including all the prefabricated files and directory structure, you couldn’t be more lazy.

scrapy startproject housing

Defining the ‘house’ Item

Next up is defining items for the project. Creating an Item is comparable with defining the parts of the object I want to scrape. As I want to scrape houses from brokers’ websites, I will need a ‘house’ Item. Luckily, Scrapy already created a file called items.py with the minimal draft for a Scrapy Item called ‘HousingItem’. I am interested in a few characteristics of the new apartment. I want to scrape the address, the name of the broker, the price of the apartment, the area, the URL address (to look for myself when I received the mail notification) and the telephone number of the broker (to be the first caller as soon as I got the notification). Create your relevant fields for the Item so we can get to the sexy scraping part, finally.

items.py

Finding and inspecting the right webpage

First you need to know which page you want to scrape. This page should contain the relevant information, so in our case we need to use the page where the overpriced apartments are shown. Because I probably shouldn’t use a real broker as an example, I will create my own example page for the sake of this article.

Scrapy works as follows: it downloads the given page as HTML, searches for predefined parts of in the page based on search criteria we have provided. For example, let’s say that the HTML of our to be scraped page looks like this:

https://www.thehouseflippers.io/offers

When scraping the web using a scraper instance, the scraper will save the HTML code of the targeted webpage into a ‘response’ object (together with other meta data about the parsed webpage). This object is used to parse the relevant information.

Coding the scraper logic

In this example there are two houses for rent. When looking closely, both houses have the same DOM elements except for the content within the elements. This structure is going to help us scraping the page and extracting the relevant parts of the two houses using so-called ‘selectors’. For this example I used CSS syntax to query the response (I advise you to get familiar with XPath). Its time to code our scraper for this specific brokers website. Scrapy uses the name ‘spider’ for its scraper objects and I will be using this reference from now on. Ofcourse, Scrapy has an built-in function to create new spiders using the command line.

scrapy genspider thehouseflippers thehouseflippers.nl

I am going to name the spiders after the brokers name: ‘thehouseflippers’. The second argument, the domain, is set to ‘thehouseflippers.nl’. Eventually it should look like the code below, but for the sake of instruction I will give the final code for the spider. I will explain the contents in the next few paragraphs.

thehouseflippers_spider.py

In order to prevent errors, it’s required to import the Item class from the items.py file called HousingItem (we just made that Item). We initiate the scraper as a subclass of the scrapy.Spider class (which makes life much easier). Within the subclass I defined the name of the scraper and set the desired URL into the ‘start_request’ function. In this example we will be scraping one page, but feel free to add more URLs in the future. The real magic happens within the parse function.

Within the parse function the scraper will look for every ‘div’ element assigned with the CSS class ‘house’ in the response object (this is the requested HTML code of the URL we put in the start_requests function). Within every div assigned with the CSS class ‘house’, the parse function searches for more information for every instance using the CSS selectors. As I want to scrape relevant information such as the address, price, etc., I need to specify the location of that particular piece of information within the div element of the selected house.

Because I don’t want to be spammed every time the scraper runs with messages of previously scraped apartments: I created a very simple (and probably embarrassing) solution to save the address of the first found house of the last scrape session. The scripts opens a file named ‘thehouseflippers.txt’ within the ‘last’ directory and sets a Boolean variable to True. In the first iteration of all the houses found in the page, it checks for the contents of the file and matches it to the newly found address of the house. If the address matches the contents of the file, no new apartment has been placed on the website. If the contents of the file does not match the address of the first house that is found during the scrape, a new house has been found and and will replace the contents of the file. The Boolean will be set to False to prevent the scraper overwriting the file with the next address of the to be scraped houses.

It ends by returning the data in the form of the HousingItem for every scraped house from the URL using the yield function. For the last part we want to notify ourselves whenever a new apartment or house has been scraped from the URL. This is where Item Pipelines comes in.

Send a mail using the Item Pipeline and Yagmail

Item Pipelines are useful in combination with the use of the Item class. For every scraped house, the scraper generates an Item with the HousingItem class. This Item can be automatically processed using a Item Pipeline. As I want to notify myself for every newly found house, I can use the preprocessing part to clean up the fields of the scraped Item and send myself (and my girlfriend and whoever I pick) an email containing HTML code using the amazing package called Yagmail.

This particular Item Pipeline called ‘HousingPipeline’ only processes the Item (there are more methods available but for this project they have no pratical use). The first step in processing the Item within the Pipeline is performed on the price field of the ‘house’ Item. Every other character besides a digit is removed and every digit after the 4th is processed as a decimal place. The same processing steps are applied for the area: every non-digit is removed (m² for instance). You can think of extra processing steps like sending a mail whenever the price is lower than you want it to be.

The fun part starts with creating the contents of the notification mail, filled with the processed data from the Item. The content is filled using Python, I used my mad HTML skills to create a beautiful layout for no particular reason. It uses all the fields of the HousingItem instance. Initate the SMTP using your Gmail account (please create a keyring, otherwise no mail is going to be sent) and use the send method to send the mail. Don’t forget to add the HousingPipeline to your settings.py.

ITEM_PIPELINES = {
'housing.pipelines.HousingPipeline': 300,
}

Crawl the page

Test your newly built spider using the command utility function of Scrapy and see if it all works, probably it needs some debugging (of course, my horse). By execution the command ‘scrapy crawl’, followed by the name of the spider (you see the methodology), the spider will crawl the website and hopefully spits out some crawled houses. When the mails arrive with the content exactly the way you want to be, you are ready for the next stage called deploying the sexy bastard (unless you want to manually run the command every once in a while, can’t imagine you do).

scrapy crawl thehouseflippers

Deploying

Because it takes a lot of time writing such an article on Medium, I decided to postpone writing the ‘deploying the scraper’ part. I do have some other great articles coming up which need a lot of work so therefore I pause here. For questions or help on starting with web-scraping, feel free to contact me. Happy scraping and good luck on the quest of finding a place to live.

--

--

Wessel Huising

Randomly Picked Stack Engineer. Amsterdam, slowly becoming older.