Developing a Web Scraping Class in Ruby

15 Jan 2015

As part of my website www.propertytrackerni.co.uk I needed to put together a web scraper application. Some of the information was available through the government web site but not in an automated way and also it contained less detail than the information that was freely available on property sites on the web. Also when you develop your own code you are not limited to working within the limits of a ready made Dataset. As I had chosen Ruby on Rails 4 as my platform I was left with the below choices for my web scraping needs

Nokogiri Gem
Mechanize Gem

Mechanize uses Nokogiri but it is at a much higher level than Nokogiri so for example if you need to edit a checkbox with a description you simply enter the method checkbox_with with the name as a method. There are methods specifically for interacting with a form, filling in fields, pressing command buttons or highlighting checkboxes. If you have used Selenium for your automated tests then you will feel very familiar with the syntax and code that you put together here

Next tool that you require is

Chrome

If you are going to pull data of a web site you need to know how the css is structured on the web site you are going to pull data from. There are at least two methods for pulling data of the website, xpath syntax and the standard css commands. I found xpath was best for pulling the overall element you want of the web site, for example the search results and the next method was to use the css itself to pull of the standard elements. The xpath command will give you an array of element you can iterate through.

For example the xpath "//div[contains(@class,'#{parse_string}')]" will give you the contents of the with the div statement belonging to the class whose name is the value of parse_string. So in other word if the parse_string is search_results it will be the contents of the div statement of class search_results.

This will give you an array of elements which you can then iterate through and parse. Now rather than use xpath you can access the css directly so say for example the header contains the name you require you can enter the below command

.css("h2")[0].content

If the contents you want is instead a class you can use the syntax .classname as shown below

.css(".beds")[0].content

This information can then be insert into the database via ActiveRecord. The web scraping element of this project was alot more easier than I thought with the code coming together in much the same way that an automated test set would come together.

The code for this is available on

https://github.com/emomonkey/PropertyTrackerNI/blob/master/app/models/property_news_crawler.rb

and

https://github.com/emomonkey/PropertyTrackerNI/blob/master/app/models/concerns/crawler_module.rb

Published on 15 Jan 2015 • Find me on Twitter!