Data · webscraping

A webscraper for everyone

For some of us, web scraping or “webscraping” means pulling out the terminal window and wrangling the python scripts for some heavy data extraction – possibly using Scrapy or BeautifulSoup. Yes, it’s true, that web scraping is an advanced user art-form.

For the rest of us, there is still hope.

Cue the Space Odyssey intro… let the military drumming drills come alive because we’re about to have more cowbell!! You can sort of tell I’m a bit excited right?

I found a no-code solution that you are going to absolutely love that allows you to create webscrapers and web spiders from most modern websites, and export the data into a CSV format (i.e. you can open that in Excel).
Not only is it a no-code, point-click solution, it is also able to be installed as a browser extension.
If you can’t wait to see what it is, then scroll down to the bottom of the page.
If you are an amazing person, keep reading.

So here are the main benefits:

  • It’s free
  • It’s available as a chrome browser extension
  • As mentioned, it’s no code, all functionality is accessed through it’s interface within the browser
  • It allows collaboration via import/export features

My first impressions

It was a no-brainer to install and get started with the interface. It is the click of a button to add it as an extension. The fact that it uses the browser and everything is ready to go once it’s added was a pleasant surprise.

One click install from the chrome web store

It uses the web developer tools interface that is built into the chrome browser, which can be accessed either by shortcut (F12 on windows) or right-clicking and choosing ‘Inspect’.

The videos are available from the website under “Learn”

I was a little over confident coming in, and should have watched the tutorial videos before trying to look at the interface, because I was flying blind. Once I’d watched the three introduction videos, I was already able to hit the ground running to create a webscraper.

There are a few web developer & some internal conventions that you need to understand before being able to accurately predict what the webscraper will do under certain conditions – but it’s fairly straightforward.

Building a spider/scraper

The webscraper tool uses a hierarchy style model that allows you to build links that it will follow. Potentially, you could spider a whole site. It also uses that same heirarchical model to dictate which elements to scrape, and which child elements will be extracted. For this reason, at the top level it uses the terminology of ‘sitemaps’ – so you are creating sitemaps, that contain your data, and then you scrape these sitemaps to extract your data in that same structure.

So understanding the heirarchy concept, elements that need to be a parent element, and how to insert child elements, became the most important part of building the webscraper.

The other main concept to grasp is selecting the parts of the page that you want to extract. This tool makes it point and click easy, however also gives you finer control over elements by offering keyboard controls too. You can choose an element and move up and now the parent/child elements using keyboard keys.

It has a nifty little ‘multiple items’ feature that allows you to choose one element on the page and the webscraper will ‘loop’ through all these elements when it extracts your data.

My first webscraper consisted of a page that contained all the elements I wanted to extract, so I thought it would be a relatively straightforward example, without the need to follow links to inner pages.

It exports to CSV

The webscraper tool

Here it is! Go to http://www.webscraper.io


The official documentation

The documentation is very straightforward and covers what you need to know from installing, creating sitemaps and selecting content for extraction.

You can find the official documentation here: https://www.webscraper.io/documentation