Scraper Project Template



The Scraper project template is used to crawl web pages containing information in structured or unstructured form that shall be extracted. To locate and extract the data on the crawled HTML pages, the Scraper uses regular expression patterns that can be provided by the user. The final result of the Scraper is the extracted data that is saved in a record set with up to 10 datafields.

These records can either be exported from the Content Viewer in the Management Console or can be retrieved by REST call to the Webservice API.

Step 1 of 4

Create a New Project from Template Scraper

You can build different Scrapers by targeting them at different web pages or you mix a Scraper with other project types. The number of projects and what project types you choose is limited by the number of projects available in your account model.

Right click the root node in the project tree and choose new project, then enter a name for your new project.

Project Dialog

In the project dialog, that now opens you can select the type of the project from the available templates. If templates change or new templates are available, you will be informed by email.

After you entered a description and hit save, you will see your Scraper in the project tree. It has four nodes: URLs, Scraping expressions, Job (Run scraper) and Scraped data.

Step 2 of 4

Enter URLs for the Scraping Targets

To add the URL of a web page from which you want to scrape data, click the new record button and enter the URL. In our example we chose the official list of representatives in the US senate:

When you are done, hit save record.

Create Multiple URLs

You can extract URLs from more than one target with the same Scraper. Just enter multiple URLs in the URLs editor.

Specifying a link depth of 0 tells the crawler to not follow any links. The representatives are all listed on one page.

Now close the editor.

Step 3 of 4

Provide Expressions to Extract the Data

Providing the regular expressions for data extraction needs a bit training and expects you to be somewhat familiar with regular expressions.
If you need information about regular expressions, there are lots of tutorials and examples available in the Internet. This is a technology that has been widely adopted in the text analysis and manipulation business.

Here we provided an expression with six columns: URL, name, district, state, party and committee.

When you are done, hit save record.

Step 4 of 4

Start the Scraper

That is all it takes to customize your first Scraper.

Now right click the job node (Run scraper) in the project tree and choose run job immediately.
The Scraper is started in the background.

Get the Extracted Data

The extracted data is ready for export.

Open the Scraped data viewer by clicking on the Scraped data node in the project tree.
Select the records you want to export and click the export records button. The records will be downloaded in the format you specified.
Alternatively you can retrieve the crawled content by querying the Query API.

Job Automation

Modify Job Settings

You can modify the settings of the job by opening the job editor.
Right click the job node in the project tree and click edit job.

You can specify a different log level to have a better control during the definition of your Scraper. (Be careful as this can become wordy and eat up your disk quota pretty fast.)

Or you check the job series check box - then you will see a scheduler that allows to define job execution times at different intervals.

When you are done, hit the save button.