URL Crawler Project Template


URL Crawler

The URL Crawler project template is used to crawl a set of seed URLs following all links on the retrieved web pages down to the predefined link depth. The final result of the URL Crawler comprises the HTML content of all crawled web pages or the extracted text of PDF or Office documents, in case crawled URLs pointed to non-HTML documents.

The content of all crawled URLs can either be exported from the Content Viewer in the Management Console or can be retrieved by REST call to the Webservice API.

Step 1 of 3

Create a New Project from Template URL Crawler

You can build different URL Crawlers by targeting them at different websites or you mix a URL Crawler with other project types. The number of projects and what project types you choose is limited by the number of projects available in your account model.

Right click the root node in the project tree and choose new project, then enter a name for your new project.

Project Dialog

The project dialog opens and you can select the type of the project from the available templates. From time to time there will be new templates available, about which you will be informed by email. Now enter a description and hit save.

Your URL Crawler Is Ready!

In the project tree you can now see your new project with its three nodes: Seed URLs, Job (URL Crawler) and Crawled content.
Now open the Seed URLs editor by clicking on the Seed URLs node.

Step 2 of 3

Enter Seed URLs for Your Targets

In the Seed URLs editor click the new record button and enter a seed URL for one of your targets. In our example we chose a sub-section of the American FDA providing information on dietary supplements. By specifying the regular expression (?i)^https://www.fda.gov/Food/DietarySupplements we make sure, we only accept links that point to pages in this sub-section.

When you are done, hit save record.

Create Multiple Seed URLs

In the Seed URLs editor you can create multiple seed URLs. Different sub-sections on the same server or entirely different websites.

We only specify one seed URL and we also specified a link depth of 1, because we do not want to follow links on pages beyond the first level.

Now close the editor.

Step 3 of 3

Run the Job

Yes, there is not more to do to configure a URL Crawler.

Now right click the job node in the project tree and choose run job immediately.
The job is now started in the background.

See Progress in the Job Logfile

Right click the job node in the project tree and click show log.
By opening the job logfile, you can monitor the progress of the job.
You may want to reload the logfile to see what is going on.

By default, you'll receive an email notification, when the job is finished. So there is no need to constantly monitor the job logfile.

C'est fini - that was it...

...and your data is ready for export.

Open the Crawled content viewer by clicking on the Crawled content node in the project tree.
Select the records you want to export and click the export records button. The records will be downloaded in the format you specified.
Alternatively you can retrieve the crawled content by querying the Query API.

Job Automation

Modify Job Settings

You can modify the settings of the job by opening the job editor.
Right click the job node in the project tree and click edit job.

You can specify a different log level to have a better control during the definition of your URL Crawler. (Be careful as this can become wordy and eat up your disk quota pretty fast.)

Or you check the job series check box - then you will see a scheduler that allows to define job execution times at different intervals.

When you are done, hit the save button.