The URL Crawler project template is used to crawl a set of seed URLs following all links on the retrieved web pages down to the predefined link depth.
The final result of the URL Crawler comprises the HTML content of all crawled web pages or the extracted text of PDF or Office documents, in case crawled
URLs pointed to non-HTML documents.
The content of all crawled URLs can either be exported from the Content Viewer in the Management Console
or can be retrieved by REST call to the Webservice API
Step 1 of 3
Create a New Project from Template URL Crawler
You can build different URL Crawlers by targeting them at different websites or you mix a URL Crawler with other project types.
The number of projects and what project types you choose is limited by the number of projects available in your account model.
Right click the root node in the project tree and choose new project,
then enter a name for your new project.
The project dialog opens and you can select the type of the project from the available templates.
From time to time there will be new templates available, about which you will be informed by email.
Now enter a description and hit save.
Your URL Crawler Is Ready!
In the project tree you can now see your new project with its three nodes: Seed URLs, Job (URL Crawler) and Crawled content.
Now open the Seed URLs editor by clicking on the Seed URLs node.
Step 2 of 3
Enter Seed URLs for Your Targets
In the Seed URLs editor click the new record button and enter a seed URL for one of your targets.
In our example we chose a sub-section of the American FDA providing information on dietary supplements. By specifying the regular expression (?i)^https://www.fda.gov/Food/DietarySupplements
we make sure, we only accept links that point to pages in this sub-section.
When you are done, hit save record.
Create Multiple Seed URLs
In the Seed URLs editor you can create multiple seed URLs. Different sub-sections on the same server or entirely different websites.
We only specify one seed URL and we also specified a link depth of 1, because we do not want to follow links on pages beyond the first level.
Now close the editor.
Step 3 of 3
Run the Job
Yes, there is not more to do to configure a URL Crawler.
Now right click the job node in the project tree and choose run job immediately.
The job is now started in the background.
See Progress in the Job Logfile
Right click the job node in the project tree and click show log.
By opening the job logfile, you can monitor the progress of the job.
You may want to reload the logfile to see what is going on.
By default, you'll receive an email notification, when the job is finished. So there is no need to constantly monitor the job logfile.
C'est fini - that was it...
...and your data is ready for export.
Open the Crawled content viewer
by clicking on the Crawled content node
in the project tree.
Select the records
you want to export and click the export records
button. The records will be downloaded in the format you specified.
Alternatively you can retrieve the crawled content by querying the Query API
Modify Job Settings
You can modify the settings of the job by opening the job editor.
Right click the job node in the project tree and click edit job.
You can specify a different log level to have a better control during the definition of your URL Crawler. (Be careful as this can become wordy and eat up your disk quota pretty fast.)
Or you check the job series check box - then you will see a scheduler that allows to define job execution times at different intervals.
When you are done, hit the save button.