Browser Farm

The Browser Farm allows access and retrieval of information from websites that rely on Javascript to load content. If an ordinary URL Crawler retrieves such a page, the only thing it will see is the basic layout of the page.

Therefore such websites cannot be crawled with an ordinary URL Crawler, because the actual content is queried from the server after the page was loaded. What the Javascript is loading even depends on user interaction with the page, e.g. scrolling down will load additional content.

The Browser Farm: An Ecosystem for Browser Crawlers

The Browser Crawler allows to load the web page, interact with the web page and take a snapshot of all content that was loaded at the given point in time.

Browser Crawler Filter Type

The Browser Farm is controlled by the Browser Crawler filter type that loads Selenese scripts and sends them to the Browser Farm for execution. Selenese scripts can be provided manually, but most of the time the scripts will be parameterized and generated by ds9 filter chains before starting the Browser Crawler. Just like URLs for URL Crawlers are generated by replacing placeholders, e.g. parameters, in URL templates, Selenese scripts can be created with placeholders that are filled in with data created by preceding filter steps.

To execute a Browser Crawler, the Browser Crawler filter type needs to know the URL of the Browser Farm, which is a Web application that needs to be installed on a separate server, a username and a password for authentication.

Each executed Selenese script results in a document that is written to the output of the filter type for further processing.



Browser Farm Setup and Deployment

The Browser Farm provides the environment in which the Browser Crawlers are operated. As Browser Crawlers allow the interaction with the web pages they retrieve, there must be a way to verify what the Browser Crawler is doing in case something does not work as intended. Therefore the Browser Crawlers are actual Chrome browsers running on a minimum Linux GUI into which the ds9 user can log in and watch the "ghost-operated" browser while it is being operated by the Browser Crawler.

What the browser is doing on a particular page is defined by a Selenium Script that is used as input by the Browser crawler.