News Tracker Project Template

 

News Tracker

The News Tracker project template can be used to crawl news from a set of custom websites to collect independent news from targeted global sources. The final result of the News Tracker is a news ticker viewer showing the latest news from the targets you specified.

Step 1 of 5

Create a New Project from Template News Tracker

You can build different News Trackers by targeting them at different types of news or you mix a News Tracker with other project types. The number of projects and what project types you choose is limited by the number of projects available in your account model.

Right click the root node in the project tree and choose new project, then enter a name for your News Tracker.


Project Dialog

Select News Tracker as type of project from the list of available project templates in the project dialog. Over time we will publish new project templates about which we will inform you by email. After you entered a description you can hit save to create the project.


Your News Tracker Is Ready!

In the project tree you can now see your new project with its seven nodes: News Tracker (the viewer), News Tracker (admin), Targets, Target news expressions, Notification recipient, News collection and the job Verify and Track to execute the News Tracker.
Now open the Targets editor by clicking on the Targets node to enter the URLs of your target web pages containing the news.


Step 2 of 5

Enter URLs of the News Pages

Our example extracts Recalls, Market Withdrawals and Safety Alerts from a page of the American FDA.

In the Targets editor click new record to enter Name, URL, Include pattern and the Icon URL for each of your targets.
The news are not extracted from the overview page, of which the URL is specified: https://www.fda.gov/Safety/Recalls/default.htm but from a page to which each of the news messages links. Therefore we specified a regular expression as include pattern that matches the URLs of all pages containing actual news messages: https://www.fda.gov/Safety/Recalls/ucm\d*.htm.

When you are done, hit save record.


Create Multiple News Targets

In the Targets editor you can specify multiple news targets. The News Tracker supports news pages where the message is retrieved directly on the news page (link depth=0) or like in our example websites where the headlines are listed on an entry page, but the actual message is retrieved from a page each headline links to.

The Icon URL that you enter, should reference a 16x16 pixel icon, in our example the favicon.ico of the FDA website.

Now close the editor.

Step 3 of 5



Target News Expressions are the Most Important Part of the Customization

Open the Target News Expressions dialog, and for each target, create a new record and enter Name, Regex name, Expression, Output and Name include pattern.
The regular expression must capture one complete news message and must output eight datafields in the following sequence:
Name, Date, Headline, Subtitle, Text, Source, Status and Content URL.


One Regular Expression per Target

In the example, we capture six capture groups (\1 - \6), the sequence of which is defined by the layout of the web page:
A part of the Content URL, Subtitle, Date, Company name (not used), Headline and the Text. The eight output datafields have a different order. The datafields Source and Status are hard coded and the Content URL is put together from a hard coded URL fragment and the captured URL part (\1=capture group 1).

Click save record and close the dialog.


Step 4 of 5



Specify the Notification Recipient

Each time the News Tracker runs, it verifies the specified targets, if they are accessible and if the news can be retrieved.
Open the Notification Recipient dialog, create a new record and enter Name, First name, Email address and a Subject. This notification recipient then receives an email showing the targets that caused problems.

Click save record and close the dialog.

Step 5 of 5



Now Run the News Tracker!

Right click the job node in the project tree and click run job immediately.
While the job is executed in the background, you can right click the job node again and click show log to open the job logfile to monitor progress.
You may want to reload the logfile to see what is going on.

When the News Tracker is finished, you'll receive an email notification.


Open the News Tracker Viewer

Open the News Tracker viewer by clicking on the News Tracker node in the project tree.
Click once on a news message and if a text was captured (this field is optional), the text is displayed in a popup.
Doubleclick on a news message and the source from where the news was captured is opened in a new browser tab.
Alternatively you can retrieve the crawled content by opening the News collection viewer from where you can export the news messages.


Job Automation

Modify Job Settings

You can modify the settings of the job by opening the job editor.
Right click the job node in the project tree and click edit job.

You can specify a different log level to have a better control during the definition of your News Tracker. (Be careful as this can become wordy and eat up your disk quota pretty fast.)

Or you check the job series check box - then you will see a scheduler that allows to define job execution times at different intervals.

When you are done, hit the save button.