The DS9 system offers more than 60 filter types to crawl information on the web, retrieve content, extract data, filter by information, aggregate and link information, semantically enrich and prepare data for semantic search.
Reaching Out to the Web
The need of enterprises is to access information that is crucial for their research and development. Information that is bound to be available on the Internet, but that cannot be found with existing state of the art technology,
because the information is either not available in the Surface Web at all or not visibly linked to other information in the same context or not recognizable as such.
Custom solutions to manage this type of corporate intelligence are the only means to build a SEARCHCORPUS® that integrates Deep and Surface Web information by retrieving it, filtering and aggregating it and finally linking it together to make it searchable.
DS9 offers different types of crawlers. Each one having a particular strength that is setting it apart from the others. URL Crawlers may be very fast but good for plain HTML web pages only.
Concurrent Crawlers may send out multiple crawlers in parallel, which makes sense, when crawling multiple servers at once. The Dark Web Crawler understands Onion URLs and
knows how to reach the TOR network. Browser Crawlers are slower, as retrieve not only the HTML source of a page, but other resources as well. But for some websites using
Many websites offering data feeds provide archives from which the whole data set can be downloaded at once. Archive Readers supporting formats like
bz2, gz, gzip, jar, tar, tgz, zip,...
download, open and extract such archives and publish the data on an internal web server, from which it can be used for further analysis.
The same publication method can be used to publish web pages that were compiled by some filter chain on an internal web server from which it can be retrieved by users with their browsers just like from an ordinary website.
DS9 APIs are provided on an optional basis and can be ordered by customers that need DS9 to directly communicate with third party systems. DS9 currently supports APIs to several partners offering advanced text analysis, named entity recognition and classification methods via web service.
Please contact us for more details on API filter types.
Advanced Web Analytics
DS9 Advanced Web Analytics expands analytics on databases and data warehouses (also known as business intelligence) to the information available on the web. Advanced Web Analytics is used to Manage Intelligence by linking classical data sources to newly discovered intelligence using pattern matching, filtering, aggregation, text analysis, text mining, semantic analysis (see also Semantic Web Technologies) and deep learning.
Custom Managed Intelligence Solutions using Advanced Web Analytics are the only means to build a SEARCHCORPUS® that integrates Deep and Surface Web information by retrieving it, filtering and aggregating it and finally linking it together to make it searchable.
Like SQL, DS9 provides operators like LEFT JOIN, INNER JOIN, DISTINCT JOIN and CROSS JOIN as well as UNIONS on datafields of containers,
where the user can define arbitrary datafields as key. Transposers, Deduplicators, Datafield Mergers for datafields as well as Mergers and
Joins operating on a whole SEARCHCORPUS® are special aggregators that can be used to support web data analytics.
Web Data Extraction
Web Data Extraction or Scraping is done using Regular Expression Matchers, that either work on the recognized content of some web document, on the raw content of an HTML, XML, JSON, ...
document or on the content of some datafield. By generating new content from matched expressions, existing content can be converted in arbitrary other formatted content, defined by the user who develops the DS9 solution.
Using Term Filters, RegularExpression Matchers or Lucene Query Matchers DS9 Advanced Web Analytics filter records or SEARCHCORPUS® content based on datafields of the record or based on
raw content in HTML, XML, JSON,... format.
Some filters also count Term Frequency or the number of matches of specific patterns and provide information on the occurrence of specific terms that can then again be applied to further analytics.
sandboxed to prevent security leaks.
Semantic Technologies (of which Semantic Web Technologies are a subset), as they are used in the DS9 filter types comprise natural language processing (NLP), category tagging, semantic search and deep learning.
We use Semantic Technologies in filter types that process unstructured information to detect content in raw HTML documents (remove clutter like the navigation or ads), recognize building blocks or structure content. DS9 filter types support
categorization of content based on pattern matching and functionality like Profile Matching based on document content.
Concept Tagging is an applied semantic technology that is using lists of synonyms to tag concepts in text content either provided as raw text or as content of datafields that was extracted by means of web data analysis.
Tagged concepts can be then used to classify documents or to prepare the documents for faceted search. Often raw content needs to be preprocessed to separate layout information like a header, footer,
navigation elements and advertisements from the actual content. The Content Detector is one of the filter types using pattern matching and heuristics to detect meaningful content and separate it from clutter.
A frequently occurring task in Advanced Web Analytics is the classification of whole documents or records of a data set based on unstructured text content.
Classification methods include simply matching occurring concepts with document classes or weighing concepts based on the number of occurrences in the documents.
Profile matching is used to assign records in a data set, e.g. news or whole documents to profiles, mostly user profiles of users that are interested in some specific information.
Profiling is done based on Lucene Queries or by matching expressions against the structured or unstructured information in the records or documents provided.
Often profile matching is applied to filter and assign news to users, format them in newsletters and send them out to recipients fully automatically.