Deep SEARCH 9® Custom Solutions


We Are Specialized in Building Managed Intelligence Solutions for Our Customers

Our team of data scientists and computer linguists has unique experience and competences.
We have not only developed the Deep SEARCH 9 technology but have also built countless solutions based on this technology for our customers and partners.
  • Web crawling in Surface Net, Deep Net and Dark Net with all the problems and pitfalls
  • Extraction of data from any type of unstructured sources
  • Building a SEARCHCORPUS® with millions of documents
  • High frequency tracking of web resources up to every 2 seconds
  • Building Search Engines for faceted search using semantic web technology
  • Annotation of natural language text sources
  • Ontology management
  • Linking of data from multiple web based and internal sources
  • Scaling of hosting environments with automated cloud based upscaling (Amazon EC2)
  • ...
 
We are sure: There are not many teams out there that can beat us when it concerns our project experience in combination with the technology stack we master.

Crawling Services

Setting up a crawler is not that difficult. We have best practices that allow us to e.g. setup a crawler for a few thousand corporate websites in just hours.

As input we need either company names or URLs.
  • If we get the URLs, we know where to crawl and the challenge is to find out who we crawl.
  • If we get a company name, we know who we crawl and the challenge is to find the right website that belongs to the specific company.
For both challenges we have solutions like reference crawling that uses crawled data from other sources that are likely to link to these companies.
Depending on the targets, we either use our highly parallel super fast URL Crawlers or, if we need Javascript and CSS to be interpreted by the browser to access the content, we use our browser farm. If the target is in the Dark Web, we use our TOR Crawlers that know how to deal with Onion URLs.

You can always contact us, we'll explain in more detail how that works - it's a great piece of technology. (We do not reveal everything we do here, because this is special knowledge we acquired over time.)

Special Knowledge Is the Key

It is always easy to get the 80% solution pretty quickly. When we get a list of URLs, we get about 90% of the targets crawled without problem. The remaining 10% are usually a bit more difficult:
  • Wrong URL
  • Company not existing anymore
  • Server down
  • robots.txt does not allow us to crawl
  • Website needs a Browser to crawl
  • Website is protected by Captchas
For all of the above problems there are solutions, but you need the technology and the experience to solve specific issues.
 

Scraping Services and Web Data Extraction

Scraping structured data from unstructured websites requires some knowledge of HTML, the language in which most documents on the web are written in. There are methods out there, that allow users to just select some text on a web page and have a scraper extract that text from all similar pages.

This really works and we started with such an approach, but got frustrated pretty fast: In 95% of the cases we have seen in our projects two pages may look exactly alike but the underlying HTML is a tiny bit different, which makes the auto-generated scrapers fail miserably.

Therefore we have abandoned the approach to automatically extract data. We stick with something that is a bit more abstract and requires some knowledge (again), but which has been around in text analysis since the 1950s.

Regular Expressions.

To make it easier and less error prone to come up with the right regular expression for the problem, we have developed RegExpert, a tool that supports our Data Scientists a great way. RegExpert is part of the ds9 Integrated Development Environment, which comes with the ds9 Developer's Edition.

Therefore we can help you out in building the customized ds9 solution for you as there is always great tool support.

Linking Structured Data from Different Sources

Often business critical internal information needs to be enhanced with external knowledge that is not available in a form that it can be integrated with internal data structures. We prepare the data that was extracted from the web in a structured form that can be easily merged into existing databases. Either automatically, semi automatically or in a manually supervised process.

Scraping or - more elegant - Web Data Extraction of, e.g., competitor data retrieved from publications of a competitor's website like competitive product prices are simply not available.

ds9 Scraping solutions help you to get access to such data that cannot be acquired elsewhere.