Company SEARCHCORPUS®

Synonym for “most advanced scientific company web search”

Search 100,000s of company websites with millions of web pages within seconds

Monitoring company websites around the globe

The DS9 Company SEARCHCORPUS® is the only web search engine that can be licensed and customized to be filled with industry specific start-ups or other companies.

Collecting company information from multiple diverse sources, DS9 crawlers retrieve entire company websites frequently and stores them in an inverse fulltext index.

The SEARCHCORPUS® optionally integrates modules for entity recognition, geocoding or classification to further optimize the indexed content and get better results faster.

 

See how it works, download the Company SEARCHCORPUS® A1 poster here:

 

1. + 2. Selecting targets and extracting company data

When retrieving company information from one of the targets (described below), two steps are performed:

1. Retrieve company raw data from source
2. Extract company data from retrieved raw data

These two process steps differ for each target

Crunchbase Pro Companies


Company data from Crunchbase is retrieved via the Crunchbase Enterprise API. The company name, description and address information and the URL of the company website is extracted.

As an option company funding data from Crunchbase can also be retrieved, which means you can analyze investment trends for technologies and research based on scientific entities that were recognized on the company websites

Venture Capital Portals

Venture Capital Portals: another important source of company data

About one third of a large Company SEARCHCORPUS® are retrieved from Venture Capital Portals

Venture Capital Companies

Venture Capital Companies focused on biotech or life science companies provide excellent quality company data

Because Venture Capital Companies typically only list companies they are invested in on their website (compared to Venture Capital Portals), they only provide access to a very limited number of companies

News Portals

On larger biotech or pharma news portals our crawlers find news that was published by companies from that industry. This news generally links directly back to the website of the publishing company

In some cases the portals provide a list of biotech or pharma companies they cover

Clinical Trial Sponsors

If you license the DS9 Clinical Trials Registry Tracker, it injects sponsor data from clinical trials into the Company SEARCHCORPUS®, if the company is not yet known from other sources

It is alarming how many companies, which fund Clinical Trial Studies and should therefore be well funded and have been around for some time, are not listed in any of the other sources

Conference Exhibitors

The Conference Exhibitior SEARCHCORPUS® is a separate solution that allows you to quickly find interesting exhibitors in some of the large pharma conferences that can have several thousand exhibitors yet each is only described with a maximum of 100 words

Many conference exhibitors are start-ups from the Far East that are not listed with any other source. Therefore, the additional benefit of the Conference Exhibitior SEARCHCORPUS® is that the exhibitors can be added to the bigger Company SEARCHCORPUS®

3. Insight App or Data Lake feed?

Once all company data has been collected you need to decide, if you want to find acquisition targets or licensing opportunites in an interactive DS9 Insight App or if you want to feed the collected company data into your Data Lake or other 3rd party systems

4. Export

The data of any container in the process can be exported using one of the ways below described to get data from DS9 into a 3rd party system:

  • Send your data as email attachment either in PDF or TSV format
  • Interactively export search results from your Insight App
  • DS9 pushes the data into a database or into an Amazon S3 bucket
  • Query the data from one of the DS9 APIs

5. Crawl SEARCHCORPUS®

DS9 crawlers process multiple company websites concurrently

In case extracted company data does not contain the website’s URL, DS9 will still likely be able to retrieve the actual URL of the corporate website by executing a Bing search through the Microsoft Azure Cognitive Services API

Finally, after validating the URLs, DS9 crawlers can start to crawl all pages that are found on the corporate website by recursively following links on these pages

Distributed crawling in the Amazon Cloud

URLs are mapped to different AWS regions (US East Coast, London, Mumbai, Sydney, Tokyo, Seoul and Sao Paolo) to crawl websites from a DS9 EC2 instance located geographically as close to the webserver as possible

The chart to the left shows the time a signal needs to travel from the DS9 server in Nuremberg to a targeted webserver e.g. in Tokyo that is 270ms compared to 26ms in London (click on chart to enlarge)

Talking about millions of pages and many TB of data, a geographic optimization saves several days of crawling time

6. Sending alerts

Profile based alerting when interesting new companies pop up

Every time the SEARCHCORPUS® is updated, registered user profiles containing scientfic queries (Lucene query language) can be run against the new SEARCHCORPUS®

Snippets of matching web pages together with the company name, description and other meta data, are compiled into an Excel or PDF report and sent by email to recipients

The number of profiles is not limited and a profile can be sent to an individual or multiple recipients

7. Geolocation of companies

Revealing geographic hubs in response to scientific queries

Revealing geographic hubs in response to scientific queries requires company information enrichment with geolocations data

DS9 uses an API to convert geographic information into geolocations. The greater the precision of available geographic information, the more precisely the company’s location will be shown on the map

Most sources provide country information for each company, however some sources also provide full address details. It is a matter of configuration, if geographic location should be generated with country or address resolution

8. Semantic search

Tagging web pages with concepts from hierarchical thesauri for entity recognition

DS9 taggers can use multiple hierarchical thesauri to tag retrieved web pages

The DS9 Thesaurus Editor can be used to stick a problem specific hierarchical thesaurus together within minutes from existing preloaded reference thesauri stored in SKOS or another RDF format (e.g. MeSH or your internal thesaurus/taxonomy). Alternatively a thesaurus can be built from scratch with the Thesaurus Editor and more complex constraints can be defined for concepts to improve matching of scientific terms in more marketing-orientated websites’ content

Recognized entities across all pages of a corporate website are extracted and mapped to the company to allow semantic filtering of companies, before starting an actual scientific search