Spotlight on GenomeQuest

Recently I was able to attend a demo of the GenomeQuest sequence searching tool, which is designed to support sequence searching for prior art investigations.   GenomeQuest provides access to proprietary patent database collections which have been indexed especially for sequence searching, as well as to public access databases of genetic and protein sequences.

One of GenomeQuest’s most notable databases, GQ-PAT, contains a proprietary collection of nucleotide and protein sequences extracted from patent collections, including the US, EPO, WO/PCT, and the DNA Databank of Japan (where the JPO deposits patents that contain sequences).  Because some WO/PCT documents are only available as images and not as electronic text, GenomeQuest employs an in-house Optical Character Recognition (OCR) process that can involve human editing with the assistance of a related machine-readable documents, such as a US family member.  The patents in GQ-PAT are also supplemented by corresponding INPADOC records to ensure that their legal status and assignee information stays up-to-date with this source.

This database differs from the Thomson Reuters database of genetic material, GeneSeq, in a number of notable ways.  First, GQ-PAT draws mainly from the US, EPO, WO/PCT, and JPO databases, while GeneSeq is based on the Derwent World Patent Index, and will contain sequence records from Derwent “basic” patent documents published by over 40 patent offices (for more about Derwent “basics,” see the Intellogist page on DWPI patent families).  Next, while GeneSeq records are hand indexed as part of the Derwent World Patent Index abstracting process, GQ-PAT is produced largely by machine extraction techniques; this leads to differences in record content, quality and timeliness.  Because the GQ-PAT records are produced by a less-laborious machine extraction process, users can expect GQ-PAT to contain more timely content than GeneSeq.  However, it is also reasonable to assume that GQ-PAT records on average will likely be more error-prone than GeneSeq records.

It’s important to note that one genetic sequence searching tool should never be used exclusively over another (this is also true for chemical structure search tools, but that’s a post for another day).   Building a searchable sequence database or chemical structure database always requires a human indexer (or a machine indexer that has been “taught” by humans), and require indexing policy decisions and judgment calls.  Therefore, these databases can yield different answers to the same query, and should all be considered unique.  In fact, the GeneSeq database can be loaded onto GenomeQuest’s platform through a separate subscription agreement with Thomson Reuters, promoting the side-by-side use of both databases.

Many sequence searchers currently access GeneSeq through the STN platform (where it is called DGENE). This blog post would not be complete without a mention of the other sequence searching databases hosted on STN, including USGENE, PCTGEN and CAS REGISTRY files.   There are hundreds of pages of documentation on these files and the most effective way to perform sequence searching in each of these files, and for a quick primer it’s always useful to take a glance at the CAS or STN International websites, where workshop manuals and other documentation materials can be found.

Share your sequence searching insights in the comments!

Like This!

Thomson Innovation

This post was edited by Intellogist Team member Kristin Whitman.

About these ads

9 Responses

  1. Thank you for your description, Kristin.

    I am particularly interested in a single sequence search tool that allows me to search both for patent as well as non-patent articles. (One can do that in scifinder using a chemical structure). However, GeneSeq/GenomeQuest sequence search takes you to only the relevant patent documents. At the most, it will also take you to related entries in public sequence databases, but does not take you to non-patent references relating to a particular gene/polypeptide. Do you know of any database where non-patent references have been curated such that you can search them using a sequence?

    Many thanks,
    Jaimin

  2. Hi Jaimin,

    Thanks for your question! If you’re not already sequence searching the CAS REGISTRY file through STN, that would be my first recommendation. CAS indexes chemical structures and sequences in both the journal literature and patent documentation, although there are some exceptions to what they index when it comes to journal articles. A current explanation of their indexing policies is available here:

    http://www.cas.org/expertise/cascontent/registry/sequences.html

    REGISTRY is a file which contains only chemical structures and sequence data. However, once you have found sequences of interest in REGISTRY, you would be able to cross reference that search into CAplus, which is the database which contains the patent and journal records indexed by CAS.

    STN is great because they make a wide range of user documentation and tutorials available. You may want to check out this tutorial on how to do a BLAST search in REGISTRY – it was presented in March of 2010 (click on Part IV for the PDF):

    http://www.stn-international.de/sequence_searching.html

    Here is another directory of tutorials on searching sequences in REGISTRY, most of which appear fairly recent:

    http://www.stn-international.de/stn_biosequencesearching_cas_reg.html?&L=%27

    I don’t have much experience with SciFinder since it is a tool marketed exclusively to academic institutions, although I did use it a bit in college. I believe SciFinder is able to access REGISTRY and CAplus – my first thought would be for you to ask your chemistry librarian or to contact your CAS representative to find out more about how you can do that from within your institution.

    Was this the kind of information you were looking for?

  3. Hi, Thanks for nice article,
    I am looking out for a patent landscaping software. Please let me know, which is the best one? (Aureka is very expensive and even Thompson Innovations).
    I need to generate pictorial representation of the landscaping report. Please help.

    Thank you

  4. Hi Kristin!

    Thanks for the updates.

    Yes, I believe one can do a sequence search in REGISTRY and CAplus using sci-finder. However, is there a way to verify that this would a no-stones-unturned search? If not, what other databases could be used to complement scifinder’s? Would a GeneSeq/GenomeQuest search be complementary, or a sci-finder search is sufficient?

    Best regards,
    Jaimin

  5. Hi Jaimin,

    GenomeQuest (including GQ-PAT and public database sources) and GeneSeq searches would both be important in addition to a REGISTRY/CAplus search. While there is going to be some overlap between the three collections, each should be considered a unique sequence searching resource.

    There are two major reasons for this. First, the data for each resource is sourced independently – CAplus is produced by CAS, Geneseq is produced by Thomson Reuters, and GQPAT is produced by GenomeQuest. So the data sources used by each of these three entities will vary.

    CAS REGISTRY sequence sources include over 3,000 life science journals and patents from 60 patent authorities, as well as NCBI Genbank. GeneSeq includes sequences from the (Derwent) basic family members of over 40 patenting authorites, and while there may be some overlap between GeneSeq and CAplus/REGISTRY, the differences in the way these two companies index their sequence data means that the same search conducted separately in each database could yield different results. GQPAT includes extracted sequences from the US, EP, WO/PCT and the DNA Databank of Japan, and although there may again be some overlap, the machine extraction algorithm used by GenomeQuest may be able to handle a higher volume of sequence extraction than the editorial indexing process used by the other two authorities, and perform extractions in a more timely manner.

    Next, the indexing policies for each collection will be different, and could cause considerable differences in retrieval. When information records are created by human editors, there is always a decision making process that determines what data goes into a record, and what is kept out. On the other hand, when records are created wholly by machine extraction (such as some of those in GQ-Pat), there is always the chance that the machine extraction process will introduce errors. All of these indexing methods are controlled to be as consistent and error-free as possible, however, errors and inconsistencies are inevitably introduced into the data.

    For a no-stone-unturned approach, I have to recommend searching all available sources of sequence data. If this isn’t possible, a careful consideration of indexing policies and practices should be made to select the best resource based on the needs of the study. Some indexing practice information is available for all three search providers in their respective documentation. Or feel free to submit your specific needs in the Intellogist forum (http://www.intellogist.com/wiki/Special:AWCforum) and hear what the community recommends!

    You may also be interested in this article from 2008 in World Patent Information, which includes a comparison of available databases. GQ-PAT is not considered in the study but it gives an excellent overvew of CAS REGISTRY and GeneSeq. There is a venn diagram in the article comparing CAS REGISTRY results to GeneSeq results for the same sequence search. While there is some overlap, there are also significant differences in the two results sets.

    http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V5D-4SNGRDC-1&_user=10&_coverDate=12%2F31%2F2008&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=14195759003eabcccebd9b380a0f09ce&searchtype=a

  6. Hi Vinay,

    Thanks so much for your question. There are many visualization tools out there. I recommend picking one that offers lots of automated features but also will allow you to “clean” the underlying data, for example, to recognize that two entities are really the same even though they might contain spelling variations or other variations in form. For example, if you were graphing documents by assignee, you wouldn’t want one column for “Sony” and another column for “Sony, Inc.” A system that would help you recognize that sort of problem with the underlying data and fix it is very helpful for patent analysis.

    For starters I would recommend taking a look at what VantagePoint can do. VantagePoint offers lots of automated data cleaning capability (for example, it can use filters and fuzzy logic to make matches between variants of company names). The Intellogist full report on VantagePoint is located at

    http://www.intellogist.com/wiki/Report:VantagePoint

    There are many patent search tools that offer some visualization capability. PatBase has an analysis module, and orbit.com offers an analysis module created by a company called Intellixir (http://www.intellogist.com/wiki/Intellixir). It sounds like you’re already aware of Thomson products. Innography is another tool that has garnered lots of attention recently for its incorporation of business and financial data and its automated assignee prediction capabilities. http://www.intellogist.com/wiki/Innography

    Intellogist does offer a consulting service for a fee – please feel free to contact us (http://www.intellogist.com/wiki/Special:Contact) if we can be of further service with this request.

    I hope this helps!

    Thanks
    Kristin

  7. [...] that CAS also applies corrections to their incarnation of their INPADOCDB database.  We know that GenomeQuest manually corrects sequence data for some low-quality OCR’ed documents, and I believe I have [...]

  8. [...] the Intellogist blog discussed GenomeQuest as a source for searching patent sequence data. This month GenomeQuest announced that they are [...]

  9. [...] subscription-based systems and databases that support sequence searching in patent collections: GenomeQuest, DGENE, USGENE, and PCTGENE.  The  article also mentions that the free NCBI network hosts a [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 742 other followers

%d bloggers like this: