Recently I was able to attend a demo of the GenomeQuest sequence searching tool, which is designed to support sequence searching for prior art investigations. GenomeQuest provides access to proprietary patent database collections which have been indexed especially for sequence searching, as well as to public access databases of genetic and protein sequences.
One of GenomeQuest’s most notable databases, GQ-PAT, contains a proprietary collection of nucleotide and protein sequences extracted from patent collections, including the US, EPO, WO/PCT, and the DNA Databank of Japan (where the JPO deposits patents that contain sequences). Because some WO/PCT documents are only available as images and not as electronic text, GenomeQuest employs an in-house Optical Character Recognition (OCR) process that can involve human editing with the assistance of a related machine-readable documents, such as a US family member. The patents in GQ-PAT are also supplemented by corresponding INPADOC records to ensure that their legal status and assignee information stays up-to-date with this source.
This database differs from the Thomson Reuters database of genetic material, GeneSeq, in a number of notable ways. First, GQ-PAT draws mainly from the US, EPO, WO/PCT, and JPO databases, while GeneSeq is based on the Derwent World Patent Index, and will contain sequence records from Derwent “basic” patent documents published by over 40 patent offices (for more about Derwent “basics,” see the Intellogist page on DWPI patent families). Next, while GeneSeq records are hand indexed as part of the Derwent World Patent Index abstracting process, GQ-PAT is produced largely by machine extraction techniques; this leads to differences in record content, quality and timeliness. Because the GQ-PAT records are produced by a less-laborious machine extraction process, users can expect GQ-PAT to contain more timely content than GeneSeq. However, it is also reasonable to assume that GQ-PAT records on average will likely be more error-prone than GeneSeq records.
It’s important to note that one genetic sequence searching tool should never be used exclusively over another (this is also true for chemical structure search tools, but that’s a post for another day). Building a searchable sequence database or chemical structure database always requires a human indexer (or a machine indexer that has been “taught” by humans), and require indexing policy decisions and judgment calls. Therefore, these databases can yield different answers to the same query, and should all be considered unique. In fact, the GeneSeq database can be loaded onto GenomeQuest’s platform through a separate subscription agreement with Thomson Reuters, promoting the side-by-side use of both databases.
Many sequence searchers currently access GeneSeq through the STN platform (where it is called DGENE). This blog post would not be complete without a mention of the other sequence searching databases hosted on STN, including USGENE, PCTGEN and CAS REGISTRY files. There are hundreds of pages of documentation on these files and the most effective way to perform sequence searching in each of these files, and for a quick primer it’s always useful to take a glance at the CAS or STN International websites, where workshop manuals and other documentation materials can be found.
Share your sequence searching insights in the comments!
This post was edited by Intellogist Team member Kristin Whitman.