[tweetmeme source=”Intellogist” only_single=false] Semantic searching is nothing new: many patent (like TotalPatent) and non-patent literature (like ProQuest Dialog) search systems provide some form of semantic search tool which allows users to find relevant documents based on the contextual meaning of their query. So how does the new semantic search technology Sophia separate itself from the pack? A blog post by Chris Horn does an excellent job of describing the unique capabilities of Sophia:
Sophia builds its indexes over a given corpus of unstructured data and documents, and then automatically clusters together documents and data on related themes. You can then use the search interface to browse and explore. Sophia will bring to your attention related documents even if they don’t explicitly match your chosen search terms.
Sophia doesn’t just find thematically similar documents; it also suggests new terms and themes to expand the search. Sophia can be used to analyze and search any file of documents, including both patent and non-patent literature. Continue reading as we look at two Sophia products, the Sophia Digital Librarian and Sophia Search, and discuss how these semantic search tools can aid prior art searchers, institutions with in-house patent collections, and patent analysts!
Semantic Metadata with Sophia Digital Librarian
The Sophia Digital Librarian creates semantic metadata based on any corpus of documents. An example of a document corpus analyzed by the Digital Librarian is described in Chris Horn’s blog post:
As a test and online demonstration, the New York Times archives annotated corpus (with permission) containing 1.8M documents, from January 1, 1987 and June 19, 2007 are one online example available from Sophia.
In the screenshots below, the metadata is based on this example database of New York Times articles. The user enters the text of any document into the search form, and the Digital Librarian will automatically generate a set of metadata tags (using a proprietary algorithm) based on relevant documents and themes in the New York Times corpus.
Chris Horn explains how each type of metadata tag is related to the original document:
The Document Tags are tags which Sophia has found in the given document. The Semantic Tags are a list of other tags which Sophia believes are relevant, even though they do not explicitly appear in the text. The Neighbors list is a list of titles of specific (in this case) NYTimes articles which Sophia believes are relevant to the given text.
The Sophia Digital Librarian generates a list of thematically similar documents, as well as a list of thematically relevant terms for any given record. The Digital Librarian may not be particularly useful for prior art searchers, since the user will first need an entire file of data to be indexed and analyzed. The Digital Librarian may prove very useful to a patent analyst, however, who already has a large collection of patent documents compiled. The analyst can use the Digital Librarian to gain new insights on how a particular patent within a document collection relates to the other patents. The Digital Librarian may also identify useful thematic terms that indicate a relevant technology area in which the individual patent or document set falls. Finally, the Digital Librarian would be an effective tool for any institution with an in-house patent collection that needs to be regularly searched and analyzed. The metadata produced by the Digital Librarian would allow users of the patent collection to search for and identify unique relationships and themes that they may have otherwise overlooked.
Find Themes and Similar Documents with Sophia Search
While the Sophia Digital Librarian is useful for the patent analyst, the Sophia Search interface will definitely be useful for the prior art searcher. Users can register for free with the Sophia website in order to search “core data repositories” (Medline and US Patents) on the Sophia Search platform on a trial basis. The search technology is currently on version 1.2 of the product, and the company will eventually release new data files on the Sophia platform that will be accessible on a subscription basis. It is important to note that this search site and the searchable data files are only for demonstration purposes and therefore may not be up-to-date with the latest abstracts. According to a representative from Sophia, they “are looking to partner with companies who wish to host a full-text patent search service leveraging our products.”
After creating a free trial account, a user can choose to search either within the Medline or US Patents file. The screenshots below illustrate a “Search by Example” within the US Patents file. Users can conduct a simple or advanced keyword search through the Sophia interface, but the “Search by Example” feature is the highlight of the platform. Users can upload a Word or text version of a document onto the Sophia platform (PDF files weren’t accepted) through the “Browse” button beside the search form.
After uploading the document and selecting “Search,” the system will generate a list of thematic folders. Each folder is labeled with a list of relevant terms and phrases.
Within each folder, the user can view:
- A cloud of relevant terms within this folder of document.
- “Key Documents,” which are the most closely related results in this particular thematic folder.
- “Related Documents,” which are less directly relevant but contain similar concepts.
If a searcher has one key document on which they would like to expand their search, then this search provides a valuable tool for both locating similar documents (based around multiple themes within the original document) and identifying additional terms on which to expand the search queries.
The platform does have some glitches. I was unable to access the help file, for example (this bug has been fixed), and the “Search by Example” form doesn’t explicitly state which document file types are accepted by the platform. Through trial and error, I discovered that .TXT and .DOC files were accepted, but the system rejected PDF files. Another issue I noticed was that if the user utilizes the Advanced Search” options, which appear below both the keyword and “Search by Example” forms, then the system will either slow down significantly while processing results or produce no results at all. It was unclear whether the lack of results during multiple tests with both date and keyword limitations was the result of a system glitch or simply because there were no possible results for that query.
The Sophia search technology may be useful to both patent analysts and prior art searchers through two different products. The Sophia Digital Librarian provides semantic metadata tags based on a specific document collection, and these tags can potentially offer new insights to patent analysts or institutions with in-house patent collections who are looking for unique relationships or technology areas within a patent document set. The Sophia Search platform offers a unique tool that allows users to upload a document and find relevant documents and search terms organized into thematically unique folders. The platform is currently only hosting demonstration search files, so the data that users retrieve during their free trial may not be up-to-date. Overall, the Sophia search and metadata products are promising new semantic analysis tools that prior art searchers and analysts will want to test and integrate into their tool box.
Do you know of any unique semantic analysis search tools that may be useful to prior art searchers? Let us know in the comments!
This post was contributed by Joelle Mornini. The Intellogist blog is provided for free by Intellogist’s parent company Landon IP, a major provider of patent searches, trademark searches, technical translations, and information retrieval services.