Xyggy and the Golden Egg

Many searchers wish that there was a simple patent searching tool that would spit out a series of neatly ranked, organized, and relevant results with the push of a single button. A veritable Golden Egg Laying Goose that could solve our invalidity searches with a few keystrokes so that we might have the rest of the morning to sip on Earl Grey and read patent blogs (no doubt, a blog that might inform us of Japanese Patent Data availability).

Unfortunately, no such tool exists. Despite the patent search industry striving to make searching simpler and more useful with less input and time investment, we’re not quite at a date when patent searchers can be replaced by robots—unlike that Kids in The Hall sketch wherein workers are replaced by robots in the satirical “arms-in-the-fish-vat” industry.

However, a number of patents search systems prioritize making searching as intuitive and simple as possible, including: IPInquest from IPCentury (formerly DECOPA/DECOPAnet), TotalPatent’s Semantic Search and PatentSurf. Meanwhile, Google Patents and Xyggy Patent provide new and simple interface designs. Xyggy further asserts an underlying “item search” technology that is briefly detailed at the Xyggy homepage and promises to go beyond both traditional text based search.

Xyggy, in particular, has been of interest to me as of late and I have done some testing of its “item search” technology. Piqued by the discussion over at the PIUG Wiki, I attempted to integrate Xyggy into my daily patent search activities. The results, while not overwhelming, were encouraging. One of the touted features of Xyggy is that the search engine can use a granted US patent number, analyze the contents of the patent, and find similar patents. Comparing the Xyggy results based on the input of a granted US patent on which one of my colleagues did a search, I looked to find references that matched what were deemed “central” references at the conclusion of my colleague’s search. Xyggy only found 1 of the several patents cited in our report in the first 5 pages of results (first 55 hits), but found many patents in the general subject area of the originally searched patent. Such search results, by the hand of an experienced searcher, could then be analyzed and transformed into a series of secondary searches such as classification and citation searches in a more thorough search system that has such capabilities.

Xyggy is not without its issues, as Edlyn Simmons and Aleksandr Belinskiy point out in the previously referenced PIUG Wiki thread, including lacking the ability to specify and demand exact (or branching) matches when necessary such as in chemical searching. A further issue is the proprietary “black box” nature of Xyggy. “Item search” may be “a new search category that lies at the opposite end of the information retrieval spectrum from text-search” as Xyggy representative Dinesh Vadhia explains, but to know exactly what that means is hard to grasp without any of the system’s blueprints (which are understandably behind lock and key).

Xyggy may not be a Golden Egg Laying Goose, but since the search interface is quite nimble (drag-and-drop and search term toggling bring up new results sets in a couple of seconds) it seems worth the time of any patent searcher to at least give it a spin and see if it fits within their individual searching methods and style.

Have any readers given Xyggy a try? Do you have any other experiences with “non-traditional” patent search engines? We would love to hear about them in the comments below.

Read an update here.

Patent Searches from Landon IP

This post was contributed by Intellogist team member Chris Jagalla.


22 Responses

  1. Chris, I also agree with your point about the mysterious nature of “item search.” My first reaction was, “does patent data have some kind of hidden non-textual metadata properties that I am not aware of”?

    I guess we have to accept that the technology behind Xyggy is a secret for now, but since they have not announced some kind of new breakthrough technology for search in general, I’m still puzzled as to what they’re doing.

    However, as you said, if it’s a fun and useful supplement to current search methods, people will stick with it!

  2. I believe MCAM Doors has a similar “whole of document” approach where it searches concepts which are supposedly impossible for our puny minds to comprehend, or realise that the differently worded claim in specification A relates quite strongly with another differently worded claim in specification B.
    I have personally had the opportunity to go up against an MCAM Doors search and smashed it to pieces. It may have taken me a little longer but I found more relevant results.
    Everything we do has to go through the necktop computer at some stage and no search algorithm is going to match it.

    • Insomniac,

      I really like the way you put it; a “necktop computer” is exactly what it is!

      I am always dubious of methods or systems that cannot adequately be explained, but that doesn’t necessarily mean such systems are useless. The best way to look at it is that the “proof is in the pudding.”

      You tested yourself against MCAM Doors and found it was lacking. On the other hand, even though I don’t know the algorithm behind the Google web search engine, I can fairly say that it serves my needs quite well!

      • as good as the description is, it isn’t mine. i first heard it from the jeweller/entrepreneur Michael Hill.
        i hear what you say. i guess my experiences were based on the assumption by MCAM that patent examiners weren’t capable of doing their job regards searching, and that Doors was going to take over and save the world in that regard. they were overselling their product.

  3. If it sounds too good to be true, it probably is, I suppose!

  4. @ Chris & Kristin

    I really appreciate your warm enthusiasm and I want to explain item-search so that it becomes clear what it is but I cannot divulge the inner working of the algorithms. The search technology behind Xyggy is a breakthrough and quite remarkable. That said, why don’t we engage in an interactive discussion driven forward by your questions?

    First, you maybe surprised to learn that the current release of Xyggy Patent is based on US granted bibliographic data only. Once we include full patent data in the indexes you will see the relevance of the results improve further.

    As an example of item-search, consider a Xyggy image search service. The images (labelled or unlabelled) are represented by low-level features such as color and texture. The algorithm scores the similarity of the images by their features. Similarity between one item and another item is very hard to define, but if you have a set of items, then from the common features the algorithm automatically learns how to compare items. For example, pictures of a red car and blue car have car features in common, whereas pictures of a red car and tomato have red-ness in common.

    Xyggy is able to apply this methodology to any data type including web pages, simple to complex documents, images, social profiles, video, ads, audio and so on. Each search service for a specific data type is distinguished by the features encoded. The algorithm is based on sound statistical theory and machine learning methods.

    Look forward to your questions.


    • Dinesh,

      First of all, thank you for your contribution to the discussion surrounding Xyggy on PIUG, through e-mail, and now on the Intellogist blog. That you are so willing to engage in a discussion about your product is very welcoming!

      Secondly, I look forward to the addition of patent data beyond US granted patents, as I think this will improve the results, as you mention.

      Thirdly, is there a beta test site where readers could check out Xyggy image search? In my research I noticed that there used to be other example Xyggy search engines besides patents, but it appears that patents are the only currently available engine through the Xyggy site.

      Does the image search bear any resemblance to Tin Eye (http://www.tineye.com)?

      Based on your description of Xyggy it sounds like there is some kind of neural net based learning mechanism (such as discussed here: http://spie.org/x39216.xml?highlight=x2412&ArticleID=x39216). Is this accurate to say?


      Chris Jagalla

      • Chris

        It is a pleasure to contribute to the discussions as I really want people to understand the difference between item-search and text-search. I feel that there is still a way to go in the patent search community and I’m looking towards members of the PIUG and Intellogist to keep asking questions until the penny drops.

        I also need to address Edlyn Simmons comments and maybe you could help me understand her concerns first.

        In our everyday lives, we constantly search for and find things (items) and do it remarkably well. Xyggy is introducing item-search into our digital lives.

        To clarify, the current Xyggy Patent services uses US granted patent “bibliographic” data. Once, full patent data is included in our indexes the relevance of the results will improve significantly.

        Over the past year we built a number of demos using different data types including images, music (last.fm), movies (Netflix), news articles (New York Times articles) and legal cases. They all used a previous interface and we are re-working some of them with the new interactive drag and drop interface. Specifically, expect to see a new image search demo using flickr photos soon.

        Xyggy Image search is different to TinEye. Like with Xyggy Patent, you will be able to initiate a search with keywords to find an initial set of similar images and then drag one or more images into the search box to improve relevance.

        Xyggy is not based on neural networks but a novel machine learning method.

  5. I will be very interested to see the image search demo. Since you understandably can’t disclose the technology/methods behind Xyggy, I think all we can do is express our interest and try out the latest Xyggy product on face value. Ease of use and results trump all in the product world : )

  6. Chris

    The technology behind Xyggy is discussed here: http://thenoisychannel.com/2010/04/04/guest-post-information-retrieval-using-a-bayesian-model-of-learning-and-generalization/

    Look forward to answering questions based on the blog post.

  7. Dinesh,

    I found the blog post to be very interesting! Bayesian Sets sound very promising. I have a couple questions and/or observations about the article:

    1. How are “features” defined? It seems to me that they are automatically generated by comparing two examples (finding common keywords or phrases, in text documents). I’m not sure how this fits in with the example of searching using “The Terminator” and “Titanic” to find they are both directed by James Cameron. Are features always automatically “discovered” through Bayesian Set testing/learning or can they be defined prior to a search (in case, say, one wanted to focus on The Terminator and Titanic’s box office gross instead of the James Cameron link)?

    2. In certain kinds of patent searches, “examples” (in terms of previously found patents) are not available. An analog of this is the beginning of a search where no “examples” are yet available to the searcher. Do you think that Xyggy is a good “first step” for patent searchers to consider, or would you view it as a secondary tool once patents of interest have already been found? Do you think this example extends to other areas of search?

    Thanks for letting me know about this article, it was quite illuminating.

  8. Chris

    Pleased to hear that the article was illuminating.

    1. Feature engineering
    The feature vector (schema) of an item type is defined beforehand during the data processing phase and before creating the search indexes. There can definitely be poorly designed feature vectors, but it is mostly a matter of common sense. There doesn’t need to be a huge amount of feature engineering as the relevant information simply needs to be clearly present in the feature vectors. For example, if you want to search for movies, it is useful to have information about actors if you expect your system to find movies with the same actor. If you represent pictures only with texture features, you can’t expect it to find images with the same colors and vice-versa. The main advantage is that it doesn’t really hurt to have too many features, if they are at least plausibly relevant to search. Depending on the application some clever and sophisticated features can also be created.

    I have always thought that if developers are given control of the feature engineering it will open the door to countless item-search services because of developers creativity and needs. Xyggy is a new search tool and looking through this tool is like seeing a new world of search possibilities across all data types from text to non-text.

    1a. Text Documents
    For a set of simple text documents we can define the features to be counts of word occurrences. For each document, create a feature vector consisting of the number of word occurrences for each word in the vocabulary. We now have feature vectors defined for each document. It will be noted that each items feature vector will be sparse.

    The sophistication of a web page or document feature vector can be increased by including phrases, concepts and relationships, numbers, urls, patterns, tags/annotations and so on.

    1b. Movies
    How are features defined for movies? Consider the IMDb (http://www.imdb.com/) movie database. Without knowing the details of the IMDB db schema, a suggestion is to create a movie feature vector that maps to the db schema.

    1c. Patents
    At its simplest we can define a feature vector of a patent to be the counts of word occurrences. We can increase the sophistication of the patent feature vector by including genetic sequence data and uniformly encoded chemical data. The Xyggy Patent service uses a sophisticated feature vector to define a patent.

    2. Initiating a Search
    An item-search can be initiated with just one item or a simple text-search facility can be used to provide an initial set of example items. For example, a New York Times reader could drag an article or image of interest into the search box to find other items of relevance. Though only one item is being dragged into the search box the results will still be superior to text-search. The reader can optionally drag other items in or out of the search box to improve relevance.

    An end-to-end, one-stop patent search service encompassing words, sequence data, chemical formulas amongst others can be built using Xyggy. Xyggy Patent is not that far away from delivering such a service and we would love to hear from companies who require such a service and would support us in building it out.

    Xyggy can also be used to build a patent image service using very simple image features. Similarly for a trademark image search service.

  9. […] and interesting conversation about the intricacies of Xyggy Patent search (and Xyggy in general) in my previous post about Xyggy. The comments on this post deserve your attention because they go above and beyond the scope of the […]

  10. Dinesh,

    You said:
    “1c. Patents
    At its simplest we can define a feature vector of a patent to be the counts of word occurrences. We can increase the sophistication of the patent feature vector by including genetic sequence data and uniformly encoded chemical data. The Xyggy Patent service uses a sophisticated feature vector to define a patent.”

    Would feature vectors in Xyggy Patent such as genetic sequence data and chemical data be created and maintained by subject experts in a similar fashion to the way Controlled Vocabulary is generated by subject experts in Derwent World Patents Index, for example?

    Is this an area you’re considering developing or is Xyggy Patent just a way to show the concepts behind Xyggy?

    • A controlled or consistent vocabulary specific to genetic sequence data, chemical data and others makes it easier to include the data into the feature vector of each patent. The number of features in a feature vector can be as small or as large as is required and in general the number of features per feature vector relative to the total vocabularly is small ie. usually very sparse. In Xyggy Patent we currently use bibliographic data only and took the creative step of using sub-vocabularies based on the key sections of the bibliographic data, and it appears to work quite well.

      The flexibility available in how feature vectors are defined and created is really one of the wonderful aspects of Xyggy and Bayesian Sets. I know that there are many patent search vendors who provide semantic, NLP and/or text-analytics services particularly in the pharmaceutical market but we believe it would be so much easier and deliver superior results if based on Xyggy and Bayesian Sets.

      To step back a little, Xyggy wants to build an online (cloud) platform to allow developers to build and deploy item-search services on behalf of organizations. Xyggy Patent demonstrates the concepts behind Xyggy but is also a useful service today. We would love to work with organizations or patent search vendors to take Xyggy Patent to the next stage.

      Does this answer your questions?

  11. Yes, that answers my question. It would be very interesting to see someone license the technology from you and build a full-featured search engine using your method.

    • Chris

      You raise an interesting point as it doesn’t require organizations to license our technology. Xyggy’s mission is to provide an online (cloud) platform for developers and organizations to build and deploy item-search services (of all kinds not just patents). What does that mean in practice?

      Say that Big Patent Company (BPC) wants to offer a Xyggy-enabled item-based patent search service to its users. Working with Xyggy, BPC’s patent data (also called the source data) will be pre-processed to produce the feature vector data. If BPC wants to retain data privacy and security, then BPC can create the feature vectors themselves with guidance from Xyggy. For additional security, the feature vector data can also be anonymized by BPC.

      The feature vector data, which would be very sparse and hence require a fraction of the storage space of the source data, is uploaded to the Xyggy site. From there, Xyggy will perform the final data processing before creating the indexes. The web site pages would be created by BPC developers with the option of using the Xyggy interactive drag and drop search box. A query from the BPC site is sent to Xyggy (consisting of unique ids) which will return the results (also unique ids) to BPC that are matched to the source data and the final patent search results displayed. Data updates would be a separate process run on some regular basis. That’s about it.

      My suggestion to organizations and existing patent search vendors is to try out Xyggy by initiating a prototype using their own data as we are setup to turn around prototypes quite quickly. A prototype with a reasonable amount of data will quickly demonstrate the value of a Xyggy enabled item-search service.

      Hope that was useful.

  12. It sounds like it might work well within the frame work of an existing search system, similar to how semantic searching is a “value added” search technique in TotalPatent.

  13. Yes exactly. That was what I was thinking because then it makes it easier for a vendor to measure how well it is working wrt existing search offerings.

    I forgot to add that for organizations it is not just patents but technical documents and again Xyggy would unearth similar documents.

    Which leads nicely onto journal publishers. Wouldn’t it be useful to find similar academic papers instead of just the most cited?

  14. […] of text and create a proprietary search query based on “document vectors.” Similar to Xyggy and TotalPatent, CPA Global promises that white papers explaining the dynamics under the hood of […]

  15. […] Here at the Intellogist® Blog we’ve looked at new search technology this year including Xyggy (also see this follow-up) and IBM’s Watson (which is set to play Jeopardy next […]

  16. […] to manual prior art searching.  Of course, the “John Henry” of prior art searching claims that he (or she) “smashed it to pieces” in a head-to-head test using only a “necktop […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: