Warning: your electronic patent search databases have gaps!

[tweetmeme source=”Intellogist” only_single=false]

UPDATE: For a further enlightening discussion of the gaps in the USPTO full text database, please see the comment section of this post (click the word “Comments” where it appears at the very end of this post).

Recently, a message came over Carl Oppedahl’s PAIR discussion list highlighting a mysterious gap in the USPTO’s online patent database: data seemed to be missing for patent numbers between 6,363,527 and 6,412,112.

Rick Neifeld, of Neifeld IP Law, responded that his 1999 survey into the PTO’s data revealed many errors in the USPTO’s data, as many of us have probably suspected for some time.  Rick’s description of these errors is very interesting:

The dirt consisted of things as minor as numerous misspellings of assignee names, or HTML pages non compliant with HTML standards, to HTML text that could not be deconstructed into component sections due to HTML formatting errors, assignment records that were combined, corrupt, unreadable.

I absolutely expect our current crop of electronic patent database to contain massive numbers of errors. We have to expect this, if only because of the sheer amount of information involved.   Another reason might be that the economic model of patent data production does not really encourage the national patent offices to maintain high quality electronic patent data. There are millions and millions of patent documents pouring out of government-run institutions (without a profit motive for perfection), and errors are bound to be rampant.


Now, what do I really mean by “massive numbers” of errors? I wish I could come up with a percentage, and there might be a study or two out there that give predictions. But the takeaway point is that there are enough errors to be significant. Enough errors to potentially impact your search. The fact that multiple patent documents were completely missing from the database should be enough to drive that point home.

Rick continued with the following:

I understand from numerous presentations by patent data vendors and my own work in that field, that one aspect of the value added by patent data vendors compared to the USPTO public databases is cleaned up data.  The vendors run routines on each new batch of patent data looking for and programmatically correcting if feasible, all sorts of irregularities.  So, if you have a high value project (and what project is not high value?), keep these facts in mind.

Now, the vendors that immediately come to my mind that are involved in actively manually correcting data are IFI, CAS, and Thomson Reuters (the manually-edited DWPI and DGENE, for example).  I think I have heard that CAS also applies corrections to their incarnation of their INPADOCDB database.  We know that GenomeQuest manually corrects sequence data for some low-quality OCR’ed documents, and I believe I have heard that the World Intellectual Property Organization (WIPO) manually corrects the lowest-quality 2% of the documents they OCR each week.   This is just for starters – please contribute me in the comment section if you’d like to add others.

To drive the point home, Roy Zimmermann of Medtronic added an even more distressing anecdote about a possible source of gaps in the database, generated by difficulties converting paper to electronic data:

Also, recall that when the PTO announced its intentions to throw out it hardcopy classified patents files, the “shoes,”  a group of patent searchers that worked primarily using those manual resources in the PTO, NIPR, did samplings of the shoes compared to the online database contents, found substantial discrepancies, brought a lawsuit to force the PTO to cease discarding the paper copies until it could assure completeness.     As I remember, the discarding was halted for a few weeks, then resumed without any sufficient effort to demonstrate the online contents matched the paper contents.    So while we may be concerned about gaps or omissions in the PTO databases, we shouldn’t be surprised by them.     This is a known problem of long standing!

Another point Roy made is that the USPTO introduced free searchable patent data available only back to 1976 via its website, but it has long since produced a searchable backfile going back even further.  Early in my career I went through a period of obsessively reading PIUG listserv archives, and I remember that there was a huge push at one time to get this database to become available to the public for free.  Although the USPTO website only offers the data back to 1976, we now have the independently-created US full text database offered by Google Patents, and Google also hosts the USPTO data for bulk download.  But even with the complete backfiles, there’s no reason to believe that this collection is really *fully* complete.

As a nice conclusion to the story, the discussion continued over on the PIUG wiki, with PIUG Chair Tony Trippe contacting the always-available Commissioner Stoll for more information.  It seems that Commissioner Stoll was able to get a spotlight on this issue and publicized contact information for Larry Larsson, the point person for USPTO database questions.   This conclusion highlights the growing relationship that PIUG has with PTO leadership, which I think will benefit the entire searching community.

UPDATE: For a further enlightening discussion of the gaps in the USPTO full text database, please see the comment section of this post (click the word “Comments” where it appears at the very end of this post).

Do you have your own anecdotes of missing US patent data?  Also, what data correcting vendors did I miss?  Let me know in the comments!

Thomson Innovation

This post was edited by Intellogist Team member Kristin Whitman.

11 Responses

  1. Google Patents also has problems in that:
    1. It’s unclear what is not included and how up to date the database is
    2. There are numerous image errors throughout the database. Google must use some sort of filter which overzealously removes lines. See, for example, the front page of this patent: http://www.google.com/patents?id=gFh7AAAAEBAJ&printsec=abstract&zoom=4#v=onepage&q&f=false
    3. Google uses OCR for ALL it’s patents, even those after 1976 where a more accurate transcription is available from the USPTO.
    4. Sometimes, for whatever reason, some of the patents do not have an OCR version and thus are not searchable in any manner, but it is unclear which ones they are. You can only discover that there is no text search when you click “plain text” and nothing shows up.

  2. The “complete” backfile of US patent full texts has not been updated since it was published. There is still only one file for 2010, and no files for patents issued after 5 Jan 2010.
    http://www.google.com/googlebooks/uspto-patents-grants-text.html#2010

    This collection is of limited utility if it is only updated annually.

  3. A, you are so right. Google Patents is certainly chock full of errors as well. But the good thing is that since the databases were independently created, they may not be the *same* errors as those in the USPTO website search.

    There is some more good discussion of Google Patents’ eccentricities on this older post from Intellogist,

    Google Patent Search Hiccups?

    And also on this thread from the Intellogist discussion forum:

    http://www.intellogist.com/wiki/Special:AWCforum/?action=st%2Fid25%2FGoogle_Patent_Search

    The point you make about the fact that they used OCR for the whole collection, not just for the pre-1976 collection, is a very important point also.

    I just read somewhere that it is almost a truism that any query you run in two different incarnations of the same information will almost always turn up some disparities in the results sets. Duplication is always a good idea in searching, and it is even more important in databases which have been indexed by chemical structure, genetic sequence, or other human-devised indexing schemes.

  4. Missing data for 50,000 patents seems unbelievable. Is the entire record missing or just partial data? Was it limited to PAIR? I was able to retrieve patents within the given range from the USPTO web databases. All appeared normal.

    You’re correct that patent offices don’t spend a lot of time correcting errors… they simply don’t have the resources to tackle this massive job. However, in my experience the EPO does a very good job of fixing errors in espacenet reported by users. The errors I have reported were corrected within a day or two.

    By the way, the old printed Official Gazette, patent indexes and paper patents were chock full of errors, too. I think the problem seems worse in databases only because it’s much easier to spot the errors.

  5. Hi Michael,

    My understanding is that the missing data was not in PAIR at all, but rather in the PAT-FT patent database on the USPTO website.

    I am glad you were able to pull up the patents for the website, although I do know that numerous folks seemed to be having trouble with the gap during the time this was all unfolding. I believe that Commissioner Stoll told Tony Trippe that the gap had been fixed soon after he was notified of the problem. I didn’t try to test the gap myself so I can’t comment on what the problem might have been.

    One other useful link that was shared on the pair listserv by Roy Zimmermann was this list of patents that have been confirmed missing from the USPTO database:

    http://www.uspto.gov/patft/help/contents.htm

    Your note about the official gazettes is very interesting. Personally I’m inclined to think that missing/erroneous data in an electronic database is even a little bit worse than misprints in a paper source, because people use much less effort in searching the electronic database – everyone tends to assume that if you put a keyword in, and data comes out, that’s all there is to it. I suspect they don’t really learn as much about what they’re missing because they don’t see the data behind the search interface. They don’t get that “what’s going on” moment after they try to use a misprinted index that sends them off on a wild goose chase.

    Of course, I had very little experience using paper sources to search patents. Perhaps it’s the newest crop of patent searchers that are suffering the most, from lack of exposure to the errors in the OG. I wrote on the PIUG wiki, and I’ll say it again here – newcomers to the patent information field are missing a lot of important context about the way electronic patent information sources evolved. We don’t know about the gaps, the politics, the limitations that were imposed by technologies that are now obsolete. Unless you’re really motivated, you won’t come by a lot of this anecdotal stuff accidentally. Thanks for sharing your experiences.

  6. Hi Kristin,

    The list of patents confirmed missing in the USPTO databases is misleading because many of the numbers on the list appear to be withdrawn patents, which we wouldn’t expect to be in the database anyway. There’s also a link to a list of withdrawn patent numbers, but if you’re not familiar with WPNs, you might not make the connection.

  7. Hi Michael,

    I ran some numbers on the two lists in question (the “missing documents” list from 2001 and the withdrawn patents list in the same range, US Pat No. 3,931,263 to 6,101,209) and found that indeed, a majority of the “missing” patents are on the withdrawn list. The final tally was 11,471 entries in the “missing” list (there were more patents “missing” than that, which I’ll get to in a second) vs. 12,246 withdrawn.

    Some lines of the list in the 2001 “missing” list are actually ranges of patents; mostly two or three patent gaps, but occasionally very large gaps (just a few examples):

    5494189 – 5495619 1431
    5496076 – 5497509 1434
    5557633 – 5557800 168

    This shows that the number of “missing” patents in the same range of patents (3.93 million to 6.10 million) is indeed higher than the number of withdrawn patents; i.e. there are (or were in 2001) missing patents that are not accounted for as being withdrawn.

    How many? We can’t be sure. There appear to be no updated counts for “missing” patents on the USPTO site, nor any list of missing patents that are NOT withdrawn patents.

    Personally, I believe the USPTO should provide these data-sets to the public.

  8. Chris,

    Thanks for confirming my suspicions about withdrawn patent numbers in the 2001 list of missing documents.

    However, the more we scrutinize the 2001 list, the clearer it becomes that it is out-of-date and should not be trusted. For example, let’s look at one of the three gaps you mentioned above, 5557633 – 5557800.

    When I searched the PatFT database (query = PN/55576?? or PN/55577?? or PN/55578??) I retrieved 300 documents. So were are the missing documents? I checked the first and last numbers in the two other large gaps you cited and found documents for them, as well.

    I agree that the USPTO should do all it can to provide public access to *all* its data. More importantly, it should provide accurate and up-to-date information on the contents of its public search systems.

    • Michael,

      I agree the 2001 document should not be trusted, but it does illustrate the point that at one time there were missing documents that were not withdrawn patents.

      How bad is this problem today? We could only tell if the USPTO were to provide up to date information, or if someone were to conduct painstaking database checks and corroborations against the withdrawn list.

  9. Hi Kristin,

    Further to Rick Neifeld’s observations on the USPTO data, sequence data errors are particularly troubling.

    We, at SequenceBase Corporation, spend a significant portion of our time correcting errors in the USPTO data in order to make the quality of our data as outstanding as possible. While many of the errors are obvious, many are not. This is the challenge that we continue to address in our ongoing efforts to produce a high quality database.

    Marty

  10. Thanks for adding this comment about your error correction efforts at SequenceBase, Marty. I’d be interested to hear more specifics about your process if you’d care to share.

    Everyone, those following the discussion on gaps in PATFT should take a look at Michael White’s post over on his own blog, at Patent LIbrarian’s Notebook.

    http://patentlibrarian.blogspot.com/2010/10/how-complete-is-ustpo-patent-database.html

    MIchael uses some basic math to point out that the numbers don’t support an argument for big gaps in the PATFT database, and the numbers are definitely convincing.

    I think that there are two major lingering questions, one about what happened recently with the gap from 6,363,527 and 6,412,112 – and could it happen again with certain patents going invisible on us…

    I think that a larger point about errors inherent in the patent electronic text and how frequently they affect search results is also still something we’ll all have to continue to ponder.

Leave a reply to Kristin Whitman Cancel reply