Patent data is almost universally messy. To the uninitiated it can be complicated and unclear which patents are related, let alone important. When dealing with large numbers of patent documents, it can be helpful to know how to thin the heard and focus in on what is most important to you or your client. In this post I’ll teach you how to clean the data by removing family duplicates in preparation to determine the most relevant classifications when dealing with a patent analysis project.
Last time, in Learn Practical Patent Analysis: A Case Study we set up our examination of patent analysis by introducing our subject matter and our initial search strings. To quickly recap:
To get our sample data, I used the following queries in a major patent search engine (w2 is a generic representation of a proximity operator that means “within 2 words of,” while % is a stemming operator):
1) (solar w2 radiation) – around 17500 family records
2) 1 and panel% – around 5000 family records
The first thing to consider is the family status of the results. If the search results are automatically grouped into families, there is not much more that needs to be accomplished in this phase. Examples of search systems that automatically group patent documents into families include:
Some patent search systems have options to group documents into families after the results have been generated. A loose term for this action (and which may be used in another context elsewhere) is “de-duplication.” Examples of patent search systems that can “de-duplicate” include:
Manual de-duplication and family grouping produces the best results, and is always preferable when dealing with more important projects. Automatic de-duplication or family grouping may serve as a less preferred option. Manually cleaning the data avoids one major problem of some family definitions—that they may be too inclusive (or not inclusive enough!). Make sure to check which family definition your patent search system is using to see if it is either too sensitive or not sensitive enough for your needs.
Grouping patents into families or de-duplicating results is important because without this data cleaning step, certain large groupings of patents may be over-represented when we conduct our statistical analysis steps.
The following image shows an example patent family that has been grouped into one result by a major patent search system:
This family relationship ensures that this group of related patent documents is treated as one object. This adjusts most systems (including the one we’re using to run this case study) to treat multiple instances of the same classification as one occurrence. The first two patent documents in the above list were, in part, classified under US Classification 160/272; running statistical analysis on that family record generates only one hit for 160/272–thus not overestimating the importance of 160/272 based on a related patent document within said family.
Part of this case study is engaging with you, the readers, so we ask: What kind of family sorting do you prefer when doing patent analysis studies? If this is your first attempt at doing one along with us, which approach would you take? Do you prefer the simplification and clarity of using inclusive patent families or do you see the value in the redundancy of data provided by using individual documents? Let us know in the comments.
Next time we’ll be addressing classification analysis within a patent analysis study.