Learn Practical Patent Analysis: A Case Study (Installment 2)

Patent data is almost universally messy. To the uninitiated it can be complicated and unclear which patents are related, let alone important. When dealing with large numbers of patent documents, it can be helpful to know how to thin the heard and focus in on what is most important to you or your client. In this post I’ll teach you how to clean the data by removing family duplicates in preparation to determine the most relevant classifications when dealing with a patent analysis project.

Last time, in Learn Practical Patent Analysis: A Case Study we set up our examination of patent analysis by introducing our subject matter and our initial search strings. To quickly recap:

To get our sample data, I used the following queries in a major patent search engine (w2 is a generic representation of a proximity operator that means “within 2 words of,” while % is a stemming operator):

1) (solar w2 radiation) – around 17500 family records

2) 1 and panel% – around 5000 family records

The first thing to consider is the family status of the results. If the search results are automatically grouped into families, there is not much more that needs to be accomplished in this phase. Examples of search systems that automatically group patent documents into families include:

Some patent search systems have options to group documents into families after the results have been generated. A loose term for this action (and which may be used in another context elsewhere) is “de-duplication.” Examples of patent search systems that can “de-duplicate” include:

Manual de-duplication and family grouping produces the best results, and is always preferable when dealing with more important projects. Automatic de-duplication or family grouping may serve as a less preferred option. Manually cleaning the data avoids one major problem of some family definitions—that they may be too inclusive (or not inclusive enough!). Make sure to check which family definition your patent search system is using to see if it is either too sensitive or not sensitive enough for your needs.

Grouping patents into families or de-duplicating results is important because without this data cleaning step, certain large groupings of patents may be over-represented when we conduct our statistical analysis steps.

The following image shows an example patent family that has been grouped into one result by a major patent search system:

A patent family

Patent documents automatically grouped into a patent family

This family relationship ensures that this group of related patent documents is treated as one object. This adjusts most systems (including the one we’re using to run this case study) to treat multiple instances of the same classification as one occurrence. The first two patent documents in the above list were, in part, classified under US Classification 160/272; running statistical analysis on that family record generates only one hit for 160/272–thus not overestimating the importance of 160/272 based on a related patent document within said family.

Part of this case study is engaging with you, the readers, so we ask: What kind of family sorting do you prefer when doing patent analysis studies? If this is your first attempt at doing one along with us, which approach would you take? Do you prefer the simplification and clarity of using inclusive patent families or do you see the value in the redundancy of data provided by using individual documents? Let us know in the comments.

Next time we’ll be addressing classification analysis within a patent analysis study.

Read installments 3 and 4.

Patent Analysis from Landon IP

This post was contributed by Intellogist Team members Dan Wolka and Chris Jagalla.

About these ads

9 Responses

  1. Having conducted searches both pre and post PatBase I find family grouping better. Pre PatBase I would search individual countries or regions and often look at potentially relevant equivalents a number of times. If they look relevant but aren’t and this can only be discovered by extensive reading of the specification I am wasting time by doing this several times. It is much better to be able to discard the whole family in one stroke. As such I was able to expand my search and view more families even though the total number of records viewed stayed the same.

  2. Generally while doing any search, mapping or lanscaping, analysis becomes much more easier by reducing the patents documents to 1member per family. This step reduces a lot of time in reading each patent document which belong to same family.
    But, it all depends on what type of analysis we are conducting, for conducting an FTO analysis, i don’t think it would be viable step.

  3. @insomniac: I agree that one of the best features of PatBase (and other by default family grouped search systems) is that initial searching is made much easier and more succinct by the family reduction. In a patent analysis study, this reduction becomes a trade-off between time and thoroughness. By the same token…

    @Shailendra: You make a great point about the type of analysis and what we choose for our family definitions. You hit on the core of our solution: go as deep and thorough as you can (manual de-duplication) provided that the scope of the analysis warrants such a solution. For different kinds of analysis projects, it’s perfectly acceptable to use a simpler family style solution as a trade-off for the ability to analyze more “distinct” documents.

    Going forward, we’re going to utilize patent families for the sake of simplicity, but there’s a whole rabbit hole you can go down as I’m sure you know,Shailendra and insomniac!

  4. [...] time, in Learn Practical Patent Analysis: A Case Study (Installment 2), we discussed how manual de-duplication can reduce the redundant patent documents returned to us in [...]

  5. As others have pointed out here, single family reduction (SFR) is really critical for an efficient data review and removing inflation from analysis. My firm has traditionally not agreed with any of the commercially available approaches for SFR, which actually led us to develop our own software tool. The sticking point was that we wanted to keep divisionals and CIPs in the set (not just file them under the patent head) since they have unique claims. Has anyone else ran into this issue or are we just way too picky?

    The other piece of advice I’d add to this well-written piece is to ensure that you include the family member data in your data download. So even if you do an SFR, you can also assess the size and geographical distribution of the patent families later if you think ahead to include that field. For certain types of analysis, the family member data can be more appropriate (e.g. assessing the filing strategy of a competitor).

  6. Thanks for the great tip Kate!

  7. Hi Kate, it seems like the sensitivity to the unique content of the claims in CIPs and divisionals would be a useful thing to have! I think it probably only applies to US patent documents, right? Did you have to program any other tweaks in to deal with quirks from the EP, PCT documents, or other patent issuing authorities?

    From an information science perspective, teaching a computer program to tease out the relationship between a CIP or divisional and the parent application must be difficult, since I thought it was the case that some claims (at least in CIPs) would get the benefit of the original filing date while others would not.

  8. Kristin, good to hear we’re not the only ones obsessed with this level of detail! It is primarily a US-specific issue. We have run into some issues with other patent jurisdictions for SFR, mainly related to patent number formatting to ensure that family members are matched up properly.

    Great point about the filing dates for CIPs. Since our work is typically at the large dataset level for the purpose of technical and business insights, we use the dates that come down with the dataset. Further research to confirm the precise dates and see if they vary by claim is typically a next step, but only for specific patents of interest.

    SFR is one of those issues where it is really easy to dive down the rabbit hole! It’s great to have this forum to discuss the intricacies.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 744 other followers

%d bloggers like this: