Meeting the Challenges of Data Collection and Analysis

As a leader in patent analysis, Landon IP has expert searchers who answer complex questions on a daily basis to produce the highest quality results. If you work with patents, you know the obstacles we face: data collections are huge and unwieldy; errors in the data are rampant; and in short, nothing involving patent data is ever easy. Today, I’m going to share some of our most basic strategies for producing high-quality datasets that lead to reliable results. Read on to get an insight into these best practices.

Clean data is essential

We’ve all seen blatant errors in patent metadata. The industry’s standard coverage now includes millions of documents from over a hundred patent authorities, and the error rate only increases with this rapidly expanding volume.

Because of this, it’s essential to work with the raw data whenever possible, rather than relying on an end-user tool for reporting. We believe that without a human review of the format and quality of the raw data, it’s hard to have faith in the resulting analysis project.

Assignee data probably presents the biggest challenge for high-volume projects. Most patent analysis work at some point relies on assignee data, whether it involves gathering a portfolio of one company’s IP, or creating a landscape that will show major players in a technology space. Unfortunately, assignee data is seriously affected by irregularity in the source data. Examples include:

  • Inconsistent name abbreviations e.g. Minnesota Mining & MFG
  • Misspellings e.g. Minnesoda Mining & Manufacturing
  • Transliterations e.g. Tsinghua University vs. Qinghua University, both phonetically valid translations from the Chinese university name.

Patent searches themselves can be carefully calibrated to find these variants (see our blog post on the topic). But when it comes to cleaning and standardizing the data, computers need a little help identifying equivalents.

When working with small datasets, the task of cleaning assignee data can sometimes be solved by brute force. But as always, the ultimate goal is always to work smarter, not harder! Usually, working with data means extracting it out of a commercial search product and into local files for further manipulation. Most of these search products support downloading records into spreadsheet formats such as .CSV (comma-separated values). This data can then be cleaned without the limitations imposed by a commercial search system.

Using a data cleaning product that can import this CSV data, such as VantagePoint, can make cleaning easier – VantagePoint uses fuzzy logic filters to suggest possible matches between names, while leaving ultimate control in the hands of the searcher.

However, in tried-and-true fashion, Microsoft Excel is the product that usually offers the right tools for initial work on the dataset. If you have no specialized data cleaning software at your disposal, you can use some of the functions in Microsoft Excel to begin the process. Excel’s Text-to-Columns, Filters, and PivotTables features can all be useful in the right situation.

By using basic MS Excel functions to clean data and separate cells containing multiple data points, we can add further dimensions for analysis to the spreadsheet. For example, by using simple Excel functions such as Text-to-Columns, a patent number column such as US6666666A could easily become:





Country Code Publication Number Kind Code
US 6666666 A


Likewise, this feature can be used to address the common problem of having two assignee names on a single patent. An example is shown in the record below:


Publication Number Assignee


Depending on the project goals, analyzing this dataset by assignee could require this record to be triple-counted: counted once for the National Research Institute of Metals, a second time for the Fuji Electric Company, and a third for Chubu Electric Power. By manipulating the dataset we can split these names into separate cells, the first step in counting them individually.


Publication Number Assignee



Publication Number Assignee Assignee 2 Assignee 3



Idiosyncrasies in the source data can make it difficult to translate a raw patent dataset into accurate information. The best approach to analysis projects is to be aware of the pitfalls, and avoid “black box” analysis tools advertise graphs and charts at the click of a button – these inconsistencies in the underlying data can subtract a significant number of documents from the end result, weakening the analysis.

Delivering reliable business intelligence means working with reliable data. Landon IP experts are experienced at handling all types of data collection, and frequently deal with highly complex projects. Our analysts are trained in the myriad problems that crop up in patent data grooming, such as synthesizing data from two different data formats, or filtering data to select only certain family members for analysis.

To learn more about Landon IP analysis services, please see our website.

Patent Information from Landon IP

This post was contributed by Landon IP Librarian Kristin Whitman. The Intellogist blog is provided for free by Intellogist’s parent company, Landon IP, a major provider of patent search, technical translation, and information services.

One Response

  1. […] sorting the column in alphabetic order, grouping the items manually. A recent Intellogist blog post covered aspects of this method and it is one of the least expensive, if time consuming, ways to cleanup the PA field since most […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: