Today, based upon some suggestions from readers, I wanted to introduce a case study that I hope to discuss over several blog posts. What’s more, my hope is to get you, the reader, involved in the process by providing my method for you to independently audit. The goal is to use the case study to see what conclusions we can come up with using a variety of search tools. I would also hope readers will post questions they would like answered, based upon the described method.
To get our sample data, I used the following queries in a major patent search engine (w2 is a generic representation of a proximity operator that means “within 2 words of,” while % is a stemming operator):
1) (solar w2 radiation) – around 17500 family records
2) 1 and panel% – around 5000 family records
I then exported the first 5000 results of string 2 into a CSV file for further manipulation. I realize my initial search is not exactly scientific and does not follow the protocol of a comprehensive and professional patent search. However, since this case study is not done for a real client, my goal was to obtain a large data set of around 5000 documents relating to a broad subject area, in order to see what information we can retrieve from the data.
You can get a similar data set using most search systems that export data by replacing the generic operators above with the operators of your search system. I apologize for not posting the data set on the site, but the files are over 100 MB each and would take up too much bandwidth. I intend to keep this project system-neutral for the time being, in order to discuss the steps taken in a general way. In addition, community members can share results they may get with their own various tools so that we may compare and contrast results.
In my next posts, I intend to accomplish the following:
- Clean the data
- Determine the most relevant classifications
- Determine the most relevant companies
- Determine the key inventors and inventor teams
Please feel free to post questions that you would like to see answered, or comments on strategies for manipulating the data set.
This post was contributed by Intellogist Team member Dan Wolka.