First logged date: Apr. 2nd, 2024

Project Proposal: Curating a Dataset for Patent Invalidity Analysis Using IPR Proceedings Data

Objective:

Curate a comprehensive and well-structured dataset for future development of a model to accurately analyze prior art and determine patent invalidity by leveraging data from Inter Partes Review (IPR) proceedings, including petitions, Patent Owner Responses, and final written decisions.

Data Collection:

  1. Identify and gather publicly available IPR petitions, Patent Owner Responses, and final written decisions from the USPTO database.
  2. Assess the accessibility of additional information from expert reports to potentially supplement the petitions.
  3. Verify that the collected data does not contain confidential information to ensure compliance with legal and ethical standards.
  4. Prioritize collecting a large number of petitions to enable the development of an accurate model in the future.

Data Labeling and Annotation:

  1. Develop a consistent labeling scheme to annotate the collected data, clearly identifying the following elements: a. Patent claims b. Prior art references c. Arguments for invalidity (from petitions) d. Arguments for patentability (from Patent Owner Responses) e. PTAB’s decision and reasoning (from final written decisions)
  2. Create detailed annotation guidelines to ensure consistent labeling across the dataset, minimizing ambiguity and subjectivity.
  3. Implement quality control measures, such as cross-validation and expert review, to maintain the accuracy and reliability of the labeled data.

Data Organization and Storage:

  1. Design a structured database schema to efficiently store and retrieve the collected and labeled data.
  2. Implement appropriate data normalization techniques to minimize redundancy and improve data integrity.
  3. Develop a secure and scalable storage solution to accommodate the potentially large volume of data.

Data Documentation and Accessibility:

  1. Create comprehensive documentation describing the data collection process, labeling scheme, and database structure.
  2. Develop user-friendly interfaces and query mechanisms to facilitate easy access to the curated dataset for researchers and developers.
  3. Establish clear guidelines for data usage, licensing, and attribution to ensure proper utilization of the dataset.

Potential Impact:

The creation of a high-quality, well-curated dataset from IPR proceedings will lay the foundation for the development of advanced models for patent invalidity analysis. This dataset will enable researchers and legal professionals to explore novel approaches to prior art analysis and decision-making, ultimately leading to more efficient and accurate patent validity assessments.

Next Steps:

  1. Identify and secure access to the necessary data sources for IPR proceedings.
  2. Develop a detailed data labeling scheme and annotation guidelines.
  3. Establish a data collection and labeling pipeline, including quality control measures.
  4. Design and implement a database schema and storage solution for the curated dataset.
  5. Create comprehensive documentation and user-friendly interfaces for data accessibility.
  6. Establish a timeline and milestones for data collection, labeling, and organization phases.

from @jamalibrated

I wonder if we can make a training set for invalidity contentions from IPR petitions, Patent Owner Responses, and final written decisions.

  • The petitions/patent owner responses highlight adversarial interpretations of the prior art
  • The final written decisions form a corpus of decisions that account for the competing arguments.
  • We might be able to train a model to analyze prior art more accurately this way. The problem of patent invalidity is comparing patent claims to “prior art,” i.e., earlier public disclosures of the claimed elements of the invention. IPR proceedings are administrative proceedings at the PTO where patent challengers introduce prior art plus reasons why the claims are invalid, and the patent owners argue why their claims are patentably distinct. The PTAB (patent trials and appeal board) decides who is correct and gives reasons. So I wonder if it’s possible to use these filings to teach a model how to analyze prior art. These are mostly public so there shouldn’t be confidentiality issues. Oh— the expert reports may contain additional information beyond the petitions themselves. The more petitions the better, right? Should I label them a certain way?