Alex Denne
Growth @ Genie AI | Introduction to Contracts @ UCL Faculty of Laws | Serial Founder

Available Legal AI Datasets

18th December 2024
3 min
Text Link

Note: This article is just one of 60+ sections from our full report titled: The 2024 Legal AI Retrospective - Key Lessons from the Past Year. Please download the full report to check any citations.

Available datasets

Contract Understanding Atticus Dataset (CUAD) is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contract review.

The contracts are collected from the Electronic Data Gathering, Analysis, and Retrieval ("EDGAR") system, which is maintained by the U.S. Securities and Exchange Commission (SEC) (https://www.sec.gov/search-filings).

ContractNLI is a dataset for document-level natural language inference (NLI) on contracts, containing 607 (NDAs). Despite containing more contracts than the CUAD dataset, these are considerably shorter and the whole contract corpus of this dataset is shorter. Moreover, it doesn't contain any other contract type other than NDA. Having more extensive knowledge of the context for this data would enhance the performance of models fine-tuned on them.

Interested in joining our team? Explore career opportunities with us and be a part of the future of Legal AI.

Related Posts

Show all