Anomaly Detection approaches are used to analyse equipment to detect failures or to determine fraud in the banking sector. We tried to use these algorithms to analyse legal documents.
1. Dealing with Structured Information
We’re lucky! The text data in contracts is quite well-structured and compiled according to templates. As part of the project, we decided to implement a prototype based on the official Russian Procurement System zakupki.gov.ru (200,000 documents). We were able to identify the structure of 170,000 documents: preambles, chapters, paragraphs, and annexes.
2. A Variety of Contract Types
Contracts can belong to various types. They differ in content, subject, main chapters, etc. To optimize the analysis of each type of contract, they need to be classified or clustered. Perhaps you already know what types of contracts are present in your database and how they can be determined. In our case, we had a corpus of contracts with no additional information and no assumptions about the classification of procurement contracts. Therefore, we were forced to cluster the database of contracts.
The clustering can be done by using tf-idf weighting scheme, but we decided to try the Doc2Vec algorithm. Documents were represented as vectors, and the resulting contract vectors were sent to the input of the clustering algorithm. We used the K-means algorithm for clustering. Since similarity is usually measured using the cosine distance, we used it instead of the Euclidean distance.
After getting 20 clusters of documents, we had to check the quality of clustering. Since we cannot classify contracts, we couldn’t compare the clusters with the existing definitions. Then we decided to pay attention to the words describing the cluster. We took the data from “Subject of agreement” line of each cluster, deleted all stop-words, numbers and popular words. For quality evaluation we selected 5 most common words for each cluster as keywords.
Examples of words describing clusters:
- tenant, landlord, apartment, rental, developer
- competence, teaching, educational, academic, full-time
- general contractor, subcontractor, general construction, designer, urban planning
- pharmacy, quarantine, expend, phytosanitary, airtight
- detective, security guard, suppression, anxiety, offence
- licensee, sublicensee, film, licensor, relay
- borrower, escrow, creditor, loan, pledger
- centralized, energy supply, intrazonal, plumbing, sewer
3. What Anomalies Can Be Found in Contracts
Let’s determine what cases we consider abnormal and what can be done with them. We identified the following scenarios:
- A contract has an extra clause that has never been seen in this context before. It is necessary to draw the lawyer’s attention to it.
- There is no clause in the contract that was usually met in such type of documents before. The lawyer should be advised to add an extra clause.
- A paragraph is similar to the template, but it is rephrased, some words are added or deleted. The lawyer is warned and suggested to edit it.
4. Ways to Display Contracts
A contract consists of chapters, chapters are divided into paragraphs, each paragraph may contain subparagraphs, etc. To divide the contract into chapters and paragraphs, we used numberings and keywords like “Chapter”, “Article”, etc. Each paragraph consists of one or more sentences. To divide a paragraph into sentences, we used sent_tokenize from the nltk.tokenize module.
A contract consists of several main chapters, their essence and content can be indexed by the headings: subject, rights and obligations of parties, price and payment procedure. We tried to combine the chapters with identical headings and work with chapters independently. Their titles are often rephrased, have typos or extra punctuation marks. To make groups of chapters large enough, we combined those headings in one group according to the Levenshtein distance.
One of the problems that we can meet is a lot of entities names (names, company names, dates, addresses, etc.) that are usually unique and can be taken as an anomaly. It is necessary to find and eliminate such entities from the contract, i.e. turn the contract into a template. We were lucky because the dataset had a large proportion of template contracts, where entities are replaced by underscores. We identified the phrases where underscores usually occur, found them in real documents and removed entities names. Then we used the Natasha library to find and remove the rest of entities in texts.
5. Defining Abnormal Clauses in Contracts
We have already been able to divide contracts by type with clustering and to identify groups of similar chapters. By using our gained knowledge about types of contracts we could calculate the probability of abnormality appearance in each sentence.
As we collected many contract templates, it was essential to let the Word2Vec model learn each type of chapters. Then, each sentence was associated with a vector of the sum of vectors (using tf-idf). The sentence vectors were divided into clusters the same way as document vectors.
Now, when a sentence appears, we determine what cluster of contracts and what group of chapters it belongs to, and what is the closest match from the sentences cluster. The distance to the nearest sentence can be regarded as a measure of how abnormal this sentence is. For instance, if the distance is zero, then our sentence is not abnormal. Vice versa, if the distance increases then there is a higher possibility of anomality in a sentence.
6. How to Deal with Missing Paragraphs
We figured out how to find abnormal paragraphs in contracts but did not learn how to find anomalies like missing sentences or paragraphs. Such anomalies can be easily found if there is a template for this type of contract. However, sometimes different templates can be used. To detect such anomalies a template has to be created and include a required set of sentences and clauses, which are indexed in our contract database.
We tested an algorithm for constructing templates. This algorithm assumes that we have a similar chapter in our database of contracts. It is supposed to have a correct set of paragraphs, which we want to identify and indicate as a template.
- Use the MinHashLSH algorithm to make the model learn every group of chapters, which allows to find similar texts quickly.
- For each chapter of the uploaded contract, find a similar list of paragraphs from the database.
- Build a language model based on similar paragraphs and choose the paragraph predicted by the language model as a template.
After getting a template for each chapter, the missing parts of a contract can be easily detected and suggested for adding.
7. Full Pipeline
Collecting, processing and storing the contracts corpus:
- Collect the corpus of template contracts.
- Classify/cluster contracts by type.
- Divide the contract into chapters, paragraphs and sentences.
- Remove entities names from contracts.
- Group chapters by their headings.
- For each chapter group, configure Word2Vec.
- Match each sentence with the vector of the sum of vectors of included words.
- Clusterize sentence vectors and store each cluster separately to search for the closest vector in the closest cluster quickly.
- For each chapter group, configure MinHashLSH.
Find anomalies in a new document
Highlight abnormal paragraphs
- Define the type of contract (class or cluster).
- Divide the document into chapters, paragraphs and sentences.
- For each chapter, find an appropriate group of chapters in the database.
- Match each sentence with a vector.
- Find the closest sentence cluster and the closest sentence for each original sentence.
- Calculate the distances between sentence vectors and colour the sentences based on the distances.
- Highlight parts of the sentence, if there is a few words difference with the nearest one.
- Advise editing.
Search for missing items
- For each chapter, build a template.
- Advise to add the missing parts from the template.
- Highlight abnormal paragraphs
8. Quality Assessment
We created a test set of contracts to assess the system quality. We added some anomalies ourselves by deleting part of words, inserting words/phrases into sentences, inserting sentences from other chapters and deleting sentences. We evaluated the performance quality for each type of anomaly and got the following distribution of error determination:
The algorithm allows to determine incorrect inclusions in 4 out of 5 cases. The more template samples and contract clusters have been collected, the better the quality of assessment will be.
To make it more convenient to use the tool, we developed a web interface. New contracts can be uploaded there and get analysed by the algorithm. It highlights abnormal words and sentences in specific colours. The darker colour means the higher possibility of finding an anomaly. We offer a user to edit the highlighted part and suggest him a recommended sentence.
10. Where Can It Be Used?
This solution will help companies who need to get massive sets of documents checked and assessed. Hundreds of pages can be analysed instantly, and abnormal places will be highlighted for the lawyer’s analysis. Of course, the algorithm will not replace real lawyers but will definitely make their life easier.