Evaluation of Large Language Models for Classifying Legal Documents in Portuguese

The growing procedural demand in legal institutions has led to work overload, impacting the efficiency of the judicial system. This scenario, worsened by the limitation of human resources, highlights the need for technological solutions that speed up the processing and analysis of documents. In view of this reality, this work proposes a pipeline for automating the classification of these documents, evaluating four methods of representing legal texts at the pipeline input: original text, summaries, centroids, and document descriptions. The pipeline was developed and tested at the Public Defender’s Office of the State of Goiás (DPE-GO). Each approach implements a specific strategy to structure the input texts, aiming to improve the models' ability to interpret and classify legal documents. A new Portuguese-language dataset was introduced, developed for this application, and the performance of Large Language Models (LLMs) was evaluated in classification tasks. The analysis results show that using summaries improves classification accuracy and maximizes the F1-score, optimizing the use of LLMs by reducing the number of processed tokens without compromising precision. These results highlight the impact of textual representations of documents and the potential of LLMs in the automatic classification of legal documents, as in the case of DPE-GO. The contributions of this work indicate that the application of LLMs, combined with optimized textual representations, can increase the productivity and quality of services provided by legal institutions, promoting advances in the overall efficiency of the judicial system.

Dissertation Link

Prerequisites

Python 3.8+
Google Colab (or local environment with Jupyter Notebook)
API Key to access the model

Installation

Install necessary dependencies:

pip install tiktoken langchain-community langchainhub langchain_openai langchain pandas matplotlib scikit-learn seaborn
pip install imbalanced-learn

Mount Google Drive:
- The project required access to Google Drive to load the corpus and save the results.
- Run the following command in Colab:

from google.colab import drive
drive.mount('/content/drive')

Model Configuration:

Configure ChatOpenAI with your desired API key and template. The LLMs available from the Together.AI.

Llama-3.1-70B-Instruct-Turbo

from langchain_openai import ChatOpenAI

api_key = 'YOUR_API_KEY'
llm = ChatOpenAI(api_key=api_key, model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo')

Preparation of Corpus:
- Upload the corpus of legal petitions from Google Drive and filter texts with less than XXX tokens (Context length - Together.AI).
Classification and Generation of Summaries:
- Use customized methods (Direct-Approach, Centroids, Descriptions and Summary) to classify and summarize legal texts.
Metrics and Visualizations:
- Calculate metrics such as precision, recall, F1-score, AUC-ROC, AUC-PR, and MCC.
- Visualize results using bar charts, histograms, and confusion matrices.
Saving the Results:
- Save results, including classifications, metrics reports, and confusion matrices, directly to Google Drive.
Credits
- This project was developed by MSc. Eng. Willgnner Ferreira Santos Lattes, Prof. Dr. Arlindo Rodrigues Galvão Filho Lattes, and Prof. Dr. Sávio Salvarino Teles de Oliveira Lattes, as part of a study on the evaluation of large language models for classifying legal documents in portuguese.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
Preprocessing		Preprocessing
Qwen/Qwen2-72B-Instruct		Qwen/Qwen2-72B-Instruct
meta-llama-Meta-Llama-3.1-70B-Instruct-Turbo		meta-llama-Meta-Llama-3.1-70B-Instruct-Turbo
meta-llama-Meta-Llama-3.1-8B-Instruct-Turbo		meta-llama-Meta-Llama-3.1-8B-Instruct-Turbo
meta-llama/Llama-3.2-3B-Instruct-Turbo		meta-llama/Llama-3.2-3B-Instruct-Turbo
mistralai-Mistral-7B-Instruct-v0-3		mistralai-Mistral-7B-Instruct-v0-3
mistralai-Mixtral-8x22B-Instruct-v0.1		mistralai-Mixtral-8x22B-Instruct-v0.1
mistralai-Mixtral-8x7B-Instruct-v0.1		mistralai-Mixtral-8x7B-Instruct-v0.1
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Large Language Models for Classifying Legal Documents in Portuguese

Prerequisites

Installation

About

Uh oh!

Releases

Packages

Languages

License

Willgnner-Santos/DPE-Legal-Doc-Classification-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Large Language Models for Classifying Legal Documents in Portuguese

Prerequisites

Installation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages