Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Extraction with LLMs

The code for chemical information extraction from PDF and images of PDF pages using GPT-4o as a baseline model.

🔧 Installation

pip install -r requirements.txt

Poppler has to be installed and added to PATH, follow the instructions here.

🚀 Usage

  1. Put open access article PDFs into data/pdfs/pdf_<dataset> folders.

  2. Merge article and supporting infromation files. Dataset keys: oxazolidinone, benzimidazole, cocrystals, complexes, nanozymes, magnetic, cytotoxicity, seltox, synergy.

python src/merge_suppl.py --dataset <dataset>

  1. Convert PDF into JPEG images

python src/pdf_to_images.py --dataset <dataset> --poppler_path <poppler_path>

  1. Extraction from PDF

python src/pdf_extraction.py --dataset <dataset> --openai_api_key <YOUR_API_KEY>

Results will appear in the result/from_pdf folder.

  1. Extraction from images

python src/images_extraction.py --dataset <dataset> --openai_api_key <YOUR_API_KEY>

Results will appear in the result/from_image folder.

  1. Calculate metrics

python src/metric_calc.py --dataset <dataset> --source <pdf_or_image>

Metrics will appear in the result/metrics folder.