The code for chemical information extraction from PDF and images of PDF pages using GPT-4o as a baseline model.
pip install -r requirements.txt
Poppler has to be installed and added to PATH, follow the instructions here.
-
Put open access article PDFs into
data/pdfs/pdf_<dataset>folders. -
Merge article and supporting infromation files. Dataset keys:
oxazolidinone, benzimidazole, cocrystals, complexes, nanozymes, magnetic, cytotoxicity, seltox, synergy.
python src/merge_suppl.py --dataset <dataset>
- Convert PDF into JPEG images
python src/pdf_to_images.py --dataset <dataset> --poppler_path <poppler_path>
- Extraction from PDF
python src/pdf_extraction.py --dataset <dataset> --openai_api_key <YOUR_API_KEY>
Results will appear in the result/from_pdf folder.
- Extraction from images
python src/images_extraction.py --dataset <dataset> --openai_api_key <YOUR_API_KEY>
Results will appear in the result/from_image folder.
- Calculate metrics
python src/metric_calc.py --dataset <dataset> --source <pdf_or_image>
Metrics will appear in the result/metrics folder.