Skip to content

Commit 4ec7b0b

Browse files
committed
Document and prepare for merge
1 parent d6fd09b commit 4ec7b0b

File tree

4 files changed

+614
-215
lines changed

4 files changed

+614
-215
lines changed

library/pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ packages = ["src/library"]
3434
dev = [
3535
"ipykernel>=7.1.0",
3636
"ipywidgets>=8.1.8",
37+
"matplotlib>=3.10.8",
3738
]
3839

3940
gpu = [
@@ -42,10 +43,13 @@ gpu = [
4243
"ollama>=0.4.7",
4344
]
4445

45-
scraping = [
46+
pdfscraping = [
4647
"pymupdf4llm>=0.0.17",
4748
"pymupdf-layout>=1.26.6",
4849
"opencv-python-headless>=4.12.0.88",
50+
]
51+
52+
webscraping = [
4953
"selenium>=4.29.0",
5054
"webdriver-manager>=4.0.2",
5155
]

library/scraping/README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Scraping
2+
3+
This part of the project aims at transforming the metadata obtained from the prescreening phase into a usable text dataset ready for ingestion into a vector DB and for policy analysis.
4+
5+
## Obtaining usable PDFs
6+
7+
The prescreening phase gave us ~2.5M publications from OpenAlex, with ~1.7M of them open access.
8+
Among those, ~1.3M have a direct PDF URL, while the rest only have web URLs, which some of the time contain the full text and most of the time links to a landing page with a button to download the PDF.
9+
10+
Out of simplicity, we chose to focus on publications with a direct PDF URL.
11+
Some of those link to actual PDF files, while other link to web pages loading PDF with javascript, or web pages anti-scraping measures.
12+
13+
Trying to obtain the full text from all those formats would take an unreasonable amount of effort for this project.
14+
Thus, again out of simplicty, we only processed publications with real PDF URL, yielding ~690k PDF files that we downloaded (which wasn't mandatory and takes about 2TB).
15+
16+
We divide those into 6 batches of 100k files and 1 batch (batch_7) of 90k files.
17+
This batch structure is preserved for the following steps.
18+
19+
The script used to do so is `download_all_pdfs.py`, with helper functions in `src/scraping/download_pdf`.
20+
21+
## Extracting text from PDFs
22+
23+
PDF text extraction is more challenging that it sounds, especially at scale.
24+
The goal is to convert each PDF into an equivalent text file, usually in markdown format.
25+
Popular libraries like docling handle it very well, but they are too expensive to run for our scale and means : quick tests showed, after extrapolation, that converting the 690k PDF files into markdown would require ~8500 H100-hours - almost a full year.
26+
27+
A much faster library with good results is `pymupdf4llm`.
28+
However, we encountered major hurdles in the form of memory issues, with OOM errors killing our jobs quickly.
29+
We could not overcome them despite several days of debugging.
30+
31+
- Disabling OCR didn't work
32+
- Reducing the memory overhead of the main process didn't work.
33+
- Processing pages one by one or by larger chunks didn't work.
34+
- Limiting the maximum memory used by a single worker didn't work.
35+
36+
Thus, we switched to a worse-quality but simple and very fast method: extracting raw text from PDFs without layout analysis.
37+
38+
The script to do so (with md or raw text option) is `extract_text_from_pdfs.py`, with helper functions in `src/scraping/extract_pdf_content.py`.
39+
40+
This creates a txt file for each pdf file in a given folder.
41+
We then gather these texts into a parquet file using `save_txt_as_parquet`.
42+
43+
Raw text is low-quality: sentences are interrupted by random line breaks virtually indistinguishible from real line breaks that we want to keep ; headers, footers and page numbers can appear in the middle of a paragraph if it spanned two pages ; and tables appear completely destructured, with often one cell value per line in a manner that makes it impossible to confidently reconstruct.
44+
45+
Raw text must be cleaned : headers, footers, pages numbers, tables and other garbage lines should be removed. Cleaning code is available in `src/scraping/clean` and gather in `src/scraping/clean/cleaning_pipeline.py`.
46+
47+
## Extracting sections
48+
49+
It was decided to only keep the results and discussion/conclusion sections of each article, to reduce scale and thus decrease cost and latency, and having a higher signal-to-noise ratio for policy analysis.
50+
51+
The markdown format makes it easy using hashtags (although PDF-to-md tools don't preserve title hierachy well), but since we ended up with raw text, we used a regex-based method. The code is in `src/scraping/extract_sections.py`.
52+
53+
The script that takes the raw text, cleans it and extract sections is `extract_sections_from_raw_text.py`.
54+
It saves for each batch the results as a dataframe in a `processed_text.parquet` file.
55+
56+
We then created a "final" parquet file with only results and conclusions from all batches : `results_conclusions_585k_2025-01-02.parquet`.
57+
It contains 585k lines out of 690k inital documents, as not all of them contain one of those sections.
58+
59+
## Quick start
60+
61+
Install uv:
62+
63+
```
64+
https://docs.astral.sh/uv/getting-started/installation/
65+
curl -LsSf https://astral.sh/uv/install.sh | sh
66+
```
67+
68+
If you plan to use pymupdf4llm OCR (not recommended), [install tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html) :
69+
70+
```
71+
sudo apt install tesseract-ocr
72+
```
73+
74+
Create the venv using group pdfscraping:
75+
76+
```
77+
uv sync --group pdfscraping
78+
```
79+
80+
Group webscraping is used by old code that used selenium to obtain more PDFs.
81+
82+
Run the script of your choice using the venv :
83+
84+
```
85+
uv run python myscript.py [cli args]
86+
```
87+
88+
Read the text above to undertand which script to run.

library/scraping/serial_extract_text_from_pdfs.py

Lines changed: 0 additions & 116 deletions
This file was deleted.

0 commit comments

Comments
 (0)