|
| 1 | +# Scraping |
| 2 | + |
| 3 | +This part of the project aims at transforming the metadata obtained from the prescreening phase into a usable text dataset ready for ingestion into a vector DB and for policy analysis. |
| 4 | + |
| 5 | +## Obtaining usable PDFs |
| 6 | + |
| 7 | +The prescreening phase gave us ~2.5M publications from OpenAlex, with ~1.7M of them open access. |
| 8 | +Among those, ~1.3M have a direct PDF URL, while the rest only have web URLs, which some of the time contain the full text and most of the time links to a landing page with a button to download the PDF. |
| 9 | + |
| 10 | +Out of simplicity, we chose to focus on publications with a direct PDF URL. |
| 11 | +Some of those link to actual PDF files, while other link to web pages loading PDF with javascript, or web pages anti-scraping measures. |
| 12 | + |
| 13 | +Trying to obtain the full text from all those formats would take an unreasonable amount of effort for this project. |
| 14 | +Thus, again out of simplicty, we only processed publications with real PDF URL, yielding ~690k PDF files that we downloaded (which wasn't mandatory and takes about 2TB). |
| 15 | + |
| 16 | +We divide those into 6 batches of 100k files and 1 batch (batch_7) of 90k files. |
| 17 | +This batch structure is preserved for the following steps. |
| 18 | + |
| 19 | +The script used to do so is `download_all_pdfs.py`, with helper functions in `src/scraping/download_pdf`. |
| 20 | + |
| 21 | +## Extracting text from PDFs |
| 22 | + |
| 23 | +PDF text extraction is more challenging that it sounds, especially at scale. |
| 24 | +The goal is to convert each PDF into an equivalent text file, usually in markdown format. |
| 25 | +Popular libraries like docling handle it very well, but they are too expensive to run for our scale and means : quick tests showed, after extrapolation, that converting the 690k PDF files into markdown would require ~8500 H100-hours - almost a full year. |
| 26 | + |
| 27 | +A much faster library with good results is `pymupdf4llm`. |
| 28 | +However, we encountered major hurdles in the form of memory issues, with OOM errors killing our jobs quickly. |
| 29 | +We could not overcome them despite several days of debugging. |
| 30 | + |
| 31 | +- Disabling OCR didn't work |
| 32 | +- Reducing the memory overhead of the main process didn't work. |
| 33 | +- Processing pages one by one or by larger chunks didn't work. |
| 34 | +- Limiting the maximum memory used by a single worker didn't work. |
| 35 | + |
| 36 | +Thus, we switched to a worse-quality but simple and very fast method: extracting raw text from PDFs without layout analysis. |
| 37 | + |
| 38 | +The script to do so (with md or raw text option) is `extract_text_from_pdfs.py`, with helper functions in `src/scraping/extract_pdf_content.py`. |
| 39 | + |
| 40 | +This creates a txt file for each pdf file in a given folder. |
| 41 | +We then gather these texts into a parquet file using `save_txt_as_parquet`. |
| 42 | + |
| 43 | +Raw text is low-quality: sentences are interrupted by random line breaks virtually indistinguishible from real line breaks that we want to keep ; headers, footers and page numbers can appear in the middle of a paragraph if it spanned two pages ; and tables appear completely destructured, with often one cell value per line in a manner that makes it impossible to confidently reconstruct. |
| 44 | + |
| 45 | +Raw text must be cleaned : headers, footers, pages numbers, tables and other garbage lines should be removed. Cleaning code is available in `src/scraping/clean` and gather in `src/scraping/clean/cleaning_pipeline.py`. |
| 46 | + |
| 47 | +## Extracting sections |
| 48 | + |
| 49 | +It was decided to only keep the results and discussion/conclusion sections of each article, to reduce scale and thus decrease cost and latency, and having a higher signal-to-noise ratio for policy analysis. |
| 50 | + |
| 51 | +The markdown format makes it easy using hashtags (although PDF-to-md tools don't preserve title hierachy well), but since we ended up with raw text, we used a regex-based method. The code is in `src/scraping/extract_sections.py`. |
| 52 | + |
| 53 | +The script that takes the raw text, cleans it and extract sections is `extract_sections_from_raw_text.py`. |
| 54 | +It saves for each batch the results as a dataframe in a `processed_text.parquet` file. |
| 55 | + |
| 56 | +We then created a "final" parquet file with only results and conclusions from all batches : `results_conclusions_585k_2025-01-02.parquet`. |
| 57 | +It contains 585k lines out of 690k inital documents, as not all of them contain one of those sections. |
| 58 | + |
| 59 | +## Quick start |
| 60 | + |
| 61 | +Install uv: |
| 62 | + |
| 63 | +``` |
| 64 | +https://docs.astral.sh/uv/getting-started/installation/ |
| 65 | +curl -LsSf https://astral.sh/uv/install.sh | sh |
| 66 | +``` |
| 67 | + |
| 68 | +If you plan to use pymupdf4llm OCR (not recommended), [install tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html) : |
| 69 | + |
| 70 | +``` |
| 71 | +sudo apt install tesseract-ocr |
| 72 | +``` |
| 73 | + |
| 74 | +Create the venv using group pdfscraping: |
| 75 | + |
| 76 | +``` |
| 77 | +uv sync --group pdfscraping |
| 78 | +``` |
| 79 | + |
| 80 | +Group webscraping is used by old code that used selenium to obtain more PDFs. |
| 81 | + |
| 82 | +Run the script of your choice using the venv : |
| 83 | + |
| 84 | +``` |
| 85 | +uv run python myscript.py [cli args] |
| 86 | +``` |
| 87 | + |
| 88 | +Read the text above to undertand which script to run. |
0 commit comments