AgroAnnotator is a Compute-to-Data (CtD) algorithm for AgrospAI that:
- extracts text from PDF / HTML / TXT / DOCX
- calls the AgroPortal Annotator API
- handles large documents via chunking
- merges/deduplicates annotations across chunks
- generates JSON + CSV summaries with concept scores
This project was designed to work both:
- Locally (terminal, file path passed as argument)
- In Pontus-X / AgrospAI CtD runtime (where input files are staged under
/data/inputsand outputs are collected from/data/outputs)
In AgrospAI/Pontus-X, algorithms are often run as:
python $ALGOMeaning the platform runs your code without any CLI arguments.
AgroAnnotator supports this by:
- making the input argument optional
- auto-discovering the dataset input file under:
/data/inputs/<something>/0
(typical staging pattern)
- writing outputs by default to:
/data/outputs
(typical Pontus-X outputs collection path)
The script accepts one input file (or stdin).
- PDF (.pdf) — extracted via PyPDF2
- HTML (.html, .htm) — visible text extracted (scripts/styles removed) using BeautifulSoup library (or a fallback parser if the library isn't available at the run time).
- Plain text (.txt, .md)
- DOCX (.docx) — extracted via python-docx
- No-extension files (Pontus-X staging)
Pontus-X may stage the input as a file named 0 with no extension.
AgroAnnotator detects this by sniffing content (e.g., %PDF- header for PDFs).
The algorithm uses a dedicated API key embedded in code by default.
You can override via environment variable:
AGROPORTAL_API_KEY
If no ontology is provided, the default is:
AGROPORTAL_DEFAULT_ONTOLOGY
Default value:
AGROVOC
If label resolution is enabled, you can request a preferred label language:
AGROPORTAL_LABEL_LANG
Default:
en
Not all ontologies provide labels for all languages. Resolution is best-effort.
All outputs are written to:
--out(local), or/data/outputs(Pontus-X default)
chunk_0001.json
chunk_0002.json
...
Raw AgroPortal Annotator API responses for each chunk.
combined.json
Links each chunk’s metadata (start/end offsets) to the chunk response.
merged_annotations.json
Contains global offsets:
_global_from_global_to_chunk_index
Includes:
- ontology URL
- concept URI
- label (best-effort)
- count
- example matches
Exports:
concepts_summary.csv
concepts_summary.json
run_metadata.json
Contains:
- input info
- chunking settings
- ontology selection
- counts
- run details
python algo.py "/path/to/file.pdf" --out out_pdfpython algo.py "/path/to/file.pdf" --out out_pdf --ontologies AGROVOC APTOpython algo.py "/path/to/file.html" --out out_htmlpython algo.py "/path/to/file.txt" --out out_txtTypical Pontus-X flow:
- pull a Docker image
- download algorithm from Algorithm URL
- run with:
python $ALGO- Docker image:
setecores96/humble-annotator:0.1.1 - Entrypoint:
python $ALGO - Algorithm URL:
[https://raw.githubusercontent.com/agroportal/agroannotator/refs/heads/main/algo.py]
Dataset is staged under:
/data/inputs/...
Script auto-detects it.
Dataset policy note: datasets must allow/trust the algorithm to run.
Large documents are split into chunks:
--chunk-size(default ≈ 8k chars)--overlap(default ≈ 200 chars)
Benefits:
- smaller API requests
- higher reliability
- preserves matches near chunk boundaries
requests
PyPDF2(PDF)beautifulsoup4(HTML extraction)python-docx(DOCX)
python -m py_compile algo.py
python -c "import requests; print('requests OK')"
python -c "import PyPDF2; print('PyPDF2 OK')"
python -c "import bs4; print('beautifulsoup4 OK')"
python -c "import docx; print('python-docx OK')"