This repository contains the complete, end-to-end pipeline we submitted to the CLEF LongEval 2025 lab. It spans the Java search engine itself, reproducible Docker tooling, configuration / data folders, statistical-analysis scripts, and the two accompanying homework papers.
.
├── code/ # Java search engine + Docker wrapper
│ ├── src/ # Lucene-based retrieval pipeline
│ ├── run-search-engine.sh
│ ├── docker-compose.yml
│ └── run.sh # one-command Docker launcher
│
├── config/ # JSON configs for indexer & searcher
├── stopwords/ # Multisource FR stop-word lists
├── optuna/ # Hyper-parameter tuning utilities
├── anova/ # Post-hoc significance tests & plots
│
├── homework-1/ # Mid-project report (LaTeX)
├── homework-2/ # Final paper (LaTeX)
└── README.md # you are here
TL;DR: clone, fill a
.envfile, run the project.
cd code
cp .env.example .env # adjust DOCUMENTS_PATH, QUERY_FOLDER … if needed
./run.sh # builds the fat-JAR in a Maven container and runs itThe launcher:
- Builds the project (
mvn clean package) inside themaven:3.8.5-openjdk-17image. - Executes the generated
longeval-search-engine-jar-with-dependencies.jar, mounting the documents, queries and configuration folders specified in.env. - Emits a TREC run file in
run/(one per search-profile).
All parameters of the container are surfaced in .env so you can reproduce our exact setup or point the engine at a different collection.
| Folder | Purpose | Key entry point |
|---|---|---|
anova/ |
One-way and two-way ANOVA + Tukey HSD plots on nDCG | anova/anova1.py, anova/anova2.py |
anova/csv_generator.py |
Converts trec_eval output into tidy CSV | |
optuna/ |
Auto-tunes hyper-parameters (k1, b, n-gram sizes, filters, …) | optuna/params_selector.py |
run_deduplicator.py |
Cleans duplicate doc-ids in a run file |
All scripts are vanilla Python 3.11; statistical plots use matplotlib + seaborn.
- Alberto Bottari
- Lorenzo Croce
- Fatemeh Mahvari