LongEval 2025 Search-Engine Project

This repository contains the complete, end-to-end pipeline we submitted to the CLEF LongEval 2025 lab. It spans the Java search engine itself, reproducible Docker tooling, configuration / data folders, statistical-analysis scripts, and the two accompanying homework papers.

Repository layout

.
├── code/                  # Java search engine + Docker wrapper
│   ├── src/               # Lucene-based retrieval pipeline
│   ├── run-search-engine.sh
│   ├── docker-compose.yml
│   └── run.sh             # one-command Docker launcher
│
├── config/                # JSON configs for indexer & searcher
├── stopwords/             # Multisource FR stop-word lists
├── optuna/                # Hyper-parameter tuning utilities
├── anova/                 # Post-hoc significance tests & plots
│
├── homework-1/            # Mid-project report (LaTeX)
├── homework-2/            # Final paper      (LaTeX)
└── README.md              # you are here

Quick start

TL;DR: clone, fill a .env file, run the project.

cd code
cp .env.example .env        # adjust DOCUMENTS_PATH, QUERY_FOLDER … if needed
./run.sh                    # builds the fat-JAR in a Maven container and runs it

The launcher:

Builds the project (mvn clean package) inside the maven:3.8.5-openjdk-17 image.
Executes the generated longeval-search-engine-jar-with-dependencies.jar, mounting the documents, queries and configuration folders specified in .env.
Emits a TREC run file in run/ (one per search-profile).

All parameters of the container are surfaced in .env so you can reproduce our exact setup or point the engine at a different collection.

Evaluation & analysis

Folder	Purpose	Key entry point
`anova/`	One-way and two-way ANOVA + Tukey HSD plots on nDCG	`anova/anova1.py`, `anova/anova2.py`
`anova/csv_generator.py`	Converts trec_eval output into tidy CSV
`optuna/`	Auto-tunes hyper-parameters (k1, b, n-gram sizes, filters, …)	`optuna/params_selector.py`
`run_deduplicator.py`	Cleans duplicate doc-ids in a run file

All scripts are vanilla Python 3.11; statistical plots use matplotlib + seaborn.

Team Members

Alberto Bottari
Lorenzo Croce
Fatemeh Mahvari

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongEval 2025 Search-Engine Project

Repository layout

Quick start

Evaluation & analysis

Team Members

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
anova		anova
code		code
config		config
homework-1		homework-1
homework-2		homework-2
optuna		optuna
slides		slides
stopwords/fr		stopwords/fr
.gitignore		.gitignore
README.md		README.md

Fireentity/seupd2425-basette

Folders and files

Latest commit

History

Repository files navigation

LongEval 2025 Search-Engine Project

Repository layout

Quick start

Evaluation & analysis

Team Members

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages