Skip to content

Fireentity/seupd2425-basette

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongEval 2025 Search-Engine Project

This repository contains the complete, end-to-end pipeline we submitted to the CLEF LongEval 2025 lab. It spans the Java search engine itself, reproducible Docker tooling, configuration / data folders, statistical-analysis scripts, and the two accompanying homework papers.


Repository layout

.
├── code/                  # Java search engine + Docker wrapper
│   ├── src/               # Lucene-based retrieval pipeline
│   ├── run-search-engine.sh
│   ├── docker-compose.yml
│   └── run.sh             # one-command Docker launcher
│
├── config/                # JSON configs for indexer & searcher
├── stopwords/             # Multisource FR stop-word lists
├── optuna/                # Hyper-parameter tuning utilities
├── anova/                 # Post-hoc significance tests & plots
│
├── homework-1/            # Mid-project report (LaTeX)
├── homework-2/            # Final paper      (LaTeX)
└── README.md              # you are here

Quick start

TL;DR: clone, fill a .env file, run the project.

cd code
cp .env.example .env        # adjust DOCUMENTS_PATH, QUERY_FOLDER … if needed
./run.sh                    # builds the fat-JAR in a Maven container and runs it

The launcher:

  1. Builds the project (mvn clean package) inside the maven:3.8.5-openjdk-17 image.
  2. Executes the generated longeval-search-engine-jar-with-dependencies.jar, mounting the documents, queries and configuration folders specified in .env.
  3. Emits a TREC run file in run/ (one per search-profile).

All parameters of the container are surfaced in .env so you can reproduce our exact setup or point the engine at a different collection.

Evaluation & analysis

Folder Purpose Key entry point
anova/ One-way and two-way ANOVA + Tukey HSD plots on nDCG anova/anova1.py, anova/anova2.py
anova/csv_generator.py Converts trec_eval output into tidy CSV
optuna/ Auto-tunes hyper-parameters (k1, b, n-gram sizes, filters, …) optuna/params_selector.py
run_deduplicator.py Cleans duplicate doc-ids in a run file

All scripts are vanilla Python 3.11; statistical plots use matplotlib + seaborn.

Team Members

  • Alberto Bottari
  • Lorenzo Croce
  • Fatemeh Mahvari

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •