🚀 Official codes of our Interspeech paper On Retrieval of Long Audios with Complex Text Queries

Project website https://sites.google.com/view/larcq
Paper https://www.isca-archive.org/interspeech_2025/yang25n_interspeech.html

@inproceedings{yang25n_interspeech,
  title     = {On Retrieval of Long Audios with Complex Text Queries},
  author    = {Ruochu Yang and Milind Rao and Harshavardhan Sundar and Anirudh Raju and Aparna Khare and Srinath Tankasala and Di He and Venkatesh Ravichandran},
  year      = {2025},
  booktitle = {Interspeech 2025},
  pages     = {2660--2664},
  doi       = {10.21437/Interspeech.2025-2085},
  issn      = {2958-1796},
}

Prerequisite

1. Configure environments

conda create -n larcq python=3.10
conda activate larcq
pip install -r requirements.txt
pip install -e hf-dev-train/transformers-main
pip install -e peft-main

2. Download benchmarks

Save the benchmarks in the datasets folder.

Due to license restriction, we cannot open-source our Clotho_LARCQ and SoundDescs_LARCQ benchmarks. However, we provide the codes of generating the benchmarks. Actually, you can use our codes to generate any LARCQ-style benchmark you want.

3. Download models

Download the clap-htsat-fused model from the Hugging Face model link. Save the model in the models folder.
Download the gpt2 model from the Hugging Face model link. Save the model in the models folder.
Download the Llama-2-7b-chat-hf-qformer folder from the Google Drive website link. Save the folder in the models folder.
Download the stage5_epoch2 folder from the Google Drive website link. Unzip and save the folder in the models folder.
Download the clapcap_weights_2023.pth checkpoint from the Hugging Face website link. Save the checkpoint in the models folder.
Download the opt-iml-max-1.3b folder from the Hugging Face website link. Unzip and save the folder in the models folder.
Download the foundation.pt checkpoint from the Hugging Face website link. Save the checkpoint in the models folder.
Download the ms-marco-MiniLM-L-6-v2 folder from the Hugging Face website link. Unzip and save the folder in the models folder.

4. Nvidia GPUs

The results in the paper are generated in a computer with Nvidia GPUs. Better to have four GPUs and configure nvidia-smi ready.

LARCQ Benchmark Generation

1. Clotho_LARCQ benchmark

We provide the codes of generating our Clotho_LARCQ benchmark based on Clotho Version 2.1 dataset so that you can follow it to create any LARCQ benchmark you want.

(1) Download the clotho_audio_evaluation.7z folder and the clotho_captions_evaluation.csv file from the Zenodo website link. Save them in the datasets/Clotho folder.

(2) Synthesize long-audio-long-query pairs as LARCQ benchmarks

Run terminal command python -m benchmark_generation.synthesize

The raw LARCQ captions are saved as datasets/Clotho_LARCQ/raw_LARCQ_captions.csv The LARCQ audios are saved as 'datasets/Clotho_LARCQ/audios/

(3) Run LLMs to refine the raw LARCQ captions

We use two options to refine the raw LARCQ captions into natural long queries.

Condense the raw captions Run terminal command python -m benchmark_generation.llm_condense
The condensed LARCQ captions are saved as datasets/Clotho_LARCQ/condensed_caption.csv
Rephrase the raw captions Run terminal command python -m benchmark_generation.llm_rephrase
The rephrased LARCQ captions are saved as datasets/Clotho_LARCQ/rephrased_caption.csv

2. SoundDescs_LARCQ benchmark

(1) Download the original SoundDescs dataset from the official GitHub website link. Save them in the datasets/SoundDescs folder.

(2) We filter for audios between 75-150 seconds with captions exceeding 150 characters as complex queries. This results in 1639 audio-query pairs, forming our SoundDescs-LARCQ benchmark.

Run Pipeline

Our pipeline consists of two main parts: multi-modal retrieval and ALM/LLM refining.

1. Run multi-modal rertieval

The retrieval scripts are in the folder pipeline/multi_modal_retrieval. Each script is independent and can be directly executed, which means that you can evaluate any method on any dataset for comprehensive comparison.

(1)retrieval_no_chunking.py is to retrieve the relevant audios given the queries without any audio chunking or query chunking applied.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_no_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_no_chunking.csv

(2)retrieval_audio_chunking.py is to retrieve the relevant audios given the queries with only audio chunking max/sum vote and without any query chunking.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_audio_chunking.csv

(3)retrieval_query_chunking.py is to retrieve the relevant audios given the queries with only query chunking max/sum vote and without any audio chunking.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_query_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_query_chunking.csv

(4)retrieval_audio_chunking_query_chunking.py is to apply the four combinations of audio chunking max vote × query chunking sum vote, audio chunking sum vote × query chunking sum vote, audio chunking sum vote × query chunking max vote, audio chunking max vote × query chunking max vote to retrieve the audios.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking_query_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_best.csv

2. Run ALM/LLM refining

2.1 Run ALM captioning on the retrieved audios

In our paper, we use two ALMs, GAMA and Audio-Flamingo, to generate captions for the retrieved audios.

(1) Run terminal command python -m pipeline.alm_llm_refining.run_gama

GAMA captioning results on the retrieved audios are saved as results/alm_results/{benchmark}/retrieved_audios_gama.csv

(2) Run terminal command python -m pipeline.alm_llm_refining.run_flamingo

Audio-Flamingo captioning results on the retrieved audios are saved as results/alm_results/{benchmark}/retrieved_audios_flamingo.csv

2.2 Run LLM re-ranking on the retrieved audios

In our paper, we use LLM or miniLM to compare the ALM generated response with the text query. You can use any LLM pr miniLM model you want.

(1) Use LLM

In our paper, we use Mixtral as the LLM for re-ranking. Follow the tutorial on the Mistral AI website link to set up Mixtral. First, install the vllm package (version >=0.6.1.post1 to ensure maximum compatibility with all Mistral models). Second, authenticate on the HuggingFace Hub using your access token $HF_TOKEN through the command huggingface-cli login --token $HF_TOKEN
Choose an ALM captioning file results/alm_results/{benchmark}/retrieved_audios_{ALM}.csv, like results/alm_results/Clotho_LARCQ/retrieved_audios_gama.csv
Run terminal command python -m pipeline.alm_llm_refining.llm_ranking LLM re-ranking results are saved as results/llm_results/{benchmark}/{ALM}_llm_ranking.csv

(2) Use miniLM

Choose an ALM captioning file results/alm_results/{benchmark}/retrieved_audios_{ALM}.csv, like results/alm_results/Clotho_LARCQ/retrieved_audios_gama.csv
Run terminal command python -m pipeline.alm_llm_refining.cross_encoder_ranking LLM re-ranking results are saved as results/llm_results/{benchmark}/{ALM}_cross_encoder_ranking.csv

3. Evaluate final results

Finally, we evalute the following final results to obtain all the metrics R@1 and R@5 in our paper.

benchmark = Clotho_LARCQ, SoundDescs_LARCQ
ALM = gama, flamingo
LLM results: results/llm_results/{benchmark}/{ALM}_llm_ranking.csv
miniLM results: results/llm_results/{benchmark}/{ALM}_cross_encoder_ranking.csv

Run terminal command python -m evaluate_final_result

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmark_generation		benchmark_generation
datasets		datasets
models		models
pipeline		pipeline
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
evaluate_final_result.py		evaluate_final_result.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Official codes of our Interspeech paper On Retrieval of Long Audios with Complex Text Queries

Prerequisite

1. Configure environments

2. Download benchmarks

3. Download models

4. Nvidia GPUs

LARCQ Benchmark Generation

1. Clotho_LARCQ benchmark

2. SoundDescs_LARCQ benchmark

Run Pipeline

1. Run multi-modal rertieval

2. Run ALM/LLM refining

2.1 Run ALM captioning on the retrieved audios

2.2 Run LLM re-ranking on the retrieved audios

3. Evaluate final results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

amazon-science/LARCQ

Folders and files

Latest commit

History

Repository files navigation

🚀 Official codes of our Interspeech paper On Retrieval of Long Audios with Complex Text Queries

Prerequisite

1. Configure environments

2. Download benchmarks

3. Download models

4. Nvidia GPUs

LARCQ Benchmark Generation

1. Clotho_LARCQ benchmark

2. SoundDescs_LARCQ benchmark

Run Pipeline

1. Run multi-modal rertieval

2. Run ALM/LLM refining

2.1 Run ALM captioning on the retrieved audios

2.2 Run LLM re-ranking on the retrieved audios

3. Evaluate final results

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages