- Project website https://sites.google.com/view/larcq
- Paper https://www.isca-archive.org/interspeech_2025/yang25n_interspeech.html
@inproceedings{yang25n_interspeech,
title = {On Retrieval of Long Audios with Complex Text Queries},
author = {Ruochu Yang and Milind Rao and Harshavardhan Sundar and Anirudh Raju and Aparna Khare and Srinath Tankasala and Di He and Venkatesh Ravichandran},
year = {2025},
booktitle = {Interspeech 2025},
pages = {2660--2664},
doi = {10.21437/Interspeech.2025-2085},
issn = {2958-1796},
}
conda create -n larcq python=3.10
conda activate larcq
pip install -r requirements.txt
pip install -e hf-dev-train/transformers-main
pip install -e peft-main
Save the benchmarks in the datasets
folder.
Due to license restriction, we cannot open-source our Clotho_LARCQ and SoundDescs_LARCQ benchmarks. However, we provide the codes of generating the benchmarks. Actually, you can use our codes to generate any LARCQ-style benchmark you want.
-
Download the
clap-htsat-fused
model from the Hugging Face model link. Save the model in themodels
folder. -
Download the
gpt2
model from the Hugging Face model link. Save the model in themodels
folder. -
Download the
Llama-2-7b-chat-hf-qformer
folder from the Google Drive website link. Save the folder in themodels
folder. -
Download the
stage5_epoch2
folder from the Google Drive website link. Unzip and save the folder in themodels
folder. -
Download the
clapcap_weights_2023.pth
checkpoint from the Hugging Face website link. Save the checkpoint in themodels
folder. -
Download the
opt-iml-max-1.3b
folder from the Hugging Face website link. Unzip and save the folder in themodels
folder. -
Download the
foundation.pt
checkpoint from the Hugging Face website link. Save the checkpoint in themodels
folder. -
Download the
ms-marco-MiniLM-L-6-v2
folder from the Hugging Face website link. Unzip and save the folder in themodels
folder.
The results in the paper are generated in a computer with Nvidia GPUs. Better to have four GPUs and configure nvidia-smi
ready.
We provide the codes of generating our Clotho_LARCQ benchmark based on Clotho Version 2.1 dataset so that you can follow it to create any LARCQ benchmark you want.
(1) Download the clotho_audio_evaluation.7z
folder and the clotho_captions_evaluation.csv
file from the Zenodo website link. Save them in the datasets/Clotho
folder.
(2) Synthesize long-audio-long-query pairs as LARCQ benchmarks
Run terminal command python -m benchmark_generation.synthesize
The raw LARCQ captions are saved as datasets/Clotho_LARCQ/raw_LARCQ_captions.csv
The LARCQ audios are saved as 'datasets/Clotho_LARCQ/audios/
(3) Run LLMs to refine the raw LARCQ captions
We use two options to refine the raw LARCQ captions into natural long queries.
-
Condense the raw captions Run terminal command
python -m benchmark_generation.llm_condense
The condensed LARCQ captions are saved asdatasets/Clotho_LARCQ/condensed_caption.csv
-
Rephrase the raw captions Run terminal command
python -m benchmark_generation.llm_rephrase
The rephrased LARCQ captions are saved asdatasets/Clotho_LARCQ/rephrased_caption.csv
(1) Download the original SoundDescs dataset from the official GitHub website link. Save them in the datasets/SoundDescs
folder.
(2) We filter for audios between 75-150 seconds with captions exceeding 150 characters as complex queries. This results in 1639 audio-query pairs, forming our SoundDescs-LARCQ benchmark.
Our pipeline consists of two main parts: multi-modal retrieval and ALM/LLM refining.
The retrieval scripts are in the folder pipeline/multi_modal_retrieval
. Each script is independent and can be directly executed, which means that you can evaluate any method on any dataset for comprehensive comparison.
(1)retrieval_no_chunking.py
is to retrieve the relevant audios given the queries without any audio chunking or query chunking applied.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_no_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_no_chunking.csv
(2)retrieval_audio_chunking.py
is to retrieve the relevant audios given the queries with only audio chunking max/sum vote and without any query chunking.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_audio_chunking.csv
(3)retrieval_query_chunking.py
is to retrieve the relevant audios given the queries with only query chunking max/sum vote and without any audio chunking.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_query_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_query_chunking.csv
(4)retrieval_audio_chunking_query_chunking.py
is to apply the four combinations of audio chunking max vote × query chunking sum vote
, audio chunking sum vote × query chunking sum vote
, audio chunking sum vote × query chunking max vote
, audio chunking max vote × query chunking max vote
to retrieve the audios.
Run terminal command python -m pipeline.multi_modal_retrieval.retrieval_audio_chunking_query_chunking
Retrieved short-list audios are saved as results/retrieved_results/{benchmark}/retrieved_audios_best.csv
In our paper, we use two ALMs, GAMA and Audio-Flamingo, to generate captions for the retrieved audios.
(1) Run terminal command python -m pipeline.alm_llm_refining.run_gama
GAMA captioning results on the retrieved audios are saved as results/alm_results/{benchmark}/retrieved_audios_gama.csv
(2) Run terminal command python -m pipeline.alm_llm_refining.run_flamingo
Audio-Flamingo captioning results on the retrieved audios are saved as results/alm_results/{benchmark}/retrieved_audios_flamingo.csv
In our paper, we use LLM or miniLM to compare the ALM generated response with the text query. You can use any LLM pr miniLM model you want.
(1) Use LLM
-
In our paper, we use Mixtral as the LLM for re-ranking. Follow the tutorial on the Mistral AI website link to set up Mixtral. First, install the
vllm
package (version>=0.6.1.post1
to ensure maximum compatibility with all Mistral models). Second, authenticate on the HuggingFace Hub using your access token$HF_TOKEN
through the commandhuggingface-cli login --token $HF_TOKEN
-
Choose an ALM captioning file
results/alm_results/{benchmark}/retrieved_audios_{ALM}.csv
, likeresults/alm_results/Clotho_LARCQ/retrieved_audios_gama.csv
-
Run terminal command
python -m pipeline.alm_llm_refining.llm_ranking
LLM re-ranking results are saved asresults/llm_results/{benchmark}/{ALM}_llm_ranking.csv
(2) Use miniLM
-
Choose an ALM captioning file
results/alm_results/{benchmark}/retrieved_audios_{ALM}.csv
, likeresults/alm_results/Clotho_LARCQ/retrieved_audios_gama.csv
-
Run terminal command
python -m pipeline.alm_llm_refining.cross_encoder_ranking
LLM re-ranking results are saved asresults/llm_results/{benchmark}/{ALM}_cross_encoder_ranking.csv
Finally, we evalute the following final results to obtain all the metrics R@1 and R@5 in our paper.
benchmark = Clotho_LARCQ, SoundDescs_LARCQ
ALM = gama, flamingo
LLM results: results/llm_results/{benchmark}/{ALM}_llm_ranking.csv
miniLM results: results/llm_results/{benchmark}/{ALM}_cross_encoder_ranking.csv
Run terminal command python -m evaluate_final_result