Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

Search-TTA is compatible with different robots (e.g. UAV / UGV / AUV) and planners (e.g. RL / IS).

📢 News / To-Dos

Release ROS2 simulation code for UAV/UGV.
[Sept 25]: We release LISA-AVS, a LISA 7B VLM finetuned on AVS-Bench to output score maps and text explanations given input satellite images and text queries. Check out our LISA-AVS demo here.
[Sept 25]: We release our AVS RL policy pre-trained on AVS-Bench score maps.
[Sept 25]: Initial release of Search-TTA and AVS-Bench. Check out our Search-TTA demo here.
[Aug 25]: Our paper is accepted at the Conference of Robot Learning (CoRL 2025).

🦁 Introduction

Our work addresses the challenges of autonomous outdoor visual navigation and search where targets cannot be directly seen from satellite images. We introduce Search-TTA, a multimodal test-time adaptatation framework that signifiantly corrects poor VLM predictions due to domain mismatch or the lack of training data, given various input modalities (e.g. image, text, sound) and planning methods (e.g. RL).

🔥 AVS-Bench Dataset

To train and evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on internet-scale ecological data comprising satellite images, each with targets and their corresponding ground-level image and taxonomic label and sound data. It contains 380k training and 8k validation images (in- and out-domain).

📝 Download Instructions

Target Location Datasets

We release the following training & evaluation datasets with target locations annotations on Huggingface. These datasets will automatically be downloaded for training and inference, although you will still need to download the raw satellite images and iNaturalist files in the following sections.

Quad-modal: Sat-Text-Image-Sound pairing modalities
- CLIP Training
- Search Evaluation
Tri-modal: Sat-Text-Image pairing modalities
- CLIP Training
- Search Evaluation

Satellite Images and Sound Data

For convenience, you may directly download the satellite images and sound zip files from the links below. Alternatively, you may run the scripts from taxabind_avs/scripts/ that download data the Taxabind datasets. Note that you should download the partial dataset if you only want to run evals.

Satellite Images: Partial Eval, Full Train
Sound + Ground Images: Partial Eval, Full Train

iNaturalist Ground Images

Download the following datasets from the iNaturalist 2021 Challenge using the links below. Note that you should download the partial dataset if you only want to run evals.

Partial iNat Dataset: Partial Images+Json
Full iNat Dataset: Train Images, Train Json, Val Images, Val Json

Dataset Organization

You must download the datasets from the above links, and organize them as follows.

Note: You will need to download at least the partial datset from above if you want to only perform evals. If you would like to train the satellite image or sound encoder, please download the full dataset. The total partial dataset size for evals is ~2GB, while full dataset size for training is around ~350GB.

├── avs_bench_ds
│   ├── inat21
│   │   ├── train
│   │   │   ├── 00000_Animalia_Annelida_Clitellata_Haplotaxida_...jpg
│   │   │   └── ...
│   │   ├── val
│   │   │   ├── 00000_Animalia_Annelida_Clitellata_Haplotaxida_...jpg
│   │   │   └── ...
│   │   ├── train.json
│   │   └── val.json
│   ├── sat_jpg
│   │   ├── train_512px
│   │   │   ├── 0_43.83486_-71.22231.jpg
│   │   │   └── ...
│   │   └── test_512px
│   │       ├── 2686843_-21.93073_114.12239.jpg
│   │       └── ...
│   └── sound_mp3
│       ├── train
│       │   ├── sounds
│       |   │   ├── 100002768.mp3
│       |   │   └── ...
│       │   └── images
│       |       ├── 100002768.jpg
│       |       └── ...
│       └── test
│           ├── sounds
│           │   ├── 100010745.mp3
│           │   └── ...
│           └── images
│               ├── 100010745.jpg
│               └── ...

📚 Code Overview

Requirements

This repository was tested using the following dependencies on Ubuntu 20.04. You may install the conda environment as such:

conda create -n search-tta python=3.10
pip install -r requirements.txt

📝 Code Structure

Code Structure

The structure of our codebase is as follows:

eval/ evaluation shellscripts to evaluate Search-TTA.
planner/ planner framework scripts of Search-TTA.
train/ training models, logs, and gifs.
inference/ trained model, inference logs and gifs.
maps/ training/eval envs and score maps.
taxabind_avs/
- satbind/ training & TTA scripts for satellite image encoder.
- soundbind/ training scripts for sound encoder.
- scripts/ scripts to download inat raw datasets.

📊 Training

If you would like, you may follow the instructions below to train the satellite image encoder, sound encoder, and RL planner policy. Else, you can skip to the Inference section below to run the pre-trained models from huggingface. Note that you will need to download the full dataset for training the satelite image / sound encoders.

📝 Training Details

SatBind

To train the satellite image encoder, follow the steps below. It automatically downloads the tri_modal dataset from Huggingface, and trains the satellite image encoder to align to the same representation space as BioCLIP's ground image encoder. Note that you should adjust the avs_ds_dir parameter in config_sat.py to match your downloaded dataset directories. We offer the finetuned sat encoder checkpoint here.

cd taxabind_avs/satbind   # Adjust config_sat.py
python model_sat.py
tensorboard --logdir=lightning_logs

SoundBind

To train the sound encoder, follow the steps below. It automatically downloads the quad_modal dataset from Huggingface, and trains the sound encoder to align to the same representation space as BioCLIP's ground image encoder. Note that you should adjust the avs_ds_dir parameter in config_sound.py to match your downloaded dataset directories. We offer the finetuned sound encoder checkpoint here.

cd taxabind_avs/soundbind   # Adjust config_sound.py
python model_sound.py
tensorboard --logdir=lightning_logs

RL Planner

To train the RL planner, follow the steps below. The planner is trained using the score maps and envs from maps/GT_GPT4o, which are generated by converting point locations to segmentation score masks using GPT4o. We offer the trained model checkpoint here.

# Adjust planner/parameter.py
python -m planner.driver    
tensorboard --logdir=train/logs

🚀 Inference

Note that you will need to download at least the partial dataset for evaluation of Search-TTA.

Evaluate Search-TTA

To run Search-TTA with RL or Information Surfing (IS) planner, follow the steps below. It automatically downloads the eval splits of AVS-Bench and the trained encoder checkpoints from Huggingface. You may test our approach on image, text, or sound input modalities. Note that you should adjust AVS_DS_DIR, NUM_GPU, and NUM_META_AGENTS parameters in test_parameter.py to match your hardware specifications.

# Adjust base parameters in planner/test_parameter.py
cd eval/
./eval_<MODE>.sh

Evaluate TTA on CLIP

If you would like to evaluate TTA on CLIP (without embodied search), follow the steps below. You should see regions with positive samples getting brighter, and regions with negative samples getting darker.

cd taxabind_avs/satbind
python clip_seg_tta.py

📝 Additional Customizations

Custom Score Maps

Instead of running on the planners on CLIP-generated score maps, you may run them on your custom score maps. Please refer to eval/eval_lisa.sh for an example on how to load your custom data. You can download the score maps for LISA here.

Custom Target Positions

Instead of retrieving targets from AVS-Bench, you can override the target positions by setting the TARGETS_SET_DIR parameter in test_parameter.py. This loads maps with target positions marked with grey squares (value of 208). See an example in maps/example/gt_masks_val_with_tgts (only targets, not mask, are loaded). Note that you need to set LOAD_AVS_BENCH to False in order to use this feature.

Custom Planners

Instead of using our Reinforcement Learning (RL) or Information Surfing (IS) based planner, you may use your own custom planners. For this, initialize another script that is similar to test_worker.py or test_info_surfing.py, and interface it with the Ray framework in test_driver.py and search environment in env.py.

🛩️ ROS2 Integration

Please stay tuned for our Search-TTA ROS2 integration with Gazebo simulation (for UAV / UGV).

✅ Acknowledgement

Our project is based on the following works:

We would like to thank the authors for their great work. Please refer to their papers for more details.

🔍 References

If you intend to use our work in your research, please cite the following publication:

@inproceedings{tan2025searchtta,
    title={Search-TTA: A Multi-Modal Test-Time Adaptation Framework for Visual Search in the Wild},
    author={Tan, Derek Ming Siang and Shailesh, Shailesh and Liu, Boyang and Raj, Alok and Ang, Qi Xuan and Dai, Weiheng and Duhan, Tanishq and Chiun, Jimmy and Cao, Yuhong and Shkurti, Florian and Sartoretti, Guillaume Adrien},
    booktitle={Proceedings of The 9th Conference on Robot Learning},
    pages={2093--2120},
    year={2025},
    volume={305},
    publisher={PMLR}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

📢 News / To-Dos

🦁 Introduction

🔥 AVS-Bench Dataset

Target Location Datasets

Satellite Images and Sound Data

iNaturalist Ground Images

Dataset Organization

📚 Code Overview

Requirements

Code Structure

📊 Training

SatBind

SoundBind

RL Planner

🚀 Inference

Evaluate Search-TTA

Evaluate TTA on CLIP

Custom Score Maps

Custom Target Positions

Custom Planners

🛩️ ROS2 Integration

✅ Acknowledgement

🔍 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assets		assets
eval		eval
inference		inference
maps		maps
planner		planner
taxabind_avs		taxabind_avs
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

📢 News / To-Dos

🦁 Introduction

🔥 AVS-Bench Dataset

Target Location Datasets

Satellite Images and Sound Data

iNaturalist Ground Images

Dataset Organization

📚 Code Overview

Requirements

Code Structure

📊 Training

SatBind

SoundBind

RL Planner

🚀 Inference

Evaluate Search-TTA

Evaluate TTA on CLIP

Custom Score Maps

Custom Target Positions

Custom Planners

🛩️ ROS2 Integration

✅ Acknowledgement

🔍 References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages