- Release ROS2 simulation code for UAV/UGV.
- [Sept 25]: We release LISA-AVS, a LISA 7B VLM finetuned on AVS-Bench to output score maps and text explanations given input satellite images and text queries. Check out our LISA-AVS demo here.
- [Sept 25]: We release our AVS RL policy pre-trained on AVS-Bench score maps.
- [Sept 25]: Initial release of Search-TTA and AVS-Bench. Check out our Search-TTA demo here.
- [Aug 25]: Our paper is accepted at the Conference of Robot Learning (CoRL 2025).
Our work addresses the challenges of autonomous outdoor visual navigation and search where targets cannot be directly seen from satellite images. We introduce Search-TTA, a multimodal test-time adaptatation framework that signifiantly corrects poor VLM predictions due to domain mismatch or the lack of training data, given various input modalities (e.g. image, text, sound) and planning methods (e.g. RL).
To train and evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on internet-scale ecological data comprising satellite images, each with targets and their corresponding ground-level image and taxonomic label and sound data. It contains 380k training and 8k validation images (in- and out-domain).
π Download Instructions
We release the following training & evaluation datasets with target locations annotations on Huggingface. These datasets will automatically be downloaded for training and inference, although you will still need to download the raw satellite images and iNaturalist files in the following sections.
Quad-modal: Sat-Text-Image-Sound pairing modalitiesTri-modal: Sat-Text-Image pairing modalities
For convenience, you may directly download the satellite images and sound zip files from the links below.
Alternatively, you may run the scripts from taxabind_avs/scripts/ that download data the Taxabind datasets.
Note that you should download the partial dataset if you only want to run evals.
- Satellite Images: Partial Eval, Full Train
- Sound + Ground Images: Partial Eval, Full Train
Download the following datasets from the iNaturalist 2021 Challenge using the links below. Note that you should download the partial dataset if you only want to run evals.
- Partial iNat Dataset: Partial Images+Json
- Full iNat Dataset: Train Images, Train Json, Val Images, Val Json
You must download the datasets from the above links, and organize them as follows.
Note: You will need to download at least the partial datset from above if you want to only perform evals. If you would like to train the satellite image or sound encoder, please download the full dataset.
The total partial dataset size for evals is ~2GB, while full dataset size for training is around ~350GB.
βββ avs_bench_ds
βΒ Β βββ inat21
βΒ Β βΒ Β βββ train
βΒ Β βΒ Β β βββ 00000_Animalia_Annelida_Clitellata_Haplotaxida_...jpg
βΒ Β βΒ Β β βββ ...
βΒ Β βΒ Β βββ val
βΒ Β βΒ Β β βββ 00000_Animalia_Annelida_Clitellata_Haplotaxida_...jpg
βΒ Β βΒ Β β βββ ...
βΒ Β βΒ Β βββ train.json
βΒ Β βΒ Β βββ val.json
βΒ Β βββ sat_jpg
βΒ Β βΒ Β βββ train_512px
βΒ Β βΒ Β β βββ 0_43.83486_-71.22231.jpg
βΒ Β βΒ Β β βββ ...
βΒ Β βΒ Β βββ test_512px
βΒ Β βΒ Β βββ 2686843_-21.93073_114.12239.jpg
βΒ Β βΒ Β βββ ...
βΒ Β βββ sound_mp3
βΒ Β Β Β βββ train
βΒ Β Β Β β βββ sounds
βΒ Β Β Β | β βββ 100002768.mp3
βΒ Β Β Β | β βββ ...
βΒ Β Β Β β βββ images
βΒ Β Β Β | βββ 100002768.jpg
βΒ Β Β Β | βββ ...
βΒ Β Β Β βββ test
βΒ Β Β Β βββ sounds
βΒ Β Β Β β βββ 100010745.mp3
βΒ Β Β Β β βββ ...
βΒ Β Β Β βββ images
βΒ Β Β Β βββ 100010745.jpg
βΒ Β Β Β βββ ...
This repository was tested using the following dependencies on Ubuntu 20.04. You may install the conda environment as such:
conda create -n search-tta python=3.10
pip install -r requirements.txtπ Code Structure
The structure of our codebase is as follows:
eval/evaluation shellscripts to evaluate Search-TTA.planner/planner framework scripts of Search-TTA.train/training models, logs, and gifs.inference/trained model, inference logs and gifs.maps/training/eval envs and score maps.taxabind_avs/satbind/training & TTA scripts for satellite image encoder.soundbind/training scripts for sound encoder.scripts/scripts to download inat raw datasets.
If you would like, you may follow the instructions below to train the satellite image encoder, sound encoder, and RL planner policy. Else, you can skip to the Inference section below to run the pre-trained models from huggingface. Note that you will need to download the full dataset for training the satelite image / sound encoders.
π Training Details
To train the satellite image encoder, follow the steps below.
It automatically downloads the tri_modal dataset from Huggingface, and trains the satellite image encoder to align to the same representation space as BioCLIP's ground image encoder.
Note that you should adjust the avs_ds_dir parameter in config_sat.py to match your downloaded dataset directories.
We offer the finetuned sat encoder checkpoint here.
cd taxabind_avs/satbind # Adjust config_sat.py
python model_sat.py
tensorboard --logdir=lightning_logsTo train the sound encoder, follow the steps below.
It automatically downloads the quad_modal dataset from Huggingface, and trains the sound encoder to align to the same representation space as BioCLIP's ground image encoder.
Note that you should adjust the avs_ds_dir parameter in config_sound.py to match your downloaded dataset directories.
We offer the finetuned sound encoder checkpoint here.
cd taxabind_avs/soundbind # Adjust config_sound.py
python model_sound.py
tensorboard --logdir=lightning_logsTo train the RL planner, follow the steps below. The planner is trained using the score maps and envs from maps/GT_GPT4o, which are generated by converting point locations to segmentation score masks using GPT4o.
We offer the trained model checkpoint here.
# Adjust planner/parameter.py
python -m planner.driver
tensorboard --logdir=train/logsNote that you will need to download at least the partial dataset for evaluation of Search-TTA.
To run Search-TTA with RL or Information Surfing (IS) planner, follow the steps below.
It automatically downloads the eval splits of AVS-Bench and the trained encoder checkpoints from Huggingface.
You may test our approach on image, text, or sound input modalities.
Note that you should adjust AVS_DS_DIR, NUM_GPU, and NUM_META_AGENTS parameters in test_parameter.py to match your hardware specifications.
# Adjust base parameters in planner/test_parameter.py
cd eval/
./eval_<MODE>.sh If you would like to evaluate TTA on CLIP (without embodied search), follow the steps below. You should see regions with positive samples getting brighter, and regions with negative samples getting darker.
cd taxabind_avs/satbind
python clip_seg_tta.pyπ Additional Customizations
Instead of running on the planners on CLIP-generated score maps, you may run them on your custom score maps.
Please refer to eval/eval_lisa.sh for an example on how to load your custom data. You can download the score maps for LISA here.
Instead of retrieving targets from AVS-Bench, you can override the target positions by setting the TARGETS_SET_DIR parameter in test_parameter.py. This loads maps with target positions marked with grey squares (value of 208). See an example in maps/example/gt_masks_val_with_tgts (only targets, not mask, are loaded). Note that you need to set LOAD_AVS_BENCH to False in order to use this feature.
Instead of using our Reinforcement Learning (RL) or Information Surfing (IS) based planner, you may use your own custom planners. For this, initialize another script that is similar to test_worker.py or test_info_surfing.py, and interface it with the Ray framework in test_driver.py and search environment in env.py.
Please stay tuned for our Search-TTA ROS2 integration with Gazebo simulation (for UAV / UGV).
Our project is based on the following works:
We would like to thank the authors for their great work. Please refer to their papers for more details.
If you intend to use our work in your research, please cite the following publication:
@inproceedings{tan2025searchtta,
title={Search-TTA: A Multi-Modal Test-Time Adaptation Framework for Visual Search in the Wild},
author={Tan, Derek Ming Siang and Shailesh, Shailesh and Liu, Boyang and Raj, Alok and Ang, Qi Xuan and Dai, Weiheng and Duhan, Tanishq and Chiun, Jimmy and Cao, Yuhong and Shkurti, Florian and Sartoretti, Guillaume Adrien},
booktitle={Proceedings of The 9th Conference on Robot Learning},
pages={2093--2120},
year={2025},
volume={305},
publisher={PMLR}
}



