Skip to content

marmotlab/Search-TTA-RL-VLN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

Project Website arXiv HF Models HF Demo

Search-TTA is compatible with different robots (e.g. UAV / UGV / AUV) and planners (e.g. RL / IS).

πŸ“’ News / To-Dos

🦁 Introduction

Our work addresses the challenges of autonomous outdoor visual navigation and search where targets cannot be directly seen from satellite images. We introduce Search-TTA, a multimodal test-time adaptatation framework that signifiantly corrects poor VLM predictions due to domain mismatch or the lack of training data, given various input modalities (e.g. image, text, sound) and planning methods (e.g. RL).


πŸ”₯ AVS-Bench Dataset

To train and evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on internet-scale ecological data comprising satellite images, each with targets and their corresponding ground-level image and taxonomic label and sound data. It contains 380k training and 8k validation images (in- and out-domain).


πŸ“ Download Instructions

Target Location Datasets

We release the following training & evaluation datasets with target locations annotations on Huggingface. These datasets will automatically be downloaded for training and inference, although you will still need to download the raw satellite images and iNaturalist files in the following sections.

Satellite Images and Sound Data

For convenience, you may directly download the satellite images and sound zip files from the links below. Alternatively, you may run the scripts from taxabind_avs/scripts/ that download data the Taxabind datasets. Note that you should download the partial dataset if you only want to run evals.

iNaturalist Ground Images

Download the following datasets from the iNaturalist 2021 Challenge using the links below. Note that you should download the partial dataset if you only want to run evals.

Dataset Organization

You must download the datasets from the above links, and organize them as follows.

Note: You will need to download at least the partial datset from above if you want to only perform evals. If you would like to train the satellite image or sound encoder, please download the full dataset. The total partial dataset size for evals is ~2GB, while full dataset size for training is around ~350GB.

β”œβ”€β”€ avs_bench_ds
β”‚Β Β  β”œβ”€β”€ inat21
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ train
β”‚Β Β  β”‚Β Β  β”‚   β”œβ”€β”€ 00000_Animalia_Annelida_Clitellata_Haplotaxida_...jpg
β”‚Β Β  β”‚Β Β  β”‚   └── ...
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ val
β”‚Β Β  β”‚Β Β  β”‚   β”œβ”€β”€ 00000_Animalia_Annelida_Clitellata_Haplotaxida_...jpg
β”‚Β Β  β”‚Β Β  β”‚   └── ...
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ train.json
β”‚Β Β  β”‚Β Β  └── val.json
β”‚Β Β  β”œβ”€β”€ sat_jpg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ train_512px
β”‚Β Β  β”‚Β Β  β”‚   β”œβ”€β”€ 0_43.83486_-71.22231.jpg
β”‚Β Β  β”‚Β Β  β”‚   └── ...
β”‚Β Β  β”‚Β Β  └── test_512px
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 2686843_-21.93073_114.12239.jpg
β”‚Β Β  β”‚Β Β      └── ...
β”‚Β Β  └── sound_mp3
β”‚Β Β   Β Β  β”œβ”€β”€ train
β”‚Β Β   Β Β  β”‚   β”œβ”€β”€ sounds
β”‚Β Β   Β Β  |   β”‚   β”œβ”€β”€ 100002768.mp3
β”‚Β Β   Β Β  |   β”‚   └── ...
β”‚Β Β   Β Β  β”‚   └── images
β”‚Β Β   Β Β  |       β”œβ”€β”€ 100002768.jpg
β”‚Β Β   Β Β  |       └── ...
β”‚Β Β   Β Β  └── test
β”‚Β Β   Β Β      β”œβ”€β”€ sounds
β”‚Β Β   Β Β      β”‚   β”œβ”€β”€ 100010745.mp3
β”‚Β Β   Β Β      β”‚   └── ...
β”‚Β Β   Β Β      └── images
β”‚Β Β   Β Β          β”œβ”€β”€ 100010745.jpg
β”‚Β Β   Β Β          └── ...

πŸ“š Code Overview

Requirements

This repository was tested using the following dependencies on Ubuntu 20.04. You may install the conda environment as such:

conda create -n search-tta python=3.10
pip install -r requirements.txt
πŸ“ Code Structure

Code Structure

The structure of our codebase is as follows:

  • eval/ evaluation shellscripts to evaluate Search-TTA.
  • planner/ planner framework scripts of Search-TTA.
  • train/ training models, logs, and gifs.
  • inference/ trained model, inference logs and gifs.
  • maps/ training/eval envs and score maps.
  • taxabind_avs/
    • satbind/ training & TTA scripts for satellite image encoder.
    • soundbind/ training scripts for sound encoder.
    • scripts/ scripts to download inat raw datasets.

πŸ“Š Training

If you would like, you may follow the instructions below to train the satellite image encoder, sound encoder, and RL planner policy. Else, you can skip to the Inference section below to run the pre-trained models from huggingface. Note that you will need to download the full dataset for training the satelite image / sound encoders.

πŸ“ Training Details

SatBind

To train the satellite image encoder, follow the steps below. It automatically downloads the tri_modal dataset from Huggingface, and trains the satellite image encoder to align to the same representation space as BioCLIP's ground image encoder. Note that you should adjust the avs_ds_dir parameter in config_sat.py to match your downloaded dataset directories. We offer the finetuned sat encoder checkpoint here.

cd taxabind_avs/satbind   # Adjust config_sat.py
python model_sat.py
tensorboard --logdir=lightning_logs

SoundBind

To train the sound encoder, follow the steps below. It automatically downloads the quad_modal dataset from Huggingface, and trains the sound encoder to align to the same representation space as BioCLIP's ground image encoder. Note that you should adjust the avs_ds_dir parameter in config_sound.py to match your downloaded dataset directories. We offer the finetuned sound encoder checkpoint here.

cd taxabind_avs/soundbind   # Adjust config_sound.py
python model_sound.py
tensorboard --logdir=lightning_logs

RL Planner

To train the RL planner, follow the steps below. The planner is trained using the score maps and envs from maps/GT_GPT4o, which are generated by converting point locations to segmentation score masks using GPT4o. We offer the trained model checkpoint here.

# Adjust planner/parameter.py
python -m planner.driver    
tensorboard --logdir=train/logs

πŸš€ Inference

Note that you will need to download at least the partial dataset for evaluation of Search-TTA.


Evaluate Search-TTA

To run Search-TTA with RL or Information Surfing (IS) planner, follow the steps below. It automatically downloads the eval splits of AVS-Bench and the trained encoder checkpoints from Huggingface. You may test our approach on image, text, or sound input modalities. Note that you should adjust AVS_DS_DIR, NUM_GPU, and NUM_META_AGENTS parameters in test_parameter.py to match your hardware specifications.

# Adjust base parameters in planner/test_parameter.py
cd eval/
./eval_<MODE>.sh    

Evaluate TTA on CLIP

If you would like to evaluate TTA on CLIP (without embodied search), follow the steps below. You should see regions with positive samples getting brighter, and regions with negative samples getting darker.

cd taxabind_avs/satbind
python clip_seg_tta.py
πŸ“ Additional Customizations

Custom Score Maps

Instead of running on the planners on CLIP-generated score maps, you may run them on your custom score maps. Please refer to eval/eval_lisa.sh for an example on how to load your custom data. You can download the score maps for LISA here.

Custom Target Positions

Instead of retrieving targets from AVS-Bench, you can override the target positions by setting the TARGETS_SET_DIR parameter in test_parameter.py. This loads maps with target positions marked with grey squares (value of 208). See an example in maps/example/gt_masks_val_with_tgts (only targets, not mask, are loaded). Note that you need to set LOAD_AVS_BENCH to False in order to use this feature.

Custom Planners

Instead of using our Reinforcement Learning (RL) or Information Surfing (IS) based planner, you may use your own custom planners. For this, initialize another script that is similar to test_worker.py or test_info_surfing.py, and interface it with the Ray framework in test_driver.py and search environment in env.py.

πŸ›©οΈ ROS2 Integration

Please stay tuned for our Search-TTA ROS2 integration with Gazebo simulation (for UAV / UGV).

βœ… Acknowledgement

Our project is based on the following works:

We would like to thank the authors for their great work. Please refer to their papers for more details.

πŸ” References

If you intend to use our work in your research, please cite the following publication:

@inproceedings{tan2025searchtta,
    title={Search-TTA: A Multi-Modal Test-Time Adaptation Framework for Visual Search in the Wild},
    author={Tan, Derek Ming Siang and Shailesh, Shailesh and Liu, Boyang and Raj, Alok and Ang, Qi Xuan and Dai, Weiheng and Duhan, Tanishq and Chiun, Jimmy and Cao, Yuhong and Shkurti, Florian and Sartoretti, Guillaume Adrien},
    booktitle={Proceedings of The 9th Conference on Robot Learning},
    pages={2093--2120},
    year={2025},
    volume={305},
    publisher={PMLR}
}

Releases

No releases published

Packages

 
 
 

Contributors