TFVT-HRI

TransFormer with Visual Tokens for Human-Robot Interaction (TFVT-HRI).

@misc{xue2020proactive,
      title={Proactive Interaction Framework for Intelligent Social Receptionist Robots},
      author={Yang Xue and Fan Wang and Hao Tian and Min Zhao and Jiangyong Li and Haiqing Pan and Yueqiang Dong},
      year={2020},
      eprint={2012.04832},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2012.04832}
}

Preparation

sh scripts/download_pretrain_models.sh
sh tools/darknet_to_paddle.sh

Collecting Videos

You need to organize the collected video clips into folder data/clips, then preprocess them using multiple objects tracking, i.e., execute:

# in data/clips

video_1.mp4
video_2.mp4
...

# Assuming that we run 2 workers
python scripts/collect_v2_data.py -w 2 -c 1 -d data/clips &
python scripts/collect_v2_data.py -w 2 -c 2 -d data/clips &

# For more help information
python scripts/collect_v2_data.py --help

Notice that this script would spawn several workers to make the preprocessing fast. After it finished, your clips folder would looks like:

# in data/clips

video_1.mp4
video_1_track.mp4
video_1_states.pkl
video_2.mp4
video_2_track.mp4
video_2_states.pkl
...

Notice: to alleviate the accumulated errors of multiple objects tracking, do not make the video clips too long, maybe several minutes.

Annotation

We developed a web-based annotation platform and you can start the server by running:

sh scripts/run_anno_platform.sh

Then, open the index.html, load the video, select the suitable timestamps by clicking "add annotation", and fill the suitable multi-modal actions.

Next, clik the "save" button to download a txt file that has a prefix from the video filename. Finally move them to folder data/annos.

Notice: for video clips as full negative examples, please save a null txt file, otherwise the video would be ignored.

Generating Datasets

After collected and annotated raw datasets, we need to split them and generate datasets that the dataloader can use.

Step I: create the initial representation of the multi-modal actions.

python scripts/collect_act_emb.py -ad data/annos

Step II: split positve examples and sample negative examples.

python scripts/prepare_dataset.py -dv ds -ad data/annos -vd data/clips
python scripts/prepare_dataset.py -dv ds_decord

Training the Model

sh scripts/attn_model.sh

Deploying the Model

First, use scripts/save_infer_model_params.py to get paddle inference model.

# Assume you got trained model 'saved_models/attn/epoch_10'
python scripts/save_infer_model_params.py saved_models/attn/epoch_10 \
    jetson/attn data/raw_wae/wae_lst.pkl visual_token

Second, setup Jetson environment following jetson/Jetson_INSTALL.md.

Thrid, configurate variables in the jetson/run.sh, use sh run.sh to compile and run the jetson/infer_v3.cpp. This would start a gRPC server and accept requests according to jetson/proactive_greeting.proto.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFVT-HRI

Preparation

Collecting Videos

Annotation

Generating Datasets

Training the Model

Deploying the Model

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TFVT-HRI

Preparation

Collecting Videos

Annotation

Generating Datasets

Training the Model

Deploying the Model