InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation

Installation

This repository has been tested on Python 3.10 and CUDA 12.8. We recommend using conda to create an isolated environment.

1. Create Conda Environment

conda create -y -n internvla_a1 python=3.10
conda activate internvla_a1

pip install --upgrade pip

2. Install System Dependencies

We use FFmpeg for video encoding/decoding and SVT-AV1 for efficient storage.

conda install -c conda-forge ffmpeg=7.1.1 svt-av1 -y

3. Install PyTorch (CUDA 12.8)

pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
  --index-url https://download.pytorch.org/whl/cu128

4. Install Python Dependencies

pip install torchcodec numpy scipy transformers==4.57.1 mediapy loguru pytest omegaconf
pip install -e .

5. Patch HuggingFace Transformers

We replace the default implementations of several model modules (e.g., π0, InternVLA_A1_3B, InternVLA_A1_2B) to support custom architectures for robot learning.

TRANSFORMERS_DIR=${CONDA_PREFIX}/lib/python3.10/site-packages/transformers/

cp -r src/lerobot/policies/pi0/transformers_replace/models        ${TRANSFORMERS_DIR}
cp -r src/lerobot/policies/InternVLA_A1_3B/transformers_replace/models  ${TRANSFORMERS_DIR}
cp -r src/lerobot/policies/InternVLA_A1_2B/transformers_replace/models  ${TRANSFORMERS_DIR}

Make sure the target directory exists—otherwise create it manually.

6. Configure Environment Variables

export HF_TOKEN=your_token  # for downloading hf models, tokenizers, or processors
export HF_HOME=path_to_huggingface   # default: ~/.cache/huggingface

7. Link Local HuggingFace Cache

ln -s ${HF_HOME}/lerobot data

This allows the repo to access datasets via ./data/.

Quick Start: Fine-tuning InternVLA-A1-3B

This section provides a minimal end-to-end example for running InternVLA-A1: download a dataset → convert it to v3.0 format → fine-tune InternVLA-A1-3B on the A2D Pick-Pen task.

1. Prepare the post-training dataset

In this example, we use the A2D Pick-Pen task from the Genie-1 real-robot dataset.

Step 1.1 Download the dataset from Hugging Face

hf download \
  InternRobotics/InternData-A1 \
  real/genie1/Put_the_pen_from_the_table_into_the_pen_holder.tar.gz \
  --repo-type dataset \
  --local-dir data

Step 1.2 Extract and organize the dataset

Extract the downloaded archive, clean up intermediate files, and rename the dataset to follow the A2D naming convention:

tar -xzf data/real/genie1/Put_the_pen_from_the_table_into_the_pen_holder.tar.gz -C data

rm -rf data/real

mkdir -p data/v21
mv data/set_0 data/v21/a2d_pick_pen

After this step, the dataset directory structure should be:

data/
└── v21/
    └── a2d_pick_pen/
        ├── data/
        ├── meta/
        └── videos/

2. Convert the dataset from v2.1 to v3.0 format

The original dataset is stored in LeRobot v2.1 format. This project requires LeRobot v3.0, so a format conversion is required.

Run the following command to convert the dataset:

python src/lerobot/datasets/v30/convert_my_dataset_v21_to_v30.py \
    --old-repo-id v21/a2d_pick_pen \
    --new-repo-id v30/a2d_pick_pen

After conversion, the dataset will be available at:

data/v30/a2d_pick_pen/

3. Compute normalization statistics for relative actions (required)

This project fine-tunes policies using relative (delta) actions. Therefore, you must compute per-dataset normalization statistics (e.g., mean/std) for the action stream before training.

Run the following command to compute statistics for v30/a2d_pick_pen:

python util_scripts/compute_norm_stats_single.py \
  --action_mode delta \
  --chunk_size 50 \
  --repo_id v30/a2d_pick_pen

This script will write a stats.json file under ${HF_HOME}/lerobot/stats/delta/v30/a2d_pick_pen/stats.json.

4. Fine-tune InternVLA-A1-3B on `v30/a2d_pick_pen`

One-line command

bash launch/internvla_a1_3b_finetune.sh v30/a2d_pick_pen

⚠️ Important Note

Before running launch/internvla_a1_3b_finetune.sh, make sure to replace the environment variables inside the script with your own settings, including but not limited to:

HF_HOME
WANDB_API_KEY
CONDA_ROOT
CUDA / GPU-related environment variables
Paths to your local dataset and output directories

TODO

Release InternVLA-A1-3B
Release InternVLA-A1-2B
Release guideline of large-scale dataset pretraining

License and Citation

All the code within this repo are under CC BY-NC-SA 4.0. Please consider citing our project if it helps your research.

@article{contributors2026internvla_a1,
  title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
  author={InternVLA-A1 contributors},
  journal={arXiv preprint arXiv:2601.02456},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
launch		launch
src/lerobot		src/lerobot
tests/policies		tests/policies
util_scripts		util_scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation

Installation

1. Create Conda Environment

2. Install System Dependencies

3. Install PyTorch (CUDA 12.8)

4. Install Python Dependencies

5. Patch HuggingFace Transformers

6. Configure Environment Variables

7. Link Local HuggingFace Cache

Quick Start: Fine-tuning InternVLA-A1-3B

1. Prepare the post-training dataset

Step 1.1 Download the dataset from Hugging Face

Step 1.2 Extract and organize the dataset

2. Convert the dataset from v2.1 to v3.0 format

3. Compute normalization statistics for relative actions (required)

4. Fine-tune InternVLA-A1-3B on `v30/a2d_pick_pen`

One-line command

⚠️ Important Note

TODO

License and Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

InternRobotics/InternVLA-A1

Folders and files

Latest commit

History

Repository files navigation

InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation​

Installation

1. Create Conda Environment

2. Install System Dependencies

3. Install PyTorch (CUDA 12.8)

4. Install Python Dependencies

5. Patch HuggingFace Transformers

6. Configure Environment Variables

7. Link Local HuggingFace Cache

Quick Start: Fine-tuning InternVLA-A1-3B

1. Prepare the post-training dataset

Step 1.1 Download the dataset from Hugging Face

Step 1.2 Extract and organize the dataset

2. Convert the dataset from v2.1 to v3.0 format

3. Compute normalization statistics for relative actions (required)

4. Fine-tune InternVLA-A1-3B on v30/a2d_pick_pen

One-line command

⚠️ Important Note

TODO

License and Citation

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation

4. Fine-tune InternVLA-A1-3B on `v30/a2d_pick_pen`

Packages