SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs

Latest Updates

[2024-09-01] Initial release

Scalable Visual Language Robotics (SVLR) Framework

A modular framework for controlling robots using visual and language inputs, based on multi-model approach.

Utilizes a Visual Language Model (VLM), zero-shot image segmentation, a Large Language Model (LLM), and a sentence similarity model to process images and instructions.

Installation

# Create and activate conda environment
conda create -n svlr python=3.10 -y
conda activate svlr

# Install PyTorch. Below is a sample command to do this, but you should check the following link
# to find installation instructions that are specific to your compute platform:
# https://pytorch.org/get-started/locally/
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# Clone and install the slvr repo
git clone https://github.com/bastien-muraccioli/svlr.git
cd svlr
pip install -r requirements.txt

Getting Started

Running SVLR in simulation

This mode was initially made for debug purposes, but can be used to test the framework without a robot.

At the end you will see the image with the detected objects and the predicted actions.

# It will run the SVLR framework with test.png in the pictures folder
python main.py --show_image --simulation

# You can also specify a custom image, put your image in the pictures folder and run:
python main.py --show_image --simulation --simulation_image_file your_image.png

Running SVLR with the UR10 robot

This mode will allow you to control the UR10 robot with the SVLR framework.

Requirements

ROS Noetic with the UR10 controller, custom controller can be used, as it can receive data from the SVLR framework.
UR10 Robot + camera + gripper (Robotiq 2F-140) (However, SVLR is adaptable to any robot with any gripper and camera, but you will need to create a custom controller)
Have a calibrated camera : If you have a USB camera you can use the ROS package logicool and follow the instructions to calibrate the camera. At the end, you will need to save the calibration matrix usb_cam.yaml into the svlr root folder and rename it to calibration.yaml.
In slvr/actions/UR10_action.json, you need to specify:
- the init_pose in the end effector coordinates
- the eye_to_hand: dx and dy that are the offsets between the camera and the end effector.
- the eye_to_hand: depth that is the distance between the camera and your setup during the init pose, if you are using a table, it's the distance between the camera and the table.
In slvr/actions/UR10_pick_place.py, you need to specify the zmin where your robot can reach the objects on the table.

How to run

# --show_image will display the image of the camera but the argument is optional
python main.py --show_image

Arguments

Below is a list of arguments you can use when running main.py to control the robot:

Robot and Server Information:
- --robot_name (str, default: "UR10"): Specifies the name of the robot.
- --server (str, default: '127.0.0.1'): Sets the robot server's IP address.
- --port (int, default: 65500): Defines the port number for the robot server.
- --buffer (int, default: 1024): Determines the buffer size for the server.
Camera Settings:
- --camera_topic (str, default: ""): Set camera image topic
- --camera_device (str, default: '/dev/video2'): Specifies the camera device path.
- --camera_width (int, default: 640): Sets the width of the camera feed.
- --camera_height (int, default: 480): Sets the height of the camera feed.
Large Language Model (LLM) Configuration:
- --llm_name (str, default: 'microsoft/Phi-3-mini-4k-instruct'): Name of the LLM to be used.
- --llm_provider (str, default: 'HuggingFace'): LLM provider (HuggingFace or OpenAI).
- --llm_temperature (float, default: 0.1): Sets the LLM temperature (value between 0.1 and 1.0).
- --llm_is_chat (flag): Indicates if the LLM is a chat model.
Simulation Mode:
- --simulation (flag): Runs the robot in simulation mode.
- --simulation_image_file (str, default: 'test.png'): Specifies the image file to use in simulation mode.
Image Handling:
- --show_image (flag): Displays the captured image.
- --save_image (flag): Saves the captured image.

Repository Structure

High-level overview of repository/project file-tree:

actions/ - JSON files describing the robots and their associated actions programs files.
pictures/ - camera images, generated plot and segmentation predictions.
similarity_model/ - all-MiniLM-L6-v2 files for sentence similarity.
src/ - main source code for the SVLR framework.
tools/ - tools for the SVLR framework.
calibration.yaml - camera calibration matrix.
control_loop.py - main control loop for the SVLR framework.
llm_prompt.json - JSON file with the LLM prompt sytem templates.
main.py - main file to run the SVLR framework.
requirements.txt - Python dependencies.
LICENSE - All code is made available under the MIT License.
README.md - You are here!

How to add a new robot and new actions

In this section, we will explain how to add a new robot and new actions to the SVLR framework.

Add a new robot

Create a new JSON file in the actions/ folder, named {robot_name}_action.json.
Add the robot description in the JSON file, following the existing format.

{
    "robot_name": "robot's name",
    "description": "robot's description",
    "init_pose": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    "eye_to_hand": {
        "dx": 0.0,
        "dy": 0.0,
        "depth": 0.0
    },
}

You can also add more components that can be used in the actions. For example, we had the open and close gripper values in the UR10_action.json file like this:

{
  "gripper": {
     "open": 30,
     "close": 220
   },
}

Your robotic controller will need to receive the actions generated by the SVLR framework. The SVLR framework sends a list of what your actions return. Whatever, we recommend you to return a list of dict with the robot control information. It the case of the UR10, we return the end effector position and the gripper value as the following:

[
  {
    "end_effector": [x, y, z, rx, ry, rz],
    "gripper": gripper_value
  },
]

These information are sent to the robot controller by socket, you will need to specify the address and the port of your controller with the --server and --port arguments when running the main.py file.

Add new actions

To add a new action to the SVLR framework, you need to modify the robot_action.json file and create a new Python file in the actions/ folder.

It's important to notice that the action described in the json file will be used by the LLM to fulfill the user's request, more the description is clear, more the LLM will be able to understand what your action is doing.

Also, in the current implementation of SVLR, the parameters can only be the objects detected in the image. The LLM will generate the action with the detected objects names in the image, but your action program will receive the position of these objects, in the end effector coordinates, to execute the action. We recommend you to explore the UR10 files to understand this process.

In your robot_action.json file, add the new actions with the following format:

{
    "actions":
  [
    {
    "name": "action_name",
    "program": "{robot_name}_{program_name}",
    "description": "action description",
    "parameters": [{
      "type": "type",
      "description": "[parameter description]",
      "required": true
      }]
    },
  ]
}

Create a new Python file in the actions/ folder, named {robot_name}_{program_name}.py. This file will contain the program for the action. You only need to return a list with the content that required your controller (e.g. With the UR10, we need to return a list of dict with the positions of the end effector and gripper values). The list is needed to let the framework execute multiple actions in a row.

How to add new AI models

By default, the SVLR framework uses the following models from HuggingFace:

VLM: OpenGVLab/Mini-InternVL-Chat-2B-V1-5
LLM: microsoft/Phi-3-mini-4k-instruct
Sentence Similarity: all-MiniLM-L6-v2
Zero-Shot Image Segmentation: CIDAS/clipseg-rd64-refined

Add a new VLM

As the lightweights open-source VLM are recent, it can be a bit tricky to add a new one. However, as it concerns the SVLR framework, you will only need to update the src/vlm.py file to use the new model.

Add a new LLM

To add a new LLM, you need to specify its system prompt in the llm_prompt.json file, otherwise it will use the default prompt, that is not recommended.

Then you will need to specify its name and its provider (HuggingFace or OpenAI) with the --llm_name and --llm_provider arguments when running the main.py file. If you want to use a chat model, you will need to specify the --llm_is_chat argument.

As it concerns the OpenAI models, you can create a .env file in the root folder with the following content:

OPENAI_API_KEY=your_openai_api_key

If you want to add other LLM providers such as Ollama, you will need to modify the src/llm.py file and install the necessary dependencies.

By default, the LLM from HuggingFace are run with a 4bit quantization, if you want to use the full precision, you will need to modify the src/llm.py file.

Add a new Sentence Similarity model

In src/action.py, you will need to replace the model_path variable by the name of the new model.

Add a new Zero-Shot Image Segmentation model

In src/perception.py, you will need to replace the seg_model_name variable by the name of the new model.

Citation

If you find our work useful, please consider citing us!

@misc{samson2025scalabletrainingfreevisuallanguage,
      title={Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs},
      author={Marie Samson and Bastien Muraccioli and Fumio Kanehiro},
      year={2025},
      eprint={2502.01071},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2502.01071},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs

Latest Updates

Scalable Visual Language Robotics (SVLR) Framework

Installation

Getting Started

Running SVLR in simulation

Running SVLR with the UR10 robot

Requirements

How to run

Arguments

Repository Structure

How to add a new robot and new actions

Add a new robot

Add new actions

How to add new AI models

Add a new VLM

Add a new LLM

Add a new Sentence Similarity model

Add a new Zero-Shot Image Segmentation model

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
actions		actions
pictures		pictures
similarity_model/all-MiniLM-L6-v2		similarity_model/all-MiniLM-L6-v2
src		src
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
calibration.yaml		calibration.yaml
control_loop.py		control_loop.py
llm_prompt.json		llm_prompt.json
main.py		main.py
requirements.txt		requirements.txt

License

bastien-muraccioli/svlr

Folders and files

Latest commit

History

Repository files navigation

SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs

Latest Updates

Scalable Visual Language Robotics (SVLR) Framework

Installation

Getting Started

Running SVLR in simulation

Running SVLR with the UR10 robot

Requirements

How to run

Arguments

Repository Structure

How to add a new robot and new actions

Add a new robot

Add new actions

How to add new AI models

Add a new VLM

Add a new LLM

Add a new Sentence Similarity model

Add a new Zero-Shot Image Segmentation model

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages