A ROS 2 package for voice-controlled robot navigation and interaction, including voice feedback and environment awareness.
- Project Goal
- Main Components
- Architecture Notes
- Installation and Setup
- Usage
- Technologies Used
- Package Structure
- Example Workflow
- License
The robot (Jetson-Nano) should:
- Use LiDAR (and optionally a camera later) for SLAM and world model building.
- Be controllable via voice commands.
- Provide voice feedback using Text-to-Speech (TTS).
- Check whether a requested action is possible based on the world model and provide appropriate feedback if not.
- Execute navigation and driving commands accordingly.
| Node | Description |
|---|---|
asr_node.py |
Transcribes speech using Whisper |
llm_node.py |
Interprets commands using Qwen3 0.6B finetuned |
control_node.py |
Checks feasibility of commands, plans motion |
tts_node.py |
Converts feedback text into speech using tts_models/en/ljspeech/tacotron2-DDC_ph |
Due to software constraints on the Jetson Nano, inference can be delegated to a locally running FastAPI-based LLM service.
The Nano supports only CUDA 10.2. However, the transformers library requires Python ≥ 3.10, and the only PyTorch build with CUDA support for the Nano targets Python 3.6. Therefore, the local service must run with Python 3.10 and uses CPU-only inference. CUDA cannot be used in this configuration.
Refer to the robo_voice_control_llm_service repository for setup instructions.
The asr_node was initially designed to run the ASR model (Moonshine) locally within the ROS node.
However, on the Jetson Nano, this is not viable because:
- The
transformerslibrary version compatible with Python 3.8 is too old to supportMoonshineForConditionalGeneration - The required ASR model is not available under the current setup due to Python and CUDA constraints
robo_voice_control/
├── asr_node.py
├── control_node.py
├── llm_node.py
└── tts_node.py
-
User gives voice command: "Move forward"
-
Microphone captures audio → Whisper transcribes it
-
LLM processes the text → interprets as
"move_forward" -
Control node checks SLAM-based world model:
- If action is possible → initiates driving
- If not possible → responds via TTS: "That action is not possible."
- ROS 2 (rclpy)
- whisper for ASR (Automatic Speech Recognition)
- Qwen3 0.6B finetuned for command interpretation
- TTS:
tts_models/en/ljspeech/tacotron2-DDC_ph - SLAM using LiDAR (Slam-Toolbox)
- SLAMTEC RPLIDAR ROS2 Package for LiDAR sensor integration
- Jetson platform for local inference
- ROS 2 Humble Hawksbill
- Python 3.8+
- Ubuntu 22.04 LTS (recommended)
- Audio capture device (microphone)
- Audio output device (speakers/headphones)
# Install system dependencies
sudo apt update
sudo apt install -y python3-pip python3-dev portaudio19-dev
sudo apt install -y ros-humble-slam-toolbox
sudo apt install -y ros-humble-sound-playInstall the required Python packages:
pip install -r requirements.txt- Install Rosdep (if not already installed):
sudo apt-get install python3-rosdep
sudo rosdep init
rosdep update- Install ROS dependencies:
rosdep install --from-paths src --ignore-src -r -y- Build the package:
colcon build --symlink-install
colcon build --symlink-install --cmake-args -DGGML_CUDA=Off
source install/setup.bash- Start the SLAM system:
ros2 launch robo_voice_control slam_launch.py- Start all voice control nodes:
ros2 launch robo_voice_control all_nodes_launch.pyTopic: /audio
GitHub: audio_common
Start the audio capturer node:
ros2 run audio_common audio_capturer_nodePublishes the last 10 seconds of the transcribed audio.
Topic: /asr/text
Start ASR node:
ros2 run robo_voice_control asr_nodeInterprets the ASR text from the topic /asr/text and converts it to commands.
Finetuned Qwen3 0.6B: Finetuned using this repository.
Example:
{"input": "Go forward for 2.8 meters", "output": "MOVE FORWARD 2.8;"}
{"input": "Drive in a circle", "output": "COMMAND NOT RECOGNIZED;"}Publishes: /llm/command_interpretations
Parameters:
model_path→ path to the LLM model, e.g.,"/media/psf/DATA SSD/LLMs/finetunes/qwen3_0.6B/checkpoint-310"
Default:
ros2 run robo_voice_control llm_nodeWith model parameter:
ros2 run robo_voice_control llm_node \
--ros-args -p model_path:="/media/psf/DATA SSD/LLMs/finetunes/qwen3_0.6B/checkpoint-310"Install dependencies:
pip install TTS sounddevice numpy
sudo apt install ros-humble-sound-playStart TTS node:
ros2 run robo_voice_control tts_nodeImportant: Transforms must be started before the SLAM toolbox.
sudo chmod 777 /dev/ttyUSB1
ros2 launch sllidar_ros2 view_sllidar_a1_launch.pyVerify the TF tree is properly configured:
ros2 run tf2_tools view_framesThis project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- OpenAI Whisper for ASR
- Qwen3 Model for command interpretation finetuned by Christopher Witzl
- TTS for text-to-speech synthesis
- SLAM Toolbox for mapping and localization