VLA (2) SmolVLA, ACT#

This document explains how to use XLeRobot for:

  1. Training and running SmolVLA with a bimanual SO-101 setup and three-camera data collection

  2. Training and running ACT (Action Chunking with Transformers) policy

  3. Using VR Control for XLeRobot VR Demo GIF VR Demo GIF


1) Overview#

XLeRobot is a LeRobot-based setup that adds:

  • BiSO101Follower (bimanual follower arm)

  • BiSO101Leader (bimanual teleoperation/leader arm)

  • Independent control of left/right arms + synchronized bimanual operation

  • Three-camera recording configuration:

    • front_cam

    • hand_cam

    • side_cam image

References#


2) Demo Tasks (What SmolVLA Can Learn with ~20 Episodes)#

Demo 1 - Drawer + Pick + Place + Grasp (Bimanual)#

After training on ~20 episodes, XLeRobot can:

  1. Pull open a drawer

  2. Pick an object

  3. Place the object into the drawer

  4. Push the drawer in

VR Demo GIF

Key aspects:

  • One-shot grasp of the drawer handle (avoid jitter during data collection)

  • While the left arm pulls the drawer, the right arm must precisely grasp the object’s center

  • Accurate one-shot placement into the drawer and smooth “push-in” close

Demo 2 - Pencil Case Zipper (Fine Manipulation)#

After training on ~20 episodes, XLeRobot can:

  1. Grasp the zipper pull tab

  2. Grasp the pencil case handle and stabilize the case

  3. Pull the zipper tab to open the zipper smoothly

VR Demo GIF

Key difficulties:

  • The zipper pull is often in a top-down camera blind spot, requiring one-shot grasp (avoid re-grasp)

  • Maintain consistent pulling height to avoid lifting up/down (no “upward yank” or “downward drag”)


3) Hardware / Configuration Notes#

Camera Placement (Three Cameras)#

Recommended configuration:

  • front_cam Ă— 1

  • hand_cam Ă— 1

  • side_cam Ă— 1

Note: In practice, consistent camera placement and stable lighting are critical for learning stable manipulation.

Action Dimension Handling (Important)#

A bimanual SO-101 robot has 12 action dimensions:

  • 6 joints Ă— 2 arms = 12

SmolVLA automatically detects and handles action dimensions without manual configuration:

  • During training: 12-D → padded to 32-D (max_action_dim)

  • During inference: 32-D → cropped back to original 12-D

Conceptual code path:

# Training: 12D -> pad to 32D
actions = pad_vector(batch[ACTION], self.config.max_action_dim)

# Inference: 32D -> crop back to 12D
original_action_dim = self.config.action_feature.shape[0]  # auto-detected: 12
actions = actions[:, :, :original_action_dim]

Unlike other VLA models (e.g., xVLA) that may require manual action_mode configuration, SmolVLA’s dynamic padding supports any action space ≤ 32D.


4) Installation & Environment Setup (Linux)#

A. Install Miniconda (Example)#

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# Restart terminal, then verify
conda --version

B. Create & Activate Environment#

conda create -n lerobot python=3.10
conda activate lerobot

Note: Activate this environment every time you use LeRobot/XLeRobot:

conda activate lerobot

C. Install System Dependencies (FFmpeg)#

conda install -c conda-forge ffmpeg

D. Clone Repository & Install Dependencies#

git clone https://github.com/kahowang/lerobot.git
cd lerobot

# Install LeRobot with Feetech motor support (required for SO-101)
pip install -e ".[feetech]"

# Install SmolVLA dependencies
pip install -e ".[smolvla]"

5) Data Collection (Three-Camera, Bimanual Teleop)#

Use lerobot-record to record bimanual demonstrations with three cameras.

Replace ${HF_USER} and your_dataset_name with your Hugging Face username and dataset name.

lerobot-record \
  --robot.type=bi_so101_follower \
  --robot.left_arm_port=/dev/ttyACM0 \
  --robot.right_arm_port=/dev/ttyACM1 \
  --robot.id=bimanual_follower \
  --robot.cameras='{
    "front_cam": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30},
    "hand_cam": {"type": "opencv", "index_or_path": 1, "width": 640, "height": 480, "fps": 30},
    "side_cam": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30}
  }' \
  --teleop.type=bi_so101_leader \
  --teleop.left_arm_port=/dev/ttyACM2 \
  --teleop.right_arm_port=/dev/ttyACM3 \
  --teleop.id=bimanual_leader \
  --dataset.repo_id=${HF_USER}/your_dataset_name \
  --dataset.single_task="Your task description here" \
  --dataset.num_episodes=50

Parameter Tips#

  1. Ports (/dev/ttyACM*) must match your actual USB device mapping

  2. dataset.single_task should be concise but specific (improves reproducibility)

  3. For quick iteration, start with 20 episodes and scale up


6) Training Policies#

6.1) Train SmolVLA#

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=${HF_USER}/your_dataset_name \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/smolvla_three_cameras \
  --job_name=smolvla_training_three_cameras \
  --policy.device=cuda \
  --wandb.enable=true

Notes:

  1. --policy.path=lerobot/smolvla_base points to the SmolVLA base policy

  2. Adjust --steps based on dataset size and task complexity

  3. If you do not have a GPU, set --policy.device=cpu (training will be slow)

6.2) Train ACT (Action Chunking with Transformers)#

ACT is an imitation-learning method that predicts short action chunks instead of single steps. It often achieves high success rates with teleoperated data.

Basic Training Command:

python -m lerobot.scripts.train \
  --dataset.repo_id=${HF_USER}/your_dataset_name \
  --policy.type=act \
  --output_dir=outputs/train/act_bimanual_demo \
  --job_name=act_training_bimanual \
  --policy.device=cuda \
  --policy.repo_id=${HF_USER}/act_bimanual_demo \
  --wandb.enable=true

Alternative using lerobot-train:

lerobot-train \
  --policy.type=act \
  --dataset.repo_id=${HF_USER}/your_dataset_name \
  --output_dir=outputs/train/act_bimanual_demo \
  --job_name=act_training_bimanual \
  --policy.device=cuda \
  --wandb.enable=true

Training Notes:

  • Checkpoints are written to outputs/train/<job_name>/checkpoints/

  • ACT typically trains in a few hours on a single GPU (~80M parameters)

  • A checkpoint at 80k steps takes about 1h45 on an Nvidia A100

  • For Apple Silicon: use --policy.device=mps

Resume Training from Checkpoint:

python -m lerobot.scripts.train \
  --config_path=outputs/train/act_bimanual_demo/checkpoints/last/pretrained_model/train_config.json \
  --resume=true

7) Inference/Evaluation#

7.1) SmolVLA Inference#

Typical pattern to run policy inference and log evaluation episodes:

lerobot-record \
  --robot.type=bi_so101_follower \
  --robot.left_arm_port=/dev/ttyACM0 \
  --robot.right_arm_port=/dev/ttyACM1 \
  --robot.id=bimanual_follower \
  --robot.cameras='{
    "front_cam": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30},
    "hand_cam": {"type": "opencv", "index_or_path": 1, "width": 640, "height": 480, "fps": 30},
    "side_cam": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30}
  }' \
  --dataset.single_task="Your task description here" \
  --dataset.repo_id=${HF_USER}/eval_results \
  --dataset.num_episodes=10 \
  --policy.path=${HF_USER}/smolvla_three_cameras

Notes:

  1. --policy.path should point to your trained policy checkpoint / uploaded policy

  2. eval_results is a separate dataset repo for evaluation logs (recommended)

7.2) ACT Inference#

Using lerobot-record:

lerobot-record \
  --robot.type=bi_so101_follower \
  --robot.left_arm_port=/dev/ttyACM0 \
  --robot.right_arm_port=/dev/ttyACM1 \
  --robot.id=bimanual_follower \
  --robot.cameras='{
    "front_cam": {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30},
    "hand_cam": {"type": "opencv", "index_or_path": 1, "width": 640, "height": 480, "fps": 30},
    "side_cam": {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30}
  }' \
  --dataset.single_task="Your task description here" \
  --dataset.repo_id=${HF_USER}/eval_act_results \
  --dataset.num_episodes=10 \
  --policy.path=${HF_USER}/act_bimanual_demo

Using python -m lerobot.record:

python -m lerobot.record \
  --robot.type=bi_so101_follower \
  --robot.left_arm_port=/dev/ttyACM0 \
  --robot.right_arm_port=/dev/ttyACM1 \
  --dataset.repo_id=${HF_USER}/eval_act_results \
  --policy.path=${HF_USER}/act_bimanual_demo \
  --episodes=10

Notes:

  • The policy will execute autonomously on the robot

  • Evaluation results are saved to the specified dataset repo

  • Compare evaluation episodes with training demonstrations to assess performance


8) VR Control for XLeRobot#

VR Demo GIF

Robot: Rumi#

Rumi is a new-generation bimanual robot with a liftable chassis: Rumi

RUMI_HEAD_VR

Repositories#

  1. VR controller repo

  2. LeRobot repo (fork)

Features#

  1. VR → robot arm mapping

  2. Supports:

    • Inverse kinematics (IK) solving

    • Joint-space → motor command conversion

XLeRobot integrates ROS 2#

For VR control, follow the repository’s README to configure VR devices, ROS 2 nodes, and robot drivers.


VR Demo GIF

9) Practical Tips / Common Pitfalls#

Data Quality#

  1. Aim for one-shot grasps during demonstrations (avoid micro-corrections)

  2. Keep camera viewpoints consistent between recording and inference

  3. Maintain stable lighting and avoid motion blur

Bimanual Coordination#

For tasks like drawers:

  • Left arm: stable pulling trajectory

  • Right arm: precise pick/place with minimal hesitation

VR Demo GIF

Device Ports#

If ports change after reboot, consider using persistent udev rules to stabilize device naming.#

VR Demo GIF

10) Quick Checklist#

  • [ ] conda activate lerobot

  • [ ] Three cameras connected and indices correct (0/1/2)

  • [ ] Follower ports correct (/dev/ttyACM0, /dev/ttyACM1)

  • [ ] Leader ports correct (/dev/ttyACM2, /dev/ttyACM3)

  • [ ] Dataset repo ID and task description set

  • [ ] Training runs on correct device (cuda vs cpu)

  • [ ] Inference uses the correct trained policy path