You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Human egocentric video and real humanoid data play different roles",
322
329
body: "Training a humanoid foundation model faces a significant data scarcity bottleneck. Human egocentric videos are much cheaper to scale than real-world robotics data and provide large-scale, high-quality, and diverse supervision, while real humanoid data is needed to learn embodiment-specific whole-body control. In Ψ₀, we therefore combine large-scale human video with high-quality humanoid teleoperation data, but use them for different learning stages rather than forcing a single monolithic policy to model both distributions at once.",
323
330
},
324
331
architecture: {
332
+
kicker: "MODEL ARCHITECTURE",
333
+
title: "A triple-system foundation model for whole-body control",
325
334
src: withBase("/figures/architecture.svg"),
326
335
alt: "Ψ₀ architecture diagram",
327
336
caption:
328
337
"The high-level policy consists of a vision-language backbone and a multi-modal diffusion transformer action expert, while an RL-based tracking controller executes the lower-body commands for whole-body control.",
329
338
},
339
+
training: {
340
+
kicker: "STAGED TRAINING",
341
+
title: "Different learning goals for different stages",
342
+
body: "We present an efficient training recipe for learning humanoid loco-manipulation skills from both human videos and real robot data. The overall training procedure consists of three stages: first, pre-training the VLM backbone on large-scale high-quality and diverse human egocentric videos; second, post-training the flow-based action expert on cross-task real humanoid data; and third, fine-tuning the action expert using a small amount of in-domain task data, which enables rapid adaptation to new tasks.",
body: "Efficiently learning a long-horizon loco-manipulation task critically depends on the quality of in-domain data for fine-tuning. To address the limitations of prior systems, we propose a tailored teleoperation framework that explicitly separates upper-body pose tracking, dexterous manipulation, and locomotion commands, while enabling single-operator whole-body control. By using a small set of wearable trackers and separating locomotion from in-place whole-body actions, our framework enables single-operator humanoid teleoperation with improved locomotion stability across diverse task scenarios.",
353
368
},
354
369
{
355
370
src: withBase("/figures/rtc.png"),
356
371
alt: "Real-time chunking diagram",
372
+
kicker: "DEPLOYMENT AND RTC",
357
373
title: "Real-Time Chunking for Deployment",
358
374
body: "Humanoid robots require smooth and reactive control, particularly when executing long-horizon, dexterous manipulation tasks. However, our model comprises over 2.5 billion parameters, with a single forward pass taking approximately 160 ms. To enable smooth policy rollout despite this latency, we adopt training-time real-time chunking. With RTC, each action prediction is conditioned on the previously committed action chunk and outputs a consistent chunk of future actions, while inference runs asynchronously with execution to avoid interruptions between chunks.",
359
375
},
360
376
{
361
377
src: withBase("/figures/sim-data.png"),
362
378
alt: "Simulation and data generation figure",
379
+
kicker: "FAST EVALUATION IN SIMULATION",
363
380
title: "Fast Evaluation in Simulation",
364
381
body: "Although our primary goal is to deploy Ψ₀ in the real world, simulation is valuable for accelerating experimental iteration and enabling unified, standardized evaluation. We introduce a large-scale humanoid loco-manipulation benchmark in simulation with automated task generation across 50 indoor scenes, imported rigid objects, and randomized episode conditions, giving Ψ₀ a fast evaluation loop before the most expensive hardware experiments.",
365
382
},
366
383
{
367
384
src: withBase("/figures/psi-tasks.png"),
368
385
alt: "Eight real-world Ψ₀ benchmark tasks",
386
+
kicker: "REAL-WORLD TASK SETUP",
369
387
title: "Real-World Deployment",
370
388
body: "We evaluate Ψ₀ on eight diverse long-horizon dexterous loco-manipulation tasks involving manipulation, whole-body motion, and locomotion. The tasks range from simple interactions, such as pick-and-place, pushing, and wiping, to more challenging dexterous manipulations requiring precise finger-object coordination, including turning a faucet and pulling out a chip tray.",
0 commit comments