Conversation
|
🚀 Preview Deployed Your preview is ready for review! 🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/57/textbook/audits/staging/yuni-wyx-jt7347/ Review Checklist
Next Steps
This preview will be removed when the PR is closed. |
6204c94 to
9aa48c1
Compare
|
@crheckman First draft of navigation paper audit ready for review. |
| 1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints. | ||
| To account for the different speeds and sizes of the robots, the waypoints were normalized by scaling them by their top speed. | ||
| This way the model wouldn't have to be trained on specific robot types, and the outputs can then just be scaled via robot-specific controllers for deployment. | ||
| To be clear, the outputs are not low level controls - they are just positional offsets (e.g. dx, dy, dtheta) that are normalized, ultimately allowing for a single policy head to work with the varying dynamics of any robot type. |
There was a problem hiding this comment.
How do they actually provide the low-level controls to move to the specific positional offsets?
There was a problem hiding this comment.
They use a low-level velocity controller to track the commands. The clarification here is that the model outputs don't directly control the robot (just high-level waypoints) - an adapter piece of code is required to convert these to specific robot controls I believe.
| Training data for every type of individual subcomponent of robot vision tasks is costly, so the authors wanted to define a unified model that can handle the general case, and can be minimally adapted to a diverse range of tasks. | ||
| Two features of their approach that stood out were: | ||
|
|
||
| 1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints. |
There was a problem hiding this comment.
I'm not sure we're at the place where foundation models can be considered "truly embodiment agnostic". It seems like the waypoints are scaled for different speeds and sizes of robots, but are they scaled such that a drone and a ground robot are both able to use them? Maybe so, but I suspect not. That's why I feel like "truly embodiment agnostic" is too strong of a claim.
There was a problem hiding this comment.
Yes - not great wording choice, ViNT tries to pursue embodiment agnostic, but clearly it's trained on a subset of robots and not all robots. There are other factors to consider in how a robot might move in comparison to traditional robots we think of. A drone could be the size of a (really) small quadruped, but learned behaviors across both need to be different.
|
|
||
| With ViNT's strong generalization capabilities as a foundation model, the aspect of building a semantic topological map stood out the most. | ||
| As is the goal with most modern learning-based end-to-end approaches, it sort of mirrors how humans navigate. | ||
| When a person wants to navigate to some location, we don't analyze the locations of the rocks on the sidewalk in relation to the street and count the steps to reach a goal. |
There was a problem hiding this comment.
ViNT focuses on geometric progress toward goals, but how might the framework incorporate higher-level semantic or social reasoning (e.g., avoiding human interaction zones) when multiple trajectories are physically feasible?
There was a problem hiding this comment.
It currently does not do so explicitly - there's emergent behaviors that they talk of, but this is largely learned from the datasets training itself - for example, modelling trajectory behavior around robot navigation datasets where the robots move around humans, so the policy might pick up on the idea that if it sees a human leg, it will output waypoints a little farther away from the human. Also ViNT doesn't tackle this like NoMaD does with a distribution over the actions, rather than just 1 action, so traversability is not largely encoded either.
| While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space. | ||
| Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling. | ||
|
|
||
| ### Architecture |
There was a problem hiding this comment.
An architecture diagram would be a great addition here
crheckman
left a comment
There was a problem hiding this comment.
intro comments - 10min into reading period
| ## Introduction | ||
|
|
||
| Robotic navigation has traditionally been structured around a modular sense–think–act pipeline, where perception, mapping, planning, and control are explicitly defined and engineered. | ||
| While this paradigm has proven reliable, it relies heavily on discrete reasoning and hand-designed representations, requiring prior knowledge about the environment, the robot embodiment, and the task structure. |
There was a problem hiding this comment.
I think this is in the right direction, but too broad. Not all representations are discrete (there exist continuous SLAM representations). They don't necessarily require knowledge about the environment because the representations themselves are geometric.
A stronger disagreement with these techniques would be (imo) that they are simply too metric. They are conditioned too strongly on a numerical/quantitative representation that isn't part of "natural" decision making, or certainly not contextual -- they boil everything down to a set of numbers that measure meters, not even a vector of numbers that represents a summary.
| This transition reduces the need for explicit intermediate representations, but introduces new challenges in generalization, temporal consistency, and physical executability. | ||
|
|
||
| This audit examines three representative approaches—ViNT, NoMaD, and Uni-NaVid—as incremental attempts to relax prior assumptions while preserving navigational competence. | ||
| Rather than proposing a single unified solution, these works decompose the problem of generalized robot navigation into progressively harder sub-problems. |
There was a problem hiding this comment.
What is a "progressively harder subproblem"? We would need an example here to make this statement mean anything.
| NoMaD addresses the subsequent challenge of goal-directed behavior under complete environmental uncertainty, explicitly modeling the trade-off between exploration and exploitation during navigation. | ||
| Uni-NaVid further extends this paradigm by incorporating language grounding, allowing navigation policies to condition their search and execution strategies on semantic instructions across diverse datasets. | ||
|
|
||
| Viewed together, these approaches form a progressive trajectory: from semantic spatial representation (ViNT), to uncertainty-aware navigation dynamics (NoMaD), to language-conditioned decision making (Uni-NaVid). |
There was a problem hiding this comment.
This introduction is not projecting the context we've established in the class within the first several lectures of mine. For instance, we have talked about how language-conditioned decision-making can be as simple as zero-shotting robot APIs to a frontier model and asking for codegen. What I want to see in this intro is how the field has rapidly moved beyond that, and what the key ingredients to that boundary expansion have been.
| The model itself serves as a strong baseline as a foundation model for robot-vision navigation tasks, but still requires more components to push it outside of the realm of just adding more training data. | ||
| On that note also, ViNT seems to be pretty computationally expensive (training, diffusion generation). | ||
| As mentioned in the paper, ViNT requires a "degree of structural similarity" - it is mostly geared towards 2D or planar movement, and focuses mostly on RGB images. | ||
| Regardless, the ViNT framework is an important step towards generalization of robot navigation tasks. |
There was a problem hiding this comment.
One additional question, whether the normalized waypoint abstraction might rely heavily on stable low-level controllers — could differences in controller dynamics limit cross-embodiment robustness in real-world deployments?
There was a problem hiding this comment.
The specific controller is written for the specific robot, they don't just use a single controller for everything. The cross embodiment part is in waypoint generation and understanding on a higher level where a robot should go. Other than that, there still needs to be an 'adapter' (like a one size fits all shoe, but you still need insoles...not a great example but I think helps explain what I mean).
|
|
||
| 2. **Zero-shot transfer**: Basically just deployed ViNT on 4 different robots, and recorded success rates. | ||
|
|
||
| 3. **Broader Generalization, Fine-Tuning, and Adaptations**: They're able to do adaptations to different task specifications via prompt tuning. |
There was a problem hiding this comment.
What kind of task specifications? I'm assuming these are all navigation-related, and that's why prompt tuning works well?
There was a problem hiding this comment.
I believe it was like changing the goal images (satellite images vs. ground images), or training in different environments, like with CARLA urban driving simulations ~ they changed it so the decions were just turn left / right / forward at the intersection, etc.). To be honest I'm a little hazy on the prompt tuning part, I didn't fully understand how they did it but I will go back and read.
| 2. **Semantic topological map**: No explicit map building was done either, but instead a topological graph was built, where waypoints were recorded on the graph as nodes (with corresponding image or feature - learned visual embeddings of visited locations), and the nodes were connected by temporal sequences of actions. | ||
| This allows for long horizon planning and exploration versus exploitation handling without any explicit SLAM. | ||
| While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space. | ||
| Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling. |
There was a problem hiding this comment.
How does the model ensure these subgoals are physically reachable or collision-free?
There was a problem hiding this comment.
Not explicitly done, but learned reasoning - physical reachability just had to do with learning how to navigate certain environmental structures, same with staying collision free. There's not a very clear explicit reasoning module per say that explicitly tackles this task I think. While it works in a very general case, that was one of the weak points of ViNT.
| Uni-NaVid further extends this paradigm by incorporating language grounding, allowing navigation policies to condition their search and execution strategies on semantic instructions across diverse datasets. | ||
|
|
||
| Viewed together, these approaches form a progressive trajectory: from semantic spatial representation (ViNT), to uncertainty-aware navigation dynamics (NoMaD), to language-conditioned decision making (Uni-NaVid). | ||
| This audit argues that while each step improves generalization, the primary bottleneck shifts toward the interface between high-level semantic goals and temporally consistent, physically grounded control. |
There was a problem hiding this comment.
This seems like a driving sentence for the audit. Is it possible to provide examples of what the bottleneck is? This would help ground my reading moving forward and help paint a picture of whats to come.
There was a problem hiding this comment.
I think the bottleneck is that there is a still a gap of learning between general navigation and robot / scene context and how robot + scene needs to be learned together. By this I mean is that based on scene understanding, and knowlegde of how a robot moves, humans can explicitly reason about actions it should take to get from A to B, whether it be going down a ramp instead of a staircase. While we can train a robot to learn prefereable behavior, it does not have explicit reasoning yet about how to choose from the two semantically. This works into the control problem as well, with knowing what controls are available to make it possible for a quadruped to go down a staircase. That level if fine-tuning is currently very difficult.
| Training data for every type of individual subcomponent of robot vision tasks is costly, so the authors wanted to define a unified model that can handle the general case, and can be minimally adapted to a diverse range of tasks. | ||
| Two features of their approach that stood out were: | ||
|
|
||
| 1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints. |
| 1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints. | ||
| To account for the different speeds and sizes of the robots, the waypoints were normalized by scaling them by their top speed. | ||
| This way the model wouldn't have to be trained on specific robot types, and the outputs can then just be scaled via robot-specific controllers for deployment. | ||
| To be clear, the outputs are not low level controls - they are just positional offsets (e.g. dx, dy, dtheta) that are normalized, ultimately allowing for a single policy head to work with the varying dynamics of any robot type. |
There was a problem hiding this comment.
Discussing policy head before understanding system architecture
| This extends to other contextual reasoning points, such as traversability (we can choose shortest set of nodes or actions in the graph, but what about the specific quality of the trajectories themselves?). | ||
| The model itself serves as a strong baseline as a foundation model for robot-vision navigation tasks, but still requires more components to push it outside of the realm of just adding more training data. | ||
| On that note also, ViNT seems to be pretty computationally expensive (training, diffusion generation). | ||
| As mentioned in the paper, ViNT requires a "degree of structural similarity" - it is mostly geared towards 2D or planar movement, and focuses mostly on RGB images. |
There was a problem hiding this comment.
Geared toward 2D or planar movement. This is likely due to the training process. Any chance this statement can be better motivated?
|
|
||
| **Data Assumptions** | ||
|
|
||
| NoMaD is trained entirely via supervised imitation learning using large-scale real-world datasets (GNM and SACSoN), totaling over 100 hours of robot navigation data. |
There was a problem hiding this comment.
What is the ground truth in these datasets? This would help understand the type of data that is being used.
There was a problem hiding this comment.
Also, how diverse are they? Do the authors even try to quantify/explain this dataset, or are there other papers that have analyzed it? Was it publicly released?
| Previously in ViNT, exploration versus exploitation was a behavior encoded within the graph generation and subgoal ranking. | ||
| Now with NoMaD, the low level collision avoidance and high level planning (exploration versus subgoal seeking) is defined in one model architecture. |
There was a problem hiding this comment.
If you include visual depictions of the architectures, these statements would be more clearly motivated.
| 2. Like most other models, how well would this transfer if trained in office spaces, and deployed in forests? | ||
| There's still some element of scene understanding that needs to be included for the model to be truly generalizable. | ||
| Need priors for understanding generalized knowledge from internet? |
There was a problem hiding this comment.
This is an interesting statement. I don't think you ever mentioned which environments NoMaD excels in. I'm sure this is related to the training data as well.
| Will end-to-end ever be? | ||
| Also, this limits human readability of internal logic. | ||
|
|
||
| 4. Interesting to note that Masked ViNT performed so poorly, suggesting that success of paper was not just goal masking, but other subtle architectural changes in tandem. |
There was a problem hiding this comment.
I like this thought. It would be interesting to identify these subtle architectural changes, if possible, and list them as bullet points. This would help the reader wrap up NoMaD before moving on.
| Joint training consistently outperforms single-task training, supporting the claim that shared representations across navigation tasks are beneficial rather than harmful. | ||
|
|
||
| 3. **Real-World Deployment** | ||
| The model is deployed zero-shot on a Unitree Go2 quadruped robot and demonstrates stable, non-blocking navigation, including multi-stage "chain-of-navigation" commands. |
There was a problem hiding this comment.
"Chain of navigation" commands is a new term in the audit. What does this mean? Chain of thought, reasoning, or causality imply increased human readability. What does this term mean?
There was a problem hiding this comment.
Good summary of contents. I think the introduction should contain more of a fusion of the information with a statement that you can make now that you have read these papers. The final sentence of the Intro points to this statement, but thats the only time it is referenced before diving into each of the individual papers.
| From a machine learning standpoint it is impressive, but is it architecturally sound? | ||
| Will end-to-end ever be? | ||
| Also, this limits human readability of internal logic. | ||
|
|
There was a problem hiding this comment.
In real deployments, could differences in robot dynamics or control latency undermine the robustness of the learned action distribution?
| These candidates are generated via an image diffusion model (trained on same dataset as ViNT), and are scored by a goal-directed heuristic. | ||
| The transformer helps with the prediction for how well the actions progress to goal since it predicts current distance to goal. | ||
| In the full pipeline, the robot uses the diffusion model to generate subgoals from current observation, and then spatially grounds them via ViNT (goal encoder), and scores them using the heuristic. | ||
| This is notably computationally expensive. |
There was a problem hiding this comment.
It might be useful to add more metrics around this. Maybe mention number of diffusion steps, latency per subgoal, etc.
| Uni-NaVid further enriches decision-making with language, but amplifies the semantic–motor gap, where linguistically valid goals may be dynamically infeasible. | ||
|
|
||
| Taken together, these failures suggest a complementary path forward: ViNT provides structural spatial priors, NoMaD contributes temporal persistence under uncertainty, and Uni-NaVid offers semantic intent. | ||
| A unified navigation system would need to integrate all three, while explicitly managing information decay at the interface between semantic reasoning and low-level control. |
There was a problem hiding this comment.
This makes it seem like these are the only components necessary for robotic navigation with VLAs. Is this what you mean?
| 4. **Subgoal diffusion**: Basically just ViNT (diffusion generation of subgoals with navigation policy). | ||
| 5. **Random subgoals**: A variation of ViNT that instead of using diffusion, just randomly samples training data for candidate subgoal. | ||
|
|
||
| From experiments, it was shown that NoMaD performed as well as if not better than ViNT (subgoal diffusion) in both exploration and navigation, whilst using markedly fewer parameters (19M vs 335M). |
There was a problem hiding this comment.
Did NoMaD happen to claim any robot-agnostic features like ViNT? Just curious
|
|
||
| Some things which could be expanded upon however, include: | ||
|
|
||
| 1. The probability distribution of the goal masking itself. |
There was a problem hiding this comment.
Agreed that this should have been expanded on in the paper. It's not very clear why this was chosen or what impact it would have.
| Will end-to-end ever be? | ||
| Also, this limits human readability of internal logic. | ||
|
|
||
| 4. Interesting to note that Masked ViNT performed so poorly, suggesting that success of paper was not just goal masking, but other subtle architectural changes in tandem. |
There was a problem hiding this comment.
I wonder if it was the way in which they implemented the goal masking that caused it's poor performance, or was it due to the data they trained with?
|
|
||
| $$ a_t^{k-1} = \alpha\left(a_t^k - \gamma_k \epsilon_\theta\left(c_t, a_t^k, k\right)\right) + \mathcal{N}\left(0, \sigma_k^2I\right)$$ | ||
|
|
||
| After $K$ steps, the policy samples a collision-free, multimodal action sequence $a_t^0$. |
There was a problem hiding this comment.
It's not totally clear to me why NoMaD_ is good (maybe?) at generating collision free behaviors. Why would we expect this merging of low- and high-level behaviors to be beneficial in collision avoidance?
|
|
||
| b. **Goal encoder** (denoted $\phi$ in the paper): Processes goal images into the sequence as spatial pairing between observations and goal. | ||
| Findings in paper mentioned that this step shouldn't just extract features from goal image, as this would lead to stuff in image sometimes being ignored (temporal inconsistencies). | ||
| Instead, this encoder encodes the difference between current observation and the goal - just stack observations and goal together, pass through EfficientNet, then flatten to get goal tokens, similar to observation encoder. |
There was a problem hiding this comment.
Just to clarify are they taking the outputs that correspond to the output? Or how do they know that this is actually encoding the difference between the current observation and goal, or is it more like they put them in together and hope that the transformer learns the difference rather than something else.
|
|
||
| 2. An **online visual token merging mechanism** that directly addresses the scalability bottleneck of long-horizon video input to LLMs. | ||
|
|
||
| 3. A demonstration that **multi-task joint training produces positive synergy**, instead of degrading performance, when the observation and action spaces are fully unified. |
There was a problem hiding this comment.
What is positive synergy specifically? Is that something that the authors measure quantitatively or is it just a "vibe"?
| 2. **Semantic topological map**: No explicit map building was done either, but instead a topological graph was built, where waypoints were recorded on the graph as nodes (with corresponding image or feature - learned visual embeddings of visited locations), and the nodes were connected by temporal sequences of actions. | ||
| This allows for long horizon planning and exploration versus exploitation handling without any explicit SLAM. | ||
| While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space. | ||
| Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling. |
There was a problem hiding this comment.
What is "emergent handling"? Was this a design goal?
|
|
||
| ### Architecture | ||
|
|
||
| 1. **Inputs**: Current observation, past observations, and goal images are all encoded via a pair of EfficientNet-B0 encoders. |
There was a problem hiding this comment.
Could use an architecture diagram here - this section provides the scaffolding for one of the most prescient criticisms of this paper (it flattens everything down and assumes attention will take care of feature resolution across observations).
|
|
||
| 3. **Diffusion**: Also previously mentioned, ViNT also produces subgoal candidates to break down the planning problem. | ||
| These candidates are generated via an image diffusion model (trained on same dataset as ViNT), and are scored by a goal-directed heuristic. | ||
| The transformer helps with the prediction for how well the actions progress to goal since it predicts current distance to goal. |
| ### Training | ||
|
|
||
| ViNT was trained on 100 hours of real-world trajectories, spanning 8 different robot platforms. | ||
| Training procedure is as follows: |
There was a problem hiding this comment.
The following seven steps don't seem to compose a functioning "training procedure." Are these all inputs to the model? Which objectives are token reconstruction? Which are L2 costs?
| As is the goal with most modern learning-based end-to-end approaches, it sort of mirrors how humans navigate. | ||
| When a person wants to navigate to some location, we don't analyze the locations of the rocks on the sidewalk in relation to the street and count the steps to reach a goal. | ||
| We do to some extent from a low-level perspective, but this is more a subconscious process than a high-level process. | ||
| We more often are just given semantic cues to tell us whether we have reached locations along the path to the goal, as well as the goal (e.g. walk in some direction for 5-10 minutes until you see a grocery store - requires knowledge of grocery stores, what following roads looks like, etc.). |
There was a problem hiding this comment.
I appreciate the "naive introspection" (NI) being performed here. I think it's generally a good first pass to ask ourselves "how do I think I do this" when it comes to observing gaps between robot capabilities and human ones.
The failure of NI emerges when we come up with an explanation that, under experiment, turns out to be incomplete, generally due to optimism about our own decision-making process. See e.g. the Meehl Paradox (Kahneman, Thinking Fast and Slow).
This deep dive would be stronger if you could dive deeper on NI and functionally what is truly happening as humans try to navigate. This could be part of the introduction, maybe a paragraph with a citation or two.
| The model relies purely on monocular RGB input, without depth, LiDAR, or explicit mapping. | ||
|
|
||
| - **Online Visual Token Merging (core design choice)**: | ||
| Instead of letting visual tokens grow unbounded over time, the model dynamically merges tokens based on temporal distance: |
There was a problem hiding this comment.
How is this token merging accomplished?
| The model relies purely on monocular RGB input, without depth, LiDAR, or explicit mapping. | ||
|
|
||
| - **Online Visual Token Merging (core design choice)**: | ||
| Instead of letting visual tokens grow unbounded over time, the model dynamically merges tokens based on temporal distance: |
There was a problem hiding this comment.
How exactly are these merged/compressed?
|
|
||
| **Data Assumptions** | ||
|
|
||
| NoMaD is trained entirely via supervised imitation learning using large-scale real-world datasets (GNM and SACSoN), totaling over 100 hours of robot navigation data. |
There was a problem hiding this comment.
Also, how diverse are they? Do the authors even try to quantify/explain this dataset, or are there other papers that have analyzed it? Was it publicly released?
| ### Evaluation | ||
|
|
||
| NoMaD was evaluated in 6 different real-world and outdoor environments using a LoCoBot mobile platform. | ||
| The model was compared with 6 different baselines: |
There was a problem hiding this comment.
All of these are essentially self-ablations or self-variations. Were there no external benchmarks?
| ### Summary - The Interesting and The Concerning | ||
|
|
||
| In comparison to ViNT, NoMaD presents a new approach to learning exploration and goal-seeking behaviors. | ||
| Instead of relying on a hierarchical graph-based approach, it's simply done by masking the goal during training and inference time to push the robot's adaptability to both scenarios. |
Pull request for draft of paper audit on navigation unit. Our audit focuses on how VLAs and VLMs can be applied to real-world robot vision tasks (specifically navigation), as well as a deep dive into the impact of model design choices and system architecture on overall performance, robustness, and scalability to the real world. In this audit, we highlight 3 main papers: ViNT, NoMaD, and Uni-NaVid.