Audit: Navigation - Yuni Wu and Jimmy Tran#57

Open

jt7347 wants to merge 5 commits intomainfrom

audit/yuni-wyx-jt7347-navigation

Contributor

jt7347 commented Feb 10, 2026

Pull request for draft of paper audit on navigation unit. Our audit focuses on how VLAs and VLMs can be applied to real-world robot vision tasks (specifically navigation), as well as a deep dive into the impact of model design choices and system architecture on overall performance, robustness, and scalability to the real world. In this audit, we highlight 3 main papers: ViNT, NoMaD, and Uni-NaVid.

jt7347 added 2 commits

February 7, 2026 10:09


          init audit doc for navigation topic

8a951aa


          audit: add first draft for navigation unit

87a047d

github-actions bot commented Feb 10, 2026 •

edited

Loading

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/57/textbook/audits/staging/yuni-wyx-jt7347/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered content using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

jt7347 added 3 commits

February 10, 2026 00:03


          add semantic line breaks to match engineering standards check

f24b219


          add stylistic fixes for engineering standards

943aad6


          break up wall of text sections

9aa48c1

jt7347 force-pushed the audit/yuni-wyx-jt7347-navigation branch from 6204c94 to 9aa48c1 Compare

February 10, 2026 07:49

Contributor Author

jt7347 commented Feb 10, 2026

@crheckman First draft of navigation paper audit ready for review.

aritrach reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.
+              To account for the different speeds and sizes of the robots, the waypoints were normalized by scaling them by their top speed.
+              This way the model wouldn't have to be trained on specific robot types, and the outputs can then just be scaled via robot-specific controllers for deployment.
+              To be clear, the outputs are not low level controls - they are just positional offsets (e.g. dx, dy, dtheta) that are normalized, ultimately allowing for a single policy head to work with the varying dynamics of any robot type.

Contributor

aritrach Feb 17, 2026

How do they actually provide the low-level controls to move to the specific positional offsets?

Contributor Author

jt7347 Feb 17, 2026

They use a low-level velocity controller to track the commands. The clarification here is that the model outputs don't directly control the robot (just high-level waypoints) - an adapter piece of code is required to convert these to specific robot controls I believe.

lorinachey reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              Training data for every type of individual subcomponent of robot vision tasks is costly, so the authors wanted to define a unified model that can handle the general case, and can be minimally adapted to a diverse range of tasks.
+              Two features of their approach that stood out were:
+. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.

Contributor

lorinachey Feb 17, 2026

I'm not sure we're at the place where foundation models can be considered "truly embodiment agnostic". It seems like the waypoints are scaled for different speeds and sizes of robots, but are they scaled such that a drone and a ground robot are both able to use them? Maybe so, but I suspect not. That's why I feel like "truly embodiment agnostic" is too strong of a claim.

Contributor Author

jt7347 Feb 17, 2026 •

edited

Loading

Yes - not great wording choice, ViNT tries to pursue embodiment agnostic, but clearly it's trained on a subset of robots and not all robots. There are other factors to consider in how a robot might move in comparison to traditional robots we think of. A drone could be the size of a (really) small quadruped, but learned behaviors across both need to be different.

Hhy903 reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              With ViNT's strong generalization capabilities as a foundation model, the aspect of building a semantic topological map stood out the most.
+              As is the goal with most modern learning-based end-to-end approaches, it sort of mirrors how humans navigate.
+              When a person wants to navigate to some location, we don't analyze the locations of the rocks on the sidewalk in relation to the street and count the steps to reach a goal.

Contributor

Hhy903 Feb 17, 2026

ViNT focuses on geometric progress toward goals, but how might the framework incorporate higher-level semantic or social reasoning (e.g., avoiding human interaction zones) when multiple trajectories are physically feasible?

Contributor Author

jt7347 Feb 17, 2026

It currently does not do so explicitly - there's emergent behaviors that they talk of, but this is largely learned from the datasets training itself - for example, modelling trajectory behavior around robot navigation datasets where the robots move around humans, so the policy might pick up on the idea that if it sees a human leg, it will output waypoints a little farther away from the human. Also ViNT doesn't tackle this like NoMaD does with a distribution over the actions, rather than just 1 action, so traversability is not largely encoded either.

lorinachey reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space.
+              Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling.
+              ### Architecture

Contributor

lorinachey Feb 17, 2026

An architecture diagram would be a great addition here

crheckman requested changes

View reviewed changes

Collaborator

crheckman left a comment

intro comments - 10min into reading period

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              ## Introduction
+              Robotic navigation has traditionally been structured around a modular sense–think–act pipeline, where perception, mapping, planning, and control are explicitly defined and engineered.
+              While this paradigm has proven reliable, it relies heavily on discrete reasoning and hand-designed representations, requiring prior knowledge about the environment, the robot embodiment, and the task structure.

Collaborator

crheckman Feb 17, 2026

I think this is in the right direction, but too broad. Not all representations are discrete (there exist continuous SLAM representations). They don't necessarily require knowledge about the environment because the representations themselves are geometric.

A stronger disagreement with these techniques would be (imo) that they are simply too metric. They are conditioned too strongly on a numerical/quantitative representation that isn't part of "natural" decision making, or certainly not contextual -- they boil everything down to a set of numbers that measure meters, not even a vector of numbers that represents a summary.

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              This transition reduces the need for explicit intermediate representations, but introduces new challenges in generalization, temporal consistency, and physical executability.
+              This audit examines three representative approaches—ViNT, NoMaD, and Uni-NaVid—as incremental attempts to relax prior assumptions while preserving navigational competence.
+              Rather than proposing a single unified solution, these works decompose the problem of generalized robot navigation into progressively harder sub-problems.

Collaborator

crheckman Feb 17, 2026

What is a "progressively harder subproblem"? We would need an example here to make this statement mean anything.

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              NoMaD addresses the subsequent challenge of goal-directed behavior under complete environmental uncertainty, explicitly modeling the trade-off between exploration and exploitation during navigation.
+              Uni-NaVid further extends this paradigm by incorporating language grounding, allowing navigation policies to condition their search and execution strategies on semantic instructions across diverse datasets.
+              Viewed together, these approaches form a progressive trajectory: from semantic spatial representation (ViNT), to uncertainty-aware navigation dynamics (NoMaD), to language-conditioned decision making (Uni-NaVid).

Collaborator

crheckman Feb 17, 2026

This introduction is not projecting the context we've established in the class within the first several lectures of mine. For instance, we have talked about how language-conditioned decision-making can be as simple as zero-shotting robot APIs to a frontier model and asking for codegen. What I want to see in this intro is how the field has rapidly moved beyond that, and what the key ingredients to that boundary expansion have been.

Hhy903 reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              The model itself serves as a strong baseline as a foundation model for robot-vision navigation tasks, but still requires more components to push it outside of the realm of just adding more training data.
+              On that note also, ViNT seems to be pretty computationally expensive (training, diffusion generation).
+              As mentioned in the paper, ViNT requires a "degree of structural similarity" - it is mostly geared towards 2D or planar movement, and focuses mostly on RGB images.
+              Regardless, the ViNT framework is an important step towards generalization of robot navigation tasks.

Contributor

Hhy903 Feb 17, 2026

One additional question, whether the normalized waypoint abstraction might rely heavily on stable low-level controllers — could differences in controller dynamics limit cross-embodiment robustness in real-world deployments?

Contributor Author

jt7347 Feb 17, 2026

The specific controller is written for the specific robot, they don't just use a single controller for everything. The cross embodiment part is in waypoint generation and understanding on a higher level where a robot should go. Other than that, there still needs to be an 'adapter' (like a one size fits all shoe, but you still need insoles...not a great example but I think helps explain what I mean).

aritrach reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		2. Zero-shot transfer: Basically just deployed ViNT on 4 different robots, and recorded success rates.

		3. Broader Generalization, Fine-Tuning, and Adaptations: They're able to do adaptations to different task specifications via prompt tuning.

Contributor

aritrach Feb 17, 2026

What kind of task specifications? I'm assuming these are all navigation-related, and that's why prompt tuning works well?

Contributor Author

jt7347 Feb 17, 2026

I believe it was like changing the goal images (satellite images vs. ground images), or training in different environments, like with CARLA urban driving simulations ~ they changed it so the decions were just turn left / right / forward at the intersection, etc.). To be honest I'm a little hazy on the prompt tuning part, I didn't fully understand how they did it but I will go back and read.

callie-jones reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+. **Semantic topological map**: No explicit map building was done either, but instead a topological graph was built, where waypoints were recorded on the graph as nodes (with corresponding image or feature - learned visual embeddings of visited locations), and the nodes were connected by temporal sequences of actions.
+              This allows for long horizon planning and exploration versus exploitation handling without any explicit SLAM.
+              While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space.
+              Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling.

Contributor

callie-jones Feb 17, 2026

How does the model ensure these subgoals are physically reachable or collision-free?

Contributor Author

jt7347 Feb 17, 2026

Not explicitly done, but learned reasoning - physical reachability just had to do with learning how to navigate certain environmental structures, same with staying collision free. There's not a very clear explicit reasoning module per say that explicitly tackles this task I think. While it works in a very general case, that was one of the weak points of ViNT.

Zaaler reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              Uni-NaVid further extends this paradigm by incorporating language grounding, allowing navigation policies to condition their search and execution strategies on semantic instructions across diverse datasets.
+              Viewed together, these approaches form a progressive trajectory: from semantic spatial representation (ViNT), to uncertainty-aware navigation dynamics (NoMaD), to language-conditioned decision making (Uni-NaVid).
+              This audit argues that while each step improves generalization, the primary bottleneck shifts toward the interface between high-level semantic goals and temporally consistent, physically grounded control.

Contributor

Zaaler Feb 17, 2026

This seems like a driving sentence for the audit. Is it possible to provide examples of what the bottleneck is? This would help ground my reading moving forward and help paint a picture of whats to come.

Contributor Author

jt7347 Feb 17, 2026

I think the bottleneck is that there is a still a gap of learning between general navigation and robot / scene context and how robot + scene needs to be learned together. By this I mean is that based on scene understanding, and knowlegde of how a robot moves, humans can explicitly reason about actions it should take to get from A to B, whether it be going down a ramp instead of a staircase. While we can train a robot to learn prefereable behavior, it does not have explicit reasoning yet about how to choose from the two semantically. This works into the control problem as well, with knowing what controls are available to make it possible for a quadruped to go down a staircase. That level if fine-tuning is currently very difficult.

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              Training data for every type of individual subcomponent of robot vision tasks is costly, so the authors wanted to define a unified model that can handle the general case, and can be minimally adapted to a diverse range of tasks.
+              Two features of their approach that stood out were:
+. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.

Contributor

Zaaler Feb 17, 2026 •

edited

Loading

"truly embody"

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.
+              To account for the different speeds and sizes of the robots, the waypoints were normalized by scaling them by their top speed.
+              This way the model wouldn't have to be trained on specific robot types, and the outputs can then just be scaled via robot-specific controllers for deployment.
+              To be clear, the outputs are not low level controls - they are just positional offsets (e.g. dx, dy, dtheta) that are normalized, ultimately allowing for a single policy head to work with the varying dynamics of any robot type.

Contributor

Zaaler Feb 17, 2026

Discussing policy head before understanding system architecture

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              This extends to other contextual reasoning points, such as traversability (we can choose shortest set of nodes or actions in the graph, but what about the specific quality of the trajectories themselves?).
+              The model itself serves as a strong baseline as a foundation model for robot-vision navigation tasks, but still requires more components to push it outside of the realm of just adding more training data.
+              On that note also, ViNT seems to be pretty computationally expensive (training, diffusion generation).
+              As mentioned in the paper, ViNT requires a "degree of structural similarity" - it is mostly geared towards 2D or planar movement, and focuses mostly on RGB images.

Contributor

Zaaler Feb 17, 2026

Geared toward 2D or planar movement. This is likely due to the training process. Any chance this statement can be better motivated?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		Data Assumptions

		NoMaD is trained entirely via supervised imitation learning using large-scale real-world datasets (GNM and SACSoN), totaling over 100 hours of robot navigation data.

Contributor

Zaaler Feb 17, 2026

What is the ground truth in these datasets? This would help understand the type of data that is being used.

Collaborator

crheckman Feb 17, 2026

Also, how diverse are they? Do the authors even try to quantify/explain this dataset, or are there other papers that have analyzed it? Was it publicly released?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

Comment on lines +242 to +243

		Previously in ViNT, exploration versus exploitation was a behavior encoded within the graph generation and subgoal ranking.
		Now with NoMaD, the low level collision avoidance and high level planning (exploration versus subgoal seeking) is defined in one model architecture.

Contributor

Zaaler Feb 17, 2026

If you include visual depictions of the architectures, these statements would be more clearly motivated.

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

Comment on lines +253 to +255

+. Like most other models, how well would this transfer if trained in office spaces, and deployed in forests?
+              There's still some element of scene understanding that needs to be included for the model to be truly generalizable.
+              Need priors for understanding generalized knowledge from internet?

Contributor

Zaaler Feb 17, 2026

This is an interesting statement. I don't think you ever mentioned which environments NoMaD excels in. I'm sure this is related to the training data as well.

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              Will end-to-end ever be?
+              Also, this limits human readability of internal logic.
+. Interesting to note that Masked ViNT performed so poorly, suggesting that success of paper was not just goal masking, but other subtle architectural changes in tandem.

Contributor

Zaaler Feb 17, 2026

I like this thought. It would be interesting to identify these subtle architectural changes, if possible, and list them as bullet points. This would help the reader wrap up NoMaD before moving on.

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+                 Joint training consistently outperforms single-task training, supporting the claim that shared representations across navigation tasks are beneficial rather than harmful.
+. **Real-World Deployment**
+                 The model is deployed zero-shot on a Unitree Go2 quadruped robot and demonstrates stable, non-blocking navigation, including multi-stage "chain-of-navigation" commands.

Contributor

Zaaler Feb 17, 2026

"Chain of navigation" commands is a new term in the audit. What does this mean? Chain of thought, reasoning, or causality imply increased human readability. What does this term mean?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

Contributor

Zaaler Feb 17, 2026

Good summary of contents. I think the introduction should contain more of a fusion of the information with a statement that you can make now that you have read these papers. The final sentence of the Intro points to this statement, but thats the only time it is referenced before diving into each of the individual papers.

Hhy903 reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              From a machine learning standpoint it is impressive, but is it architecturally sound?
+              Will end-to-end ever be?
+              Also, this limits human readability of internal logic.

Contributor

Hhy903 Feb 17, 2026

In real deployments, could differences in robot dynamics or control latency undermine the robustness of the learned action distribution?

callie-jones reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              These candidates are generated via an image diffusion model (trained on same dataset as ViNT), and are scored by a goal-directed heuristic.
+              The transformer helps with the prediction for how well the actions progress to goal since it predicts current distance to goal.
+              In the full pipeline, the robot uses the diffusion model to generate subgoals from current observation, and then spatially grounds them via ViNT (goal encoder), and scores them using the heuristic.
+              This is notably computationally expensive.

Contributor

callie-jones Feb 17, 2026

It might be useful to add more metrics around this. Maybe mention number of diffusion steps, latency per subgoal, etc.

zlaouar reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              Uni-NaVid further enriches decision-making with language, but amplifies the semantic–motor gap, where linguistically valid goals may be dynamically infeasible.
+              Taken together, these failures suggest a complementary path forward: ViNT provides structural spatial priors, NoMaD contributes temporal persistence under uncertainty, and Uni-NaVid offers semantic intent.
+              A unified navigation system would need to integrate all three, while explicitly managing information decay at the interface between semantic reasoning and low-level control.

Collaborator

zlaouar Feb 17, 2026

This makes it seem like these are the only components necessary for robotic navigation with VLAs. Is this what you mean?

aritrach reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+. **Subgoal diffusion**: Basically just ViNT (diffusion generation of subgoals with navigation policy).
+. **Random subgoals**: A variation of ViNT that instead of using diffusion, just randomly samples training data for candidate subgoal.
+              From experiments, it was shown that NoMaD performed as well as if not better than ViNT (subgoal diffusion) in both exploration and navigation, whilst using markedly fewer parameters (19M vs 335M).

Contributor

aritrach Feb 17, 2026

Did NoMaD happen to claim any robot-agnostic features like ViNT? Just curious

lorinachey reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		Some things which could be expanded upon however, include:

		1. The probability distribution of the goal masking itself.

Contributor

lorinachey Feb 17, 2026

Agreed that this should have been expanded on in the paper. It's not very clear why this was chosen or what impact it would have.

lorinachey reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              Will end-to-end ever be?
+              Also, this limits human readability of internal logic.
+. Interesting to note that Masked ViNT performed so poorly, suggesting that success of paper was not just goal masking, but other subtle architectural changes in tandem.

Contributor

lorinachey Feb 17, 2026

I wonder if it was the way in which they implemented the goal masking that caused it's poor performance, or was it due to the data they trained with?

krusnim reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		$$ a_t^{k-1} = \alpha\left(a_t^k - \gamma_k \epsilon_\theta\left(c_t, a_t^k, k\right)\right) + \mathcal{N}\left(0, \sigma_k^2I\right)$$

		After $K$ steps, the policy samples a collision-free, multimodal action sequence $a_t^0$.

Contributor

krusnim Feb 17, 2026

It's not totally clear to me why NoMaD_ is good (maybe?) at generating collision free behaviors. Why would we expect this merging of low- and high-level behaviors to be beneficial in collision avoidance?

antony-zhao reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+                 b. **Goal encoder** (denoted $\phi$ in the paper): Processes goal images into the sequence as spatial pairing between observations and goal.
+                 Findings in paper mentioned that this step shouldn't just extract features from goal image, as this would lead to stuff in image sometimes being ignored (temporal inconsistencies).
+                 Instead, this encoder encodes the difference between current observation and the goal - just stack observations and goal together, pass through EfficientNet, then flatten to get goal tokens, similar to observation encoder.

Contributor

antony-zhao Feb 17, 2026

Just to clarify are they taking the outputs that correspond to the output? Or how do they know that this is actually encoding the difference between the current observation and goal, or is it more like they put them in together and hope that the transformer learns the difference rather than something else.

lorinachey reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		2. An online visual token merging mechanism that directly addresses the scalability bottleneck of long-horizon video input to LLMs.

		3. A demonstration that multi-task joint training produces positive synergy, instead of degrading performance, when the observation and action spaces are fully unified.

Contributor

lorinachey Feb 17, 2026

What is positive synergy specifically? Is that something that the authors measure quantitatively or is it just a "vibe"?

crheckman requested changes

View reviewed changes

Collaborator

crheckman left a comment

through ViNT

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+. **Semantic topological map**: No explicit map building was done either, but instead a topological graph was built, where waypoints were recorded on the graph as nodes (with corresponding image or feature - learned visual embeddings of visited locations), and the nodes were connected by temporal sequences of actions.
+              This allows for long horizon planning and exploration versus exploitation handling without any explicit SLAM.
+              While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space.
+              Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling.

Collaborator

crheckman Feb 17, 2026

What is "emergent handling"? Was this a design goal?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		### Architecture

		1. Inputs: Current observation, past observations, and goal images are all encoded via a pair of EfficientNet-B0 encoders.

Collaborator

crheckman Feb 17, 2026

Could use an architecture diagram here - this section provides the scaffolding for one of the most prescient criticisms of this paper (it flattens everything down and assumes attention will take care of feature resolution across observations).

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+. **Diffusion**: Also previously mentioned, ViNT also produces subgoal candidates to break down the planning problem.
+              These candidates are generated via an image diffusion model (trained on same dataset as ViNT), and are scored by a goal-directed heuristic.
+              The transformer helps with the prediction for how well the actions progress to goal since it predicts current distance to goal.

Collaborator

crheckman Feb 17, 2026

???

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              ### Training
+              ViNT was trained on 100 hours of real-world trajectories, spanning 8 different robot platforms.
+              Training procedure is as follows:

Collaborator

crheckman Feb 17, 2026

The following seven steps don't seem to compose a functioning "training procedure." Are these all inputs to the model? Which objectives are token reconstruction? Which are L2 costs?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              As is the goal with most modern learning-based end-to-end approaches, it sort of mirrors how humans navigate.
+              When a person wants to navigate to some location, we don't analyze the locations of the rocks on the sidewalk in relation to the street and count the steps to reach a goal.
+              We do to some extent from a low-level perspective, but this is more a subconscious process than a high-level process.
+              We more often are just given semantic cues to tell us whether we have reached locations along the path to the goal, as well as the goal (e.g. walk in some direction for 5-10 minutes until you see a grocery store - requires knowledge of grocery stores, what following roads looks like, etc.).

Collaborator

crheckman Feb 17, 2026

I appreciate the "naive introspection" (NI) being performed here. I think it's generally a good first pass to ask ourselves "how do I think I do this" when it comes to observing gaps between robot capabilities and human ones.

The failure of NI emerges when we come up with an explanation that, under experiment, turns out to be incomplete, generally due to optimism about our own decision-making process. See e.g. the Meehl Paradox (Kahneman, Thinking Fast and Slow).

This deep dive would be stronger if you could dive deeper on NI and functionally what is truly happening as humans try to navigate. This could be part of the introduction, maybe a paragraph with a citation or two.

lorinachey reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+                The model relies purely on monocular RGB input, without depth, LiDAR, or explicit mapping.
+              - **Online Visual Token Merging (core design choice)**:
+                Instead of letting visual tokens grow unbounded over time, the model dynamically merges tokens based on temporal distance:

Contributor

lorinachey Feb 17, 2026

How is this token merging accomplished?

antony-zhao reviewed

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+                The model relies purely on monocular RGB input, without depth, LiDAR, or explicit mapping.
+              - **Online Visual Token Merging (core design choice)**:
+                Instead of letting visual tokens grow unbounded over time, the model dynamically merges tokens based on temporal distance:

Contributor

antony-zhao Feb 17, 2026

How exactly are these merged/compressed?

crheckman requested changes

View reviewed changes

content/textbook/audits/staging/yuni-wyx-jt7347.mdx


		Data Assumptions

		NoMaD is trained entirely via supervised imitation learning using large-scale real-world datasets (GNM and SACSoN), totaling over 100 hours of robot navigation data.

Collaborator

crheckman Feb 17, 2026

Also, how diverse are they? Do the authors even try to quantify/explain this dataset, or are there other papers that have analyzed it? Was it publicly released?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              ### Evaluation
+              NoMaD was evaluated in 6 different real-world and outdoor environments using a LoCoBot mobile platform.
+              The model was compared with 6 different baselines:

Collaborator

crheckman Feb 17, 2026

All of these are essentially self-ablations or self-variations. Were there no external benchmarks?

content/textbook/audits/staging/yuni-wyx-jt7347.mdx

+              ### Summary - The Interesting and The Concerning
+              In comparison to ViNT, NoMaD presents a new approach to learning exploration and goal-seeking behaviors.
+              Instead of relying on a hierarchical graph-based approach, it's simply done by masking the goal during training and inference time to push the robot's adaptability to both scenarios.

Collaborator

crheckman Feb 17, 2026

define adapatability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Zaaler Zaaler left review comments

callie-jones callie-jones left review comments

aritrach aritrach left review comments

zlaouar zlaouar left review comments

lorinachey lorinachey left review comments

antony-zhao antony-zhao left review comments

Hhy903 Hhy903 left review comments

krusnim krusnim left review comments

crheckman crheckman requested changes

Requested changes must be addressed to merge this pull request.

Labels

None yet