Skip to content

Audit: Navigation - Yuni Wu and Jimmy Tran#57

Open
jt7347 wants to merge 5 commits intomainfrom
audit/yuni-wyx-jt7347-navigation
Open

Audit: Navigation - Yuni Wu and Jimmy Tran#57
jt7347 wants to merge 5 commits intomainfrom
audit/yuni-wyx-jt7347-navigation

Conversation

@jt7347
Copy link
Contributor

@jt7347 jt7347 commented Feb 10, 2026

Pull request for draft of paper audit on navigation unit. Our audit focuses on how VLAs and VLMs can be applied to real-world robot vision tasks (specifically navigation), as well as a deep dive into the impact of model design choices and system architecture on overall performance, robustness, and scalability to the real world. In this audit, we highlight 3 main papers: ViNT, NoMaD, and Uni-NaVid.

@github-actions
Copy link

github-actions bot commented Feb 10, 2026

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/57/textbook/audits/staging/yuni-wyx-jt7347/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered content using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

@jt7347 jt7347 force-pushed the audit/yuni-wyx-jt7347-navigation branch from 6204c94 to 9aa48c1 Compare February 10, 2026 07:49
@jt7347
Copy link
Contributor Author

jt7347 commented Feb 10, 2026

@crheckman First draft of navigation paper audit ready for review.

1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.
To account for the different speeds and sizes of the robots, the waypoints were normalized by scaling them by their top speed.
This way the model wouldn't have to be trained on specific robot types, and the outputs can then just be scaled via robot-specific controllers for deployment.
To be clear, the outputs are not low level controls - they are just positional offsets (e.g. dx, dy, dtheta) that are normalized, ultimately allowing for a single policy head to work with the varying dynamics of any robot type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do they actually provide the low-level controls to move to the specific positional offsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They use a low-level velocity controller to track the commands. The clarification here is that the model outputs don't directly control the robot (just high-level waypoints) - an adapter piece of code is required to convert these to specific robot controls I believe.

Training data for every type of individual subcomponent of robot vision tasks is costly, so the authors wanted to define a unified model that can handle the general case, and can be minimally adapted to a diverse range of tasks.
Two features of their approach that stood out were:

1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we're at the place where foundation models can be considered "truly embodiment agnostic". It seems like the waypoints are scaled for different speeds and sizes of robots, but are they scaled such that a drone and a ground robot are both able to use them? Maybe so, but I suspect not. That's why I feel like "truly embodiment agnostic" is too strong of a claim.

Copy link
Contributor Author

@jt7347 jt7347 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - not great wording choice, ViNT tries to pursue embodiment agnostic, but clearly it's trained on a subset of robots and not all robots. There are other factors to consider in how a robot might move in comparison to traditional robots we think of. A drone could be the size of a (really) small quadruped, but learned behaviors across both need to be different.


With ViNT's strong generalization capabilities as a foundation model, the aspect of building a semantic topological map stood out the most.
As is the goal with most modern learning-based end-to-end approaches, it sort of mirrors how humans navigate.
When a person wants to navigate to some location, we don't analyze the locations of the rocks on the sidewalk in relation to the street and count the steps to reach a goal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ViNT focuses on geometric progress toward goals, but how might the framework incorporate higher-level semantic or social reasoning (e.g., avoiding human interaction zones) when multiple trajectories are physically feasible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently does not do so explicitly - there's emergent behaviors that they talk of, but this is largely learned from the datasets training itself - for example, modelling trajectory behavior around robot navigation datasets where the robots move around humans, so the policy might pick up on the idea that if it sees a human leg, it will output waypoints a little farther away from the human. Also ViNT doesn't tackle this like NoMaD does with a distribution over the actions, rather than just 1 action, so traversability is not largely encoded either.

While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space.
Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling.

### Architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An architecture diagram would be a great addition here

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intro comments - 10min into reading period

## Introduction

Robotic navigation has traditionally been structured around a modular sense–think–act pipeline, where perception, mapping, planning, and control are explicitly defined and engineered.
While this paradigm has proven reliable, it relies heavily on discrete reasoning and hand-designed representations, requiring prior knowledge about the environment, the robot embodiment, and the task structure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in the right direction, but too broad. Not all representations are discrete (there exist continuous SLAM representations). They don't necessarily require knowledge about the environment because the representations themselves are geometric.

A stronger disagreement with these techniques would be (imo) that they are simply too metric. They are conditioned too strongly on a numerical/quantitative representation that isn't part of "natural" decision making, or certainly not contextual -- they boil everything down to a set of numbers that measure meters, not even a vector of numbers that represents a summary.

This transition reduces the need for explicit intermediate representations, but introduces new challenges in generalization, temporal consistency, and physical executability.

This audit examines three representative approaches—ViNT, NoMaD, and Uni-NaVid—as incremental attempts to relax prior assumptions while preserving navigational competence.
Rather than proposing a single unified solution, these works decompose the problem of generalized robot navigation into progressively harder sub-problems.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "progressively harder subproblem"? We would need an example here to make this statement mean anything.

NoMaD addresses the subsequent challenge of goal-directed behavior under complete environmental uncertainty, explicitly modeling the trade-off between exploration and exploitation during navigation.
Uni-NaVid further extends this paradigm by incorporating language grounding, allowing navigation policies to condition their search and execution strategies on semantic instructions across diverse datasets.

Viewed together, these approaches form a progressive trajectory: from semantic spatial representation (ViNT), to uncertainty-aware navigation dynamics (NoMaD), to language-conditioned decision making (Uni-NaVid).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduction is not projecting the context we've established in the class within the first several lectures of mine. For instance, we have talked about how language-conditioned decision-making can be as simple as zero-shotting robot APIs to a frontier model and asking for codegen. What I want to see in this intro is how the field has rapidly moved beyond that, and what the key ingredients to that boundary expansion have been.

The model itself serves as a strong baseline as a foundation model for robot-vision navigation tasks, but still requires more components to push it outside of the realm of just adding more training data.
On that note also, ViNT seems to be pretty computationally expensive (training, diffusion generation).
As mentioned in the paper, ViNT requires a "degree of structural similarity" - it is mostly geared towards 2D or planar movement, and focuses mostly on RGB images.
Regardless, the ViNT framework is an important step towards generalization of robot navigation tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional question, whether the normalized waypoint abstraction might rely heavily on stable low-level controllers — could differences in controller dynamics limit cross-embodiment robustness in real-world deployments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific controller is written for the specific robot, they don't just use a single controller for everything. The cross embodiment part is in waypoint generation and understanding on a higher level where a robot should go. Other than that, there still needs to be an 'adapter' (like a one size fits all shoe, but you still need insoles...not a great example but I think helps explain what I mean).


2. **Zero-shot transfer**: Basically just deployed ViNT on 4 different robots, and recorded success rates.

3. **Broader Generalization, Fine-Tuning, and Adaptations**: They're able to do adaptations to different task specifications via prompt tuning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of task specifications? I'm assuming these are all navigation-related, and that's why prompt tuning works well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was like changing the goal images (satellite images vs. ground images), or training in different environments, like with CARLA urban driving simulations ~ they changed it so the decions were just turn left / right / forward at the intersection, etc.). To be honest I'm a little hazy on the prompt tuning part, I didn't fully understand how they did it but I will go back and read.

2. **Semantic topological map**: No explicit map building was done either, but instead a topological graph was built, where waypoints were recorded on the graph as nodes (with corresponding image or feature - learned visual embeddings of visited locations), and the nodes were connected by temporal sequences of actions.
This allows for long horizon planning and exploration versus exploitation handling without any explicit SLAM.
While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space.
Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the model ensure these subgoals are physically reachable or collision-free?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not explicitly done, but learned reasoning - physical reachability just had to do with learning how to navigate certain environmental structures, same with staying collision free. There's not a very clear explicit reasoning module per say that explicitly tackles this task I think. While it works in a very general case, that was one of the weak points of ViNT.

Uni-NaVid further extends this paradigm by incorporating language grounding, allowing navigation policies to condition their search and execution strategies on semantic instructions across diverse datasets.

Viewed together, these approaches form a progressive trajectory: from semantic spatial representation (ViNT), to uncertainty-aware navigation dynamics (NoMaD), to language-conditioned decision making (Uni-NaVid).
This audit argues that while each step improves generalization, the primary bottleneck shifts toward the interface between high-level semantic goals and temporally consistent, physically grounded control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a driving sentence for the audit. Is it possible to provide examples of what the bottleneck is? This would help ground my reading moving forward and help paint a picture of whats to come.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bottleneck is that there is a still a gap of learning between general navigation and robot / scene context and how robot + scene needs to be learned together. By this I mean is that based on scene understanding, and knowlegde of how a robot moves, humans can explicitly reason about actions it should take to get from A to B, whether it be going down a ramp instead of a staircase. While we can train a robot to learn prefereable behavior, it does not have explicit reasoning yet about how to choose from the two semantically. This works into the control problem as well, with knowing what controls are available to make it possible for a quadruped to go down a staircase. That level if fine-tuning is currently very difficult.

Training data for every type of individual subcomponent of robot vision tasks is costly, so the authors wanted to define a unified model that can handle the general case, and can be minimally adapted to a diverse range of tasks.
Two features of their approach that stood out were:

1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.
Copy link
Contributor

@Zaaler Zaaler Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"truly embody"

1. **Action space normalization**: To make the model truly embodiment agnostic, ViNT used relative waypoints.
To account for the different speeds and sizes of the robots, the waypoints were normalized by scaling them by their top speed.
This way the model wouldn't have to be trained on specific robot types, and the outputs can then just be scaled via robot-specific controllers for deployment.
To be clear, the outputs are not low level controls - they are just positional offsets (e.g. dx, dy, dtheta) that are normalized, ultimately allowing for a single policy head to work with the varying dynamics of any robot type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing policy head before understanding system architecture

This extends to other contextual reasoning points, such as traversability (we can choose shortest set of nodes or actions in the graph, but what about the specific quality of the trajectories themselves?).
The model itself serves as a strong baseline as a foundation model for robot-vision navigation tasks, but still requires more components to push it outside of the realm of just adding more training data.
On that note also, ViNT seems to be pretty computationally expensive (training, diffusion generation).
As mentioned in the paper, ViNT requires a "degree of structural similarity" - it is mostly geared towards 2D or planar movement, and focuses mostly on RGB images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geared toward 2D or planar movement. This is likely due to the training process. Any chance this statement can be better motivated?


**Data Assumptions**

NoMaD is trained entirely via supervised imitation learning using large-scale real-world datasets (GNM and SACSoN), totaling over 100 hours of robot navigation data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the ground truth in these datasets? This would help understand the type of data that is being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how diverse are they? Do the authors even try to quantify/explain this dataset, or are there other papers that have analyzed it? Was it publicly released?

Comment on lines +242 to +243
Previously in ViNT, exploration versus exploitation was a behavior encoded within the graph generation and subgoal ranking.
Now with NoMaD, the low level collision avoidance and high level planning (exploration versus subgoal seeking) is defined in one model architecture.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you include visual depictions of the architectures, these statements would be more clearly motivated.

Comment on lines +253 to +255
2. Like most other models, how well would this transfer if trained in office spaces, and deployed in forests?
There's still some element of scene understanding that needs to be included for the model to be truly generalizable.
Need priors for understanding generalized knowledge from internet?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting statement. I don't think you ever mentioned which environments NoMaD excels in. I'm sure this is related to the training data as well.

Will end-to-end ever be?
Also, this limits human readability of internal logic.

4. Interesting to note that Masked ViNT performed so poorly, suggesting that success of paper was not just goal masking, but other subtle architectural changes in tandem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this thought. It would be interesting to identify these subtle architectural changes, if possible, and list them as bullet points. This would help the reader wrap up NoMaD before moving on.

Joint training consistently outperforms single-task training, supporting the claim that shared representations across navigation tasks are beneficial rather than harmful.

3. **Real-World Deployment**
The model is deployed zero-shot on a Unitree Go2 quadruped robot and demonstrates stable, non-blocking navigation, including multi-stage "chain-of-navigation" commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Chain of navigation" commands is a new term in the audit. What does this mean? Chain of thought, reasoning, or causality imply increased human readability. What does this term mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good summary of contents. I think the introduction should contain more of a fusion of the information with a statement that you can make now that you have read these papers. The final sentence of the Intro points to this statement, but thats the only time it is referenced before diving into each of the individual papers.

From a machine learning standpoint it is impressive, but is it architecturally sound?
Will end-to-end ever be?
Also, this limits human readability of internal logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In real deployments, could differences in robot dynamics or control latency undermine the robustness of the learned action distribution?

These candidates are generated via an image diffusion model (trained on same dataset as ViNT), and are scored by a goal-directed heuristic.
The transformer helps with the prediction for how well the actions progress to goal since it predicts current distance to goal.
In the full pipeline, the robot uses the diffusion model to generate subgoals from current observation, and then spatially grounds them via ViNT (goal encoder), and scores them using the heuristic.
This is notably computationally expensive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to add more metrics around this. Maybe mention number of diffusion steps, latency per subgoal, etc.

Uni-NaVid further enriches decision-making with language, but amplifies the semantic–motor gap, where linguistically valid goals may be dynamically infeasible.

Taken together, these failures suggest a complementary path forward: ViNT provides structural spatial priors, NoMaD contributes temporal persistence under uncertainty, and Uni-NaVid offers semantic intent.
A unified navigation system would need to integrate all three, while explicitly managing information decay at the interface between semantic reasoning and low-level control.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it seem like these are the only components necessary for robotic navigation with VLAs. Is this what you mean?

4. **Subgoal diffusion**: Basically just ViNT (diffusion generation of subgoals with navigation policy).
5. **Random subgoals**: A variation of ViNT that instead of using diffusion, just randomly samples training data for candidate subgoal.

From experiments, it was shown that NoMaD performed as well as if not better than ViNT (subgoal diffusion) in both exploration and navigation, whilst using markedly fewer parameters (19M vs 335M).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did NoMaD happen to claim any robot-agnostic features like ViNT? Just curious


Some things which could be expanded upon however, include:

1. The probability distribution of the goal masking itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this should have been expanded on in the paper. It's not very clear why this was chosen or what impact it would have.

Will end-to-end ever be?
Also, this limits human readability of internal logic.

4. Interesting to note that Masked ViNT performed so poorly, suggesting that success of paper was not just goal masking, but other subtle architectural changes in tandem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it was the way in which they implemented the goal masking that caused it's poor performance, or was it due to the data they trained with?


$$ a_t^{k-1} = \alpha\left(a_t^k - \gamma_k \epsilon_\theta\left(c_t, a_t^k, k\right)\right) + \mathcal{N}\left(0, \sigma_k^2I\right)$$

After $K$ steps, the policy samples a collision-free, multimodal action sequence $a_t^0$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not totally clear to me why NoMaD_ is good (maybe?) at generating collision free behaviors. Why would we expect this merging of low- and high-level behaviors to be beneficial in collision avoidance?


b. **Goal encoder** (denoted $\phi$ in the paper): Processes goal images into the sequence as spatial pairing between observations and goal.
Findings in paper mentioned that this step shouldn't just extract features from goal image, as this would lead to stuff in image sometimes being ignored (temporal inconsistencies).
Instead, this encoder encodes the difference between current observation and the goal - just stack observations and goal together, pass through EfficientNet, then flatten to get goal tokens, similar to observation encoder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify are they taking the outputs that correspond to the output? Or how do they know that this is actually encoding the difference between the current observation and goal, or is it more like they put them in together and hope that the transformer learns the difference rather than something else.


2. An **online visual token merging mechanism** that directly addresses the scalability bottleneck of long-horizon video input to LLMs.

3. A demonstration that **multi-task joint training produces positive synergy**, instead of degrading performance, when the observation and action spaces are fully unified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is positive synergy specifically? Is that something that the authors measure quantitatively or is it just a "vibe"?

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

through ViNT

2. **Semantic topological map**: No explicit map building was done either, but instead a topological graph was built, where waypoints were recorded on the graph as nodes (with corresponding image or feature - learned visual embeddings of visited locations), and the nodes were connected by temporal sequences of actions.
This allows for long horizon planning and exploration versus exploitation handling without any explicit SLAM.
While ViNT does not explicitly set a hard line between when to explore or when to seek goal (which will be explored more in NoMaD), its scoring system encoded into the topological map (diffusion-based subgoal generation in latent space.
Some subgoals might not yet exist in map but are scored high for progress to goal) creates some form of emergent handling.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "emergent handling"? Was this a design goal?


### Architecture

1. **Inputs**: Current observation, past observations, and goal images are all encoded via a pair of EfficientNet-B0 encoders.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an architecture diagram here - this section provides the scaffolding for one of the most prescient criticisms of this paper (it flattens everything down and assumes attention will take care of feature resolution across observations).


3. **Diffusion**: Also previously mentioned, ViNT also produces subgoal candidates to break down the planning problem.
These candidates are generated via an image diffusion model (trained on same dataset as ViNT), and are scored by a goal-directed heuristic.
The transformer helps with the prediction for how well the actions progress to goal since it predicts current distance to goal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

### Training

ViNT was trained on 100 hours of real-world trajectories, spanning 8 different robot platforms.
Training procedure is as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following seven steps don't seem to compose a functioning "training procedure." Are these all inputs to the model? Which objectives are token reconstruction? Which are L2 costs?

As is the goal with most modern learning-based end-to-end approaches, it sort of mirrors how humans navigate.
When a person wants to navigate to some location, we don't analyze the locations of the rocks on the sidewalk in relation to the street and count the steps to reach a goal.
We do to some extent from a low-level perspective, but this is more a subconscious process than a high-level process.
We more often are just given semantic cues to tell us whether we have reached locations along the path to the goal, as well as the goal (e.g. walk in some direction for 5-10 minutes until you see a grocery store - requires knowledge of grocery stores, what following roads looks like, etc.).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the "naive introspection" (NI) being performed here. I think it's generally a good first pass to ask ourselves "how do I think I do this" when it comes to observing gaps between robot capabilities and human ones.

The failure of NI emerges when we come up with an explanation that, under experiment, turns out to be incomplete, generally due to optimism about our own decision-making process. See e.g. the Meehl Paradox (Kahneman, Thinking Fast and Slow).

This deep dive would be stronger if you could dive deeper on NI and functionally what is truly happening as humans try to navigate. This could be part of the introduction, maybe a paragraph with a citation or two.

The model relies purely on monocular RGB input, without depth, LiDAR, or explicit mapping.

- **Online Visual Token Merging (core design choice)**:
Instead of letting visual tokens grow unbounded over time, the model dynamically merges tokens based on temporal distance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this token merging accomplished?

The model relies purely on monocular RGB input, without depth, LiDAR, or explicit mapping.

- **Online Visual Token Merging (core design choice)**:
Instead of letting visual tokens grow unbounded over time, the model dynamically merges tokens based on temporal distance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly are these merged/compressed?


**Data Assumptions**

NoMaD is trained entirely via supervised imitation learning using large-scale real-world datasets (GNM and SACSoN), totaling over 100 hours of robot navigation data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how diverse are they? Do the authors even try to quantify/explain this dataset, or are there other papers that have analyzed it? Was it publicly released?

### Evaluation

NoMaD was evaluated in 6 different real-world and outdoor environments using a LoCoBot mobile platform.
The model was compared with 6 different baselines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are essentially self-ablations or self-variations. Were there no external benchmarks?

### Summary - The Interesting and The Concerning

In comparison to ViNT, NoMaD presents a new approach to learning exploration and goal-seeking behaviors.
Instead of relying on a hierarchical graph-based approach, it's simply done by masking the goal during training and inference time to push the robot's adaptability to both scenarios.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define adapatability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants