You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/textbook/audits/staging/yuni-wyx-jt7347.mdx
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,8 @@ While ViNT does not explicitly set a hard line between when to explore or when t
54
54
Instead, this encoder encodes the difference between current observation and the goal - just stack observations and goal together, pass through EfficientNet, then flatten to get goal tokens, similar to observation encoder.
55
55
Attention forces goal to attend to / compare with window observation sequence.
56
56
57
-
2.**Transformer**: $P_{\text{past}}$, $P_{\text{obs}}$ (i.e. current), and $P_{\text{goal}}$ tokens are combined with positional encoding, and passed into decoder-only Transformer backbone (denoted as 'f' in the paper) with 4 multi-headed attention blocks (4 heads, 4 layers), and 2048 hidden units.
57
+
2.**Transformer**: $P_{\text{past}}$, $P_{\text{obs}}$ (i.e. current), and $P_{\text{goal}}$ tokens are combined with positional encoding.
58
+
These are passed into a decoder-only Transformer backbone (denoted as 'f' in the paper) with 4 multi-headed attention blocks (4 heads, 4 layers), and 2048 hidden units.
58
59
59
60
a. 6 tokens, model dimension of 512, 4 layers, 4 heads, 2048 feed-forward hidden dim, 128 per attention head (512 / 4).
60
61
@@ -216,7 +217,8 @@ NoMaD was evaluated in 6 different real-world and outdoor environments using a L
216
217
The model was compared with 6 different baselines:
217
218
218
219
1.**VIB**: Variational Information bottleneck, which models distribution of actions conditioned on observations.
219
-
2.**Masked ViNT**: Essentially ViNT but with goal masking policy. Predicts point estimates of future actions instead of modeling the entire distribution.
220
+
2.**Masked ViNT**: Essentially ViNT but with goal masking policy.
221
+
Predicts point estimates of future actions instead of modeling the entire distribution.
220
222
3.**Autoregressive**: Uses autoregressive prediction over a discrete distribution of actions.
221
223
4.**Subgoal diffusion**: Basically just ViNT (diffusion generation of subgoals with navigation policy).
222
224
5.**Random subgoals**: A variation of ViNT that instead of using diffusion, just randomly samples training data for candidate subgoal.
@@ -231,7 +233,8 @@ In comparison to ViNT, NoMaD presents a new approach to learning exploration and
231
233
The main contribution is architecturally quite simple, but effective.
232
234
The goal-masking effectively turns navigation from a single-task problem into a conditional behavior spectrum, allowing for a more unified end-to-end approach.
233
235
Additionally, the way they used diffusion for only action generation instead of image generation greatly reduces computational costs for running NoMaD.
234
-
Previously in ViNT, exploration vs. exploitation was a behavior encoded within the graph generation and subgoal ranking, but now with NoMaD, the low level collision avoidance and high level planning (exploration vs. subgoal seeking) is defined in one model architecture.
236
+
Previously in ViNT, exploration vs. exploitation was a behavior encoded within the graph generation and subgoal ranking.
237
+
Now with NoMaD, the low level collision avoidance and high level planning (exploration vs. subgoal seeking) is defined in one model architecture.
235
238
Additionally, the probability distribution allows for more fine-tuned assignment of what actions are good and bad in all action space (e.g. high probability on turn left or turn right at a T junction, low probability of going straight and hitting wall).
236
239
237
240
Some things which could be expanded upon however, include:
0 commit comments