forked from TRAILab/OpenNav-website
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
329 lines (288 loc) · 14.3 KB
/
index.html
File metadata and controls
329 lines (288 loc) · 14.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory">
<meta name="keywords" content="Vision-Language-Navigation, Open Vocabulary Object Detector">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory</title>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
w[l] = w[l] || []; w[l].push({
'gtm.start':
new Date().getTime(), event: 'gtm.js'
}); var f = d.getElementsByTagName(s)[0],
j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
})(window, document, 'script', 'dataLayer', 'GTM-TSPQB2LZ');</script>
<!-- End Google Tag Manager -->
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/TRAIL_BLACK_ICON.png">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-TSPQB2LZ" height="0" width="0"
style="display:none;visibility:hidden"></iframe></noscript>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory
</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://www.trailab.utias.utoronto.ca/mingfeng-yuan">
Mingfeng Yuan<sup>1</sup>
</a>,
</span>
<span class="author-block">
Hao Zhang<sup>2</sup>,
</span>
<span class="author-block">
Mahan Mohammadi<sup>1</sup>,
</span>
<span class="author-block">
Runhao Li<sup>1</sup>,
</span>
<span class="author-block">
<a href="https://lassonde.yorku.ca/users/jjshan">
Jinjun Shan<sup>2</sup>
</a>,
</span>
<span class="author-block">
<a href="https://www.trailab.utias.utoronto.ca/steven-waslander">
Steven L. Waslander<sup>1</sup>
</a>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">University of Toronto<sup>1</sup>, York University<sup>2</sup></span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://trailab.github.io/STaR-website/"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://arxiv.org/abs/2602.09255" class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/TRAILab/STaR" class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Key Features</h2>
<div class="content has-text-justified">
<ul>
<li>
<strong>Long-Horizon Multimodal Robot Memory (OmniMem).</strong>
We introduce a unified, task-agnostic memory that integrates
3D primitives, temporally aligned video
captions (dynamic scene descriptions), and keyframe visual memory,
enabling joint spatial, temporal, and semantic reasoning over
long-duration robot memory.
</li>
<li>
<strong>Scalable Task-Conditioned Retrieval via Information Bottleneck (STaR).</strong>
STaR applies the Information Bottleneck principle to distill a compact,
non-redundant, and information-rich subset of memories tailored to a
given task, avoiding the inefficiency and hallucination risks of
naïve Retrieval-Augmented Generation (RAG).
</li>
<li>
<strong>Agentic RAG for Planning, Retrieval, and Reasoning.</strong>
We propose an agentic workflow in which an MLLM autonomously plans
search strategies, issues memory retrieval calls, and reasons over
STaR-distilled evidence, enabling precise answers and object-goal navigation.
</li>
<li>
<strong>Extensive Evaluation and Real-Robot Deployment.</strong>
STaR is evaluated on long-horizon navigation VQA benchmarks,
including NaVQA (campus-scale indoor/outdoor scenes) and WH-VQA,
a warehouse benchmark with many visually similar objects built in
Isaac Sim, and is further validated through end-to-end deployment
on a real Husky mobile robot.
</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<div class="has-text-centered">
<h2 class="title is-3">🎥 STaR Demo Videos</h2>
<div class="columns is-multiline is-centered">
<!-- Video 1 -->
<div class="column is-half has-text-centered">
<iframe
width="100%"
height="315"
src="https://www.youtube.com/embed/uFEB3nVWhBg"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
allowfullscreen>
</iframe>
<p class="has-text-weight-semibold" style="margin-top: 0.5em;">
Isaac Sim (Warehouse)
</p>
</div>
<!-- Video 2 -->
<div class="column is-half has-text-centered">
<iframe
width="100%"
height="315"
src="https://www.youtube.com/embed/fkfjyogRCpk"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
allowfullscreen>
</iframe>
<p class="has-text-weight-semibold" style="margin-top: 0.5em;">
Real Robot Deployment
</p>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<div class="has-text-centered">
<h2 class="title is-3" style="margin-bottom: 1em;">🧠 Method Overview</h2>
</div>
<div class="has-text-centered" style="margin-top: 1em;">
<!-- <embed src="./static/images/JDT3D_Architecture_v6.pdf" style="width: 80%; height: 500px;" alt="JDT3D Architecture"> -->
<img src="./static/images/F1.jpg" style="width: 95%;" alt="OpenNav Architecture">
</div>
<h2 class="subtitle" style="text-align: justify;">
<strong>STaR System Overview</strong>. Our framework consists of three stages. (Left) Memory construction: the robot records RGB and posed depth data to build a multimodal memory composed of three complementary databases (DB) -- video caption, 3D primitive, and visual keyframe -- jointly forming OmniMem. (Middle) User query and reasoning: given text or multimodal queries, an agentic planner (MLLM) retrieves task-relevant memories through an Information Bottleneck, performs contextual reasoning, and outputs structured answers (location, time, or description). (Right) Evaluation: We evaluate STaR on both the NaVQA dataset (campus) and the WH-VQA dataset (warehouse), which cover spatial, temporal, and descriptive question types across short-, medium-, and long-term memory settings. The evaluation examines three key capabilities-long horizon cross-modal memory construction, task-conditioned memory retrieval, and contextual reasoning. We also validate the multi-modal query and navigation tasks in a warehouse simulated with Isaac Sim.
</h2>
</div>
</div>
</section>
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<div class="has-text-centered">
<!-- <embed src="./static/images/JDT3D_Architecture_v6.pdf" style="width: 80%; height: 500px;" alt="JDT3D Architecture"> -->
<img src="./static/images/F2.jpg" style="width: 95%;" alt="OVPS Architecture">
</div>
<h2 class="subtitle" style="text-align: justify;">
<strong>Task-conditioned retrieval and contextual reasoning.</strong>
Given an open-ended user query, STaR embeds task cues and queries the memory
database to retrieve relevant video captions with timestamps and associated
detected objects (caption-induced primitives). These retrieved cues define a
task-specific working set of 3D primitives, over which STaR applies an
Information Bottleneck–based clustering to merge neighboring primitives into
compact, task-relevant groups. Captions are then grouped by cluster, and a
single representative caption is selected from each group to form a
non-redundant evidence set. When necessary, the robot further loads keyframe
images to resolve fine-grained visual details, enabling contextual reasoning
and the generation of actionable outputs, such as object locations, shelf
indices, and navigation targets for downstream tasks.
</h2>
</div>
</div>
</section>
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<div class="has-text-centered">
<h2 class="title is-3" style="margin-bottom: 1em;">🚘 On-Device Deployment</h2>
</div>
<div class="has-text-centered" style="margin-top: 1em;">
<!-- <embed src="./static/images/JDT3D_Architecture_v6.pdf" style="width: 80%; height: 500px;" alt="JDT3D Architecture"> -->
<img src="./static/images/F3.png" style="width: 95%;" alt="OpenNav Demo">
</div>
<h2 class="subtitle" style="text-align: justify;">
STaR deployed on a Husky robot for indoor and outdoor experiments, supporting both text-based and multimodal queries.
</h2>
</div>
</div>
</section>
<section class="section" id="Citation">
<div class="container is-max-desktop content">
<h2 class="title">Citation</h2>
<p>If you find this work helpful, please consider citing:</p>
<pre><code>@article{Yuan2026STaR,
title={STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory},
author={Mingfeng Yuan and Hao Zhang and Mahan Mohammadi and Runhao Li and Jinjun Shan and Steven L. Waslander},
year={2026},
eprint={2602.09255},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.09255},
}
</code></pre>
</div>
</section>
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<a class="icon-link" href="https://github.com/TRAILab" class="external-link" disabled>
<i class="fab fa-github"></i>
</a>
</div>
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
We would like to express our gratitude to the open-source projects and their contributors, especially <a href="https://github.com/NVIDIA-AI-IOT/remembr">ReMEmbR</a> and <a href="https://github.com/BIT-DYN/OpenGraph">OpenGraph</a>. Their valuable work has greatly contributed to the development of our codebase.
</p>
<p>
Thank you to the authors of <a href="https://github.com/nerfies/nerfies.github.io/tree/main">Nerfies</a> for the website template.
</p>
<p>
This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>