Skip to content

Commit b72bb6f

Browse files
added metric
1 parent 34e549d commit b72bb6f

File tree

5 files changed

+128
-16
lines changed

5 files changed

+128
-16
lines changed

README.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,34 @@
11
<p align="center">
2-
<img src="assets/iccv2025_logo.svg" alt="ICCV 2025" height="70">
2+
<img src="assets/iccv2025_logo.svg" alt="ICCV 2025" height="55">
33
</p>
44

5-
# Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction [ICCV 2025 🌺]
5+
<div align="center">
6+
<img align="left" height="90" style="margin-left: 20px" src="assets/logo.png" alt="">
67

8+
# Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
9+
# Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
710

8-
Official implementation of **Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction**, ICCV 2025 🌺
9-
10-
[**Giuseppe Cartella**](https://giuseppecartella.github.io/),
11-
[**Vittorio Cuculo**](https://www.vcuculo.com),
12-
[**Alessandro D'Amelio**](https://sites.google.com/view/alessandro-damelio/home),
13-
[**Marcella Cornia**](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90),
11+
[**Giuseppe Cartella**](https://scholar.google.com/citations?hl=en&user=0sJ4VCcAAAAJ),
12+
[**Vittorio Cuculo**](https://scholar.google.com/citations?hl=en&user=usEfqxoAAAAJ&hl=it&oi=ao),
13+
[**Alessandro D'Amelio**](https://scholar.google.com/citations?user=chkawtoAAAAJ&hl=en&oi=ao),<br>
14+
[**Marcella Cornia**](https://scholar.google.com/citations?hl=en&user=DzgmSJEAAAAJ),
1415
[**Giuseppe Boccignone**](https://scholar.google.com/citations?user=LqM0uJwAAAAJ&hl),
15-
[**Rita Cucchiara**](https://www.ritacucchiara.it/)
16+
[**Rita Cucchiara**](https://scholar.google.com/citations?hl=en&user=OM3sZEoAAAAJ)
17+
18+
Official implementation of "Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction", ICCV 2025 🌺
19+
20+
</div>
21+
22+
## Overview
23+
24+
<p align="center">
25+
<img src="assets/figure.jpg">
26+
</p>
27+
28+
>**Abstract**: <br>
29+
> Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research.
30+
31+
## Code coming soon
1632

1733
## Citation
1834

assets/metric.png

38.1 KB
Loading

assets/model.png

122 KB
Loading

index.html

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ <h2 class="subtitle has-text-centered">
143143
explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, enabling
144144
the generation of diverse yet plausible gaze trajectories.
145145
</h2>
146-
<img src="assets/teaser.png" alt="ScanDiff Teaser" width="70%">
146+
<img src="assets/teaser.png" alt="ScanDiff Teaser" width="100%">
147147
</div>
148148
</div>
149149
</section>
@@ -188,16 +188,49 @@ <h2 class="title is-3">Abstract</h2>
188188
<div class="column is-full-width">
189189
<h2 class="title is-3 has-text-centered">Method Overview</h2>
190190
<div class="content has-text-centered">
191-
<img src="assets/model.png" alt="ScanDiff Method Overview" width="80%">
191+
<img src="assets/model.png" alt="ScanDiff Method Overview" width="100%"><br><br>
192+
<p class="has-text-justified">
193+
<span class="scandiff">ScanDiff</span> is based on a unified architecture combining Diffusion Models with Transformers.
194+
It represents the first diffusion-based approach for scanpath prediction on natural images. Textual conditioning allows <span class="scandiff">ScanDiff</span> to work both in free-viewing and task-driven scenarios,
195+
and a dedicated length prediction module is introduced to handle variable-length scanpaths.
196+
197+
<h3>How does the model work?</h3>
198+
199+
<!-- begine itemize with numbers-->
200+
<ol class="custom-steps has-text-left mt-4">
201+
<li><strong>Scanpath embedding:</strong> The scanpath is embedded into the initial uncorrupted latent variable
202+
<span class="math-symbol">z<sub>0</sub></span>.
203+
</li>
204+
<li><strong>Forward diffusion:</strong> Gaussian noise is added to the embedded sequence
205+
<span class="math-symbol">z<sub>0</sub></span> over <span class="math-symbol">T</span> timesteps.
206+
</li>
207+
<li><strong>Visual encoding:</strong> The stimulus <span class="math-symbol">I</span> is encoded with a Transformer-based backbone (DINOv2).
208+
</li>
209+
<li><strong>Task encoding:</strong> A textual encoder (CLIP) processes the viewing task <span class="math-symbol">c</span>.
210+
</li>
211+
<li><strong>Multimodal fusion:</strong> Visual and textual features are projected into a joint multimodal embedding space.
212+
</li>
213+
<li><strong>Denoising:</strong> A Transformer encoder refines the noised scanpath embedding
214+
<span class="math-symbol">z<sub>t</sub></span>, conditioned on the multimodal features.
215+
</li>
216+
<li><strong>Reconstruction:</strong> A three-layer MLP
217+
<span class="math-symbol">γ<sub>θ</sub></span> reconstructs the scanpath, and a length prediction module
218+
<span class="math-symbol"><sub>θ</sub></span> estimates its length.
219+
</li>
220+
</ol>
221+
222+
</p>
192223
</div>
193224
</div>
194225
</div>
195226

227+
228+
196229
<div class="columns is-centered">
197230
<div class="column is-four-fifths">
198231
<h2 class="title is-3 has-text-centered">Qualitative Results</h2>
199232
<div class="content has-text-centered">
200-
<div class="carousel-container" style="width: 90%; margin: auto;">
233+
<div class="carousel-container" style="width: 100%; margin: auto;">
201234
<div id="difference-detection-carousel" class="carousel results-carousel">
202235
<div class="item">
203236
<img src="./assets/qualitatives/example1.png"
@@ -250,10 +283,31 @@ <h2 class="title is-3 has-text-centered">Qualitative Results</h2>
250283
</div>
251284

252285
<div class="columns is-centered">
253-
<div class="column is-four-fifths">
286+
<div class="column is-full-width">
254287
<h2 class="title is-3 has-text-centered">Scanpath Variability Analysis</h2>
255-
<div class="content has-text-centered">
256-
<div class="carousel-container" style="width: 90%; margin: auto;">
288+
<div class="content has-text-justified">
289+
<p>
290+
Human visual exploration is inherently variable. Individuals perceive the same stimulus in different manners depending on factors such as attention,
291+
context, and cognitive processes. Capturing such variability is essential for developing models that accurately reflect the diverse range of
292+
human traits. However, existing scanpath prediction models tend to align closely with the statistical mean of human gaze behavior. While this approach may improve performance
293+
on traditional evaluation metrics, it fails to reflect the natural variability in human visual attention. Commonly used
294+
metrics such as MM, SM, and SS tend to reward predictions that closely match an aggregated ground truth, thus
295+
favoring models that generate a single representative scanpath. Indeed, the
296+
average similarity between ground-truth scanpaths can be smaller than the average similarity between generated scanpaths if these well reflect an average behavior.
297+
<br><br>
298+
We propose the <strong>Diversity-aware Sequence Score (DSS)</strong>, a new metric that extends the standard sequence similarity mesures by incorporating
299+
a term that penalizes excessive similarity among the generated scanpaths when humans do not reflect such behavior. Given a set of generated scanpaths <span class="math-symbol">s<sub>g</sub></span> and
300+
corresponding human scanpaths <span class="math-symbol">s<sub>h</sub></span> for a specific visual stimulus, DSS is computed as:
301+
</p>
302+
303+
<div class="has-text-centered my-5">
304+
<figure class="image is-inline-block">
305+
<img src="assets/metric.png" alt="metric" style="width: 40%;">
306+
</figure>
307+
</div>
308+
309+
310+
<div class="carousel-container" style="width: 100%; margin: auto;">
257311
<div id="difference-detection-carousel" class="carousel results-carousel">
258312
<div class="item">
259313
<img src="./assets/variability_imgs/example1.png"
@@ -273,6 +327,12 @@ <h2 class="title is-3 has-text-centered">Scanpath Variability Analysis</h2>
273327
</div>
274328
</div>
275329
</div>
330+
<br>
331+
This is a first attempt to quantitatively assess the ability of a model to generate diverse, yet human-like, gaze trajectories.
332+
<span class="scandiff">ScanDiff</span> achieves the best overall performance on all settings and datasets, highlighting its
333+
effectiveness in predicting accurate eye movement trajectories well aligned with the human scanpath variability.
334+
Goal-oriented scanpaths tend to be more deterministic, particularly in the target-present setting, and are generally shorter
335+
than those in free-viewing scenarios. Nevertheless, our model effectively captures even the more subtle variability present in human gaze behavior.
276336
</div>
277337
</div>
278338
</div>

static/css/index.css

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,4 +223,40 @@ body {
223223
max-height: 100%;
224224
max-width: 100%;
225225
object-fit: contain; /* Keep aspect ratio */
226-
}
226+
}
227+
228+
.custom-steps {
229+
counter-reset: step-counter;
230+
list-style: none;
231+
padding-left: 1em;
232+
}
233+
234+
.custom-steps li {
235+
counter-increment: step-counter;
236+
margin-bottom: 1em;
237+
position: relative;
238+
padding-left: 2em;
239+
list-style: none;
240+
}
241+
242+
.custom-steps li::before {
243+
content: counter(step-counter);
244+
position: absolute;
245+
left: 0;
246+
top: 0.1em;
247+
background: #9e1e63; /* Bulma primary */
248+
color: white;
249+
border-radius: 50%;
250+
width: 1.5em;
251+
height: 1.5em;
252+
text-align: center;
253+
line-height: 1.5em;
254+
font-weight: bold;
255+
}
256+
257+
.math-symbol {
258+
font-family: 'Georgia', serif;
259+
font-size: 1.05em;
260+
/* i want bold font */
261+
font-weight: bold;
262+
}

0 commit comments

Comments
 (0)