Skip to content

Official implementation of "HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos", accepted at ICCV 2025.

License

Notifications You must be signed in to change notification settings

sapeirone/HiERO

Repository files navigation

HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos

Simone Alberto PeironeFrancesca PistilliGiuseppe Averta


Welcome to the official repository of our paper "HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos", accepted at ICCV 2025.

📝 Abstract

Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.

🔎 Getting started

🏗️ Environment and data setup

First, clone the current repository and create the required directories:

git clone --recursive git@github.com:sapeirone/HiERO.git && cd HiERO
mkdir -p checkpoints data/ego4d/raw/annotations data/ego4d/raw/features

Then, build a python environment and install all the required dependencies:

python -m venv .env
source .env/bin/activate
pip install -r requirements.txt -f https://data.pyg.org/whl/torch-2.4.0+cu124.html --extra-index-url https://download.pytorch.org/whl/

Finally, download the Ego4d annotations and the pre-extracted features and copy (or link them) in the data/ego4d/raw directory. Download and copy the EgoClip and EgoMCQ annotations from EgoVLP in the data/ego4d/raw/annotations directory.

The resulting data/ego4d/raw should be structured as follows:

data/ego4d/raw
          └─── annotations
          │    ├─── v1
          │    │    │ ego4d.json
          │    │    │ ...
          │    ├──── egoclip.csv
          │    └──── egomcq.json
          │
          └─── features
               ├─── omnivore_video_swinl
               │    │ 64b355f3-ef49-4990-8622-9e9eef68b495.pth
               │    │ ...
               │     
               └─── egovlp
                    │ 64b355f3-ef49-4990-8622-9e9eef68b495.pth
                    │ ...

Pretrained components

For HiERO (EgoVLP), we use the frozen text encoder from EgoVLP. Download (egovlp_text.pth, egovlp_txt_proj.pth) and copy the weights under the pretrained directory:

pretrained
     ├─── egovlp_text.pth
     └─── egovlp_txt_proj.pth

🚀 Training and validation

EgoClip/EgoMCQ

Training and validation on EgoClip/EgoMCQ are implemented in the train.py and validate.py. Pre-trained checkpoints are provided in section Model Zoo.

Training on EgoClip:

python train.py --config-name=omnivore save_to=path/to/ckpt/dir

Validation on EgoMCQ:

python validate.py --config-name=omnivore resume_from=path/to/ckpt/dir/model.pth

EgoProceL

Please refer to egoprocel/README.md.

Ego4d Goal-Step

Please refer to ego4d_goalstep/README.md.

🐘 Model Zoo

Tip

HiERO is backbone-agnostic and can be easily extended to new features extractors. See here for more!

We provide pretrained checkpoints of HiERO for a set of features extractors. Omnivore Video Swin-L features are part of the official Ego4d release and can be downloaded following to the official docs. We provide pre-extracted features for the remaining backbones according to the following table.

Backbone Config file Features Checkpoint EgoMCQ Inter (%) EgoMCQ Intra (%)
Omnivore omnivore ego4d docs hiero_omnivore.pth 90.1 53.4
EgoVLP egovlp ego4d_egovlp.zip hiero_egovlp.pth 91.6 59.6
LaViLa lavila ego4d_LaViLa-L.zip hiero_lavila-l.pth 94.6 64.4

Acknowledgements

This study was carried out within the FAIR - Future Artificial Intelligence Research and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1555 11/10/2022, PE00000013). This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high performance computing resources and support.

This codebase is based on our previous works EgoPack and Hier-EgoPack.

Cite Us

@InProceedings{Peirone_2025_ICCV,
    author    = {Peirone, Simone Alberto and Pistilli, Francesca and Averta, Giuseppe},
    title     = {HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {19862-19871}
}

About

Official implementation of "HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos", accepted at ICCV 2025.

Topics

Resources

License

Stars

Watchers

Forks

Contributors