MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

Carnegie Mellon University National University of Singapore

CVPR 2025

Comparison of different token scanning methods. (a) Cross Attention acts on all image tokens. (b) Projective Attention obtains anchors with perspective projection and selectively attends to sample tokens surrounding the anchor points. (c) The proposed Grid Token-guided Bidirectional Scanning (GTBS) encodes the local context and the joint spatial sequence at the visual feature and person-keypoint levels.
📖 For visual results, go checkout our project page

🚀 Updates

🔲 Dec. 15, 2025: MV-SSM Codes and Model Weights. Coming Soon!
✅ Aug. 31, 2025: We released MV-SSM on arXiv. Check the preprint!
✅ Aug. 12, 2025: MV-SSM is featured on TechXplore! Check out our Blog.
✅ Feb. 26, 2025: MV-SSM accepted at CVPR. Check out our Poster here. See everyone at Nashville!

📖 Abstract

While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba's traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP25 (+24%) on the challenging three-camera setting in CMU Panoptic, +7.0 on AP25 (+13%) on varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations.

Star History

License

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Permission is granted for non-commercial research. For commerical use, please reachout to us.

Acknowledgements

Parts of the codes have been taken and adapted from the below repos. Please acknowledge and adhere to the licenses of each repository that Hamba builds upon.

📑 Citation

If you find our work useful for your project, please consider adding a star to this repo and citing our paper:

    @inproceedings{chharia2025mv,
      title={MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation},
      author={Chharia, Aviral and Gou, Wenbo and Dong, Haoye},
      booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
      pages={11590--11599},
      year={2025}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

🚀 Updates

📖 Abstract

Star History

License

Acknowledgements

📑 Citation

About

Uh oh!

Releases

Packages

License

aviralchharia/MV-SSM

Folders and files

Latest commit

History

Repository files navigation

MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

🚀 Updates

📖 Abstract

Star History

License

Acknowledgements

📑 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages