A comprehensive surevy on Multimodal Models in 3D
- Classification
- Detection
- Segmentation
- Tracking
- Localization
- Retrival
- Scene Understanding
- Editing and Manupulation
- Generation
- Grounding
- Captioning
- Pose Estimation
- Question Answering
- Pretraining
- Matching
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| ClipFace: Text-guided Editing of Textured 3D Morphable Models | nan | nan | 2023 | |
| CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout | nan | nan | 2023 | |
| Volumetric Disentanglement for 3D Scene Manipulation | nan | nan | 2022 | |
| Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions | nan | nan | 2023 | |
| LADIS: Language Disentanglement for 3D Shape Editing | nan | nan | 2022 | |
| Local 3D Editing via 3D Distillation of CLIP Knowledge | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| 3D Multi-Object Tracking Using Graph Neural Networks with Cross-Edge Modality Attention | nan | nan | 2022 | |
| LATTE: LAnguage Trajectory TransformEr | nan | nan | 2022 | |
| 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking | nan | nan | 2023 | |
| EagerMOT: 3D Multi-Object Tracking via Sensor Fusion | nan | nan | 2021 | |
| MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Complementary Pseudo Multimodal Feature for Point Cloud Anomaly Detection | nan | nan | 2023 | |
| EasyNet: An Easy Network for 3D Industrial Anomaly Detection | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding | nan | nan | 2022 | |
| Learning Point-Language Hierarchical Alignment for 3D Visual Grounding | nan | nan | 2022 | |
| ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance | nan | nan | 2023 | |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | nan | nan | 2023 | |
| Multi-View Transformer for 3D Visual Grounding | nan | nan | 2022 | |
| Learning Point-Language Hierarchical Alignment for 3D Visual Grounding | nan | nan | 2022 | |
| 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection | nan | nan | 2022 | |
| 3D VR Sketch Guided 3D Shape Prototyping and Exploration | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| TeSTNeRF: Text-Driven 3D Style Transfer via Cross-Modal Learning | nan | nan | 2023 | |
| TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition | nan | nan | 2022 | |
| HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks | nan | nan | 2023 | |
| CLIP3Dstyler: Language Guided 3D Arbitrary Neural Style Transfer | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Towards Label-free Scene Understanding by Vision Foundation Models | nan | nan | 2023 | |
| CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP | nan | nan | 2023 | |
| Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction | nan | nan | 2023 | |
| Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding | nan | nan | 2023 | |
| PLA: Language-Driven Open-Vocabulary 3D Scene Understanding | nan | nan | 2023 | |
| Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models | nan | nan | 2022 | |
| OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation | nan | nan | 2023 | |
| TextDeformer: Geometry Manipulation using Text Guidance | nan | nan | 2023 | |
| Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting | nan | nan | 2023 | |
| RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval | nan | nan | 2023 | |
| TextANIMAR: Text-based 3D Animal Fine-Grained Retrieval | nan | nan | 2023 | |
| SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval | nan | nan | 2023 | |
| OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data | nan | nan | 2023 | |
| Towards 3D VR-Sketch to 3D Shape Retrieval | nan | nan | 2022 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Multimodal Brain Disease Classification with Functional Interaction Learning from Single fMRI Volume | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions | nan | nan | 2023 | |
| UnLoc: A Universal Localization Method for Autonomous Vehicles using LiDAR, Radar and/or Camera Input | nan | nan | 2023 | |
| WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| 3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Towards Zero-Shot Scale-Aware Monocular Depth Estimation | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| ImageBind-LLM: Multi-modality Instruction Tuning | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| LiCamGait: Gait Recognition in the Wild by Using LiDAR and Camera Multi-modal Visual Sensors | nan | nan | 2022 | |
| LATFormer: Locality-Aware Point-View Fusion Transformer for 3D Shape Recognition | nan | nan | 2023 | |
| Cross-Modal Learning with 3D Deformable Attention for Action Recognition | nan | nan | 2023 | |
| FER-former: Multi-modal Transformer for Facial Expression Recognition | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation | nan | nan | 2023 | |
| Zero-1-to-3: Zero-shot One Image to 3D Object | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Style-aware Augmented Virtuality Embeddings (SAVE) | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding | nan | nan | 2023 |
| Title | arXiv | Github | WebSite | Pub. & Date |
|---|---|---|---|---|
| Scalable 3D Captioning with Pretrained Models | nan | nan | 2023 |