A mixed-reality indoor navigation system generating human-like, landmark-grounded navigation instructions from a single RGB image and natural-language query
Installation β’ Demo β’ Pipeline β’ Evaluation β’ Contributors
Indoor navigation differs fundamentally from outdoor navigation: GPS is unreliable, maps are incomplete, and humans rely heavily on local visual landmarks rather than metric distances. This project, developed as part of the Mixed Reality course (ETH Zurich / University of Zurich), presents a practical, deployable indoor navigation system that generates instructions the way people naturally do:
"Walk past the sofa, then turn right at the stairs."
Comparison between baseline navigation instructions and our landmark-based, human-oriented guidance. The baseline approach (left) relies on metric and abstract descriptions, resulting in verbose and less intuitive instructions. In contrast, our method (right) explicitly references visible landmarks, producing concise and human-interpretable guidance.
- π· Single-image user localization in reconstructed indoor environments
- π§ Geometric path planning with collision-free trajectories
- π·οΈ Semantic landmark extraction from perceptual observations
- π£οΈ Concise, human-oriented navigation instructions
- π§ Lightweight on-device language model (LoRA fine-tuned)
- π± Web-based mobile interface for real-world deployment
- π Quantitative evaluation + user study
mr_final_demo.mp4
End-to-end demonstration of the landmark-based navigation system in action
Overview of the proposed mixed-reality navigation system. Given a user image and a natural-language query through our web app, the system localizes the user, plans a path in Habitat-Sim, extracts semantic landmarks along the route, and generates grounded navigation instructions.
- User Localization: Estimate 6-DoF pose from a single RGB image using image-based localization against a pre-built 3D reconstruction
- Path Planning: Compute collision-free trajectories using Habitat-Sim's planning module
- Landmark Extraction: Densify the path, capture RGB/depth/semantic observations, and cluster visible objects into persistent semantic landmarks
- Instruction Generation: Generate concise, fluent instructions grounded in visible landmarks using a fine-tuned lightweight language model
Experimental setup in the simulated house environment. The Habitat-Sim rendering is shown on the left, while the graphical interface for issuing navigation queries and visualizing instructions is shown on the right.
This controlled environment allows us to:
- Evaluate instruction generation quality with known start and goal positions
- Compare different language models and landmark grounding strategies
- Isolate instruction quality from localization errors
This real-world building evaluation includes:
- Full pipeline evaluation with real image-based localization
- Web-based user interaction for practical deployment testing
- User studies with actual navigation queries
Detailed installation instructions are available in SETUP.md. We provide two installation options:
Quick setup for exploring HM3D house environments with our finetuned model:
git clone https://github.com/rzninvo/CNSG.git
cd CNSG
bash scripts/install_hm3d.shRun the system:
conda activate habitat-default
cd habitat-sim
python examples/mr_viewer.py --backend=local --finetuned-model=TrueFull installation with semantic segmentation for the ETH HG academic building:
git clone https://github.com/rzninvo/CNSG.git
cd CNSG
bash scripts/install.shRun the system:
conda activate habitat-source
cd habitat-sim
python examples/mr_viewer.py --scene ./data/scene_datasets/HGE/HGE.basis.glb --dataset data/scene_datasets/HGE.scene_dataset_config.jsonSee SETUP.md for complete installation instructions, environment setup, and troubleshooting.
The system includes a mobile-friendly web interface that enables real-world deployment and user interaction. Users can capture images of their surroundings and submit natural-language navigation queries through an intuitive interface.
The system can be run in two modes: GUI mode (default) or server mode (for web app integration).
Start the navigation server to handle web app requests:
# For HM3D environment
conda activate habitat-default
cd habitat-sim
python examples/mr_viewer.py --server-mode --backend=local --finetuned-model=TrueThe server will start on http://localhost:5000 and provide REST API endpoints for:
- Image-based localization
- Navigation instruction generation
- Path planning and visualization
In a separate terminal, start the web application:
cd webapp
npm install
npm run devThe web app will be available at http://localhost:8080.
To access the web app from a mobile device:
-
Install and authenticate ngrok:
ngrok config add-authtoken <YOUR_NGROK_TOKEN>
-
Expose the frontend:
ngrok http 8080
-
Expose the backend (in another terminal):
ngrok http 5000
-
Update the frontend configuration to use the ngrok backend URL
-
Access the ngrok frontend URL from your mobile device
This setup enables real-world testing and user studies with actual mobile navigation queries.
The system is evaluated across two complementary settings:
Simulated House Environment (HM3D)
- Controlled evaluation with known start and goal positions
- Isolates instruction generation quality
- Enables systematic comparison of language models
Real Building Environment (HG Academic Building)
- End-to-end pipeline evaluation
- Real image-based localization
- User studies with actual navigation queries
All instructions are evaluated on a 5-point Likert scale across three dimensions:
- Reference Object Quality: Accuracy and usefulness of landmark references
- Spatial & Directional Correctness: Accuracy of spatial relations and directions
- Naturalness of Language: Fluency and human-likeness of instructions
Our evaluation demonstrates that landmark-grounded instruction generation significantly outperforms baseline approaches across both evaluation environments.
Evaluation in the simulated house environment. Average scores for the three instruction quality metrics, comparing the different configurations. Higher scores indicate better performance.
Evaluation in the HG building. Average scores for the three instruction quality metrics, comparing the different configurations. Higher scores indicate better performance.
Key Findings:
- Landmark grounding provides substantial improvements over baseline approaches across all metrics
- GPT-4 and our fine-tuned local model achieve comparable performance, with GPT-4 showing slight advantages in spatial correctness
- The local baseline model (without fine-tuning) already outperforms the non-landmark baseline, demonstrating the value of landmark-based reasoning
- Our fine-tuned lightweight model achieves near-GPT-4 performance while enabling fully on-device inference
- All landmark-grounded approaches show consistent improvements in reference object quality and language naturalness
End-to-end latency profile:
- Visual localization: Primary bottleneck (~8-11s)
- Path planning: Minimal overhead (<0.5s)
- Landmark extraction: Real-time (~0.3s)
- Instruction generation: Negligible with local model (<1s)
The system maintains interactive latency suitable for real-world deployment while ensuring user privacy through local inference.
Our instruction generation leverages a LoRA-finetuned Phi-3 model that provides:
β No cloud dependency - Fully on-device inference β Low GPU memory footprint - Suitable for resource-constrained devices β Privacy preservation - No data leaves the device β Comparable quality - Matches large proprietary models on navigation tasks
The fine-tuned model is specifically optimized for generating concise, landmark-grounded navigation instructions and achieves superior performance compared to general-purpose language models.
- Single-image localization requires manual image capture
- Batch-style interaction rather than continuous guidance
- π₯ Real-time egocentric video localization for continuous tracking
- π Continuous instruction refinement based on user progress
- β‘ Latency optimization for sub-second response times
- π₯½ AR glasses deployment for hands-free navigation
Team Members
- Riccardo Bianco (ETH Zurich)
- Francesco Bondi (ETH Zurich)
- Roham Zendehdel Nobari (ETH Zurich)
- Shaurya Kishore Panwar (University of Zurich)
- Fatemeh Sadat Daneshmand (ZHAW Winterthur)
Supervisors
Mahdi Rad Β· Gabriele Goletto Β· Kate Jaroslavceva
MIT License Β© 2025 Landmark-Based Conversational Indoor Navigation Team
This project builds upon several excellent open-source projects:
- Habitat-Sim and Habitat-Lab for simulation infrastructure
- Matterport3D and HM3D for indoor scene datasets
- Microsoft Phi-3 for the base language model
- LaMAR for localization benchmarking tools



