Landmark-Based Conversational Indoor Navigation

A mixed-reality indoor navigation system generating human-like, landmark-grounded navigation instructions from a single RGB image and natural-language query

Installation • Demo • Pipeline • Evaluation • Contributors

📖 Overview

Indoor navigation differs fundamentally from outdoor navigation: GPS is unreliable, maps are incomplete, and humans rely heavily on local visual landmarks rather than metric distances. This project, developed as part of the Mixed Reality course (ETH Zurich / University of Zurich), presents a practical, deployable indoor navigation system that generates instructions the way people naturally do:

"Walk past the sofa, then turn right at the stairs."

Comparison between baseline navigation instructions and our landmark-based, human-oriented guidance. The baseline approach (left) relies on metric and abstract descriptions, resulting in verbose and less intuitive instructions. In contrast, our method (right) explicitly references visible landmarks, producing concise and human-interpretable guidance.

Key Features

📷 Single-image user localization in reconstructed indoor environments
🧭 Geometric path planning with collision-free trajectories
🏷️ Semantic landmark extraction from perceptual observations
🗣️ Concise, human-oriented navigation instructions
🧠 Lightweight on-device language model (LoRA fine-tuned)
📱 Web-based mobile interface for real-world deployment
📊 Quantitative evaluation + user study

🎬 Demo

mr_final_demo.mp4

End-to-end demonstration of the landmark-based navigation system in action

🏗️ System Pipeline

Overview of the proposed mixed-reality navigation system. Given a user image and a natural-language query through our web app, the system localizes the user, plans a path in Habitat-Sim, extracts semantic landmarks along the route, and generates grounded navigation instructions.

Pipeline Stages

User Localization: Estimate 6-DoF pose from a single RGB image using image-based localization against a pre-built 3D reconstruction
Path Planning: Compute collision-free trajectories using Habitat-Sim's planning module
Landmark Extraction: Densify the path, capture RGB/depth/semantic observations, and cluster visible objects into persistent semantic landmarks
Instruction Generation: Generate concise, fluent instructions grounded in visible landmarks using a fine-tuned lightweight language model

🔬 Experimental Setup

HM3D House Environment

Experimental setup in the simulated house environment. The Habitat-Sim rendering is shown on the left, while the graphical interface for issuing navigation queries and visualizing instructions is shown on the right.

This controlled environment allows us to:

Evaluate instruction generation quality with known start and goal positions
Compare different language models and landmark grounding strategies
Isolate instruction quality from localization errors

ETH HG Academic Building

ETH HG E floor environment with full semantic segmentation in Habitat-Sim

This real-world building evaluation includes:

Full pipeline evaluation with real image-based localization
Web-based user interaction for practical deployment testing
User studies with actual navigation queries

🔧 Installation

Detailed installation instructions are available in SETUP.md. We provide two installation options:

Option 1: HM3D House Environment (Recommended)

Quick setup for exploring HM3D house environments with our finetuned model:

git clone https://github.com/rzninvo/CNSG.git
cd CNSG
bash scripts/install_hm3d.sh

Run the system:

conda activate habitat-default
cd habitat-sim
python examples/mr_viewer.py --backend=local --finetuned-model=True

Option 2: ETH HG E Floor (Optional)

Full installation with semantic segmentation for the ETH HG academic building:

git clone https://github.com/rzninvo/CNSG.git
cd CNSG
bash scripts/install.sh

Run the system:

conda activate habitat-source
cd habitat-sim
python examples/mr_viewer.py --scene ./data/scene_datasets/HGE/HGE.basis.glb --dataset data/scene_datasets/HGE.scene_dataset_config.json

See SETUP.md for complete installation instructions, environment setup, and troubleshooting.

📱 Web Application

The system includes a mobile-friendly web interface that enables real-world deployment and user interaction. Users can capture images of their surroundings and submit natural-language navigation queries through an intuitive interface.

Web interface for image submission and natural-language navigation queries

Running the Web App

The system can be run in two modes: GUI mode (default) or server mode (for web app integration).

Server Mode (Backend)

Start the navigation server to handle web app requests:

# For HM3D environment
conda activate habitat-default
cd habitat-sim
python examples/mr_viewer.py --server-mode --backend=local --finetuned-model=True

The server will start on http://localhost:5000 and provide REST API endpoints for:

Image-based localization
Navigation instruction generation
Path planning and visualization

Web Frontend

In a separate terminal, start the web application:

cd webapp
npm install
npm run dev

The web app will be available at http://localhost:8080.

Mobile Access with ngrok

To access the web app from a mobile device:

Install and authenticate ngrok:

ngrok config add-authtoken <YOUR_NGROK_TOKEN>

Expose the frontend:
```
ngrok http 8080
```
Expose the backend (in another terminal):
```
ngrok http 5000
```
Update the frontend configuration to use the ngrok backend URL
Access the ngrok frontend URL from your mobile device

This setup enables real-world testing and user studies with actual mobile navigation queries.

📊 Evaluation

Methodology

The system is evaluated across two complementary settings:

Simulated House Environment (HM3D)

Controlled evaluation with known start and goal positions
Isolates instruction generation quality
Enables systematic comparison of language models

Real Building Environment (HG Academic Building)

End-to-end pipeline evaluation
Real image-based localization
User studies with actual navigation queries

Metrics

All instructions are evaluated on a 5-point Likert scale across three dimensions:

Reference Object Quality: Accuracy and usefulness of landmark references
Spatial & Directional Correctness: Accuracy of spatial relations and directions
Naturalness of Language: Fluency and human-likeness of instructions

Results

Our evaluation demonstrates that landmark-grounded instruction generation significantly outperforms baseline approaches across both evaluation environments.

Evaluation in the simulated house environment. Average scores for the three instruction quality metrics, comparing the different configurations. Higher scores indicate better performance.

Evaluation in the HG building. Average scores for the three instruction quality metrics, comparing the different configurations. Higher scores indicate better performance.

Key Findings:

Landmark grounding provides substantial improvements over baseline approaches across all metrics
GPT-4 and our fine-tuned local model achieve comparable performance, with GPT-4 showing slight advantages in spatial correctness
The local baseline model (without fine-tuning) already outperforms the non-landmark baseline, demonstrating the value of landmark-based reasoning
Our fine-tuned lightweight model achieves near-GPT-4 performance while enabling fully on-device inference
All landmark-grounded approaches show consistent improvements in reference object quality and language naturalness

Latency Analysis

End-to-end latency profile:

Visual localization: Primary bottleneck (~8-11s)
Path planning: Minimal overhead (<0.5s)
Landmark extraction: Real-time (~0.3s)
Instruction generation: Negligible with local model (<1s)

The system maintains interactive latency suitable for real-world deployment while ensuring user privacy through local inference.

🧠 Lightweight Language Model

Our instruction generation leverages a LoRA-finetuned Phi-3 model that provides:

✅ No cloud dependency - Fully on-device inference ✅ Low GPU memory footprint - Suitable for resource-constrained devices ✅ Privacy preservation - No data leaves the device ✅ Comparable quality - Matches large proprietary models on navigation tasks

The fine-tuned model is specifically optimized for generating concise, landmark-grounded navigation instructions and achieves superior performance compared to general-purpose language models.

🚀 Future Work

Current Limitations

Single-image localization requires manual image capture
Batch-style interaction rather than continuous guidance

Planned Extensions

🎥 Real-time egocentric video localization for continuous tracking
🔄 Continuous instruction refinement based on user progress
⚡ Latency optimization for sub-second response times
🥽 AR glasses deployment for hands-free navigation

👥 Contributors

Team Members

Riccardo Bianco (ETH Zurich)
Francesco Bondi (ETH Zurich)
Roham Zendehdel Nobari (ETH Zurich)
Shaurya Kishore Panwar (University of Zurich)
Fatemeh Sadat Daneshmand (ZHAW Winterthur)

Supervisors

Mahdi Rad · Gabriele Goletto · Kate Jaroslavceva

📄 License

🙏 Acknowledgments

This project builds upon several excellent open-source projects:

Habitat-Sim and Habitat-Lab for simulation infrastructure
Matterport3D and HM3D for indoor scene datasets
Microsoft Phi-3 for the base language model
LaMAR for localization benchmarking tools

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
finetuning		finetuning
graphs		graphs
habitat-lab		habitat-lab
habitat-sim		habitat-sim
media		media
mesh_pipeline		mesh_pipeline
scripts		scripts
utils		utils
webapp		webapp
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
SETUP.md		SETUP.md
check_content_count.py		check_content_count.py

Folders and files

Latest commit

History

Repository files navigation

Landmark-Based Conversational Indoor Navigation

📖 Overview

Key Features

🎬 Demo

🏗️ System Pipeline

Pipeline Stages

🔬 Experimental Setup

HM3D House Environment

ETH HG Academic Building

🔧 Installation

Option 1: HM3D House Environment (Recommended)

Option 2: ETH HG E Floor (Optional)

📱 Web Application

Running the Web App

Server Mode (Backend)

Web Frontend

Mobile Access with ngrok

📊 Evaluation

Methodology

Metrics

Results

Latency Analysis

🧠 Lightweight Language Model

🚀 Future Work

Current Limitations

Planned Extensions

👥 Contributors

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages