VGGT-Qwen3 RoomPlan is a Stage-1 multi-view vision–language model for 3D indoor scene understanding. It combines a frozen VGGT visual backbone with a Perceiver projector and Qwen3-4B to answer questions about 3D scenes (ScanQA/SQA3D) using a token-level visual injection mechanism.
Stage 2 (RoomPlan action JSON prediction) is future work and intentionally out of scope for this README.
High-level Stage-1 pipeline:
Multi-view Images
└─▶ VGGT Aggregator (frozen)
└─▶ Perceiver Projector
└─▶ Visual tokens
└─▶ Injected at <image> in Qwen3-4B
└─▶ Text answer
- VGGT: multi-view visual aggregator, kept frozen in Stage 1.
- Perceiver projector: maps VGGT features to a fixed-length sequence of visual tokens in Qwen3’s hidden dimension.
- Qwen3-4B: causal LM loaded via
AutoModelForCausalLM.from_pretrained(...)and fine-tuned end-to-end in Stage 1 (no LoRA/PEFT injection in current code). - Token-level injection: visual tokens overwrite embeddings at the
<image>placeholder in the prompt; loss is computed only on answer tokens.
For a deeper architectural description (including label masking and injection span rules), see:
docs/model/architecture.md
- Training on ScanQA + SQA3D JSONL shards using:
- Multi-view images (currently often single “bird-view” crops).
- Textual questions and short answers.
- Visual token injection at a dedicated
<image>placeholder in the prompt. - Stage-1 QA inference with a short-answer constraint (for better exact-match behavior).
- Configurable training via YAML:
- Model name, Perceiver config, number of visual/geom tokens.
- Dataset mix, views, sequence length.
- Optimizer and scheduler parameters.
configs/stage1_3d.yamlcontainslora:andmodel.freeze_text_layers, but these fields are currently not applied bysrc/vggt_qwen3/train/stage1.py.
- Canonical entry points:
train_stage1.sh/python -m vggt_qwen3.train.stage1infer_stage1.sh/python -m vggt_qwen3.inference.qa_inference
Stage 2 / RoomPlan JSON actions are not documented here beyond brief “future work” mentions.
Stage-1 currently fine-tunes Qwen end-to-end (full-parameter fine-tuning) while freezing VGGT by default.
| Component | Trainable? | Notes |
|---|---|---|
| VGGT vision backbone | No (default) | Controlled by model.freeze_vision in src/vggt_qwen3/models/vggt_qwen3_vlm.py; default true in configs/stage1_3d.yaml. |
| Perceiver projector | Yes | Included in optimizer with train.proj_lr in src/vggt_qwen3/train/stage1.py. |
geom_head |
Yes | Included in optimizer with train.proj_lr in src/vggt_qwen3/train/stage1.py. |
| Qwen3 text model | Yes | Loaded with AutoModelForCausalLM.from_pretrained(...); parameters remain trainable and are optimized with base train.lr (no LoRA/PEFT currently). |
configs/stage1_3d.yaml includes lora: and model.freeze_text_layers fields as placeholders, but the current Stage-1 training harness does not read/apply them.
- Recommended Python: 3.9+.
- Recommended: conda environment.
Using conda (general GPUs like V100/A100/H100):
conda env create -f env/environment.yml
conda activate roomplanUsing venv + pip:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .If you are running on NVIDIA B200 (compute capability sm_100), the default env/environment.yml (PyTorch 2.4.0) may not support this architecture. In that case, create a dedicated environment using a newer PyTorch build that includes B200 support:
# From the repo root, on a node with CUDA 12.9+ available
# (adjust the module name to your cluster)
module load cuda/12.9.1 # or similar
conda create -n roomplan-b200 python=3.10 -y
conda activate roomplan-b200
# Install a PyTorch build that supports NVIDIA B200 (sm_100)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Install remaining dependencies without overriding the CUDA-enabled PyTorch
pip install -r requirements.txt --no-deps
# Install this repo and VGGT in editable mode
pip install -e .
pip install -e third_party/vggt
# (Optional) logging / training extras
pip install --no-build-isolation deepspeed tensorboardYou can verify that your PyTorch build sees the B200 architecture via:
python -c "import torch; print(torch.__version__, torch.cuda.get_arch_list())"Look for "sm_100" in the printed list. Once this environment is active, use the standard training commands (for example, single-GPU DeepSpeed):
ACCELERATE_CONFIG=configs/accelerate_1gpu.yaml ./train_stage1.sh- VGGT:
- Place the VGGT checkpoint at
third_party/vggt/vggt_1B_commercial.pt. - (Optional) install VGGT in editable mode:
pip install -e third_party/vggt
- Place the VGGT checkpoint at
- Hugging Face cache (recommended on shared filesystems):
export HF_HOME="$PWD/.cache/huggingface" export TRANSFORMERS_CACHE="$HF_HOME" export HF_DATASETS_CACHE="$HF_HOME"
Verify CUDA/GPU:
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"Expected on-disk layout for Stage 1:
data/
raw/
scannet/ # optional; only needed to regenerate from raw
scanqa/
sqa3d/
processed/
scanqa/
train_split.jsonl
test_split.jsonl
sqa3d/
train_split.jsonl
test_split.jsonl
Each JSONL line in train_split.jsonl / test_split.jsonl is a record with:
images: list of image paths (relative or absolute).geom_token:nullfor current Stage-1 shards (geometry bypassed).task:"scanqa"or"sqa3d".question: question string.answer: short textual answer.scene_id: scene identifier.
The Stage-1 config configs/stage1_3d.yaml mixes ScanQA and SQA3D (default 0.7 / 0.3) via MultiSourceDataset.
Data loading and collation:
vggt_qwen3.dataio.dataset_builder.MultiViewJsonDataset:- Loads JSON/JSONL, normalizes fields, caps to
num_views, and resolves image paths.
- Loads JSON/JSONL, normalizes fields, caps to
vggt_qwen3.dataio.collate_multiview.MultiViewCollator:- Resizes/crops images.
- Builds prompts like:
"{question}
This guarantees a safe injection span for the visual tokens (no overlap with answer tokens).
The Stage-1 trainer lives at:
src/vggt_qwen3/train/stage1.py- Config:
configs/stage1_3d.yaml
Canonical command:
accelerate launch --config_file configs/accelerate_single_gpu.yaml -m vggt_qwen3.train.stage1 --config configs/stage1_3d.yaml --output_dir ckpts/stage1_3dOr, via the wrapper:
./train_stage1.shYou can override key parameters via environment variables:
CONFIG– path to a Stage-1 config (defaultconfigs/stage1_3d.yaml).OUTPUT_DIR– checkpoint directory (defaultckpts/stage1_3d).ACCELERATE_CONFIG– Accelerate config (defaultconfigs/accelerate_single_gpu.yaml).
Use one of the provided Accelerate configs (e.g., configs/accelerate_4gpu.yaml):
ACCELERATE_CONFIG=configs/accelerate_4gpu.yaml ./train_stage1.shOn Slurm/HiPerGator, wrap this command in an sbatch script that requests the appropriate number of GPUs and nodes. See the existing scripts/slurm/ files and docs/SLURM_TRAINING_GUIDE.md for patterns (note: Stage 2 scripts are future work).
If you prefer launching the Stage-1 harness directly with torchrun (using Accelerate only as a library), you can run:
torchrun --standalone --nproc_per_node=4 -m vggt_qwen3.train.stage1 \
--config configs/stage1_3d.yaml \
--output_dir ckpts/stage1_3d \
--max_steps 30000This uses standard DDP instead of the accelerate launch CLI, but the training loop and logging behavior are identical.
If you want DeepSpeed ZeRO-3, pass a DeepSpeed config via DEEPSPEED_CONFIG:
DEEPSPEED_CONFIG=configs/deepspeed_zero3.json ACCELERATE_CONFIG=configs/accelerate_4gpu.yaml ./train_stage1.shInternally, vggt_qwen3.train.stage1:
- Builds the tokenizer and ensures a dedicated
<image>token. - Creates datasets and
MultiViewCollator. - Builds
VGGTQwen3VLMwith:- Frozen VGGT.
- Perceiver projector.
- Qwen3-4B full-parameter fine-tuning (no LoRA/PEFT injection in current code).
- Splits parameters into base vs projector/geom parameter groups with separate LRs.
- Uses a cosine LR schedule with warmup.
- Writes reproducibility metadata under the chosen
--output_dir:logs/events.jsonl– structured training metrics (loss, LR, speed, ETA).meta/args.yaml– CLI arguments for the run.meta/env.txt– Python/Torch/CUDA summary andpip freeze.meta/git.txt– git commit and status (when available).
Stage-1 QA inference uses:
vggt_qwen3.inference.qa_inference- The shell wrapper
infer_stage1.sh
Wrapper (recommended):
CONFIG=configs/stage1_3d.yaml \
CHECKPOINT_DIR=ckpts/stage1_3d \
DATASET=scanqa \
./infer_stage1.shEquivalent Python command:
python -m vggt_qwen3.inference.qa_inference \
--config configs/stage1_3d.yaml \
--checkpoint_dir ckpts/stage1_3d \
--dataset scanqa \
--max_new_tokens 32Defaults:
- Glob:
data/processed/scanqa/test_split.jsonl - Predictions:
outputs/qa/scanqa/scanqa_predictions_test.jsonl - Events log:
outputs/qa/scanqa/scanqa_predictions_test.jsonl.events.jsonl
Wrapper:
CONFIG=configs/stage1_3d.yaml \
CHECKPOINT_DIR=ckpts/stage1_3d \
DATASET=sqa3d \
./infer_stage1.shEquivalent Python:
python -m vggt_qwen3.inference.qa_inference \
--config configs/stage1_3d.yaml \
--checkpoint_dir ckpts/stage1_3d \
--dataset sqa3d \
--max_new_tokens 32Defaults:
- Glob:
data/processed/sqa3d/test_split.jsonl - Predictions:
outputs/qa/sqa3d/sqa3d_predictions_test.jsonl - Events log:
outputs/qa/sqa3d/sqa3d_predictions_test.jsonl.events.jsonl
Run both datasets in a single call:
python -m vggt_qwen3.inference.qa_inference \
--config configs/stage1_3d.yaml \
--checkpoint_dir ckpts/stage1_3d \
--dataset scanqa+sqa3d \
--max_new_tokens 32This produces both prediction files listed above (one per dataset) plus per-dataset .events.jsonl logs.
The QA inference script builds prompts using the Qwen3 chat template with a short-answer hint:
ScanQA (no situation text):
{question}
<image>
Answer with a short phrase only.
SQA3D (includes “situation” when present):
Situation: {situation}
Question: {question}
<image>
Answer with a short phrase only.
Generation settings:
temperature = 0.0top_p = 1.0num_beams = 1- Small
max_new_tokens(default 32)
This combination encourages deterministic, concise answers suitable for exact-match evaluation.
- Per-sample predictions are written as JSONL records with:
question,prediction,reference,scene_id,task, etc.
- A companion
*.events.jsonlfile logs:- Dataset name, number of samples, number of generations, and how many predictions were non-empty.
- One event per sample plus a final summary event.
For more detailed evaluation beyond exact match, see docs/evaluation.md.
Key paths:
src/vggt_qwen3/dataio/– dataset and collator.models/– VGGTQwen3 wrapper and Perceiver projector.train/stage1.py– Stage-1 training harness.inference/qa_inference.py– Stage-1 QA inference.eval/– lightweight evaluation helpers.
configs/stage1_3d.yaml– Stage-1 config.perceiver_small.yaml– projector config.accelerate_*.yaml,deepspeed_zero3.json– launcher configs.
scripts/eval_baseline.sh,eval_baseline_quick.py– evaluation helpers.prep/– data preparation scripts (ScanQA/SQA3D/ARKit).slurm/– Slurm examples (Stage 2/3 future work).
train_stage1.sh,infer_stage1.sh– canonical CLI wrappers.docs/index.md– docs landing page.stage1_quickstart.md– extended Stage-1 guide.model/architecture.md– model internals.dev/debug_history.md– maintainer debug notes.dev/repo_structure.md– repo organization rationale.
Common issues and tips:
-
Stage-1 inference returns empty predictions
- Symptom: every line in the JSONL file has an empty
predictionstring, or the script raises aRuntimeErrorstating that no non-empty predictions were produced. - Checklist:
- Verify that test splits exist and are non-empty:
data/processed/scanqa/test_split.jsonldata/processed/sqa3d/test_split.jsonl
- Re-run with debug mode to inspect prompts and token IDs:
python -m vggt_qwen3.inference.qa_inference --dataset scanqa --debug --debug_max_samples 4 ...
- Inspect the companion
*.events.jsonlfile to confirmtotal_non_empty_predictions > 0. - If
total_generations_attempted > 0but all decoded strings are empty, suspect a tokenizer decode issue (special-tokens-only outputs); the debug logs print bothskip_special_tokens=TrueandFalsedecodes and raw token IDs to help diagnose.
- Verify that test splits exist and are non-empty:
- Symptom: every line in the JSONL file has an empty
-
Missing
<image>token- Symptom:
MultiViewCollatoror model raises an error about missing<image>or usesunkID. - Fix: use the provided tokenizer builders in
train.stage1andqa_inference; avoid custom tokenization that bypasses them.
- Symptom:
-
Injection span overlapping answer tokens
- Stage-1 collator inserts a reserved padding span after
<image>sized tonum_vis_tokens + geom_tokensand labels it as-100. If you changenum_vis_tokensorgeom_tokens, keepdata.max_lengthsufficiently large;MultiViewCollatorenforces a sanity check and will raise ifmax_lengthis too small.
- Stage-1 collator inserts a reserved padding span after
-
dtype or device mismatches
- On GPU, Stage 1 uses bf16 by default; ensure your hardware supports it.
- On CPU, you may want to run with float32 (adjust
model.dtypein the config or via the wrapper if needed).
-
Cleaning artifacts (checkpoints, logs, outputs)
- To safely remove training and inference artifacts without touching code or data:
# Dry run – prints what would be removed python scripts/clean_artifacts.py # Actual deletion (requires confirmation flag or env var) python scripts/clean_artifacts.py --yes # or CONFIRM=1 python scripts/clean_artifacts.py
- This script only targets
ckpts/,outputs/,logs/,runs/,results/, andpytorchdist_*.outfiles under the repo root.
- To safely remove training and inference artifacts without touching code or data:
For a deeper engineering/debugging history (including injection fixes), see:
docs/dev/debug_history.md
If you use this codebase or model in your research, please cite it using the metadata in CITATION.cff.
- This repository is licensed under the Apache License 2.0 (see
LICENSE). - VGGT, Qwen3, and any other third-party components in
third_party/are subject to their respective licenses. Please review those files before redistribution or commercial use.