cd <your-project-path>/multimodal-ambiguity-resolutionpython3 -m venv venv
source venv/bin/activatepip install -r requirements.txtCreate .env in project root:
DEEPSEEK_API_KEY=your_key_here
QWEN_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_herepython3 experiments/full_experiment.py \
--dataset spider \
--model deepseek \
--samples 100 \
--mode full
# Optional: include official Spider evaluation (if spider-master/evaluation.py is available locally)
python3 experiments/full_experiment.py \
--dataset spider \
--model deepseek \
--samples 100 \
--mode full \
--official-evalRun full Spider dev set:
python3 experiments/full_experiment.py --dataset spider --model deepseek --mode fullRun smoke test (fast end-to-end sanity check):
python3 experiments/full_experiment.py --dataset spider --model deepseek --mode smokepython3 experiments/full_experiment.py \
--dataset bird \
--model deepseek \
--samples 100 \
--mode full--benchmark is required for unified multimodal evaluation.
python3 experiments/unified_multimodal_experiment.py \
--benchmark chartqa \
--data-root data/multimodal \
--split test \
--samples 100 \
--mode full \
--max-questions 3 \
--eig-threshold 0.0 \
--ambiguity-threshold 0.5 \
--model deepseek \
--output experiments/results/unified_chartqa_eval_100.jsonpython3 experiments/unified_multimodal_experiment.py \
--benchmark infographicsvqa \
--data-root data/multimodal \
--split val \
--samples 100 \
--mode full \
--max-questions 3 \
--eig-threshold 0.0 \
--ambiguity-threshold 0.5 \
--model deepseek \
--output experiments/results/unified_infovqa_eval_100.json# Artwork domain
python3 experiments/unified_multimodal_experiment.py \
--benchmark caesura_artwork \
--data-root data/multimodal \
--split test \
--samples 100 \
--mode full \
--max-questions 3 \
--auto-resolve-without-user \
--model deepseek \
--output experiments/results/unified_caesura_artwork_eval_100.json
# Rotowire domain
python3 experiments/unified_multimodal_experiment.py \
--benchmark caesura_rotowire \
--data-root data/multimodal \
--split test \
--samples 100 \
--mode full \
--max-questions 3 \
--auto-resolve-without-user \
--model deepseek \
--output experiments/results/unified_caesura_rotowire_eval_100.jsonOptional clarification behavior knobs:
--auto-resolve-without-userbash run_all_benchmarks.sh smoke deepseek
bash run_all_benchmarks.sh full deepseek
``
## G) Clarification Parameters (Unified QA)
The unified benchmark script supports explicit clarification-control knobs:
```bash
python3 experiments/unified_multimodal_experiment.py --help| Parameter | Type | Default | Meaning | Typical Usage |
|---|---|---|---|---|
--ambiguity-threshold |
float |
0.5 |
Minimum ambiguity confidence required for an ambiguity candidate to be kept. Higher value means stricter ambiguity detection. | Increase to reduce over-triggered ambiguities (e.g., 0.6~0.75). |
--eig-threshold |
float |
0.0 |
Minimum Expected Information Gain (EIG) for a clarification question to be kept. | Raise to keep only high-value clarification questions (e.g., 0.1~0.3). |
--max-questions |
int |
3 |
Upper bound on clarification budget (max retained/iterated clarification questions). | Use 1~2 for conservative settings; larger values for interactive settings. |
--auto-resolve-without-user |
flag |
off |
If enabled, the system does not ask a human; it auto-selects the first clarification option and continues. | Use for uninterrupted offline batch runs; keep off for user-facing interaction. |
--disable-llm-entity-extraction |
flag |
off |
Disable LLM-based semantic entity extraction during index build/query-time enrichment. | Enable this flag for stricter rule-only extraction or lower runtime cost. |
--min-grounding-score |
float |
0.0 |
Minimum grounding evidence score required for an entity to enter clarification options. | Increase to reduce weak/unreliable option candidates. |
--min-criticality |
float |
0.0 |
Minimum criticality score for a clarification question to be asked. | Increase to suppress low-value clarification turns. |
--stop-after-no-progress-rounds |
int |
2 |
Stop iterative clarification after this many rounds without ambiguity reduction. | Keep small (1~2) to avoid over-clarification loops. |
Example (strict ambiguity + no auto clarification):
python3 experiments/unified_multimodal_experiment.py \
--benchmark chartqa \
--data-root data/multimodal \
--split test \
--mode smoke \
--model deepseek \
--ambiguity-threshold 0.65 \
--eig-threshold 0.15 \
--min-grounding-score 0.5 \
--min-criticality 0.2 \
--stop-after-no-progress-rounds 2 \
--max-questions 2 \
--output experiments/results/unified_chartqa_smoke_tuned.jsonpython3 experiments/build_entity_index.py --target all --limit 100
# or build CAESURA only:
python3 experiments/build_entity_index.py --target caesura_artwork --limit 200Use --entity-index-dir to change the storage location (default cache/entities).
Download and Place dataset under data/spider/:
Spider dataset files are large (approx. 2GB) and hosted on Google Drive, which can be tricky for automated downloads.
Option 1: Attempt Automated Download (Recommended first)
python3 -c "from src.data.dataset_manager import DatasetManager; DatasetManager().download_dataset('spider')"Note: This command attempts to download and extract the full Spider dataset (including databases) into ./data/spider/. If it fails or downloads too slowly due to Google Drive limitations, proceed with Option 2.
Option 2: Manual Download
- Visit the official Spider website: https://yale-lily.github.io/spider
- Download the
spider_dataset.zip(or similar full dataset archive). - Extract the contents. Ensure the
database/folder,dev.json,train_spider.json, andtables.jsonare placed directly under./data/spider/. The final structure should look like:data/spider/ ├── dev.json ├── tables.json └── database/ ├── <db_id_1>/ │ └── <db_id_1>.sqlite ├── <db_id_2>/ │ └── <db_id_2>.sqlite └── ...
Place dataset under:
data/bird/dev_20240627/dev.jsondata/bird/dev_20240627/dev_databases/<db_id>/<db_id>.sqlite
Place dataset under:
data/multimodal/chartqa/ChartQA Dataset/test/test_augmented.jsondata/multimodal/chartqa/ChartQA Dataset/test/png/...
Supported formats:
- JSON format under
data/multimodal/infographicsvqa/...*.json - MP-DocVQA imdb format:
data/multimodal/infographicsvqa/imdb_train.npydata/multimodal/infographicsvqa/imdb_val.npydata/multimodal/infographicsvqa/imdb_test.npy
Place dataset under:
data/multimodal/caesura/artwork/...data/multimodal/caesura/rotowire/...
Supported query files include *.json and *.jsonl (e.g. test.json, queries_test.jsonl).
All results are saved to:
experiments/results/