Quick Start

1) Enter Project

cd <your-project-path>/multimodal-ambiguity-resolution

2) Create And Activate venv

python3 -m venv venv
source venv/bin/activate

3) Install Dependencies

pip install -r requirements.txt

4) Configure API Keys

Create .env in project root:

DEEPSEEK_API_KEY=your_key_here
QWEN_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here

Run Experiments

A) Spider (Text-to-SQL)

python3 experiments/full_experiment.py \
  --dataset spider \
  --model deepseek \
  --samples 100 \
  --mode full

# Optional: include official Spider evaluation (if spider-master/evaluation.py is available locally)
python3 experiments/full_experiment.py \
  --dataset spider \
  --model deepseek \
  --samples 100 \
  --mode full \
  --official-eval

Run full Spider dev set:

python3 experiments/full_experiment.py --dataset spider --model deepseek --mode full

Run smoke test (fast end-to-end sanity check):

python3 experiments/full_experiment.py --dataset spider --model deepseek --mode smoke

B) BIRD (Text-to-SQL)

python3 experiments/full_experiment.py \
  --dataset bird \
  --model deepseek \
  --samples 100 \
  --mode full

C) ChartQA (Unified Multimodal)

--benchmark is required for unified multimodal evaluation.

python3 experiments/unified_multimodal_experiment.py \
  --benchmark chartqa \
  --data-root data/multimodal \
  --split test \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --eig-threshold 0.0 \
  --ambiguity-threshold 0.5 \
  --model deepseek \
  --output experiments/results/unified_chartqa_eval_100.json

D) InfographicsVQA / MP-DocVQA (Unified Multimodal)

python3 experiments/unified_multimodal_experiment.py \
  --benchmark infographicsvqa \
  --data-root data/multimodal \
  --split val \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --eig-threshold 0.0 \
  --ambiguity-threshold 0.5 \
  --model deepseek \
  --output experiments/results/unified_infovqa_eval_100.json

E) CAESURA (Unified Multimodal)

# Artwork domain
python3 experiments/unified_multimodal_experiment.py \
  --benchmark caesura_artwork \
  --data-root data/multimodal \
  --split test \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --auto-resolve-without-user \
  --model deepseek \
  --output experiments/results/unified_caesura_artwork_eval_100.json

# Rotowire domain
python3 experiments/unified_multimodal_experiment.py \
  --benchmark caesura_rotowire \
  --data-root data/multimodal \
  --split test \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --auto-resolve-without-user \
  --model deepseek \
  --output experiments/results/unified_caesura_rotowire_eval_100.json

Optional clarification behavior knobs:

--auto-resolve-without-user

F) One-Command Smoke/Full Run (Spider + BIRD + QA)

bash run_all_benchmarks.sh smoke deepseek
bash run_all_benchmarks.sh full deepseek
``

## G) Clarification Parameters (Unified QA)

The unified benchmark script supports explicit clarification-control knobs:

```bash
python3 experiments/unified_multimodal_experiment.py --help

Parameter	Type	Default	Meaning	Typical Usage
`--ambiguity-threshold`	`float`	`0.5`	Minimum ambiguity confidence required for an ambiguity candidate to be kept. Higher value means stricter ambiguity detection.	Increase to reduce over-triggered ambiguities (e.g., `0.6~0.75`).
`--eig-threshold`	`float`	`0.0`	Minimum Expected Information Gain (EIG) for a clarification question to be kept.	Raise to keep only high-value clarification questions (e.g., `0.1~0.3`).
`--max-questions`	`int`	`3`	Upper bound on clarification budget (max retained/iterated clarification questions).	Use `1~2` for conservative settings; larger values for interactive settings.
`--auto-resolve-without-user`	`flag`	`off`	If enabled, the system does not ask a human; it auto-selects the first clarification option and continues.	Use for uninterrupted offline batch runs; keep off for user-facing interaction.
`--disable-llm-entity-extraction`	`flag`	`off`	Disable LLM-based semantic entity extraction during index build/query-time enrichment.	Enable this flag for stricter rule-only extraction or lower runtime cost.
`--min-grounding-score`	`float`	`0.0`	Minimum grounding evidence score required for an entity to enter clarification options.	Increase to reduce weak/unreliable option candidates.
`--min-criticality`	`float`	`0.0`	Minimum criticality score for a clarification question to be asked.	Increase to suppress low-value clarification turns.
`--stop-after-no-progress-rounds`	`int`	`2`	Stop iterative clarification after this many rounds without ambiguity reduction.	Keep small (`1~2`) to avoid over-clarification loops.

Example (strict ambiguity + no auto clarification):

python3 experiments/unified_multimodal_experiment.py \
  --benchmark chartqa \
  --data-root data/multimodal \
  --split test \
  --mode smoke \
  --model deepseek \
  --ambiguity-threshold 0.65 \
  --eig-threshold 0.15 \
  --min-grounding-score 0.5 \
  --min-criticality 0.2 \
  --stop-after-no-progress-rounds 2 \
  --max-questions 2 \
  --output experiments/results/unified_chartqa_smoke_tuned.json

H) Build Entity Index Offline

python3 experiments/build_entity_index.py --target all --limit 100
# or build CAESURA only:
python3 experiments/build_entity_index.py --target caesura_artwork --limit 200

Use --entity-index-dir to change the storage location (default cache/entities).

Data Locations

Spider

Download and Place dataset under data/spider/:

Spider dataset files are large (approx. 2GB) and hosted on Google Drive, which can be tricky for automated downloads.

Option 1: Attempt Automated Download (Recommended first)

python3 -c "from src.data.dataset_manager import DatasetManager; DatasetManager().download_dataset('spider')"

Note: This command attempts to download and extract the full Spider dataset (including databases) into ./data/spider/. If it fails or downloads too slowly due to Google Drive limitations, proceed with Option 2.

Option 2: Manual Download

Visit the official Spider website: https://yale-lily.github.io/spider
Download the spider_dataset.zip (or similar full dataset archive).

Extract the contents. Ensure the database/ folder, dev.json, train_spider.json, and tables.json are placed directly under ./data/spider/. The final structure should look like:

data/spider/
├── dev.json
├── tables.json
└── database/
    ├── <db_id_1>/
    │   └── <db_id_1>.sqlite
    ├── <db_id_2>/
    │   └── <db_id_2>.sqlite
    └── ...

BIRD

Place dataset under:

data/bird/dev_20240627/dev.json
data/bird/dev_20240627/dev_databases/<db_id>/<db_id>.sqlite

ChartQA

Place dataset under:

data/multimodal/chartqa/ChartQA Dataset/test/test_augmented.json
data/multimodal/chartqa/ChartQA Dataset/test/png/...

InfographicsVQA / MP-DocVQA

Supported formats:

JSON format under data/multimodal/infographicsvqa/...*.json
MP-DocVQA imdb format:
- data/multimodal/infographicsvqa/imdb_train.npy
- data/multimodal/infographicsvqa/imdb_val.npy
- data/multimodal/infographicsvqa/imdb_test.npy

CAESURA

Place dataset under:

data/multimodal/caesura/artwork/...
data/multimodal/caesura/rotowire/...

Supported query files include *.json and *.jsonl (e.g. test.json, queries_test.jsonl).

Outputs

All results are saved to:

experiments/results/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start

1) Enter Project

2) Create And Activate venv

3) Install Dependencies

4) Configure API Keys

Run Experiments

A) Spider (Text-to-SQL)

B) BIRD (Text-to-SQL)

C) ChartQA (Unified Multimodal)

D) InfographicsVQA / MP-DocVQA (Unified Multimodal)

E) CAESURA (Unified Multimodal)

F) One-Command Smoke/Full Run (Spider + BIRD + QA)

H) Build Entity Index Offline

Data Locations

Spider

BIRD

ChartQA

InfographicsVQA / MP-DocVQA

CAESURA

Outputs

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Quick Start

1) Enter Project

2) Create And Activate venv

3) Install Dependencies

4) Configure API Keys

Run Experiments

A) Spider (Text-to-SQL)

B) BIRD (Text-to-SQL)

C) ChartQA (Unified Multimodal)

D) InfographicsVQA / MP-DocVQA (Unified Multimodal)

E) CAESURA (Unified Multimodal)

F) One-Command Smoke/Full Run (Spider + BIRD + QA)

H) Build Entity Index Offline

Data Locations

Spider

BIRD

ChartQA

InfographicsVQA / MP-DocVQA

CAESURA

Outputs