Skip to content

Latest commit

 

History

History
266 lines (203 loc) · 7.74 KB

File metadata and controls

266 lines (203 loc) · 7.74 KB

Quick Start

1) Enter Project

cd <your-project-path>/multimodal-ambiguity-resolution

2) Create And Activate venv

python3 -m venv venv
source venv/bin/activate

3) Install Dependencies

pip install -r requirements.txt

4) Configure API Keys

Create .env in project root:

DEEPSEEK_API_KEY=your_key_here
QWEN_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here

Run Experiments

A) Spider (Text-to-SQL)

python3 experiments/full_experiment.py \
  --dataset spider \
  --model deepseek \
  --samples 100 \
  --mode full

# Optional: include official Spider evaluation (if spider-master/evaluation.py is available locally)
python3 experiments/full_experiment.py \
  --dataset spider \
  --model deepseek \
  --samples 100 \
  --mode full \
  --official-eval

Run full Spider dev set:

python3 experiments/full_experiment.py --dataset spider --model deepseek --mode full

Run smoke test (fast end-to-end sanity check):

python3 experiments/full_experiment.py --dataset spider --model deepseek --mode smoke

B) BIRD (Text-to-SQL)

python3 experiments/full_experiment.py \
  --dataset bird \
  --model deepseek \
  --samples 100 \
  --mode full

C) ChartQA (Unified Multimodal)

--benchmark is required for unified multimodal evaluation.

python3 experiments/unified_multimodal_experiment.py \
  --benchmark chartqa \
  --data-root data/multimodal \
  --split test \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --eig-threshold 0.0 \
  --ambiguity-threshold 0.5 \
  --model deepseek \
  --output experiments/results/unified_chartqa_eval_100.json

D) InfographicsVQA / MP-DocVQA (Unified Multimodal)

python3 experiments/unified_multimodal_experiment.py \
  --benchmark infographicsvqa \
  --data-root data/multimodal \
  --split val \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --eig-threshold 0.0 \
  --ambiguity-threshold 0.5 \
  --model deepseek \
  --output experiments/results/unified_infovqa_eval_100.json

E) CAESURA (Unified Multimodal)

# Artwork domain
python3 experiments/unified_multimodal_experiment.py \
  --benchmark caesura_artwork \
  --data-root data/multimodal \
  --split test \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --auto-resolve-without-user \
  --model deepseek \
  --output experiments/results/unified_caesura_artwork_eval_100.json

# Rotowire domain
python3 experiments/unified_multimodal_experiment.py \
  --benchmark caesura_rotowire \
  --data-root data/multimodal \
  --split test \
  --samples 100 \
  --mode full \
  --max-questions 3 \
  --auto-resolve-without-user \
  --model deepseek \
  --output experiments/results/unified_caesura_rotowire_eval_100.json

Optional clarification behavior knobs:

--auto-resolve-without-user

F) One-Command Smoke/Full Run (Spider + BIRD + QA)

bash run_all_benchmarks.sh smoke deepseek
bash run_all_benchmarks.sh full deepseek
``

## G) Clarification Parameters (Unified QA)

The unified benchmark script supports explicit clarification-control knobs:

```bash
python3 experiments/unified_multimodal_experiment.py --help
Parameter Type Default Meaning Typical Usage
--ambiguity-threshold float 0.5 Minimum ambiguity confidence required for an ambiguity candidate to be kept. Higher value means stricter ambiguity detection. Increase to reduce over-triggered ambiguities (e.g., 0.6~0.75).
--eig-threshold float 0.0 Minimum Expected Information Gain (EIG) for a clarification question to be kept. Raise to keep only high-value clarification questions (e.g., 0.1~0.3).
--max-questions int 3 Upper bound on clarification budget (max retained/iterated clarification questions). Use 1~2 for conservative settings; larger values for interactive settings.
--auto-resolve-without-user flag off If enabled, the system does not ask a human; it auto-selects the first clarification option and continues. Use for uninterrupted offline batch runs; keep off for user-facing interaction.
--disable-llm-entity-extraction flag off Disable LLM-based semantic entity extraction during index build/query-time enrichment. Enable this flag for stricter rule-only extraction or lower runtime cost.
--min-grounding-score float 0.0 Minimum grounding evidence score required for an entity to enter clarification options. Increase to reduce weak/unreliable option candidates.
--min-criticality float 0.0 Minimum criticality score for a clarification question to be asked. Increase to suppress low-value clarification turns.
--stop-after-no-progress-rounds int 2 Stop iterative clarification after this many rounds without ambiguity reduction. Keep small (1~2) to avoid over-clarification loops.

Example (strict ambiguity + no auto clarification):

python3 experiments/unified_multimodal_experiment.py \
  --benchmark chartqa \
  --data-root data/multimodal \
  --split test \
  --mode smoke \
  --model deepseek \
  --ambiguity-threshold 0.65 \
  --eig-threshold 0.15 \
  --min-grounding-score 0.5 \
  --min-criticality 0.2 \
  --stop-after-no-progress-rounds 2 \
  --max-questions 2 \
  --output experiments/results/unified_chartqa_smoke_tuned.json

H) Build Entity Index Offline

python3 experiments/build_entity_index.py --target all --limit 100
# or build CAESURA only:
python3 experiments/build_entity_index.py --target caesura_artwork --limit 200

Use --entity-index-dir to change the storage location (default cache/entities).

Data Locations

Spider

Download and Place dataset under data/spider/:

Spider dataset files are large (approx. 2GB) and hosted on Google Drive, which can be tricky for automated downloads.

Option 1: Attempt Automated Download (Recommended first)

python3 -c "from src.data.dataset_manager import DatasetManager; DatasetManager().download_dataset('spider')"

Note: This command attempts to download and extract the full Spider dataset (including databases) into ./data/spider/. If it fails or downloads too slowly due to Google Drive limitations, proceed with Option 2.

Option 2: Manual Download

  1. Visit the official Spider website: https://yale-lily.github.io/spider
  2. Download the spider_dataset.zip (or similar full dataset archive).
  3. Extract the contents. Ensure the database/ folder, dev.json, train_spider.json, and tables.json are placed directly under ./data/spider/. The final structure should look like:
    data/spider/
    ├── dev.json
    ├── tables.json
    └── database/
        ├── <db_id_1>/
        │   └── <db_id_1>.sqlite
        ├── <db_id_2>/
        │   └── <db_id_2>.sqlite
        └── ...
    

BIRD

Place dataset under:

  • data/bird/dev_20240627/dev.json
  • data/bird/dev_20240627/dev_databases/<db_id>/<db_id>.sqlite

ChartQA

Place dataset under:

  • data/multimodal/chartqa/ChartQA Dataset/test/test_augmented.json
  • data/multimodal/chartqa/ChartQA Dataset/test/png/...

InfographicsVQA / MP-DocVQA

Supported formats:

  • JSON format under data/multimodal/infographicsvqa/...*.json
  • MP-DocVQA imdb format:
    • data/multimodal/infographicsvqa/imdb_train.npy
    • data/multimodal/infographicsvqa/imdb_val.npy
    • data/multimodal/infographicsvqa/imdb_test.npy

CAESURA

Place dataset under:

  • data/multimodal/caesura/artwork/...
  • data/multimodal/caesura/rotowire/...

Supported query files include *.json and *.jsonl (e.g. test.json, queries_test.jsonl).

Outputs

All results are saved to:

  • experiments/results/