Skip to content

Repository for HypoSpace as Yang Xianing, Wang Ziqi, Song Yaobohan's homework for Machine Learning of AY2025-2026 EEC4300.

Notifications You must be signed in to change notification settings

XianingY/_HypoSpace

ย 
ย 

Repository files navigation

๐Ÿ”ฌ HypoSpace

Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Paper Python License: MIT Code style: black


๐ŸŽฏ Three Domains โ€ข Three Metrics โ€ข Infinite Insights

Overview ย ย ย  Model Comparison

Task Illustration

๐Ÿงฌ Causal Graphs โ€ข ๐Ÿ“ฆ 3D Reconstruction โ€ข ๐Ÿ”€ Boolean Logic


๐Ÿ“– About

TL;DR: HypoSpace evaluates how well LLMs generate diverse sets of valid hypotheses in underdetermined scientific problems, not just single correct answers.

The Challenge

As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanationsโ€”not just a single correct answerโ€”becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations.

Our Solution

We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators:

Metric Symbol What It Measures
๐ŸŽฏ Validity V Precision of proposals consistent with observations
โœจ Uniqueness U Non-redundancy among proposals
๐Ÿ“ˆ Recovery R Coverage of the enumerated admissible set

Three Structured Domains

We instantiate HypoSpace in three domains with deterministic validators and exactly enumerated hypothesis spaces:

  1. ๐Ÿงฌ Causal Graphs โ€” from perturbations
  2. ๐Ÿ“ฆ 3D Voxel Reconstruction โ€” gravity-constrained from top-down projections
  3. ๐Ÿ”€ Boolean Genetic Interactions โ€” logical function discovery

Key Findings

Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics.

๐Ÿ’ก HypoSpace offers a controlled probeโ€”rather than a leaderboardโ€”for methods that explicitly explore and cover admissible explanation spaces.


๐Ÿ“ Repository Structure

๐Ÿ“‚ HypoSpace/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ฆ 3d/                                  โ† 3D Voxel Reconstruction Domain
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง generate_3d_dataset_complete.py  โ€ข Dataset generator
โ”‚   โ”œโ”€โ”€ ๐Ÿš€ run_3d_benchmark.py              โ€ข Benchmark runner
โ”‚   โ”œโ”€โ”€ ๐Ÿ“š modules/                         โ€ข LLM interface & models
โ”‚   โ”‚   โ”œโ”€โ”€ llm_interface.py
โ”‚   โ”‚   โ””โ”€โ”€ models.py
โ”‚   โ””โ”€โ”€ โš™๏ธ  config/
โ”‚       โ””โ”€โ”€ config_gpt4o.yaml               โ€ข Configuration file
โ”‚
โ”œโ”€โ”€ ๐Ÿ”€ boolean/                             โ† Boolean Genetic Interactions
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง boolean_dataset.py               โ€ข Dataset generator
โ”‚   โ”œโ”€โ”€ ๐Ÿš€ boolean_benchmark.py             โ€ข Benchmark runner
โ”‚   โ”œโ”€โ”€ ๐Ÿ“š modules/
โ”‚   โ”‚   โ”œโ”€โ”€ llm_interface.py
โ”‚   โ”‚   โ””โ”€โ”€ models.py
โ”‚   โ””โ”€โ”€ โš™๏ธ  config/
โ”‚       โ””โ”€โ”€ config_gpt4o.yaml
โ”‚
โ””โ”€โ”€ ๐Ÿงฌ causal/                              โ† Causal Graph Discovery
    โ”œโ”€โ”€ ๐Ÿ”ง generate_causal_dataset.py       โ€ข Dataset generator (small)
    โ”œโ”€โ”€ ๐Ÿ”ง generate_causal_dataset_for_large.py  โ€ข Dataset generator (large)
    โ”œโ”€โ”€ ๐Ÿš€ run_causal_benchmark.py          โ€ข Benchmark runner
    โ”œโ”€โ”€ ๐Ÿ“š modules/
    โ”‚   โ”œโ”€โ”€ llm_interface.py
    โ”‚   โ””โ”€โ”€ models.py
    โ””โ”€โ”€ โš™๏ธ  config/
        โ””โ”€โ”€ config_gpt4o.yaml

๐Ÿš€ Quick Start

Step 1๏ธโƒฃ: Configure Your LLM

Edit the YAML config files in each domain's config/ folder:

What you can customize:

  • ๐Ÿค– LLM provider and model
  • ๐ŸŒก๏ธ Temperature settings
  • ๐Ÿ“‚ Output paths
  • ๐Ÿ’พ Checkpoint directories

Example: config/config_gpt4o.yaml

llm:
  type: openrouter              # Options: openai, anthropic, openrouter
  models:
    openrouter: "openai/gpt-4o"
  api_keys:
    openrouter: "your-api-key"  # โš ๏ธ Replace with your actual API key
  temperature: 0.7              # 0.0 = deterministic, 1.0 = creative

benchmark:
  checkpoint: "checkpoints"     # Resume interrupted runs
  verbose: true                 # Print detailed logs
  output_pattern: "results/{dataset_name}_{model}.json"

Step 2๏ธโƒฃ: Generate Datasets

Each domain has its own dataset generator. Here are examples for all three:

๐Ÿงฌ Causal Graphs (click to expand)
cd causal
python generate_causal_dataset.py \
  --nodes 3 \
  --seed 33550336 \
  --output "datasets/node03/n3_all_observations.json"

Parameters:

  • --nodes: Number of nodes in graphs (3, 4, 5, etc.)
  • --seed: Random seed for reproducibility
  • --output: Path to save dataset JSON
๐Ÿ“ฆ 3D Voxel Reconstruction (click to expand)
cd 3d
python generate_3d_dataset_complete.py \
  --grid-size 3 \
  --max-height 3 \
  --max-blocks 1 \
  --fixed \
  --seed 33550336 \
  --output "datasets/3d_grid3_h3.json"

Parameters:

  • --grid-size: Grid dimensions (e.g., 3 for 3ร—3)
  • --max-height: Maximum structure height
  • --max-blocks: Maximum number of blocks in top view
  • --fixed: If set, generate only structures with exactly max-blocks blocks, else from 1 to max-blocks
  • --output: Output file path
๐Ÿ”€ Boolean Logic (click to expand)
cd boolean
python boolean_dataset.py \
  --operators basic \
  --max-depth 2 \
  --output 'datasets/boolean_2var.json' \
  --seed 33550336 \

Parameters:

  • --operators: Allowed Boolean operators: choices=['basic', 'extended', 'full']
  • --max-depth: Maximum expression depth
  • --output: Output JSON file

Step 3๏ธโƒฃ: Run Benchmarks

Run the benchmark for your chosen domain:

๐Ÿงฌ Causal Benchmark
cd causal
python run_causal_benchmark.py \
  --dataset "datasets/node03/n3_all_observations.json" \
  --config "config/config_gpt4o.yaml" \
  --n-samples 30 \
  --query-multiplier 1.0 \
  --seed 33550336

Run in background with logging:

nohup python -u run_causal_benchmark.py \
  --dataset "datasets/node03/n3_all_observations.json" \
  --config "config/config_gpt4o.yaml" \
  --n-samples 30 \
  --query-multiplier 1.0 \
  --seed 33550336 > logs/causal_gpt4o.log 2>&1 &
๐Ÿ“ฆ 3D Benchmark
cd 3d
python run_3d_benchmark.py \
  --dataset "datasets/3d_grid3_h3.json" \
  --config "config/config_gpt4o.yaml" \
  --n-samples 30 \
  --query-multiplier 1.0 \
  --seed 33550336
๐Ÿ”€ Boolean Benchmark
cd boolean
python boolean_benchmark.py \
  --dataset "datasets/boolean_2var.json" \
  --config "config/config_gpt4o.yaml" \
  --n-samples 30 \
  --query-multiplier 1.0 \
  --seed 33550336

Common Parameters:

  • --dataset: Path to generated dataset
  • --config: Configuration YAML file
  • --n-samples: Number of observation sets to evaluate
  • --query-multiplier: Multiplier for queries per task
  • --seed: Random seed for reproducibility

Step 4๏ธโƒฃ: Analyze Results

Results are automatically saved as JSON files in the results/ directory.

What's included:

{
  "metadata": {
    "model": "openai/gpt-4o",
    "dataset": "causal_n3",
    "n_samples": 30,
    "timestamp": "2025-10-17T12:00:00"
  },
  "aggregate_metrics": {
    "mean_validity": 0.92,      // ๐ŸŽฏ How many proposals are valid
    "mean_uniqueness": 0.78,    // โœจ How diverse are the proposals
    "mean_recovery": 0.65,      // ๐Ÿ“ˆ Coverage of solution space
    "std_validity": 0.08,
    "std_uniqueness": 0.12,
    "std_recovery": 0.15
  },
  "results": [/* detailed per-sample results */]
}

Understanding the Metrics:

Metric Range Good Score Interpretation
๐ŸŽฏ Validity 0-1 > 0.90 Model proposes correct hypotheses
โœจ Uniqueness 0-1 > 0.80 Model avoids redundant proposals
๐Ÿ“ˆ Recovery 0-1 > 0.80 Model explores solution space well

๐Ÿ“Š Supported Models

Provider Example Models Config Type
OpenAI GPT-4o, GPT-4-turbo, GPT-3.5 openai
OpenRouter Any model via OpenRouter openrouter

๐Ÿ“ Citation

If you use HypoSpace in your research, please cite:

@article{chen2025hypospace,
  title={HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination},
  author={Chen, Tingting and Lin, Beibei and Yuan, Zifeng and Zou, Qiran and He, Hongyu and Ong, Yew-Soon and Goyal, Anirudh and Liu, Dianbo},
  journal={arXiv preprint arXiv:2510.15614},
  year={2025}
}

๐Ÿ“„ License

This project is released under the MIT License.


Built with โค๏ธ for scientific discovery

โญ Star us on GitHub โ€ข ๐Ÿ› Report issues โ€ข ๐Ÿ’ก Suggest features

About

Repository for HypoSpace as Yang Xianing, Wang Ziqi, Song Yaobohan's homework for Machine Learning of AY2025-2026 EEC4300.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%