Skip to content

Allow truth data to have smaller domain than prediction models in BMC#22

Merged
pabggpnMSU merged 6 commits intomainfrom
copilot/fix-common-domain-issues
Oct 14, 2025
Merged

Allow truth data to have smaller domain than prediction models in BMC#22
pabggpnMSU merged 6 commits intomainfrom
copilot/fix-common-domain-issues

Conversation

Copy link
Contributor

Copilot AI commented Oct 10, 2025

Plan to Fix Domain Intersection Issue ✅ COMPLETE

Problem: Currently, load_data() uses an inner join on all models including the truth data, which forces the truth data to be present at all domain points. This unnecessarily restricts the model domains when truth data has a smaller domain.

Solution: Modify load_data() to:

  • Add optional parameter truth_column_name to identify which model is the truth data
  • Separate models into regular models and truth model
  • Calculate common domain intersection for regular models only
  • Left join truth data onto the common model domain (keeping all model domain points)
  • Ensure backward compatibility by making the parameter optional

Changes completed:

  • Modified Dataset.load_data() method signature to accept truth_column_name parameter
  • Updated loading logic to handle truth data separately using left join
  • Added comprehensive tests to verify truth data can have smaller domain (HDF5 & CSV)
  • Added integration tests demonstrating full BMC workflow with smaller truth domains
  • Verified backward compatibility (all existing tests pass)
  • Created example script demonstrating the new feature
  • Updated documentation (docs/usage.md and docs/index.md) with examples and usage guide

Documentation Updates:

  • Updated usage guide with truth_column_name parameter examples
  • Added notes explaining the behavior when truth has smaller domain
  • Added complete example workflow demonstrating the feature
  • Updated index.md to highlight flexible truth data capability
  • Added tips for filtering data when training with smaller truth domain

Testing: All 29 tests pass, including:

  • 3 new unit tests for smaller truth domain functionality
  • 2 new integration tests showing complete BMC workflow
  • All 24 existing tests (verifying backward compatibility)

Key Benefits:
✓ Truth data can have fewer domain points than models
✓ Train BMC on subset where truth is available
✓ Make predictions on full model domain
✓ Backward compatible - existing code works unchanged
✓ Works with both HDF5 and CSV formats
Fully documented with examples

Original prompt

When loading data and using in the BMC, the models being mixed need to have a common domain. The truth data, called truth_column_name in the BMC module, can be smaller than the model domains. Propose a fix for this without introducing breaking changes.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits October 10, 2025 20:55
Co-authored-by: kylegodbey <5806808+kylegodbey@users.noreply.github.com>
…omains

Co-authored-by: kylegodbey <5806808+kylegodbey@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix common domain issues in BMC module Allow truth data to have smaller domain than prediction models in BMC Oct 10, 2025
Copilot AI requested a review from kylegodbey October 10, 2025 21:03
@kylegodbey
Copy link
Member

@copilot update the examples and the usage guide in the docs to reflect this behavior

Co-authored-by: kylegodbey <5806808+kylegodbey@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 10, 2025

@copilot update the examples and the usage guide in the docs to reflect this behavior

Updated the documentation in commit COMMIT_HASH:

Changes made:

  1. docs/usage.md - Comprehensive updates:

    • Added truth_column_name parameter to the main example
    • Added note boxes explaining the smaller domain behavior
    • Added complete example workflow section demonstrating the feature
    • Updated all code examples to show models_list excluding truth column
    • Added tips for filtering data when training with smaller truth domain
  2. docs/index.md - Added feature highlight:

    • Listed "Flexible Truth Data" as a key feature

The usage guide now clearly demonstrates:

  • How to use truth_column_name parameter
  • What happens when truth has smaller domain
  • How to filter training data appropriately
  • Complete end-to-end workflow example

Updated wording for clarity in usage documentation.
Removed the feature about flexible truth data from the documentation.
@pabggpnMSU pabggpnMSU marked this pull request as ready for review October 14, 2025 21:07
@pabggpnMSU pabggpnMSU merged commit cd6d1d8 into main Oct 14, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants