Skip to content

Conversation

@sebhaan
Copy link

@sebhaan sebhaan commented May 25, 2025

Summary

This PR adds a new extension integrating TabPFGen for synthetic tabular data generation.

Impact

This extension enables TabPFN users to:

  • Automatically balance imbalanced datasets using TabPFGen's latest features
  • Generate high-quality synthetic tabular data with zero additional training
  • Assess synthetic data quality with a suite of metrics and visualisations
  • Integrate seamlessly with existing TabPFN workflows

Key Features:

  • Energy-based synthetic data generation
  • Support for both classification and regression tasks
  • Automatic dataset balancing for imbalanced classes
  • Class-balanced sampling option
  • Comprehensive visualization tools
  • Built on TabPFN transformer architecture
  • No additional training required

Installation

# Install specific extension
pip install "tabpfn-extensions[tabpfgen_datasynthesizer]"

# Or install all extensions  
pip install "tabpfn-extensions[all]"

Usage Examples

The extension provides multiple usage patterns, see :

  • /tabpfn-extensions/src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md
  • examples/tabpfn_datasynthesizer/

Basic classification dataset example:

from tabpfn_extensions.tabpfgen_datasynthesizer import TabPFNDataSynthesizer
from sklearn.datasets import load_breast_cancer

# Load example data
X, y = load_breast_cancer(return_X_y=True)

# Initialize synthesizer
synthesizer = TabPFNDataSynthesizer(n_sgld_steps=500)

# Generate synthetic classification data
X_synth, y_synth = synthesizer.generate_classification(
    X, y, 
    n_samples=100,
    balance_classes=True,  # Only balances synthetic samples
    visualize=True         # TabPFGen's built-in visualization
)

Project Structure

tabpfn-extensions/
├── src/tabpfn_extensions/tabpfgen_datasynthesizer/
│   ├── __init__.py             
│   ├── tabpfgen_wrapper.py   # TabPFNDataSynthesizer: Wrapper around TabPFGen functionality
│   └── utils.py                          #  Data validation, preparation, and quality assessment   
│   ├── README.md             
├── examples/tabpfn_datasynthesizer/
│   ├── basic_classification_example.py
│   ├── dataset_balancing_demo.py       
│   ├── basic_regression_example.py  
│   ├── README.md         
└── tests/
    └── test_tabpfgen_datasynthesizer.py 

Checklist

  • Follows project structure conventions
  • All tests passing (pytest tests/test_tabpfgen_datasynthesizer.py -v)
  • Examples working and documented
  • Dependencies configured in pyproject.toml
  • No breaking changes to existing code
  • Detailed documentation

Dependencies

Added to pyproject.toml: tabpfgen>=0.1.3 in both extension and [all] groups

Documentation

  • Detailed README with installation and usage examples
  • Three detailed examples covering different use cases
  • API documentation for all classes and methods
  • Troubleshooting guide and performance tips

Copilot AI review requested due to automatic review settings May 25, 2025 04:52
@CLAassistant
Copy link

CLAassistant commented May 25, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new TabPFGen data synthesizer extension that integrates TabPFGen into TabPFN workflows, providing data validation, synthesis, balancing, quality metrics, and documentation/examples.

  • Introduces utility functions for data validation, dataset combination, class analysis, and synthetic quality metrics.
  • Adds TabPFNDataSynthesizer wrapper with methods for classification, regression, and dataset balancing.
  • Updates packaging (pyproject.toml), includes detailed README, and provides multiple example scripts.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py Utility functions for validation, combining, analysis, and metrics
src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py TabPFNDataSynthesizer wrapper class around TabPFGen
src/tabpfn_extensions/tabpfgen_datasynthesizer/init.py Package exports, version, and author metadata
src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md Extension README with installation and usage examples
pyproject.toml Added tabpfgen_datasynthesizer extra and updated all extras
examples/tabpfgen_datasynthesizer/* Example scripts and README demonstrating usage

Comment on lines 86 to 87
balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)")

Copy link

Copilot AI May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call to analyze_class_distribution is duplicated (also on line 84). You can remove one of them to avoid redundant computation and output.

Suggested change
balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)")

Copilot uses AI. Check for mistakes.
except ImportError as e:
TABPFGEN_AVAILABLE = False
TabPFGen = None
print(f"TabPFGen import failed: {e}") # Debug info
Copy link

Copilot AI May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using warnings.warn(...) or a logging framework instead of print for import failure notifications to avoid writing directly to stdout.

Suggested change
print(f"TabPFGen import failed: {e}") # Debug info
warnings.warn(f"TabPFGen import failed: {e}", ImportWarning)

Copilot uses AI. Check for mistakes.
Comment on lines 202 to 208
print(f"\n=== {title} ===")
for cls, count, pct in zip(unique_classes, counts, percentages):
print(f"Class {cls}: {count} samples ({pct:.1f}%)")

print(f"Total: {total_samples} samples, {len(unique_classes)} classes")
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")

Copy link

Copilot AI May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Utility functions like analyze_class_distribution should avoid printing directly. Consider returning analysis data only and let callers handle display or logging.

Suggested change
print(f"\n=== {title} ===")
for cls, count, pct in zip(unique_classes, counts, percentages):
print(f"Class {cls}: {count} samples ({pct:.1f}%)")
print(f"Total: {total_samples} samples, {len(unique_classes)} classes")
print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")
# Removed direct printing. Callers should handle display or logging.
# Example: Use the returned `analysis` dictionary for custom output.
pass

Copilot uses AI. Check for mistakes.
@noahho
Copy link
Contributor

noahho commented May 25, 2025

Hi @sebhaan, thanks again for this excellent contribution integrating TabPFGen!

To get this merged, please address the following:

1. Critical CI Failures (Must Address First):

  • Ruff Linting: Please fix the linting errors by running ruff format . and ruff check . --fix locally and committing the changes.
  • Python 3.9 Dependency Conflict: tabpfgen 0.1.3 requires Python >=3.11, which conflicts with tabpfn-extensions's Python 3.9 testing for the [all] extra. The error is:
    No solution found when resolving dependencies:
      ╰─▶ Because the current Python version (3.9.22) does not satisfy
          Python>=3.10 and tabpfgen==0.1.3 depends on Python>=3.11, we can
          conclude that tabpfgen==0.1.3 cannot be used.
          ...
          And because only tabpfn-extensions[all]==0.1.0 is available and you
          require tabpfn-extensions[all], we can conclude that your requirements
          are unsatisfiable.
    
    Please decide on a strategy:
    • Option A (if feasible): Update tabpfgen for Python 3.9 Compatibility.
      • Since you built and maintain tabpfgen, could you investigate if tabpfgen can be modified to support Python 3.9? This might involve checking for Python 3.10/3.11-specific syntax or dependencies.
      • If feasible, releasing a new tabpfgen version (e.g., 0.1.4) that supports Python 3.9+ would allow this PR to update its dependency and keep tabpfgen in the [all] extra.
    • Option B (If Option A is not practical): Isolate This Extension's Python Requirement.
      • In tabpfn-extensions' pyproject.toml, remove tabpfgen from the project.optional-dependencies.all list (it should remain in project.optional-dependencies.tabpfgen_datasynthesizer).
      • Ensure tests for this extension are correctly skipped on Python <3.11 (the existing TABPFGEN_AVAILABLE flag should facilitate this if tabpfgen isn't installed in those environments).
      • Document in this extension's README that it requires Python >=3.11.
  • License/CLA: Ensure the pending CLA status is resolved by signing CLA (this is standard for OS projects and we just want to double check; contribution is Apache 2.0 so this is a formality; also could only affect the code in this repo of course)

2. Code:

  • READMEs:
    • Consolidate primary documentation into src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md (as suggested by @noahho).
    • Significantly trim examples/tabpfgen_datasynthesizer/README.md to a brief purpose statement and link to the main extension README.
    • Relocate general installation instructions for the extension (e.g., pip install ...[tabpfgen_datasynthesizer]) to the top-level PriorLabs/tabpfn-extensions README. The extension's own README should then detail its specific Python version requirements (per point 1B above, if chosen).

@@ -0,0 +1,189 @@
# TabPFGen Data Synthesizer Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is duplicated (with slight differences), let's only add it to the src/ folder and remove from here, there should be one readme

@sebhaan
Copy link
Author

sebhaan commented May 27, 2025

Thank you for the detailed feedback. I've addressed the critical issues as follows:

1. Critical CI Failures - RESOLVED

Ruff Linting: - FIXED

  • Ran ruff format . and ruff check . --fix locally
  • All linting issues have been resolved and changes committed

Python 3.9 Dependency Conflict: - FIXED via Option B

  • Updated tabpfgen to version 0.1.4 with relaxed Python requirement from >=3.11 to >=3.10
  • Implemented Option B as suggested:
    • Removed tabpfgen from the [all] extra in pyproject.toml
    • tabpfgen remains available via [tabpfgen_datasynthesizer] extra only
    • Added Python version compatibility check in the extension code
    • Tests are properly skipped on Python <3.10 and EXTENSION_AVAILABLE flag

This approach ensures:

  • Main tabpfn-extensions package maintains Python 3.9+ compatibility
  • Users who need this specific extension can install it with the appropriate Python version
  • CI passes on all supported Python versions

License/CLA:

  • Signed the CLA as requested

2. Code Structure - RESOLVED

README Consolidation: - COMPLETED

  • Consolidated the example README into src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md
  • Removed the redundant examples/tabpfgen_datasynthesizer/README.md
  • Updated top-level README with:
    • tabpfgen_datasynthesizer extension listing
    • Installation instructions: pip install tabpfn-extensions[tabpfgen_datasynthesizer]
    • Clear Python >=3.10 requirement documentation

Summary of Changes Made:

  1. ✅ Fixed all ruff linting issues
  2. ✅ Updated tabpfgen dependency to version 0.1.4 (Python >=3.10)
  3. ✅ Implemented Option B dependency isolation strategy
  4. ✅ Consolidated and restructured all README files
  5. ✅ Added proper Python version compatibility checks
  6. ✅ CLA signed

@noahho noahho requested a review from Copilot May 27, 2025 06:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new TabPFGen Data Synthesizer extension to the TabPFN ecosystem for synthetic tabular data generation, dataset balancing, and quality assessment. Key changes include the implementation of utility functions for data validation and quality metrics, a streamlined wrapper for TabPFGen integration with methods for classification, regression, and dataset balancing, and updated documentation and examples to support the new extension.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py New utility functions for validating data, combining datasets, analyzing class distribution, and calculating quality metrics.
src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py Implementation of the TabPFNDataSynthesizer class with methods for synthetic data generation and dataset balancing.
src/tabpfn_extensions/tabpfgen_datasynthesizer/init.py Initialization module that enforces the required Python version and imports key components.
src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md Detailed documentation with installation, usage examples, and API reference for the new extension.
pyproject.toml Updates dependency list to include the new TabPFGen Data Synthesizer requirements.
Examples (class_balancing_demo.py, basic_classification_example.py, basic_regression_example.py) New and updated usage examples demonstrating classification, regression, and dataset balancing with the new extension.
README.md Updated repository documentation to list the new tabpfgen_datasynthesizer extension.

Comment on lines +206 to +207
for cls, count, pct in zip(unique_classes, counts, percentages):
pass
Copy link

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop currently does nothing (contains only a 'pass' statement). Consider either removing it if not needed or adding a comment/implementation to clarify its intended purpose.

Suggested change
for cls, count, pct in zip(unique_classes, counts, percentages):
pass
# Removed redundant loop that performed no operations.

Copilot uses AI. Check for mistakes.
@noahho
Copy link
Contributor

noahho commented May 27, 2025

Thanks a lot for the quick and detailed improvements! The tests for TabPFNGen fail on py3.13 at this point because it is never installed. We might check the python version and install explicitly if we are on >= 3.10 in our github CI workflow. Similar to here: https://github.com/PriorLabs/tabpfn-extensions/blob/main/.github/workflows/pull_request.yml#L82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants