Add TabPFGen Data Synthesizer Extension #83

sebhaan · 2025-05-25T04:52:44Z

Summary

This PR adds a new extension integrating TabPFGen for synthetic tabular data generation.

Impact

This extension enables TabPFN users to:

Automatically balance imbalanced datasets using TabPFGen's latest features
Generate high-quality synthetic tabular data with zero additional training
Assess synthetic data quality with a suite of metrics and visualisations
Integrate seamlessly with existing TabPFN workflows

Key Features:

Energy-based synthetic data generation
Support for both classification and regression tasks
Automatic dataset balancing for imbalanced classes
Class-balanced sampling option
Comprehensive visualization tools
Built on TabPFN transformer architecture
No additional training required

Installation

# Install specific extension
pip install "tabpfn-extensions[tabpfgen_datasynthesizer]"

# Or install all extensions  
pip install "tabpfn-extensions[all]"

Usage Examples

The extension provides multiple usage patterns, see :

/tabpfn-extensions/src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md
examples/tabpfn_datasynthesizer/

Basic classification dataset example:

from tabpfn_extensions.tabpfgen_datasynthesizer import TabPFNDataSynthesizer
from sklearn.datasets import load_breast_cancer

# Load example data
X, y = load_breast_cancer(return_X_y=True)

# Initialize synthesizer
synthesizer = TabPFNDataSynthesizer(n_sgld_steps=500)

# Generate synthetic classification data
X_synth, y_synth = synthesizer.generate_classification(
    X, y, 
    n_samples=100,
    balance_classes=True,  # Only balances synthetic samples
    visualize=True         # TabPFGen's built-in visualization
)

Project Structure

tabpfn-extensions/
├── src/tabpfn_extensions/tabpfgen_datasynthesizer/
│   ├── __init__.py             
│   ├── tabpfgen_wrapper.py   # TabPFNDataSynthesizer: Wrapper around TabPFGen functionality
│   └── utils.py                          #  Data validation, preparation, and quality assessment   
│   ├── README.md             
├── examples/tabpfn_datasynthesizer/
│   ├── basic_classification_example.py
│   ├── dataset_balancing_demo.py       
│   ├── basic_regression_example.py  
│   ├── README.md         
└── tests/
    └── test_tabpfgen_datasynthesizer.py

Checklist

Follows project structure conventions
All tests passing (pytest tests/test_tabpfgen_datasynthesizer.py -v)
Examples working and documented
Dependencies configured in pyproject.toml
No breaking changes to existing code
Detailed documentation

Dependencies

Added to pyproject.toml: tabpfgen>=0.1.3 in both extension and [all] groups

Documentation

Detailed README with installation and usage examples
Three detailed examples covering different use cases
API documentation for all classes and methods
Troubleshooting guide and performance tips

CLAassistant · 2025-05-25T04:52:50Z

All committers have signed the CLA.

Copilot

Pull Request Overview

Adds a new TabPFGen data synthesizer extension that integrates TabPFGen into TabPFN workflows, providing data validation, synthesis, balancing, quality metrics, and documentation/examples.

Introduces utility functions for data validation, dataset combination, class analysis, and synthetic quality metrics.
Adds TabPFNDataSynthesizer wrapper with methods for classification, regression, and dataset balancing.
Updates packaging (pyproject.toml), includes detailed README, and provides multiple example scripts.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py	Utility functions for validation, combining, analysis, and metrics
src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py	`TabPFNDataSynthesizer` wrapper class around TabPFGen
src/tabpfn_extensions/tabpfgen_datasynthesizer/init.py	Package exports, version, and author metadata
src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md	Extension README with installation and usage examples
pyproject.toml	Added `tabpfgen_datasynthesizer` extra and updated `all` extras
examples/tabpfgen_datasynthesizer/*	Example scripts and README demonstrating usage

Copilot · 2025-05-25T04:55:01Z

examples/tabpfgen_datasynthesizer/class_balancing_demo.py

+    balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)")
+


This call to analyze_class_distribution is duplicated (also on line 84). You can remove one of them to avoid redundant computation and output.

Suggested change

balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)")

Copilot · 2025-05-25T04:55:02Z

src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py

+except ImportError as e:
+    TABPFGEN_AVAILABLE = False
+    TabPFGen = None
+    print(f"TabPFGen import failed: {e}")  # Debug info


Consider using warnings.warn(...) or a logging framework instead of print for import failure notifications to avoid writing directly to stdout.

Suggested change

print(f"TabPFGen import failed: {e}") # Debug info

warnings.warn(f"TabPFGen import failed: {e}", ImportWarning)

Copilot · 2025-05-25T04:55:02Z

src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py

+    print(f"\n=== {title} ===")
+    for cls, count, pct in zip(unique_classes, counts, percentages):
+        print(f"Class {cls}: {count} samples ({pct:.1f}%)")
+
+    print(f"Total: {total_samples} samples, {len(unique_classes)} classes")
+    print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")
+


[nitpick] Utility functions like analyze_class_distribution should avoid printing directly. Consider returning analysis data only and let callers handle display or logging.

Suggested change

print(f"\n=== {title} ===")

for cls, count, pct in zip(unique_classes, counts, percentages):

print(f"Class {cls}: {count} samples ({pct:.1f}%)")

print(f"Total: {total_samples} samples, {len(unique_classes)} classes")

print(f"Imbalance ratio: {imbalance_ratio:.1f}:1")

# Removed direct printing. Callers should handle display or logging.

# Example: Use the returned `analysis` dictionary for custom output.

pass

noahho · 2025-05-25T08:16:35Z

Hi @sebhaan, thanks again for this excellent contribution integrating TabPFGen!

To get this merged, please address the following:

1. Critical CI Failures (Must Address First):

Ruff Linting: Please fix the linting errors by running ruff format . and ruff check . --fix locally and committing the changes.
Python 3.9 Dependency Conflict: tabpfgen 0.1.3 requires Python >=3.11, which conflicts with tabpfn-extensions's Python 3.9 testing for the [all] extra. The error is:
```
No solution found when resolving dependencies:
  ╰─▶ Because the current Python version (3.9.22) does not satisfy
      Python>=3.10 and tabpfgen==0.1.3 depends on Python>=3.11, we can
      conclude that tabpfgen==0.1.3 cannot be used.
      ...
      And because only tabpfn-extensions[all]==0.1.0 is available and you
      require tabpfn-extensions[all], we can conclude that your requirements
      are unsatisfiable.
```
Please decide on a strategy:
- Option A (if feasible): Update tabpfgen for Python 3.9 Compatibility.
  - Since you built and maintain tabpfgen, could you investigate if tabpfgen can be modified to support Python 3.9? This might involve checking for Python 3.10/3.11-specific syntax or dependencies.
  - If feasible, releasing a new tabpfgen version (e.g., 0.1.4) that supports Python 3.9+ would allow this PR to update its dependency and keep tabpfgen in the [all] extra.
- Option B (If Option A is not practical): Isolate This Extension's Python Requirement.
  - In tabpfn-extensions' pyproject.toml, remove tabpfgen from the project.optional-dependencies.all list (it should remain in project.optional-dependencies.tabpfgen_datasynthesizer).
  - Ensure tests for this extension are correctly skipped on Python <3.11 (the existing TABPFGEN_AVAILABLE flag should facilitate this if tabpfgen isn't installed in those environments).
  - Document in this extension's README that it requires Python >=3.11.
License/CLA: Ensure the pending CLA status is resolved by signing CLA (this is standard for OS projects and we just want to double check; contribution is Apache 2.0 so this is a formality; also could only affect the code in this repo of course)

2. Code:

READMEs:
- Consolidate primary documentation into src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md (as suggested by @noahho).
- Significantly trim examples/tabpfgen_datasynthesizer/README.md to a brief purpose statement and link to the main extension README.
- Relocate general installation instructions for the extension (e.g., pip install ...[tabpfgen_datasynthesizer]) to the top-level PriorLabs/tabpfn-extensions README. The extension's own README should then detail its specific Python version requirements (per point 1B above, if chosen).

noahho · 2025-05-25T07:45:13Z

examples/tabpfgen_datasynthesizer/README.md

@@ -0,0 +1,189 @@
+# TabPFGen Data Synthesizer Examples


The README is duplicated (with slight differences), let's only add it to the src/ folder and remove from here, there should be one readme

src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md

…+ requirement isolation

sebhaan · 2025-05-27T05:26:39Z

Thank you for the detailed feedback. I've addressed the critical issues as follows:

1. Critical CI Failures - RESOLVED

Ruff Linting: - FIXED

Ran ruff format . and ruff check . --fix locally
All linting issues have been resolved and changes committed

Python 3.9 Dependency Conflict: - FIXED via Option B

Updated tabpfgen to version 0.1.4 with relaxed Python requirement from >=3.11 to >=3.10
Implemented Option B as suggested:
- Removed tabpfgen from the [all] extra in pyproject.toml
- tabpfgen remains available via [tabpfgen_datasynthesizer] extra only
- Added Python version compatibility check in the extension code
- Tests are properly skipped on Python <3.10 and EXTENSION_AVAILABLE flag

This approach ensures:

Main tabpfn-extensions package maintains Python 3.9+ compatibility
Users who need this specific extension can install it with the appropriate Python version
CI passes on all supported Python versions

License/CLA:

Signed the CLA as requested

2. Code Structure - RESOLVED

README Consolidation: - COMPLETED

Consolidated the example README into src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md
Removed the redundant examples/tabpfgen_datasynthesizer/README.md
Updated top-level README with:
- tabpfgen_datasynthesizer extension listing
- Installation instructions: pip install tabpfn-extensions[tabpfgen_datasynthesizer]
- Clear Python >=3.10 requirement documentation

Summary of Changes Made:

✅ Fixed all ruff linting issues
✅ Updated tabpfgen dependency to version 0.1.4 (Python >=3.10)
✅ Implemented Option B dependency isolation strategy
✅ Consolidated and restructured all README files
✅ Added proper Python version compatibility checks
✅ CLA signed

Copilot

Pull Request Overview

This PR adds a new TabPFGen Data Synthesizer extension to the TabPFN ecosystem for synthetic tabular data generation, dataset balancing, and quality assessment. Key changes include the implementation of utility functions for data validation and quality metrics, a streamlined wrapper for TabPFGen integration with methods for classification, regression, and dataset balancing, and updated documentation and examples to support the new extension.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py	New utility functions for validating data, combining datasets, analyzing class distribution, and calculating quality metrics.
src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py	Implementation of the TabPFNDataSynthesizer class with methods for synthetic data generation and dataset balancing.
src/tabpfn_extensions/tabpfgen_datasynthesizer/init.py	Initialization module that enforces the required Python version and imports key components.
src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md	Detailed documentation with installation, usage examples, and API reference for the new extension.
pyproject.toml	Updates dependency list to include the new TabPFGen Data Synthesizer requirements.
Examples (class_balancing_demo.py, basic_classification_example.py, basic_regression_example.py)	New and updated usage examples demonstrating classification, regression, and dataset balancing with the new extension.
README.md	Updated repository documentation to list the new tabpfgen_datasynthesizer extension.

Copilot · 2025-05-27T06:27:36Z

src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py

+    for cls, count, pct in zip(unique_classes, counts, percentages):
+        pass


This loop currently does nothing (contains only a 'pass' statement). Consider either removing it if not needed or adding a comment/implementation to clarify its intended purpose.

Suggested change

for cls, count, pct in zip(unique_classes, counts, percentages):

pass

# Removed redundant loop that performed no operations.

noahho · 2025-05-27T06:35:44Z

Thanks a lot for the quick and detailed improvements! The tests for TabPFNGen fail on py3.13 at this point because it is never installed. We might check the python version and install explicitly if we are on >= 3.10 in our github CI workflow. Similar to here: https://github.com/PriorLabs/tabpfn-extensions/blob/main/.github/workflows/pull_request.yml#L82

sebhaan added 5 commits May 25, 2025 13:57

Add main extension script for tabpfgen

dd5bb72

Add utils and readme to tabpfgen extension package

e8e8307

Add examples for tabpfgen extension

65f8612

add test functions for tabpfgen extension

59c4aaa

Update pyproject file to include tabpfgen extension

6e17db5

Copilot AI review requested due to automatic review settings May 25, 2025 04:52

Copilot AI reviewed May 25, 2025

View reviewed changes

sebhaan mentioned this pull request May 25, 2025

Showcase TabPFNGen? sebhaan/TabPFGen#2

Closed

noahho reviewed May 25, 2025

View reviewed changes

sebhaan added 6 commits May 27, 2025 11:00

Fix linting issues with ruff format and ruff check --fix

2acba05

Consolidated example README into src README and rm of example README

dea578c

Add tabpfgen_datasynthesizer extension and installation instruction

127b049

Remove tabpfgen dependency from all

28b1748

Add python version compatibility check

b6ee841

Update test using tabpfgen_datasynthesizer extension with Python 3.10…

5c062d5

…+ requirement isolation

noahho requested a review from Copilot May 27, 2025 06:27

Copilot AI reviewed May 27, 2025

View reviewed changes

sebhaan mentioned this pull request Jul 15, 2025

Clarifications sebhaan/TabPFGen#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TabPFGen Data Synthesizer Extension #83

Add TabPFGen Data Synthesizer Extension #83

Uh oh!

sebhaan commented May 25, 2025

Uh oh!

CLAassistant commented May 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 25, 2025

Uh oh!

Copilot AI May 25, 2025

Uh oh!

Copilot AI May 25, 2025

Uh oh!

noahho commented May 25, 2025

Uh oh!

noahho May 25, 2025

Uh oh!

Uh oh!

sebhaan commented May 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 27, 2025

Uh oh!

noahho commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)")

	print(f"TabPFGen import failed: {e}") # Debug info
	warnings.warn(f"TabPFGen import failed: {e}", ImportWarning)

		for cls, count, pct in zip(unique_classes, counts, percentages):
		pass

	for cls, count, pct in zip(unique_classes, counts, percentages):
	pass
	# Removed redundant loop that performed no operations.

Add TabPFGen Data Synthesizer Extension #83

Are you sure you want to change the base?

Add TabPFGen Data Synthesizer Extension #83

Uh oh!

Conversation

sebhaan commented May 25, 2025

Summary

Impact

Installation

Usage Examples

Project Structure

Checklist

Dependencies

Documentation

Uh oh!

CLAassistant commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 25, 2025

Choose a reason for hiding this comment

Uh oh!

noahho commented May 25, 2025

Uh oh!

noahho May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebhaan commented May 27, 2025

1. Critical CI Failures - RESOLVED

2. Code Structure - RESOLVED

Summary of Changes Made:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 27, 2025

Choose a reason for hiding this comment

Uh oh!

noahho commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented May 25, 2025 •

edited

Loading