-
Notifications
You must be signed in to change notification settings - Fork 50
Add TabPFGen Data Synthesizer Extension #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new TabPFGen data synthesizer extension that integrates TabPFGen into TabPFN workflows, providing data validation, synthesis, balancing, quality metrics, and documentation/examples.
- Introduces utility functions for data validation, dataset combination, class analysis, and synthetic quality metrics.
- Adds
TabPFNDataSynthesizerwrapper with methods for classification, regression, and dataset balancing. - Updates packaging (
pyproject.toml), includes detailed README, and provides multiple example scripts.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py | Utility functions for validation, combining, analysis, and metrics |
| src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py | TabPFNDataSynthesizer wrapper class around TabPFGen |
| src/tabpfn_extensions/tabpfgen_datasynthesizer/init.py | Package exports, version, and author metadata |
| src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md | Extension README with installation and usage examples |
| pyproject.toml | Added tabpfgen_datasynthesizer extra and updated all extras |
| examples/tabpfgen_datasynthesizer/* | Example scripts and README demonstrating usage |
| balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)") | ||
|
|
Copilot
AI
May 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call to analyze_class_distribution is duplicated (also on line 84). You can remove one of them to avoid redundant computation and output.
| balanced_analysis_custom = analyze_class_distribution(y_balanced_custom, "Custom-Balanced Dataset (target=1000)") | |
| except ImportError as e: | ||
| TABPFGEN_AVAILABLE = False | ||
| TabPFGen = None | ||
| print(f"TabPFGen import failed: {e}") # Debug info |
Copilot
AI
May 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using warnings.warn(...) or a logging framework instead of print for import failure notifications to avoid writing directly to stdout.
| print(f"TabPFGen import failed: {e}") # Debug info | |
| warnings.warn(f"TabPFGen import failed: {e}", ImportWarning) |
| print(f"\n=== {title} ===") | ||
| for cls, count, pct in zip(unique_classes, counts, percentages): | ||
| print(f"Class {cls}: {count} samples ({pct:.1f}%)") | ||
|
|
||
| print(f"Total: {total_samples} samples, {len(unique_classes)} classes") | ||
| print(f"Imbalance ratio: {imbalance_ratio:.1f}:1") | ||
|
|
Copilot
AI
May 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Utility functions like analyze_class_distribution should avoid printing directly. Consider returning analysis data only and let callers handle display or logging.
| print(f"\n=== {title} ===") | |
| for cls, count, pct in zip(unique_classes, counts, percentages): | |
| print(f"Class {cls}: {count} samples ({pct:.1f}%)") | |
| print(f"Total: {total_samples} samples, {len(unique_classes)} classes") | |
| print(f"Imbalance ratio: {imbalance_ratio:.1f}:1") | |
| # Removed direct printing. Callers should handle display or logging. | |
| # Example: Use the returned `analysis` dictionary for custom output. | |
| pass | |
|
Hi @sebhaan, thanks again for this excellent contribution integrating To get this merged, please address the following: 1. Critical CI Failures (Must Address First):
2. Code:
|
| @@ -0,0 +1,189 @@ | |||
| # TabPFGen Data Synthesizer Examples | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README is duplicated (with slight differences), let's only add it to the src/ folder and remove from here, there should be one readme
…+ requirement isolation
|
Thank you for the detailed feedback. I've addressed the critical issues as follows: 1. Critical CI Failures - RESOLVEDRuff Linting: - FIXED
Python 3.9 Dependency Conflict: - FIXED via Option B
This approach ensures:
License/CLA:
2. Code Structure - RESOLVEDREADME Consolidation: - COMPLETED
Summary of Changes Made:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new TabPFGen Data Synthesizer extension to the TabPFN ecosystem for synthetic tabular data generation, dataset balancing, and quality assessment. Key changes include the implementation of utility functions for data validation and quality metrics, a streamlined wrapper for TabPFGen integration with methods for classification, regression, and dataset balancing, and updated documentation and examples to support the new extension.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/tabpfn_extensions/tabpfgen_datasynthesizer/utils.py | New utility functions for validating data, combining datasets, analyzing class distribution, and calculating quality metrics. |
| src/tabpfn_extensions/tabpfgen_datasynthesizer/tabpfgen_wrapper.py | Implementation of the TabPFNDataSynthesizer class with methods for synthetic data generation and dataset balancing. |
| src/tabpfn_extensions/tabpfgen_datasynthesizer/init.py | Initialization module that enforces the required Python version and imports key components. |
| src/tabpfn_extensions/tabpfgen_datasynthesizer/README.md | Detailed documentation with installation, usage examples, and API reference for the new extension. |
| pyproject.toml | Updates dependency list to include the new TabPFGen Data Synthesizer requirements. |
| Examples (class_balancing_demo.py, basic_classification_example.py, basic_regression_example.py) | New and updated usage examples demonstrating classification, regression, and dataset balancing with the new extension. |
| README.md | Updated repository documentation to list the new tabpfgen_datasynthesizer extension. |
| for cls, count, pct in zip(unique_classes, counts, percentages): | ||
| pass |
Copilot
AI
May 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop currently does nothing (contains only a 'pass' statement). Consider either removing it if not needed or adding a comment/implementation to clarify its intended purpose.
| for cls, count, pct in zip(unique_classes, counts, percentages): | |
| pass | |
| # Removed redundant loop that performed no operations. |
|
Thanks a lot for the quick and detailed improvements! The tests for TabPFNGen fail on py3.13 at this point because it is never installed. We might check the python version and install explicitly if we are on >= 3.10 in our github CI workflow. Similar to here: https://github.com/PriorLabs/tabpfn-extensions/blob/main/.github/workflows/pull_request.yml#L82 |
Summary
This PR adds a new extension integrating TabPFGen for synthetic tabular data generation.
Impact
This extension enables TabPFN users to:
Key Features:
Installation
Usage Examples
The extension provides multiple usage patterns, see :
Basic classification dataset example:
Project Structure
Checklist
Dependencies
Added to pyproject.toml: tabpfgen>=0.1.3 in both extension and [all] groups
Documentation