Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 189 additions & 0 deletions examples/tabpfgen_datasynthesizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# TabPFGen Data Synthesizer Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is duplicated (with slight differences), let's only add it to the src/ folder and remove from here, there should be one readme


This directory contains examples demonstrating how to use the TabPFGen Data Synthesizer extension for TabPFN.

The TabPFGen Data Synthesizer extension integrates [TabPFGen](https://github.com/sebhaan/TabPFGen) with the TabPFN ecosystem, enabling synthetic tabular data generation with automatic dataset balancing capabilities.

Author: Sebastian Haan

## Key Features

- **Synthetic Data Generation**: Support for both classification and regression tasks
- ** Automatic Dataset Balancing**: Built-in imbalanced dataset handling
- **Built-in Visualizations**: Uses TabPFGen's comprehensive visualization suite
- **Quality Assessment**: Comprehensive synthetic data quality metrics

## Examples

### 1. Basic Classification Example
```bash
python basic_classification_example.py
```

**Demonstrates:**
- Loading and analyzing datasets
- Generating synthetic classification data
- Using TabPFGen's built-in visualizations
- Quality assessment metrics

### 2. Dataset Balancing Demo
```bash
python class_balancing_demo.py
```

**Demonstrates:**
- Creating imbalanced datasets
- Using TabPFGen's new `balance_dataset()` method
- Automatic vs. custom target balancing
- Effectiveness analysis

### 3. Basic Regression Example
```bash
python basic_regression_example.py
```

**Demonstrates:**
- Synthetic regression data generation
- Quantile-based sampling
- Target correlation preservation
- Statistical quality comparisons


## Installation Requirements

```bash
# Install TabPFN (choose one)
pip install tabpfn # For local inference
pip install tabpfn-client # For cloud-based inference

# Install TabPFGen (v0.1.3+)
pip install tabpfgen>=0.1.3

# Install TabPFN Extensions
pip install "tabpfn-extensions[all] @ git+https://github.com/PriorLabs/tabpfn-extensions.git"
```

## Quick Start

### Basic Generation
```python
from tabpfn_extensions.tabpfgen_datasynthesizer import TabPFNDataSynthesizer
from sklearn.datasets import load_breast_cancer

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Initialize synthesizer
synthesizer = TabPFNDataSynthesizer(n_sgld_steps=300)

# Generate synthetic data with TabPFGen's visualizations
X_synth, y_synth = synthesizer.generate_classification(
X, y, n_samples=100, visualize=True
)
```

### Dataset Balancing
```python
from tabpfn_extensions.tabpfgen_datasynthesizer import TabPFNDataSynthesizer
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=3,
n_informative=3, n_redundant=1,
weights=[0.7, 0.2, 0.1], random_state=42)

# Initialize synthesizer
synthesizer = TabPFNDataSynthesizer(n_sgld_steps=300)

# Balance automatically
X_synth, y_synth, X_balanced, y_balanced = synthesizer.balance_dataset(
X, y, visualize=True
)

print(f"Original: {len(X)} samples")
print(f"Balanced: {len(X_balanced)} samples")
print(f"Added: {len(X_synth)} synthetic samples")
```

### Quality Assessment
```python
from tabpfn_extensions.tabpfgen_datasynthesizer.utils import (
validate_tabpfn_data,
analyze_class_distribution,
calculate_synthetic_quality_metrics
)

# Validate data for TabPFN compatibility
is_valid, message = validate_tabpfn_data(X, y)
print(f"Validation: {message}")

# Analyze class distribution
analysis = analyze_class_distribution(y, "Dataset Name")

# Calculate quality metrics
quality = calculate_synthetic_quality_metrics(X, X_synth, y, y_synth)
```

## Parameters

### TabPFNDataSynthesizer Parameters

- `n_sgld_steps` (int, default=500): Number of SGLD iterations for generation
- `sgld_step_size` (float, default=0.01): Step size for SGLD updates
- `sgld_noise_scale` (float, default=0.01): Scale of noise in SGLD
- `device` (str, default='auto'): Computing device ('cpu', 'cuda', or 'auto')

### balance_dataset() Parameters

- `target_per_class` (int, optional): Custom target samples per class
- `visualize` (bool, default=False): Enable TabPFGen's built-in visualizations
- `feature_names` (list, optional): Feature names for visualization

### Generation Parameters

- `n_samples` (int): Number of synthetic samples to generate
- `balance_classes` (bool, default=True): Balance only synthetic samples
- `use_quantiles` (bool, default=True): Quantile-based sampling for regression
- `visualize` (bool, default=False): Enable visualization plots

## Important Notes

### About balance_classes vs balance_dataset()

- **`balance_classes=True`**: Only balances the synthetic samples generated
- **`balance_dataset()`**: Balances the entire dataset by generating synthetic samples for minority classes

### Balancing Results

The final class distribution may be **approximately balanced** rather than perfectly balanced. This is due to TabPFN's label refinement process, which prioritizes data quality and realism over exact class counts.

## Tips for Best Results

1. **SGLD Steps**: Use 300-500 steps for good quality; 500+ for production
2. **Device**: Use 'cuda' for significant speedup on GPU systems
3. **Validation**: Always validate data compatibility with `validate_tabpfn_data()`
4. **Balancing**: Use `balance_dataset()` for imbalanced datasets
5. **Quality Check**: Monitor synthetic data quality with built-in metrics

## Troubleshooting

### Common Issues

1. **TabPFGen Import Error**:
```bash
pip install tabpfgen>=0.1.3
```

2. **Memory Issues**: Reduce `n_samples` or `n_sgld_steps`

3. **Generation Quality**: Increase `n_sgld_steps` or adjust step size

4. **Imbalanced Results**: Use `balance_dataset()` instead of `generate_classification()`

### Performance Optimization

- **Development**: Use 100-300 SGLD steps for faster iteration
- **Production**: Use 500+ SGLD steps for best quality
- **GPU**: Enable with `device='cuda'` for 5-10x speedup
- **Batch Processing**: Generate larger batches rather than multiple small ones

Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
"""
Basic Classification Example with TabPFGen Data Synthesizer

This example demonstrates how to use TabPFGen for synthetic data generation
in classification tasks, leveraging the actual TabPFGen package features.
"""

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Import TabPFN Extensions
from tabpfn_extensions.tabpfgen_datasynthesizer import TabPFNDataSynthesizer
from tabpfn_extensions.tabpfgen_datasynthesizer.utils import analyze_class_distribution

def main():
"""Run basic classification example."""
print("=== TabPFGen Classification Example ===\n")

# Load breast cancer dataset
print("Loading breast cancer dataset...")
X, y = load_breast_cancer(return_X_y=True)
feature_names = load_breast_cancer().feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training data: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test data: {X_test.shape[0]} samples")

# Analyze original distribution
analyze_class_distribution(y_train, "Original Training Data")

# Initialize TabPFGen synthesizer
print("\nInitializing TabPFGen synthesizer...")
synthesizer = TabPFNDataSynthesizer(
n_sgld_steps=300, # Reduced for faster demo
device='auto'
)

# Generate synthetic data using TabPFGen's built-in methods
print("\nGenerating synthetic classification data...")
n_synthetic = 200
X_synth, y_synth = synthesizer.generate_classification(
X_train, y_train,
n_samples=n_synthetic,
balance_classes=True, # This balances only the synthetic samples
visualize=True, # Use TabPFGen's built-in visualization
feature_names=list(feature_names)
)

print(f"\nGenerated {len(X_synth)} synthetic samples")
analyze_class_distribution(y_synth, "Synthetic Data")

# Combine original and synthetic data
from tabpfn_extensions.tabpfgen_datasynthesizer.utils import combine_datasets
X_augmented, y_augmented = combine_datasets(
X_train, y_train, X_synth, y_synth, strategy='append'
)

analyze_class_distribution(y_augmented, "Augmented Training Data")

print("\n✅ Synthetic data generation completed successfully!")

# Calculate quality metrics
from tabpfn_extensions.tabpfgen_datasynthesizer.utils import calculate_synthetic_quality_metrics

print("\n" + "="*60)
print("SYNTHETIC DATA QUALITY METRICS")
print("="*60)

quality_metrics = calculate_synthetic_quality_metrics(
X_train, X_synth, y_train, y_synth
)

for metric, value in quality_metrics.items():
print(f"{metric}: {value:.4f}")

if __name__ == "__main__":
main()
103 changes: 103 additions & 0 deletions examples/tabpfgen_datasynthesizer/basic_regression_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
"""
Basic Regression Example with TabPFGen Data Synthesizer

This example demonstrates how to use TabPFGen for synthetic data generation
in regression tasks, using TabPFGen's built-in features.
"""

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Import TabPFN Extensions
from tabpfn_extensions.tabpfgen_datasynthesizer import TabPFNDataSynthesizer
from tabpfn_extensions.tabpfgen_datasynthesizer.utils import calculate_synthetic_quality_metrics

def main():
"""Run basic regression example."""
print("=== TabPFGen Regression Example ===\n")

# Load diabetes dataset
print("Loading diabetes dataset...")
X, y = load_diabetes(return_X_y=True)
feature_names = load_diabetes().feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

print(f"Training data: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test data: {X_test.shape[0]} samples")
print(f"Target range: [{y_train.min():.1f}, {y_train.max():.1f}]")

# Initialize TabPFGen synthesizer
print("\nInitializing TabPFGen synthesizer...")
synthesizer = TabPFNDataSynthesizer(
n_sgld_steps=300, # Good balance for regression
device='auto'
)

# Generate synthetic regression data
print("\nGenerating synthetic regression data...")
n_synthetic = 150
X_synth, y_synth = synthesizer.generate_regression(
X_train, y_train,
n_samples=n_synthetic,
use_quantiles=True, # Important for regression quality
visualize=True, # Use TabPFGen's built-in visualization
feature_names=list(feature_names)
)

print(f"\nGenerated {len(X_synth)} synthetic samples")
print(f"Synthetic target range: [{y_synth.min():.1f}, {y_synth.max():.1f}]")

# Combine original and synthetic data
from tabpfn_extensions.tabpfgen_datasynthesizer.utils import combine_datasets
X_augmented, y_augmented = combine_datasets(
X_train, y_train, X_synth, y_synth, strategy='append'
)

print(f"Combined dataset: {len(X_augmented)} samples")
print(f"Combined target range: [{y_augmented.min():.1f}, {y_augmented.max():.1f}]")

# Calculate quality metrics
print("\n" + "="*60)
print("SYNTHETIC DATA QUALITY METRICS")
print("="*60)

quality_metrics = calculate_synthetic_quality_metrics(
X_train, X_synth, y_train, y_synth
)

print("\nFeature quality metrics:")
for metric, value in quality_metrics.items():
print(f"{metric}: {value:.4f}")

# Statistical comparison
print(f"\nStatistical comparison:")
print(f"Original data - Mean: {np.mean(X_train):.3f}, Std: {np.std(X_train):.3f}")
print(f"Synthetic data - Mean: {np.mean(X_synth):.3f}, Std: {np.std(X_synth):.3f}")
print(f"Target correlation preservation:")

# Check target correlations
orig_target_corr = []
synth_target_corr = []

for i in range(X_train.shape[1]):
orig_corr = np.corrcoef(X_train[:, i], y_train)[0, 1]
synth_corr = np.corrcoef(X_synth[:, i], y_synth)[0, 1]
orig_target_corr.append(orig_corr)
synth_target_corr.append(synth_corr)

print(f"Average target correlation - Original: {np.mean(np.abs(orig_target_corr)):.3f}")
print(f"Average target correlation - Synthetic: {np.mean(np.abs(synth_target_corr)):.3f}")

correlation_preservation = 1 - np.mean(np.abs(np.array(orig_target_corr) - np.array(synth_target_corr)))
print(f"Correlation preservation score: {correlation_preservation:.3f}")

print("\n✅ Synthetic regression data generation completed successfully!")


if __name__ == "__main__":
main()
Loading
Loading