Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ dependencies = [
"datasets (>=3.2.0,<4.0.0)",
"xxhash (>=3.5.0,<4.0.0)",
"python-dotenv (>=1.0.1,<2.0.0)",
"ipykernel (>=6.29.5,<7.0.0)",
"jupyter (>=1.1.1,<2.0.0)",
]

[project.urls]
Expand Down
217 changes: 217 additions & 0 deletions user_guides/advanced/06_balancer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# %% [markdown]
"""
# Balancing Datasets with DatasetBalancer

This guide demonstrates how to use the DatasetBalancer class to balance class distribution in your datasets through LLM-based data augmentation.

## Why Balance Datasets?

Imbalanced datasets can lead to biased models that perform well on majority classes but poorly on minority classes. DatasetBalancer helps address this issue by generating additional examples for underrepresented classes using large language models.
"""

# %%
from autointent import Dataset
from autointent.generation.utterances.balancer import DatasetBalancer
from autointent.generation.utterances.generator import Generator
from autointent.generation.chat_templates import EnglishSynthesizerTemplate

# %% [markdown]
"""
## Creating a Sample Imbalanced Dataset

Let's create a small imbalanced dataset to demonstrate the balancing process:
"""

# %%
# Create a simple imbalanced dataset
sample_data = {
"intents": [
{"id": 0, "name": "restaurant_booking", "description": "Booking a table at a restaurant"},
{"id": 1, "name": "weather_query", "description": "Checking weather conditions"},
{"id": 2, "name": "navigation", "description": "Getting directions to a location"},
],
"train": [
# Restaurant booking examples (5)
{"utterance": "Book a table for two tonight", "label": 0},
{"utterance": "I need a reservation at Le Bistro", "label": 0},
{"utterance": "Can you reserve a table for me?", "label": 0},
{"utterance": "I want to book a restaurant for my anniversary", "label": 0},
{"utterance": "Make a dinner reservation for 8pm", "label": 0},

# Weather query examples (3)
{"utterance": "What's the weather like today?", "label": 1},
{"utterance": "Will it rain tomorrow?", "label": 1},
{"utterance": "Weather forecast for New York", "label": 1},

# Navigation example (1)
{"utterance": "How do I get to the museum?", "label": 2},
]
}

# Create the dataset
dataset = Dataset.from_dict(sample_data)

# %% [markdown]
"""
## Setting up the Generator and Template

DatasetBalancer requires two main components:
1. A Generator - responsible for creating new utterances using an LLM
2. A Template - defines the prompt format sent to the LLM

Let's set up these components:
"""

# %%
# Initialize a generator (uses OpenAI API by default)
generator = Generator()

# Create a template for generating utterances
template = EnglishSynthesizerTemplate(dataset=dataset, split="train")

# %% [markdown]
"""
## Creating the DatasetBalancer

Now we can create our DatasetBalancer instance:
"""

# %%
balancer = DatasetBalancer(
generator=generator,
prompt_maker=template,
async_mode=False, # Set to True for faster generation with async processing
max_samples_per_class=5, # Each class will have exactly 5 samples after balancing
)

# %% [markdown]
"""
## Checking Initial Class Distribution

Let's examine the class distribution before balancing:
"""

# %%
# Check the initial distribution of classes in the training set
initial_distribution = {}
for sample in dataset["train"]:
label = sample[Dataset.label_feature]
initial_distribution[label] = initial_distribution.get(label, 0) + 1

print("Initial class distribution:")
for class_id, count in sorted(initial_distribution.items()):
intent = next(i for i in dataset.intents if i.id == class_id)
print(f"Class {class_id} ({intent.name}): {count} samples")

print(f"\nMost represented class: {max(initial_distribution.values())} samples")
print(f"Least represented class: {min(initial_distribution.values())} samples")

# %% [markdown]
"""
## Balancing the Dataset

Now we'll use the DatasetBalancer to augment our dataset:
"""

# %%
# Create a copy of the dataset
dataset_copy = Dataset.from_dict(dataset.to_dict())

# Balance the training split
balanced_dataset = balancer.balance(
dataset=dataset_copy,
split="train",
batch_size=2, # Process generations in batches of 2
)

# %% [markdown]
"""
## Checking the Results

Let's examine the class distribution after balancing:
"""

# %%
# Check the balanced distribution
balanced_distribution = {}
for sample in balanced_dataset["train"]:
label = sample[Dataset.label_feature]
balanced_distribution[label] = balanced_distribution.get(label, 0) + 1

print("Balanced class distribution:")
for class_id, count in sorted(balanced_distribution.items()):
intent = next(i for i in dataset.intents if i.id == class_id)
print(f"Class {class_id} ({intent.name}): {count} samples")

print(f"\nMost represented class: {max(balanced_distribution.values())} samples")
print(f"Least represented class: {min(balanced_distribution.values())} samples")

# %% [markdown]
"""
## Examining Generated Examples

Let's look at some examples of original and generated utterances for the navigation class,
which was the most underrepresented:
"""

# %%
# Navigation class (Class 2)
navigation_class_id = 2
intent = next(i for i in dataset.intents if i.id == navigation_class_id)

print(f"Examples for class {navigation_class_id} ({intent.name}):")

# Original examples
original_examples = [
s[Dataset.utterance_feature] for s in dataset["train"] if s[Dataset.label_feature] == navigation_class_id
]
print("\nOriginal examples:")
for i, example in enumerate(original_examples, 1):
print(f"{i}. {example}")

# Generated examples
all_examples = [
s[Dataset.utterance_feature] for s in balanced_dataset["train"] if s[Dataset.label_feature] == navigation_class_id
]
generated_examples = [ex for ex in all_examples if ex not in original_examples]
print("\nGenerated examples:")
for i, example in enumerate(generated_examples, 1):
print(f"{i}. {example}")

# %% [markdown]
"""
## Configuring the Number of Samples per Class

You can configure how many samples each class should have:
"""

# %%
# To bring all classes to exactly 10 samples
original_dataset = Dataset.from_dict(sample_data)
exact_template = EnglishSynthesizerTemplate(dataset=original_dataset, split="train")

exact_balancer = DatasetBalancer(
generator=generator,
prompt_maker=exact_template,
max_samples_per_class=10
)

# Balance to the level of the most represented class
max_template = EnglishSynthesizerTemplate(dataset=original_dataset, split="train")

max_balancer = DatasetBalancer(
generator=generator,
prompt_maker=max_template,
max_samples_per_class=None # Will use the count of the most represented class
)

# %% [markdown]
"""
## Tips for Effective Dataset Balancing

1. **Quality Control**: Always review a sample of generated utterances to ensure quality.
2. **Template Selection**: Different templates may work better for different domains.
3. **Model Selection**: Larger models generally produce higher quality utterances.
4. **Batch Size**: Increase batch size for faster generation if your hardware allows.
5. **Validation**: Test your model on both original and augmented data to ensure it generalizes well.
"""
Loading