balancer jupiter notebook (#160)

SeBorgey · voorhs · web-flow · commit 73a87ebd56b1 · 2025-03-07T11:09:22.000+03:00
* balancer jupiter notebook

* convert to rst

* update pyproject

* add references to tutorials

---------

Co-authored-by: voorhs &lt;ilya_alekseev_2016@list.ru&gt;
diff --git a/autointent/generation/utterances/balancer.py b/autointent/generation/utterances/balancer.py
@@ -20,6 +20,8 @@ class DatasetBalancer:
     If your dataset is unbalanced, you can add LLM-generated samples.
     This method uses :py:class:`autointent.generation.utterances.UtteranceGenerator` under the hood.
 
+    See tutorial :ref:`balancer_aug` for usage examples.
+
     Args:
         generator (Generator): The generator object used to create utterances.
         prompt_maker (Callable[[Intent, int], list[Message]]): A callable that creates prompts for the generator.
diff --git a/autointent/generation/utterances/evolution/__init__.py b/autointent/generation/utterances/evolution/__init__.py
@@ -1,7 +1,5 @@
+from .dspy_evolver import DSPYIncrementalUtteranceEvolver
 from .evolver import UtteranceEvolver
 from .incremental_evolver import IncrementalUtteranceEvolver
 
-__all__ = [
-    "IncrementalUtteranceEvolver",
-    "UtteranceEvolver",
-]
+__all__ = ["DSPYIncrementalUtteranceEvolver", "IncrementalUtteranceEvolver", "UtteranceEvolver"]
diff --git a/autointent/generation/utterances/evolution/dspy_evolver.py b/autointent/generation/utterances/evolution/dspy_evolver.py
@@ -152,6 +152,8 @@ class DSPYIncrementalUtteranceEvolver:
 
     For scoring generations it would use modified SemanticF1 as the base metric with a ROUGE-1 as repetition penalty.
 
+    See tutorial :ref:`evolutionary_strategy_augmentation` for usage examples.
+
     Args:
         model: Model name. This should follow naming schema from `litellm providers <https://docs.litellm.ai/docs/providers>`_.
         api_base: API base URL. Some models require this.
diff --git a/docs/source/augmentation_tutorials/balancer.rst b/docs/source/augmentation_tutorials/balancer.rst
@@ -0,0 +1,210 @@
+.. _balancer_aug:
+
+Balancing Datasets with DatasetBalancer
+=======================================
+
+This guide demonstrates how to use the DatasetBalancer class to balance class distribution in your datasets through LLM-based data augmentation.
+
+.. contents:: Table of Contents
+    :depth: 2
+
+Why Balance Datasets?
+---------------------
+
+Imbalanced datasets can lead to biased models that perform well on majority classes but poorly on minority classes. DatasetBalancer helps address this issue by generating additional examples for underrepresented classes using large language models.
+
+Creating a Sample Imbalanced Dataset
+-----------------------------------
+
+Let's create a small imbalanced dataset to demonstrate the balancing process:
+
+.. code-block:: python
+
+    from autointent import Dataset
+    from autointent.generation.utterances.balancer import DatasetBalancer
+    from autointent.generation.utterances.generator import Generator
+    from autointent.generation.chat_templates import EnglishSynthesizerTemplate
+
+    # Create a simple imbalanced dataset
+    sample_data = {
+        "intents": [
+            {"id": 0, "name": "restaurant_booking", "description": "Booking a table at a restaurant"},
+            {"id": 1, "name": "weather_query", "description": "Checking weather conditions"},
+            {"id": 2, "name": "navigation", "description": "Getting directions to a location"},
+        ],
+        "train": [
+            # Restaurant booking examples (5)
+            {"utterance": "Book a table for two tonight", "label": 0},
+            {"utterance": "I need a reservation at Le Bistro", "label": 0},
+            {"utterance": "Can you reserve a table for me?", "label": 0},
+            {"utterance": "I want to book a restaurant for my anniversary", "label": 0},
+            {"utterance": "Make a dinner reservation for 8pm", "label": 0},
+
+            # Weather query examples (3)
+            {"utterance": "What's the weather like today?", "label": 1},
+            {"utterance": "Will it rain tomorrow?", "label": 1},
+            {"utterance": "Weather forecast for New York", "label": 1},
+
+            # Navigation example (1)
+            {"utterance": "How do I get to the museum?", "label": 2},
+        ]
+    }
+
+    # Create the dataset
+    dataset = Dataset.from_dict(sample_data)
+
+Setting up the Generator and Template
+------------------------------------
+
+DatasetBalancer requires two main components:
+1. A Generator - responsible for creating new utterances using an LLM
+2. A Template - defines the prompt format sent to the LLM
+
+Let's set up these components:
+
+.. code-block:: python
+
+    # Initialize a generator (uses OpenAI API by default)
+    generator = Generator()
+
+    # Create a template for generating utterances
+    template = EnglishSynthesizerTemplate(dataset=dataset, split="train")
+
+Creating the DatasetBalancer
+----------------------------
+
+Now we can create our DatasetBalancer instance:
+
+.. code-block:: python
+
+    balancer = DatasetBalancer(
+        generator=generator,
+        prompt_maker=template,
+        async_mode=False,  # Set to True for faster generation with async processing
+        max_samples_per_class=5,  # Each class will have exactly 5 samples after balancing
+    )
+
+Checking Initial Class Distribution
+----------------------------------
+
+Let's examine the class distribution before balancing:
+
+.. code-block:: python
+
+    # Check the initial distribution of classes in the training set
+    initial_distribution = {}
+    for sample in dataset["train"]:
+        label = sample[Dataset.label_feature]
+        initial_distribution[label] = initial_distribution.get(label, 0) + 1
+
+    print("Initial class distribution:")
+    for class_id, count in sorted(initial_distribution.items()):
+        intent = next(i for i in dataset.intents if i.id == class_id)
+        print(f"Class {class_id} ({intent.name}): {count} samples")
+
+    print(f"\nMost represented class: {max(initial_distribution.values())} samples")
+    print(f"Least represented class: {min(initial_distribution.values())} samples")
+
+Balancing the Dataset
+---------------------
+
+Now we'll use the DatasetBalancer to augment our dataset:
+
+.. code-block:: python
+
+    # Create a copy of the dataset
+    dataset_copy = Dataset.from_dict(dataset.to_dict())
+
+    # Balance the training split
+    balanced_dataset = balancer.balance(
+        dataset=dataset_copy,
+        split="train",
+        batch_size=2,  # Process generations in batches of 2
+    )
+
+Checking the Results
+-------------------
+
+Let's examine the class distribution after balancing:
+
+.. code-block:: python
+
+    # Check the balanced distribution
+    balanced_distribution = {}
+    for sample in balanced_dataset["train"]:
+        label = sample[Dataset.label_feature]
+        balanced_distribution[label] = balanced_distribution.get(label, 0) + 1
+
+    print("Balanced class distribution:")
+    for class_id, count in sorted(balanced_distribution.items()):
+        intent = next(i for i in dataset.intents if i.id == class_id)
+        print(f"Class {class_id} ({intent.name}): {count} samples")
+
+    print(f"\nMost represented class: {max(balanced_distribution.values())} samples")
+    print(f"Least represented class: {min(balanced_distribution.values())} samples")
+
+Examining Generated Examples
+---------------------------
+
+Let's look at some examples of original and generated utterances for the navigation class,
+which was the most underrepresented:
+
+.. code-block:: python
+
+    # Navigation class (Class 2)
+    navigation_class_id = 2
+    intent = next(i for i in dataset.intents if i.id == navigation_class_id)
+
+    print(f"Examples for class {navigation_class_id} ({intent.name}):")
+
+    # Original examples
+    original_examples = [
+        s[Dataset.utterance_feature] for s in dataset["train"] if s[Dataset.label_feature] == navigation_class_id
+    ]
+    print("\nOriginal examples:")
+    for i, example in enumerate(original_examples, 1):
+        print(f"{i}. {example}")
+
+    # Generated examples
+    all_examples = [
+        s[Dataset.utterance_feature] for s in balanced_dataset["train"] if s[Dataset.label_feature] == navigation_class_id
+    ]
+    generated_examples = [ex for ex in all_examples if ex not in original_examples]
+    print("\nGenerated examples:")
+    for i, example in enumerate(generated_examples, 1):
+        print(f"{i}. {example}")
+
+Configuring the Number of Samples per Class
+------------------------------------------
+
+You can configure how many samples each class should have:
+
+.. code-block:: python
+
+    # To bring all classes to exactly 10 samples
+    original_dataset = Dataset.from_dict(sample_data)
+    exact_template = EnglishSynthesizerTemplate(dataset=original_dataset, split="train")
+
+    exact_balancer = DatasetBalancer(
+        generator=generator,
+        prompt_maker=exact_template,
+        max_samples_per_class=10
+    )
+
+    # Balance to the level of the most represented class
+    max_template = EnglishSynthesizerTemplate(dataset=original_dataset, split="train")
+
+    max_balancer = DatasetBalancer(
+        generator=generator,
+        prompt_maker=max_template,
+        max_samples_per_class=None  # Will use the count of the most represented class
+    )
+
+Tips for Effective Dataset Balancing
+-----------------------------------
+
+1. **Quality Control**: Always review a sample of generated utterances to ensure quality.
+2. **Template Selection**: Different templates may work better for different domains.
+3. **Model Selection**: Larger models generally produce higher quality utterances.
+4. **Batch Size**: Increase batch size for faster generation if your hardware allows.
+5. **Validation**: Test your model on both original and augmented data to ensure it generalizes well.
diff --git a/docs/source/augmentation_tutorials/dspy_augmentation.rst b/docs/source/augmentation_tutorials/dspy_augmentation.rst
@@ -8,13 +8,11 @@ This tutorial covers the implementation and usage of an evolutionary strategy to
 .. contents:: Table of Contents
     :depth: 2
 
--------------
 What is DSPy?
 -------------
 
 DSPy is a framework for optimizing and evaluating language models. It provides tools for defining signatures, optimizing modules, and measuring evaluation metrics. This module leverages DSPy to generate augmented utterances using an evolutionary approach.
 
----------------------
 How This Module Works
 ---------------------
 
@@ -26,7 +24,6 @@ This module applies an incremental evolutionary strategy for augmenting utteranc
 
 The augmentation process runs for a specified number of evolutions, saving intermediate models and optimizing the results.
 
-------------
 Installation
 ------------
 
@@ -36,7 +33,6 @@ Ensure you have the required dependencies installed:
 
     pip install "autointent[dspy]"
 
---------------
 Scoring Metric
 --------------
 
@@ -54,7 +50,6 @@ The scoring metric consists of:
    - `Final Score = SemanticF1 * Repetition Factor`
    - A higher score means better augmentation.
 
--------------
 Usage Example
 -------------
 
diff --git a/docs/source/user_guides.rst b/docs/source/user_guides.rst
@@ -16,4 +16,5 @@ Data augmentation tutorials
    :maxdepth: 1
 
    augmentation_tutorials/dspy_augmentation
+   augmentation_tutorials/balancer
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -36,9 +36,7 @@ dependencies = [
     "scikit-learn (>=1.5,<2.0)",
     "scikit-multilearn (==0.2.0)",
     "appdirs (>=1.4,<2.0)",
-    "sre-yield (>=1.2,<2.0)",
     "optuna (>=4.0.0,<5.0.0)",
-    "xeger (>=0.4.0,<0.5.0)",
     "pathlib (>=1.0.1,<2.0.0)",
     "pydantic (>=2.10.5,<3.0.0)",
     "faiss-cpu (>=1.9.0,<2.0.0)",

Original file line number	Diff line number	Diff line change
`@@ -16,4 +16,5 @@ Data augmentation tutorials`
`16`	`16`	`:maxdepth: 1`
`17`	`17`
`18`	`18`	`augmentation_tutorials/dspy_augmentation`
	`19`	`+ augmentation_tutorials/balancer`
`19`	`20`