Refactor generation pipeline pipeline #48

kohankhaki · 2025-12-12T10:40:03Z

PR Type

Fix

Short Description

This pull request refactors the base generation pipeline to use the schema in src/schemas and use experimental "diverse task generator" for task generation and validation.
It also removed legacy unused files.

Tests Added

None

This change is

afkanpour

@afkanpour partially reviewed 43 files and all commit messages, and made 12 comments.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @kohankhaki).

src/schemas/task_schemas.py line 22 at r2 (raw file):

    task: str
    capability: Capability
    generation_metadata: Optional[Dict] = field(default_factory=dict)

Let's use this sparingly. Please add at least the difficulty, bloom's level, solution type (mcq, open-ended, ...), and the solution attributes to this class.
In addition, since we're converging to multiple choice questions, add another field, e.g., a list of (label, solution) that holds the choices. "label" refers to "A", "B", ..., and "solution" is the actual solution.

Code quote:

generation_metadata: Optional[Dict] = field(default_factory=dict)

src/utils/embedding_utils.py line 170 at r1 (raw file):

def generate_schema_capabilities_embeddings(

We don't keep the old capability class anymore, right? Then, let's remove schema from the name here and all comments that follow.

Code quote:

generate_schema_capabilities_embeddings

src/utils/capability_utils.py line 17 at r1 (raw file):

from inspect_ai.scorer import CORRECT

# from langsmith import traceable, tracing_context  # COMMENTED OUT FOR DEBUGGING

Do we use langsmith at all? If not, let's remove all references to it.

Code quote:

# from langsmith import traceable, tracing_context  # COMMENTED OUT FOR DEBUGGING

src/utils/capability_utils.py line 252 at r1 (raw file):

    # @traceable(  # COMMENTED OUT FOR DEBUGGING
    #     run_type="llm",
    # )

Are these for langsmith? Please remove these and all similar instances.

Code quote:

    # @traceable(  # COMMENTED OUT FOR DEBUGGING
    #     run_type="llm",
    # )

src/base_stages/generate_tasks.py line 0 at r2 (raw file):
What's the difference between generate_tasks.py and generate_diverse_tasks.py? It seems the previous one implements the combination pipeline whereas this one only generates the tasks. Let's use more descriptive names for modules and functions.

src/base_stages/generate_tasks.py line 116 at r2 (raw file):

                    "blueprint": blueprint.blueprint,
                    "subtopic": blueprint.subtopic,
                    "difficulty": blueprint.difficulty,

Please move difficulty and bloom's level to the main task class. We'll use them regularly.

More generally, I think generation_metadata should be used sparingly and only for experimental purposes. It has some serious shortcomings. For example, it significantly reduces the readability of the code (the reader has no idea what attributes are in the class). It causes issues with type safety, and makes refactoring more difficult.

Code quote:

                    "blueprint": blueprint.blueprint,
                    "subtopic": blueprint.subtopic,
                    "difficulty": blueprint.difficulty,

src/base_stages/generate_tasks.py line 117 at r2 (raw file):

                    "subtopic": blueprint.subtopic,
                    "difficulty": blueprint.difficulty,
                    "reasoning": blueprint.reasoning,

What is reasoning? Does it refer to the bloom's level? If so, let's use a self-explanatory name.

Code quote:

"reasoning": blueprint.reasoning

src/utils/prompts.py line 0 at r1 (raw file):
There is prompts.py under base_stages/ and there's this module. What's the difference between them? And why don't we merge them?

src/utils/capability_management_utils.py line 195 at r1 (raw file):

def filter_schema_capabilities_by_embeddings(
    capabilities: List[Any],  # List of schema Capability objects
    embeddings: List[torch.Tensor],

Should we add embedding as an optional field to the capability class?

Code quote:

embeddings: List[torch.Tensor]

src/base_stages/task_dataclasses.py line 0 at r2 (raw file):
Does this define classes for the 5-stage generation pipeline? If so, let's merge it with the main schemas. We don't want to keep two sets of classes around. If there are experimental work, they should go in an experimental directory.

README.md line 273 at r1 (raw file):

1. **Follow Schema Guidelines**: All data objects must use the schema classes defined in `src/schemas/`:
   - Use `Domain`, `Area`, `Capability`, `Task`, `TaskSolution`, `ValidationResult` objects

TaskSolution should be part of Task. Let's add new fields for each of these: solution, multiple choices, task_type (with values like OpenEnded, MultipleChoice, ...)

In addition, when a subject model solves a task, how do we store its answer?

Code quote:

`Task`, `TaskSolution`

src/model.py line 13 at r1 (raw file):

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import ChatOpenAI
# from langsmith import traceable  # COMMENTED OUT FOR DEBUGGING

Let's remove all of the commented-out lines.

Code quote:

# from langsmith import traceable  # COMMENTED OUT FOR DEBUGGING

…ame/capability_name) Update Domain, Area, Capability dataclasses and all base_stages to use explicit field names. Update documentation with new JSON format.

…sage

kohankhaki

@kohankhaki made 9 comments.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @afkanpour and @Negiiiin).

README.md line 273 at r1 (raw file):