Skip to content

Conversation

@kohankhaki
Copy link
Collaborator

@kohankhaki kohankhaki commented Dec 12, 2025

PR Type

Fix

Short Description

This pull request refactors the base generation pipeline to use the schema in src/schemas and use experimental "diverse task generator" for task generation and validation.
It also removed legacy unused files.

Tests Added

None


This change is Reviewable

@afkanpour afkanpour self-requested a review December 15, 2025 14:56
@afkanpour afkanpour requested a review from Negiiiin January 8, 2026 15:15
Copy link
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afkanpour partially reviewed 43 files and all commit messages, and made 12 comments.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @kohankhaki).


src/schemas/task_schemas.py line 22 at r2 (raw file):

    task: str
    capability: Capability
    generation_metadata: Optional[Dict] = field(default_factory=dict)

Let's use this sparingly. Please add at least the difficulty, bloom's level, solution type (mcq, open-ended, ...), and the solution attributes to this class.
In addition, since we're converging to multiple choice questions, add another field, e.g., a list of (label, solution) that holds the choices. "label" refers to "A", "B", ..., and "solution" is the actual solution.

Code quote:

generation_metadata: Optional[Dict] = field(default_factory=dict)

src/utils/embedding_utils.py line 170 at r1 (raw file):

def generate_schema_capabilities_embeddings(

We don't keep the old capability class anymore, right? Then, let's remove schema from the name here and all comments that follow.

Code quote:

generate_schema_capabilities_embeddings

src/utils/capability_utils.py line 17 at r1 (raw file):

from inspect_ai.scorer import CORRECT

# from langsmith import traceable, tracing_context  # COMMENTED OUT FOR DEBUGGING

Do we use langsmith at all? If not, let's remove all references to it.

Code quote:

# from langsmith import traceable, tracing_context  # COMMENTED OUT FOR DEBUGGING

src/utils/capability_utils.py line 252 at r1 (raw file):

    # @traceable(  # COMMENTED OUT FOR DEBUGGING
    #     run_type="llm",
    # )

Are these for langsmith? Please remove these and all similar instances.

Code quote:

    # @traceable(  # COMMENTED OUT FOR DEBUGGING
    #     run_type="llm",
    # )

src/base_stages/generate_tasks.py line 0 at r2 (raw file):
What's the difference between generate_tasks.py and generate_diverse_tasks.py? It seems the previous one implements the combination pipeline whereas this one only generates the tasks. Let's use more descriptive names for modules and functions.


src/base_stages/generate_tasks.py line 116 at r2 (raw file):

                    "blueprint": blueprint.blueprint,
                    "subtopic": blueprint.subtopic,
                    "difficulty": blueprint.difficulty,

Please move difficulty and bloom's level to the main task class. We'll use them regularly.

More generally, I think generation_metadata should be used sparingly and only for experimental purposes. It has some serious shortcomings. For example, it significantly reduces the readability of the code (the reader has no idea what attributes are in the class). It causes issues with type safety, and makes refactoring more difficult.

Code quote:

                    "blueprint": blueprint.blueprint,
                    "subtopic": blueprint.subtopic,
                    "difficulty": blueprint.difficulty,

src/base_stages/generate_tasks.py line 117 at r2 (raw file):

                    "subtopic": blueprint.subtopic,
                    "difficulty": blueprint.difficulty,
                    "reasoning": blueprint.reasoning,

What is reasoning? Does it refer to the bloom's level? If so, let's use a self-explanatory name.

Code quote:

"reasoning": blueprint.reasoning

src/utils/prompts.py line 0 at r1 (raw file):
There is prompts.py under base_stages/ and there's this module. What's the difference between them? And why don't we merge them?


src/utils/capability_management_utils.py line 195 at r1 (raw file):

def filter_schema_capabilities_by_embeddings(
    capabilities: List[Any],  # List of schema Capability objects
    embeddings: List[torch.Tensor],

Should we add embedding as an optional field to the capability class?

Code quote:

embeddings: List[torch.Tensor]

src/base_stages/task_dataclasses.py line 0 at r2 (raw file):
Does this define classes for the 5-stage generation pipeline? If so, let's merge it with the main schemas. We don't want to keep two sets of classes around. If there are experimental work, they should go in an experimental directory.


README.md line 273 at r1 (raw file):

1. **Follow Schema Guidelines**: All data objects must use the schema classes defined in `src/schemas/`:
   - Use `Domain`, `Area`, `Capability`, `Task`, `TaskSolution`, `ValidationResult` objects

TaskSolution should be part of Task. Let's add new fields for each of these: solution, multiple choices, task_type (with values like OpenEnded, MultipleChoice, ...)

In addition, when a subject model solves a task, how do we store its answer?

Code quote:

`Task`, `TaskSolution`

src/model.py line 13 at r1 (raw file):

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import ChatOpenAI
# from langsmith import traceable  # COMMENTED OUT FOR DEBUGGING

Let's remove all of the commented-out lines.

Code quote:

# from langsmith import traceable  # COMMENTED OUT FOR DEBUGGING

…ame/capability_name)

Update Domain, Area, Capability dataclasses and all base_stages to use
explicit field names. Update documentation with new JSON format.
Copy link
Collaborator Author

@kohankhaki kohankhaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kohankhaki made 9 comments.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @afkanpour and @Negiiiin).


README.md line 273 at r1 (raw file):

Previously, afkanpour (Arash) wrote…

TaskSolution should be part of Task. Let's add new fields for each of these: solution, multiple choices, task_type (with values like OpenEnded, MultipleChoice, ...)

In addition, when a subject model solves a task, how do we store its answer?

Done.


src/base_stages/generate_tasks.py line at r2 (raw file):

Previously, afkanpour (Arash) wrote…

What's the difference between generate_tasks.py and generate_diverse_tasks.py? It seems the previous one implements the combination pipeline whereas this one only generates the tasks. Let's use more descriptive names for modules and functions.

Renamed the files for clarity: generate_tasks_from_blueprints now builds MCQ tasks from blueprints, and generate_diverse_tasks_pipeline runs the full subtopic→combination→blueprint→task pipeline. All imports updated.


src/utils/capability_management_utils.py line 195 at r1 (raw file):

Previously, afkanpour (Arash) wrote…

Should we add embedding as an optional field to the capability class?

No. Let’s keep Capability small (id/name/description) and store embeddings elsewhere keyed by capability_id. That avoids bloating JSON and locking us to a specific embedding model.


src/utils/capability_utils.py line 17 at r1 (raw file):

Previously, afkanpour (Arash) wrote…

Do we use langsmith at all? If not, let's remove all references to it.

Done.


src/utils/capability_utils.py line 252 at r1 (raw file):

Previously, afkanpour (Arash) wrote…

Are these for langsmith? Please remove these and all similar instances.

Done. Removed the function as it will not be used in future and there will be a new eval pipeline.


src/utils/embedding_utils.py line 170 at r1 (raw file):

Previously, afkanpour (Arash) wrote…

We don't keep the old capability class anymore, right? Then, let's remove schema from the name here and all comments that follow.

Done.


src/utils/prompts.py line at r1 (raw file):

Previously, afkanpour (Arash) wrote…

There is prompts.py under base_stages/ and there's this module. What's the difference between them? And why don't we merge them?

I kept src/utils/prompts.py as the legacy/agentic prompt set used by capability_discovery_utils, and moved the new base-generation pipeline prompts to src/utils/base_generation_prompts.py (all base stages now import that). They coexist without breaking legacy callers; when we fully merge later, well remove the file.


src/model.py line 13 at r1 (raw file):

Previously, afkanpour (Arash) wrote…

Let's remove all of the commented-out lines.

Done.


src/base_stages/task_dataclasses.py line at r2 (raw file):

Previously, afkanpour (Arash) wrote…

Does this define classes for the 5-stage generation pipeline? If so, let's merge it with the main schemas. We don't want to keep two sets of classes around. If there are experimental work, they should go in an experimental directory.

These are intermediate dataclasses used internally within Stage 3's multi-step task generation process (subtopic extraction → combination finding → blueprint generation → task generation). They're not pipeline stage outputs - the final output is converted to the main Task schema.

pipeline_type,
)

domain_id = "domain_000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that this is defined as a constant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the criteria mentioned in the VERIFICATION_USER_PROMPT_TEMPLATE and VERIFICATION_SYSTEM_PROMPT are not very detailed (which might work fine). I would like to see in a set of generated tasks how good this verification is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants