Skip to content

Persistence moved to python#979

Merged
sfierro merged 11 commits intosfierro/specsfrom
sfierro/spec-persistence
Jan 27, 2026
Merged

Persistence moved to python#979
sfierro merged 11 commits intosfierro/specsfrom
sfierro/spec-persistence

Conversation

@sfierro
Copy link
Contributor

@sfierro sfierro commented Jan 23, 2026

Summary by CodeRabbit

  • New Features
    • Copilot-assisted spec creation: option to generate, review, and save specs using AI-generated examples and prompts directly from the spec builder UI.
  • User Experience
    • Improved reviewed-examples workflow and review row identifiers for more reliable pass/fail and feedback interactions.
  • Refactor
    • Under-the-hood API and model updates to support the Copilot workflow (no change to core spec UI behavior).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

Walkthrough

Adds Kiln Copilot end-to-end support: new POST /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot endpoint and request model, frontend wiring to choose Copilot vs standard spec creation, new backend Copilot utilities and models, migration of spec persistence logic into Copilot utilities, and updated types/schema across client and server.

Changes

Cohort / File(s) Summary
API Schema & Types
app/web_ui/src/lib/api_schema.d.ts, app/web_ui/src/lib/types.ts
New API path /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot and CreateSpecWithCopilotRequest/ReviewedExample schemas; split PromptGenerationResultApi into -Input/-Output; added TypeScript aliases (PromptGenerationResultApi, TaskMetadataApi, ReviewedExample, SampleApi, SubsampleBatchOutputItemApi).
Frontend Spec Builder UI
app/web_ui/src/routes/.../spec_builder/+page.svelte, .../review_examples.svelte
Switched judge_info type to PromptGenerationResultApi; migrated ReviewRow to use row_id and import from spec_utils.ts; saveSpec now builds payload client-side and conditionally POSTs to /spec_with_copilot (Copilot) or /spec (legacy).
Frontend Utilities
app/web_ui/src/routes/.../spec_builder/spec_utils.ts
Added exported ReviewRow type (includes row_id) and removed local ReviewedExample/JudgeInfo usages.
Removed Frontend Persistence
app/web_ui/src/routes/.../spec_builder/spec_persistence.ts
Deleted large orchestration module (createSpec and helpers) — persistence/eval orchestration moved server-side and into util modules; multiple exported types/functions removed.
Backend Copilot API & Models
app/desktop/studio_server/copilot_api.py, app/desktop/studio_server/copilot_models.py
New CreateSpecWithCopilotRequest Pydantic model and /spec_with_copilot endpoint; added many Copilot DTOs (TaskInfoApi, SampleApi, ReviewedExample, PromptGenerationResultApi, Clarify/Refine/GenerateBatch inputs/outputs). Endpoint orchestrates eval/config/task-run/spec creation and rollback on failure.
Backend Copilot Utilities
app/desktop/util/spec_utils.py
New utilities: get_copilot_api_key(), generate_copilot_examples(), SampleApi/ReviewedExample models, sampling and TaskRun creation helpers, dataset construction functions used by copilot flow.
API Client Models
app/desktop/studio_server/api_client/.../models/*
Added TaskInfo model and updated GenerateBatchInput / ClarifySpecInput to use TaskInfo fields (nested target_task_info, topic_generation_task_info, input_generation_task_info, target_specification); exported TaskInfo in package init.
Server Tests & Minor docs
app/desktop/studio_server/test_copilot_api.py, libs/server/kiln_server/spec_api.py, app/desktop/studio_server/test_questions_models.py
Adjusted tests to use new config path and updated payload shapes; added TODO comments in spec_api about eval generation behavior; minor import/doc updates for question models.

Sequence Diagram(s)

sequenceDiagram
    participant UI as Spec Builder UI
    participant FE as Frontend Payload Builder
    participant API as Backend Copilot API
    participant Copilot as Kiln Copilot Service
    participant DB as Database

    UI->>FE: saveSpec(use_kiln_copilot=true)
    FE->>FE: build definition & properties
    FE->>API: POST /spec_with_copilot (CreateSpecWithCopilotRequest)
    API->>API: validate request & create Eval/EvalConfig
    API->>Copilot: generate_batch (target/task infos, spec)
    Copilot-->>API: list[SampleApi]
    API->>DB: create TaskRuns (eval/train/golden) and persist TaskRuns
    API->>DB: create Spec and link Eval
    API-->>FE: Spec response (Spec)
    FE-->>UI: spec_id or error
    Note over API,DB: on any failure => rollback created Eval/TaskRuns
Loading
sequenceDiagram
    participant UI as Spec Builder UI
    participant FE as Frontend Payload Builder
    participant API as Backend Spec API
    participant DB as Database

    UI->>FE: saveSpec(use_kiln_copilot=false)
    FE->>FE: build definition & properties
    FE->>API: POST /spec (legacy payload)
    API->>DB: create Spec (default priority/status)
    API-->>FE: Spec response
    FE-->>UI: spec_id
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • scosman
  • chiang-daniel
  • tawnymanticore

Poem

🐰 Hop, click, and bind — a Copilot tale unfolds,
From UI fields to server dreams and models bright as gold.
I nibble payloads, stitch the flow, then tidy up the trace,
With rows and samples, tags and run — a spec that's full of grace.
Hooray! — the rabbit cheers for code that finds its place.

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty, missing the required template sections including 'What does this PR do?' and other mandatory checklist items. Add a comprehensive description explaining the persistence migration, including the 'What does this PR do?' section, related issues, CLA confirmation, and completed checklists.
Docstring Coverage ⚠️ Warning Docstring coverage is 41.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Persistence moved to python' accurately reflects the main change: spec persistence logic was migrated from TypeScript (spec_persistence.ts deleted) to Python (spec_utils.py added).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sfierro, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the architecture for creating new specifications within the application. By moving the intricate persistence and example generation logic for Kiln Copilot-assisted spec creation from the frontend to a new backend Python endpoint, the change centralizes critical business logic. This improves maintainability, enhances data integrity by ensuring server-side validation and atomic operations, and streamlines the client-side code by offloading complex orchestration.

Highlights

  • Backend Centralization of Spec Creation: The complex logic for creating specifications with Kiln Copilot, including eval generation, example batching, and judge configuration, has been migrated from the frontend (TypeScript) to the backend (Python).
  • New API Endpoint: A dedicated FastAPI endpoint, /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot, has been introduced to handle the comprehensive spec creation process on the server side.
  • Frontend Simplification: The spec_persistence.ts file, which previously contained the client-side persistence logic, has been removed, simplifying the frontend codebase.
  • API Schema Updates: The API schema (api_schema.d.ts) and frontend types (types.ts) have been updated to reflect the new spec_with_copilot endpoint and its associated request/response models, including new ReviewedExample and refined PromptGenerationResultApi types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Jan 23, 2026

📊 Coverage Report

Overall Coverage: 91%

Diff: origin/sfierro/specs...HEAD

  • app/desktop/studio_server/copilot_api.py (41.5%): Missing lines 211,244,293,296-299,302,305-309,314,317,328,331,342,345,348-349,352,355,376,384-386,389,399,403-406,408-409,411-413,415-417,419-422,424,427,429
  • app/desktop/studio_server/copilot_models.py (100%)
  • app/desktop/util/spec_utils.py (31.1%): Missing lines 81,83,94,99-100,104-105,110-111,117-121,128,133,144-145,147-148,150,155-174,184,193-194,196-199,201-202,204,209,219-220,235,244-245,247,285,288-289,292-293,296-297,300-304,307-308,310

Summary

  • Total: 258 lines
  • Missing: 121 lines
  • Coverage: 53%

Line-by-line

View line-by-line diff coverage

app/desktop/studio_server/copilot_api.py

Lines 207-215

  207     @app.post("/api/copilot/question_spec")
  208     async def question_spec(
  209         input: SpecQuestionerInput,
  210     ) -> QuestionSet:
! 211         api_key = get_copilot_api_key()
  212         client = get_authenticated_client(api_key)
  213 
  214         questioner_input = SpecQuestionerInputServerApi.from_dict(input.model_dump())

Lines 240-248

  240     @app.post("/api/copilot/refine_spec_with_question_answers")
  241     async def submit_question_answers(
  242         request: SubmitAnswersRequest,
  243     ) -> RefineSpecWithQuestionAnswersResponse:
! 244         api_key = get_copilot_api_key()
  245         client = get_authenticated_client(api_key)
  246 
  247         submit_input = SubmitAnswersRequestServerApi.from_dict(request.model_dump())

Lines 289-321

  289 
  290         All models are validated before any saves occur. If validation fails,
  291         no data is persisted.
  292         """
! 293         task = task_from_id(project_id, task_id)
  294 
  295         # Generate tag suffixes
! 296         eval_tag_suffix = request.name.lower().replace(" ", "_")
! 297         eval_tag = f"eval_{eval_tag_suffix}"
! 298         train_tag = f"eval_train_{eval_tag_suffix}"
! 299         golden_tag = f"eval_golden_{eval_tag_suffix}"
  300 
  301         # Extract spec_type from properties (discriminated union)
! 302         spec_type = request.properties["spec_type"]
  303 
  304         # Determine eval properties
! 305         template = spec_eval_template(spec_type)
! 306         output_scores = [spec_eval_output_score(request.name)]
! 307         eval_set_filter_id = f"tag::{eval_tag}"
! 308         eval_configs_filter_id = f"tag::{golden_tag}"
! 309         evaluation_data_type = spec_eval_data_type(
  310             spec_type, request.evaluate_full_trace
  311         )
  312 
  313         # Build models but don't save yet, collect all models first
! 314         models_to_save: list[Eval | EvalConfig | TaskRun | Spec] = []
  315 
  316         # 1. Create the Eval
! 317         eval_model = Eval(
  318             parent=task,
  319             name=request.name,
  320             description=None,
  321             template=template,

Lines 324-335

  324             eval_configs_filter_id=eval_configs_filter_id,
  325             template_properties=None,
  326             evaluation_data_type=evaluation_data_type,
  327         )
! 328         models_to_save.append(eval_model)
  329 
  330         # 2. Create judge eval config
! 331         eval_config = EvalConfig(
  332             parent=eval_model,
  333             name=generate_memorable_name(),
  334             config_type=EvalConfigType.llm_as_judge,
  335             model_name=request.judge_info.task_metadata.model_name,

Lines 338-359

  338                 "eval_steps": [request.judge_info.prompt],
  339                 "task_description": request.task_description,
  340             },
  341         )
! 342         models_to_save.append(eval_config)
  343 
  344         # Set as default config after ID is assigned
! 345         eval_model.current_config_id = eval_config.id
  346 
  347         # 3. Generate examples via copilot API
! 348         api_key = get_copilot_api_key()
! 349         task_input_schema = (
  350             str(task.input_json_schema) if task.input_json_schema else ""
  351         )
! 352         task_output_schema = (
  353             str(task.output_json_schema) if task.output_json_schema else ""
  354         )
! 355         all_examples = await generate_copilot_examples(
  356             api_key=api_key,
  357             target_task_info=TaskInfoApi(
  358                 task_prompt=request.task_prompt_with_few_shot,
  359                 task_input_schema=task_input_schema,

Lines 372-380

  372             spec_definition=request.definition,
  373         )
  374 
  375         # 4. Create TaskRuns for eval, train, and golden datasets
! 376         task_runs = create_dataset_task_runs(
  377             all_examples=all_examples,
  378             reviewed_examples=request.reviewed_examples,
  379             eval_tag=eval_tag,
  380             train_tag=train_tag,

Lines 380-393

  380             train_tag=train_tag,
  381             golden_tag=golden_tag,
  382             spec_name=request.name,
  383         )
! 384         for run in task_runs:
! 385             run.parent = task
! 386         models_to_save.extend(task_runs)
  387 
  388         # 5. Create the Spec using pre-computed definition and properties from client
! 389         spec = Spec(
  390             parent=task,
  391             name=request.name,
  392             definition=request.definition,
  393             properties=request.properties,

Lines 395-430

  395             status=SpecStatus.active,
  396             tags=[],
  397             eval_id=eval_model.id,
  398         )
! 399         models_to_save.append(spec)
  400 
  401         # All models are now created and validated via Pydantic.
  402         # Save everything, with cleanup on failure.
! 403         saved_models: list[Eval | EvalConfig | TaskRun | Spec] = []
! 404         try:
! 405             eval_model.save_to_file()
! 406             saved_models.append(eval_model)
  407 
! 408             eval_config.save_to_file()
! 409             saved_models.append(eval_config)
  410 
! 411             for run in task_runs:
! 412                 run.save_to_file()
! 413                 saved_models.append(run)
  414 
! 415             spec.save_to_file()
! 416             saved_models.append(spec)
! 417         except Exception:
  418             # Clean up any models that were successfully saved before the error
! 419             for model in reversed(saved_models):
! 420                 try:
! 421                     model.delete()
! 422                 except Exception:
  423                     # Log cleanup error but continue, the original error is more important
! 424                     logger.exception(
  425                         f"Failed to delete {type(model).__name__} during cleanup"
  426                     )
! 427             raise
  428 
! 429         return spec

app/desktop/util/spec_utils.py

Lines 77-87

  77         topic_generation_task_info: Task info for topic generation
  78         input_generation_task_info: Task info for input generation
  79         spec_definition: The rendered spec definition
  80     """
! 81     client = get_authenticated_client(api_key)
  82 
! 83     generate_input = GenerateBatchInput.from_dict(
  84         {
  85             "target_task_info": target_task_info.model_dump(),
  86             "topic_generation_task_info": topic_generation_task_info.model_dump(),
  87             "input_generation_task_info": input_generation_task_info.model_dump(),

Lines 90-115

   90             "num_topics": NUM_TOPICS,
   91         }
   92     )
   93 
!  94     result = await generate_batch_v1_copilot_generate_batch_post.asyncio(
   95         client=client,
   96         body=generate_input,
   97     )
   98 
!  99     if result is None:
! 100         raise HTTPException(
  101             status_code=500, detail="Failed to generate batch: No response"
  102         )
  103 
! 104     if isinstance(result, HTTPValidationError):
! 105         raise HTTPException(
  106             status_code=422,
  107             detail=f"Validation error: {result.to_dict()}",
  108         )
  109 
! 110     if not isinstance(result, GenerateBatchOutput):
! 111         raise HTTPException(
  112             status_code=500,
  113             detail=f"Failed to generate batch: Unexpected response type {type(result)}",
  114         )

Lines 113-125

  113             detail=f"Failed to generate batch: Unexpected response type {type(result)}",
  114         )
  115 
  116     # Convert result to flat list of SampleApi
! 117     examples: list[SampleApi] = []
! 118     data_dict = result.to_dict().get("data_by_topic", {})
! 119     for topic_examples in data_dict.values():
! 120         for ex in topic_examples:
! 121             examples.append(
  122                 SampleApi(
  123                     input=ex.get("input", ""),
  124                     output=ex.get("output", ""),
  125                 )

Lines 124-137

  124                     output=ex.get("output", ""),
  125                 )
  126             )
  127 
! 128     return examples
  129 
  130 
  131 def spec_eval_output_score(spec_name: str) -> EvalOutputScore:
  132     """Create an EvalOutputScore for a spec."""
! 133     return EvalOutputScore(
  134         name=spec_name,
  135         type=TaskOutputRatingType.pass_fail,
  136         instruction=f"Evaluate if the model's behaviour meets the spec: {spec_name}.",
  137     )

Lines 140-178

  140 def spec_eval_data_type(
  141     spec_type: SpecType, evaluate_full_trace: bool = False
  142 ) -> EvalDataType:
  143     """Determine the eval data type for a spec."""
! 144     if spec_type == SpecType.reference_answer_accuracy:
! 145         return EvalDataType.reference_answer
  146 
! 147     if evaluate_full_trace:
! 148         return EvalDataType.full_trace
  149     else:
! 150         return EvalDataType.final_answer
  151 
  152 
  153 def spec_eval_template(spec_type: SpecType) -> EvalTemplateId | None:
  154     """Get the eval template for a spec type."""
! 155     match spec_type:
! 156         case SpecType.appropriate_tool_use:
! 157             return EvalTemplateId.tool_call
! 158         case SpecType.reference_answer_accuracy:
! 159             return EvalTemplateId.rag
! 160         case SpecType.factual_correctness:
! 161             return EvalTemplateId.factual_correctness
! 162         case SpecType.toxicity:
! 163             return EvalTemplateId.toxicity
! 164         case SpecType.bias:
! 165             return EvalTemplateId.bias
! 166         case SpecType.maliciousness:
! 167             return EvalTemplateId.maliciousness
! 168         case SpecType.jailbreak:
! 169             return EvalTemplateId.jailbreak
! 170         case SpecType.issue:
! 171             return EvalTemplateId.issue
! 172         case SpecType.desired_behaviour:
! 173             return EvalTemplateId.desired_behaviour
! 174         case (
  175             SpecType.tone
  176             | SpecType.formatting
  177             | SpecType.localization
  178             | SpecType.hallucinations

Lines 180-188

  180             | SpecType.nsfw
  181             | SpecType.taboo
  182             | SpecType.prompt_leakage
  183         ):
! 184             return None
  185 
  186 
  187 def sample_and_remove(examples: list[SampleApi], n: int) -> list[SampleApi]:
  188     """Randomly sample and remove n items from a list.

Lines 189-213

  189 
  190     Mutates the input list by removing the sampled elements.
  191     Uses swap-and-pop for O(1) removal.
  192     """
! 193     sampled: list[SampleApi] = []
! 194     count = min(n, len(examples))
  195 
! 196     for _ in range(count):
! 197         if not examples:
! 198             break
! 199         random_index = random.randint(0, len(examples) - 1)
  200         # Swap with last element and pop
! 201         examples[random_index], examples[-1] = examples[-1], examples[random_index]
! 202         sampled.append(examples.pop())
  203 
! 204     return sampled
  205 
  206 
  207 def create_task_run_from_sample(sample: SampleApi, tag: str) -> TaskRun:
  208     """Create a TaskRun from a SampleApi (without parent set)."""
! 209     data_source = DataSource(
  210         type=DataSourceType.synthetic,
  211         properties={
  212             "adapter_name": KILN_ADAPTER_NAME,
  213             "model_name": KILN_COPILOT_MODEL_NAME,

Lines 215-224

  215         },
  216     )
  217 
  218     # Access input using model_dump since SampleApi uses alias
! 219     sample_dict = sample.model_dump(by_alias=True)
! 220     return TaskRun(
  221         input=sample_dict["input"],
  222         input_source=data_source,
  223         output=TaskOutput(
  224             output=sample.output,

Lines 231-239

  231 def create_task_run_from_reviewed(
  232     example: ReviewedExample, tag: str, spec_name: str
  233 ) -> TaskRun:
  234     """Create a TaskRun from a reviewed example with rating (without parent set)."""
! 235     data_source = DataSource(
  236         type=DataSourceType.synthetic,
  237         properties={
  238             "adapter_name": KILN_ADAPTER_NAME,
  239             "model_name": KILN_COPILOT_MODEL_NAME,

Lines 240-251

  240             "model_provider": KILN_COPILOT_MODEL_PROVIDER,
  241         },
  242     )
  243 
! 244     rating_key = f"named::{spec_name}"
! 245     rating_value = 1.0 if example.user_says_meets_spec else 0.0
  246 
! 247     return TaskRun(
  248         input=example.input,
  249         input_source=data_source,
  250         output=TaskOutput(
  251             output=example.output,

Lines 281-311

  281     - Golden dataset (reviewed examples + unrated examples to reach MIN_GOLDEN_EXAMPLES)
  282 
  283     Returns TaskRuns without parent set - caller must set parent.
  284     """
! 285     task_runs: list[TaskRun] = []
  286 
  287     # Sample examples for eval and train datasets
! 288     eval_examples = sample_and_remove(all_examples, MIN_EVAL_EXAMPLES)
! 289     train_examples = sample_and_remove(all_examples, MIN_TRAIN_EXAMPLES)
  290 
  291     # Create TaskRuns for eval examples
! 292     for example in eval_examples:
! 293         task_runs.append(create_task_run_from_sample(example, eval_tag))
  294 
  295     # Create TaskRuns for train examples
! 296     for example in train_examples:
! 297         task_runs.append(create_task_run_from_sample(example, train_tag))
  298 
  299     # Create unrated golden examples from remaining pool if needed
! 300     unrated_golden_count = max(0, MIN_GOLDEN_EXAMPLES - len(reviewed_examples))
! 301     if unrated_golden_count > 0:
! 302         unrated_golden_examples = sample_and_remove(all_examples, unrated_golden_count)
! 303         for example in unrated_golden_examples:
! 304             task_runs.append(create_task_run_from_sample(example, golden_tag))
  305 
  306     # Create TaskRuns for reviewed examples with ratings
! 307     for reviewed in reviewed_examples:
! 308         task_runs.append(create_task_run_from_reviewed(reviewed, golden_tag, spec_name))
  309 
! 310     return task_runs


Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the spec creation logic, moving it from the TypeScript frontend (spec_persistence.ts) to the Python backend (spec_api.py). This is a significant and positive change, centralizing business logic on the server. A new endpoint /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot is introduced to handle spec creation with Kiln Copilot, including eval creation, example generation, and judge configuration.

The review identified a few critical issues in the new backend implementation. Specifically, the persistence logic lacks transactionality, which could lead to an inconsistent state if an error occurs during file saving. There's also a bug in how JSON schemas are serialized before being sent to an external API. Additionally, there are some opportunities to improve code clarity and maintainability in both the frontend and backend code.

Overall, this is a great refactoring, but the identified issues, especially the critical ones, should be addressed before merging.

* These examples form the golden dataset for the spec's eval.
* user_says_meets_spec is optional in the UI (not yet reviewed) but required when sent to backend.
*/
export type ReviewRow = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from spec_persistence

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/spec_utils.ts:
- Around line 22-35: Prettier formatting failed for the file containing the
ReviewRow type; run a code formatter (e.g., run prettier --write on the file) to
fix whitespace/formatting issues in the ReviewRow declaration and surrounding
comments, then stage and re-commit the updated file so the CI pipeline passes.
♻️ Duplicate comments (4)
libs/server/kiln_server/spec_api.py (4)

83-96: Rename input to avoid shadowing the built‑in

Shadowing input() is a frequent lint/source-of-confusion hotspot. Prefer an internal name with an alias so the external API remains unchanged.

♻️ Suggested refactor
 class ReviewedExample(BaseModel):
@@
-    input: str = Field(alias="input")
+    input_str: str = Field(alias="input")
@@
 def _create_task_run_from_reviewed(
     example: ReviewedExample, tag: str, spec_name: str
 ) -> TaskRun:
@@
-        input=example.input,
+        input=example.input_str,
Pydantic v2 Field alias and populate_by_name usage for renaming internal fields while keeping external names

Also applies to: 306-324


28-33: Keep libs/server free of desktop-only dependencies

Importing app.desktop... couples the library to the desktop app and breaks portability/packaging. Please move this endpoint to app/desktop/studio_server or extract the copilot client/types into a shared package and depend on that instead.


1-2: Serialize schemas with json.dumps (str() isn’t JSON)

str(dict) yields single quotes and is not valid JSON; downstream APIs expecting JSON strings will misparse it. Use json.dumps instead.

🐛 Proposed fix
-import random
+import json
+import random
@@
-            task_input_schema=str(task.input_json_schema)
+            task_input_schema=json.dumps(task.input_json_schema)
             if task.input_json_schema
             else "",
-            task_output_schema=str(task.output_json_schema)
+            task_output_schema=json.dumps(task.output_json_schema)
             if task.output_json_schema
             else "",
Python str(dict) vs json.dumps — why str(dict) output is not valid JSON

Based on learnings, avoid str() for schema serialization.

Also applies to: 526-533


563-569: Add rollback to avoid partial persistence

If any save fails, you can leave orphaned Eval/Config/TaskRun files. A rollback cleanup (like the old TS flow) is needed to keep storage consistent.

🧹 Suggested rollback pattern
-        eval_model.save_to_file()
-        eval_config.save_to_file()
-        for run in task_runs:
-            run.save_to_file()
-        spec.save_to_file()
+        saved_models: list[Eval | EvalConfig | TaskRun | Spec] = []
+        try:
+            eval_model.save_to_file()
+            saved_models.append(eval_model)
+            eval_config.save_to_file()
+            saved_models.append(eval_config)
+            for run in task_runs:
+                run.save_to_file()
+                saved_models.append(run)
+            spec.save_to_file()
+            saved_models.append(spec)
+        except Exception:
+            for model in reversed(saved_models):
+                try:
+                    model.delete()
+                except Exception:
+                    pass
+            raise

Base automatically changed from sfierro/spec_questions to scosman/spec_questions January 25, 2026 00:44
Base automatically changed from scosman/spec_questions to sfierro/specs January 25, 2026 00:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@app/desktop/studio_server/copilot_api.py`:
- Around line 490-498: The code passes Python repr strings to
_generate_copilot_examples by calling str(task.input_json_schema) and
str(task.output_json_schema); replace these with proper JSON serialization
(e.g., json.dumps(task.input_json_schema) and
json.dumps(task.output_json_schema)) and import json at the top, ensuring you
still pass an empty string when the schema is falsy (keep the existing
conditional). Update the arguments in the _generate_copilot_examples call
(task_input_schema and task_output_schema) to use json.dumps so the Copilot API
receives valid JSON strings.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/spec_builder/+page.svelte:
- Around line 364-416: Remove the debug console.log calls in the spec creation
flow: delete the console.log statements referencing "use_kiln_copilot",
"judge_info", "data", and "api_error" inside the if (use_kiln_copilot) branch in
the +page.svelte spec builder (they appear around the POST to
"/spec_with_copilot"); leave error handling (throw api_error) intact and, if you
need persisted logging, replace with a proper logger call rather than
console.log.
- Line 39: Update the import of ReviewRow in +page.svelte to point to the parent
directory where spec_utils.ts actually lives; replace the current relative
import "./spec_utils.ts" with the correct "../spec_utils.ts" so the type
ReviewRow resolves from spec_utils.ts used by the spec_builder page.
🧹 Nitpick comments (7)
app/desktop/util/spec_creation.py (4)

34-38: Redundant Field(alias="input") when field name matches alias.

The alias="input" is unnecessary here since the field is already named input. The alias only has effect when the alias differs from the field name. This can be simplified.

Suggested simplification
 class SampleApi(BaseModel):
     """A sample input/output pair."""

-    input: str = Field(alias="input")
+    input: str
     output: str

41-54: Same redundant alias issue and missing model_config in SampleApi.

ReviewedExample has populate_by_name=True configured, but SampleApi doesn't. If these models are used with aliased JSON keys, consider adding model_config to SampleApi as well for consistency. Also, the alias="input" is redundant when the field name is already input.

Suggested consistency fix
 class SampleApi(BaseModel):
     """A sample input/output pair."""

-    input: str = Field(alias="input")
+    input: str
     output: str

+    model_config = {"populate_by_name": True}
+

 class ReviewedExample(BaseModel):
     """A reviewed example from the spec review process.

     Extends SampleApi with review-specific fields for tracking
     model and user judgments on spec compliance.
     """

-    input: str = Field(alias="input")
+    input: str
     output: str
     model_says_meets_spec: bool
     user_says_meets_spec: bool
     feedback: str

     model_config = {"populate_by_name": True}

79-110: Missing default case in match statement.

The spec_eval_template function uses a match statement without a wildcard case _: default. While Python won't raise an error, it will implicitly return None for unhandled SpecType values. If new SpecType variants are added in the future, this could silently return None unexpectedly. Consider adding an explicit default case for clarity.

Add explicit default case
         case (
             SpecType.tone
             | SpecType.formatting
             | SpecType.localization
             | SpecType.hallucinations
             | SpecType.completeness
             | SpecType.nsfw
             | SpecType.taboo
             | SpecType.prompt_leakage
         ):
             return None
+        case _:
+            return None

133-154: Unnecessary model_dump when field name equals alias.

Since the field name is input and the alias is also "input", you can access sample.input directly without needing model_dump(by_alias=True). The comment suggests uncertainty about alias behavior that doesn't apply here.

Simplify direct field access
 def create_task_run_from_sample(sample: SampleApi, tag: str) -> TaskRun:
     """Create a TaskRun from a SampleApi (without parent set)."""
     data_source = DataSource(
         type=DataSourceType.synthetic,
         properties={
             "adapter_name": KILN_ADAPTER_NAME,
             "model_name": KILN_COPILOT_MODEL_NAME,
             "model_provider": KILN_COPILOT_MODEL_PROVIDER,
         },
     )

-    # Access input using model_dump since SampleApi uses alias
-    sample_dict = sample.model_dump(by_alias=True)
     return TaskRun(
-        input=sample_dict["input"],
+        input=sample.input,
         input_source=data_source,
         output=TaskOutput(
             output=sample.output,
             source=data_source,
         ),
         tags=[tag],
     )
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_builder/+page.svelte (2)

350-362: Potential runtime error if value is not a string.

The filter uses value.trim() but value is typed as string | null. While the !== null check filters nulls, if values contains non-string values at runtime (e.g., numbers or booleans from misconfiguration), this would throw. Consider adding a type guard.

Add type safety
     // Build properties object with spec_type, filtering out null and empty values
     const filteredValues = Object.fromEntries(
       Object.entries(values).filter(
-        ([_, value]) => value !== null && value.trim() !== "",
+        ([_, value]) => typeof value === "string" && value.trim() !== "",
       ),
     )

394-412: Non-copilot path has incomplete implementation (TODO comment).

The TODO at line 406 indicates this endpoint doesn't create the eval with eval tags. This may result in incomplete spec creation for non-copilot flows. Consider either implementing the missing functionality or documenting this as a known limitation.

Would you like me to help implement the non-copilot eval creation, or should this be tracked as a separate issue?

app/desktop/studio_server/copilot_api.py (1)

527-553: Cleanup logic re-raises the original exception without context.

The cleanup logic properly attempts to delete saved models on failure, but the bare raise will propagate the original exception. Consider wrapping in an HTTPException with a 500 status to ensure consistent API error responses, or at minimum log the original error before cleanup.

Improve error handling
         saved_models: list[Eval | EvalConfig | TaskRun | Spec] = []
         try:
             eval_model.save_to_file()
             saved_models.append(eval_model)

             eval_config.save_to_file()
             saved_models.append(eval_config)

             for run in task_runs:
                 run.save_to_file()
                 saved_models.append(run)

             spec.save_to_file()
             saved_models.append(spec)
-        except Exception:
+        except Exception as e:
+            logger.exception("Failed to save spec with copilot, rolling back")
             # Clean up any models that were successfully saved before the error
             for model in reversed(saved_models):
                 try:
                     model.delete()
                 except Exception:
                     # Log cleanup error but continue, the original error is more important
                     logger.exception(
                         f"Failed to delete {type(model).__name__} during cleanup"
                     )
-            raise
+            raise HTTPException(
+                status_code=500,
+                detail=f"Failed to create spec: {str(e)}",
+            ) from e

@sfierro sfierro merged commit fdc0a3b into sfierro/specs Jan 27, 2026
10 checks passed
@sfierro sfierro deleted the sfierro/spec-persistence branch January 27, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants