Persistence moved to python by sfierro · Pull Request #979 · Kiln-AI/Kiln

sfierro · 2026-01-23T05:11:19Z

Summary by CodeRabbit

New Features
- Copilot-assisted spec creation: option to generate, review, and save specs using AI-generated examples and prompts directly from the spec builder UI.
User Experience
- Improved reviewed-examples workflow and review row identifiers for more reliable pass/fail and feedback interactions.
Refactor
- Under-the-hood API and model updates to support the Copilot workflow (no change to core spec UI behavior).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-23T05:11:27Z

Walkthrough

Adds Kiln Copilot end-to-end support: new POST /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot endpoint and request model, frontend wiring to choose Copilot vs standard spec creation, new backend Copilot utilities and models, migration of spec persistence logic into Copilot utilities, and updated types/schema across client and server.

Changes

Cohort / File(s)	Summary
API Schema & Types `app/web_ui/src/lib/api_schema.d.ts`, `app/web_ui/src/lib/types.ts`	New API path `/api/projects/{project_id}/tasks/{task_id}/spec_with_copilot` and `CreateSpecWithCopilotRequest`/`ReviewedExample` schemas; split PromptGenerationResultApi into `-Input`/`-Output`; added TypeScript aliases (`PromptGenerationResultApi`, `TaskMetadataApi`, `ReviewedExample`, `SampleApi`, `SubsampleBatchOutputItemApi`).
Frontend Spec Builder UI `app/web_ui/src/routes/.../spec_builder/+page.svelte`, `.../review_examples.svelte`	Switched judge_info type to PromptGenerationResultApi; migrated ReviewRow to use `row_id` and import from `spec_utils.ts`; saveSpec now builds payload client-side and conditionally POSTs to `/spec_with_copilot` (Copilot) or `/spec` (legacy).
Frontend Utilities `app/web_ui/src/routes/.../spec_builder/spec_utils.ts`	Added exported `ReviewRow` type (includes `row_id`) and removed local ReviewedExample/JudgeInfo usages.
Removed Frontend Persistence `app/web_ui/src/routes/.../spec_builder/spec_persistence.ts`	Deleted large orchestration module (createSpec and helpers) — persistence/eval orchestration moved server-side and into util modules; multiple exported types/functions removed.
Backend Copilot API & Models `app/desktop/studio_server/copilot_api.py`, `app/desktop/studio_server/copilot_models.py`	New `CreateSpecWithCopilotRequest` Pydantic model and `/spec_with_copilot` endpoint; added many Copilot DTOs (TaskInfoApi, SampleApi, ReviewedExample, PromptGenerationResultApi, Clarify/Refine/GenerateBatch inputs/outputs). Endpoint orchestrates eval/config/task-run/spec creation and rollback on failure.
Backend Copilot Utilities `app/desktop/util/spec_utils.py`	New utilities: `get_copilot_api_key()`, `generate_copilot_examples()`, `SampleApi`/`ReviewedExample` models, sampling and TaskRun creation helpers, dataset construction functions used by copilot flow.
API Client Models `app/desktop/studio_server/api_client/.../models/*`	Added `TaskInfo` model and updated `GenerateBatchInput` / `ClarifySpecInput` to use `TaskInfo` fields (nested `target_task_info`, `topic_generation_task_info`, `input_generation_task_info`, `target_specification`); exported `TaskInfo` in package init.
Server Tests & Minor docs `app/desktop/studio_server/test_copilot_api.py`, `libs/server/kiln_server/spec_api.py`, `app/desktop/studio_server/test_questions_models.py`	Adjusted tests to use new config path and updated payload shapes; added TODO comments in spec_api about eval generation behavior; minor import/doc updates for question models.

Sequence Diagram(s)

sequenceDiagram
    participant UI as Spec Builder UI
    participant FE as Frontend Payload Builder
    participant API as Backend Copilot API
    participant Copilot as Kiln Copilot Service
    participant DB as Database

    UI->>FE: saveSpec(use_kiln_copilot=true)
    FE->>FE: build definition & properties
    FE->>API: POST /spec_with_copilot (CreateSpecWithCopilotRequest)
    API->>API: validate request & create Eval/EvalConfig
    API->>Copilot: generate_batch (target/task infos, spec)
    Copilot-->>API: list[SampleApi]
    API->>DB: create TaskRuns (eval/train/golden) and persist TaskRuns
    API->>DB: create Spec and link Eval
    API-->>FE: Spec response (Spec)
    FE-->>UI: spec_id or error
    Note over API,DB: on any failure => rollback created Eval/TaskRuns

sequenceDiagram
    participant UI as Spec Builder UI
    participant FE as Frontend Payload Builder
    participant API as Backend Spec API
    participant DB as Database

    UI->>FE: saveSpec(use_kiln_copilot=false)
    FE->>FE: build definition & properties
    FE->>API: POST /spec (legacy payload)
    API->>DB: create Spec (default priority/status)
    API-->>FE: Spec response
    FE-->>UI: spec_id

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Spec schema update #981 — Overlapping schema/model changes introducing TaskInfo and replacing flat task fields with nested target_task_info across client/server models.
Client changes for recent schema update from Kiln Server #976 — Introduces PromptGenerationResult/ReviewedExample types and updates generate/clarify/refine request/response shapes used by Copilot flows.
Misc specs feedback #973 — Related frontend edits to spec builder UI, ReviewRow/ReviewedExample handling, and spec persistence refactors.

Suggested reviewers

scosman
chiang-daniel
tawnymanticore

Poem

🐰 Hop, click, and bind — a Copilot tale unfolds,
From UI fields to server dreams and models bright as gold.
I nibble payloads, stitch the flow, then tidy up the trace,
With rows and samples, tags and run — a spec that's full of grace.
Hooray! — the rabbit cheers for code that finds its place.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely empty, missing the required template sections including 'What does this PR do?' and other mandatory checklist items.	Add a comprehensive description explaining the persistence migration, including the 'What does this PR do?' section, related issues, CLA confirmation, and completed checklists.
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.30% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Persistence moved to python' accurately reflects the main change: spec persistence logic was migrated from TypeScript (spec_persistence.ts deleted) to Python (spec_utils.py added).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-23T05:11:51Z

Summary of Changes

Hello @sfierro, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the architecture for creating new specifications within the application. By moving the intricate persistence and example generation logic for Kiln Copilot-assisted spec creation from the frontend to a new backend Python endpoint, the change centralizes critical business logic. This improves maintainability, enhances data integrity by ensuring server-side validation and atomic operations, and streamlines the client-side code by offloading complex orchestration.

Highlights

Backend Centralization of Spec Creation: The complex logic for creating specifications with Kiln Copilot, including eval generation, example batching, and judge configuration, has been migrated from the frontend (TypeScript) to the backend (Python).
New API Endpoint: A dedicated FastAPI endpoint, /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot, has been introduced to handle the comprehensive spec creation process on the server side.
Frontend Simplification: The spec_persistence.ts file, which previously contained the client-side persistence logic, has been removed, simplifying the frontend codebase.
API Schema Updates: The API schema (api_schema.d.ts) and frontend types (types.ts) have been updated to reflect the new spec_with_copilot endpoint and its associated request/response models, including new ReviewedExample and refined PromptGenerationResultApi types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-01-23T05:13:44Z

📊 Coverage Report

Overall Coverage: 91%

Diff: origin/sfierro/specs...HEAD

app/desktop/studio_server/copilot_api.py (41.5%): Missing lines 211,244,293,296-299,302,305-309,314,317,328,331,342,345,348-349,352,355,376,384-386,389,399,403-406,408-409,411-413,415-417,419-422,424,427,429
app/desktop/studio_server/copilot_models.py (100%)
app/desktop/util/spec_utils.py (31.1%): Missing lines 81,83,94,99-100,104-105,110-111,117-121,128,133,144-145,147-148,150,155-174,184,193-194,196-199,201-202,204,209,219-220,235,244-245,247,285,288-289,292-293,296-297,300-304,307-308,310

Summary

Total: 258 lines
Missing: 121 lines
Coverage: 53%

Line-by-line

View line-by-line diff coverage

app/desktop/studio_server/copilot_api.py

Lines 207-215

  207     @app.post("/api/copilot/question_spec")
  208     async def question_spec(
  209         input: SpecQuestionerInput,
  210     ) -> QuestionSet:
! 211         api_key = get_copilot_api_key()
  212         client = get_authenticated_client(api_key)
  213 
  214         questioner_input = SpecQuestionerInputServerApi.from_dict(input.model_dump())

Lines 240-248

  240     @app.post("/api/copilot/refine_spec_with_question_answers")
  241     async def submit_question_answers(
  242         request: SubmitAnswersRequest,
  243     ) -> RefineSpecWithQuestionAnswersResponse:
! 244         api_key = get_copilot_api_key()
  245         client = get_authenticated_client(api_key)
  246 
  247         submit_input = SubmitAnswersRequestServerApi.from_dict(request.model_dump())

Lines 289-321

  289 
  290         All models are validated before any saves occur. If validation fails,
  291         no data is persisted.
  292         """
! 293         task = task_from_id(project_id, task_id)
  294 
  295         # Generate tag suffixes
! 296         eval_tag_suffix = request.name.lower().replace(" ", "_")
! 297         eval_tag = f"eval_{eval_tag_suffix}"
! 298         train_tag = f"eval_train_{eval_tag_suffix}"
! 299         golden_tag = f"eval_golden_{eval_tag_suffix}"
  300 
  301         # Extract spec_type from properties (discriminated union)
! 302         spec_type = request.properties["spec_type"]
  303 
  304         # Determine eval properties
! 305         template = spec_eval_template(spec_type)
! 306         output_scores = [spec_eval_output_score(request.name)]
! 307         eval_set_filter_id = f"tag::{eval_tag}"
! 308         eval_configs_filter_id = f"tag::{golden_tag}"
! 309         evaluation_data_type = spec_eval_data_type(
  310             spec_type, request.evaluate_full_trace
  311         )
  312 
  313         # Build models but don't save yet, collect all models first
! 314         models_to_save: list[Eval | EvalConfig | TaskRun | Spec] = []
  315 
  316         # 1. Create the Eval
! 317         eval_model = Eval(
  318             parent=task,
  319             name=request.name,
  320             description=None,
  321             template=template,

Lines 324-335

  324             eval_configs_filter_id=eval_configs_filter_id,
  325             template_properties=None,
  326             evaluation_data_type=evaluation_data_type,
  327         )
! 328         models_to_save.append(eval_model)
  329 
  330         # 2. Create judge eval config
! 331         eval_config = EvalConfig(
  332             parent=eval_model,
  333             name=generate_memorable_name(),
  334             config_type=EvalConfigType.llm_as_judge,
  335             model_name=request.judge_info.task_metadata.model_name,

Lines 338-359

  338                 "eval_steps": [request.judge_info.prompt],
  339                 "task_description": request.task_description,
  340             },
  341         )
! 342         models_to_save.append(eval_config)
  343 
  344         # Set as default config after ID is assigned
! 345         eval_model.current_config_id = eval_config.id
  346 
  347         # 3. Generate examples via copilot API
! 348         api_key = get_copilot_api_key()
! 349         task_input_schema = (
  350             str(task.input_json_schema) if task.input_json_schema else ""
  351         )
! 352         task_output_schema = (
  353             str(task.output_json_schema) if task.output_json_schema else ""
  354         )
! 355         all_examples = await generate_copilot_examples(
  356             api_key=api_key,
  357             target_task_info=TaskInfoApi(
  358                 task_prompt=request.task_prompt_with_few_shot,
  359                 task_input_schema=task_input_schema,

Lines 372-380

  372             spec_definition=request.definition,
  373         )
  374 
  375         # 4. Create TaskRuns for eval, train, and golden datasets
! 376         task_runs = create_dataset_task_runs(
  377             all_examples=all_examples,
  378             reviewed_examples=request.reviewed_examples,
  379             eval_tag=eval_tag,
  380             train_tag=train_tag,

Lines 380-393

  380             train_tag=train_tag,
  381             golden_tag=golden_tag,
  382             spec_name=request.name,
  383         )
! 384         for run in task_runs:
! 385             run.parent = task
! 386         models_to_save.extend(task_runs)
  387 
  388         # 5. Create the Spec using pre-computed definition and properties from client
! 389         spec = Spec(
  390             parent=task,
  391             name=request.name,
  392             definition=request.definition,
  393             properties=request.properties,

Lines 395-430

  395             status=SpecStatus.active,
  396             tags=[],
  397             eval_id=eval_model.id,
  398         )
! 399         models_to_save.append(spec)
  400 
  401         # All models are now created and validated via Pydantic.
  402         # Save everything, with cleanup on failure.
! 403         saved_models: list[Eval | EvalConfig | TaskRun | Spec] = []
! 404         try:
! 405             eval_model.save_to_file()
! 406             saved_models.append(eval_model)
  407 
! 408             eval_config.save_to_file()
! 409             saved_models.append(eval_config)
  410 
! 411             for run in task_runs:
! 412                 run.save_to_file()
! 413                 saved_models.append(run)
  414 
! 415             spec.save_to_file()
! 416             saved_models.append(spec)
! 417         except Exception:
  418             # Clean up any models that were successfully saved before the error
! 419             for model in reversed(saved_models):
! 420                 try:
! 421                     model.delete()
! 422                 except Exception:
  423                     # Log cleanup error but continue, the original error is more important
! 424                     logger.exception(
  425                         f"Failed to delete {type(model).__name__} during cleanup"
  426                     )
! 427             raise
  428 
! 429         return spec

app/desktop/util/spec_utils.py

Lines 77-87

  77         topic_generation_task_info: Task info for topic generation
  78         input_generation_task_info: Task info for input generation
  79         spec_definition: The rendered spec definition
  80     """
! 81     client = get_authenticated_client(api_key)
  82 
! 83     generate_input = GenerateBatchInput.from_dict(
  84         {
  85             "target_task_info": target_task_info.model_dump(),
  86             "topic_generation_task_info": topic_generation_task_info.model_dump(),
  87             "input_generation_task_info": input_generation_task_info.model_dump(),

Lines 90-115

   90             "num_topics": NUM_TOPICS,
   91         }
   92     )
   93 
!  94     result = await generate_batch_v1_copilot_generate_batch_post.asyncio(
   95         client=client,
   96         body=generate_input,
   97     )
   98 
!  99     if result is None:
! 100         raise HTTPException(
  101             status_code=500, detail="Failed to generate batch: No response"
  102         )
  103 
! 104     if isinstance(result, HTTPValidationError):
! 105         raise HTTPException(
  106             status_code=422,
  107             detail=f"Validation error: {result.to_dict()}",
  108         )
  109 
! 110     if not isinstance(result, GenerateBatchOutput):
! 111         raise HTTPException(
  112             status_code=500,
  113             detail=f"Failed to generate batch: Unexpected response type {type(result)}",
  114         )

Lines 113-125

  113             detail=f"Failed to generate batch: Unexpected response type {type(result)}",
  114         )
  115 
  116     # Convert result to flat list of SampleApi
! 117     examples: list[SampleApi] = []
! 118     data_dict = result.to_dict().get("data_by_topic", {})
! 119     for topic_examples in data_dict.values():
! 120         for ex in topic_examples:
! 121             examples.append(
  122                 SampleApi(
  123                     input=ex.get("input", ""),
  124                     output=ex.get("output", ""),
  125                 )

Lines 124-137

  124                     output=ex.get("output", ""),
  125                 )
  126             )
  127 
! 128     return examples
  129 
  130 
  131 def spec_eval_output_score(spec_name: str) -> EvalOutputScore:
  132     """Create an EvalOutputScore for a spec."""
! 133     return EvalOutputScore(
  134         name=spec_name,
  135         type=TaskOutputRatingType.pass_fail,
  136         instruction=f"Evaluate if the model's behaviour meets the spec: {spec_name}.",
  137     )

Lines 140-178

  140 def spec_eval_data_type(
  141     spec_type: SpecType, evaluate_full_trace: bool = False
  142 ) -> EvalDataType:
  143     """Determine the eval data type for a spec."""
! 144     if spec_type == SpecType.reference_answer_accuracy:
! 145         return EvalDataType.reference_answer
  146 
! 147     if evaluate_full_trace:
! 148         return EvalDataType.full_trace
  149     else:
! 150         return EvalDataType.final_answer
  151 
  152 
  153 def spec_eval_template(spec_type: SpecType) -> EvalTemplateId | None:
  154     """Get the eval template for a spec type."""
! 155     match spec_type:
! 156         case SpecType.appropriate_tool_use:
! 157             return EvalTemplateId.tool_call
! 158         case SpecType.reference_answer_accuracy:
! 159             return EvalTemplateId.rag
! 160         case SpecType.factual_correctness:
! 161             return EvalTemplateId.factual_correctness
! 162         case SpecType.toxicity:
! 163             return EvalTemplateId.toxicity
! 164         case SpecType.bias:
! 165             return EvalTemplateId.bias
! 166         case SpecType.maliciousness:
! 167             return EvalTemplateId.maliciousness
! 168         case SpecType.jailbreak:
! 169             return EvalTemplateId.jailbreak
! 170         case SpecType.issue:
! 171             return EvalTemplateId.issue
! 172         case SpecType.desired_behaviour:
! 173             return EvalTemplateId.desired_behaviour
! 174         case (
  175             SpecType.tone
  176             | SpecType.formatting
  177             | SpecType.localization
  178             | SpecType.hallucinations

Lines 180-188

  180             | SpecType.nsfw
  181             | SpecType.taboo
  182             | SpecType.prompt_leakage
  183         ):
! 184             return None
  185 
  186 
  187 def sample_and_remove(examples: list[SampleApi], n: int) -> list[SampleApi]:
  188     """Randomly sample and remove n items from a list.

Lines 189-213

  189 
  190     Mutates the input list by removing the sampled elements.
  191     Uses swap-and-pop for O(1) removal.
  192     """
! 193     sampled: list[SampleApi] = []
! 194     count = min(n, len(examples))
  195 
! 196     for _ in range(count):
! 197         if not examples:
! 198             break
! 199         random_index = random.randint(0, len(examples) - 1)
  200         # Swap with last element and pop
! 201         examples[random_index], examples[-1] = examples[-1], examples[random_index]
! 202         sampled.append(examples.pop())
  203 
! 204     return sampled
  205 
  206 
  207 def create_task_run_from_sample(sample: SampleApi, tag: str) -> TaskRun:
  208     """Create a TaskRun from a SampleApi (without parent set)."""
! 209     data_source = DataSource(
  210         type=DataSourceType.synthetic,
  211         properties={
  212             "adapter_name": KILN_ADAPTER_NAME,
  213             "model_name": KILN_COPILOT_MODEL_NAME,

Lines 215-224

  215         },
  216     )
  217 
  218     # Access input using model_dump since SampleApi uses alias
! 219     sample_dict = sample.model_dump(by_alias=True)
! 220     return TaskRun(
  221         input=sample_dict["input"],
  222         input_source=data_source,
  223         output=TaskOutput(
  224             output=sample.output,

Lines 231-239

  231 def create_task_run_from_reviewed(
  232     example: ReviewedExample, tag: str, spec_name: str
  233 ) -> TaskRun:
  234     """Create a TaskRun from a reviewed example with rating (without parent set)."""
! 235     data_source = DataSource(
  236         type=DataSourceType.synthetic,
  237         properties={
  238             "adapter_name": KILN_ADAPTER_NAME,
  239             "model_name": KILN_COPILOT_MODEL_NAME,

Lines 240-251

  240             "model_provider": KILN_COPILOT_MODEL_PROVIDER,
  241         },
  242     )
  243 
! 244     rating_key = f"named::{spec_name}"
! 245     rating_value = 1.0 if example.user_says_meets_spec else 0.0
  246 
! 247     return TaskRun(
  248         input=example.input,
  249         input_source=data_source,
  250         output=TaskOutput(
  251             output=example.output,

Lines 281-311

  281     - Golden dataset (reviewed examples + unrated examples to reach MIN_GOLDEN_EXAMPLES)
  282 
  283     Returns TaskRuns without parent set - caller must set parent.
  284     """
! 285     task_runs: list[TaskRun] = []
  286 
  287     # Sample examples for eval and train datasets
! 288     eval_examples = sample_and_remove(all_examples, MIN_EVAL_EXAMPLES)
! 289     train_examples = sample_and_remove(all_examples, MIN_TRAIN_EXAMPLES)
  290 
  291     # Create TaskRuns for eval examples
! 292     for example in eval_examples:
! 293         task_runs.append(create_task_run_from_sample(example, eval_tag))
  294 
  295     # Create TaskRuns for train examples
! 296     for example in train_examples:
! 297         task_runs.append(create_task_run_from_sample(example, train_tag))
  298 
  299     # Create unrated golden examples from remaining pool if needed
! 300     unrated_golden_count = max(0, MIN_GOLDEN_EXAMPLES - len(reviewed_examples))
! 301     if unrated_golden_count > 0:
! 302         unrated_golden_examples = sample_and_remove(all_examples, unrated_golden_count)
! 303         for example in unrated_golden_examples:
! 304             task_runs.append(create_task_run_from_sample(example, golden_tag))
  305 
  306     # Create TaskRuns for reviewed examples with ratings
! 307     for reviewed in reviewed_examples:
! 308         task_runs.append(create_task_run_from_reviewed(reviewed, golden_tag, spec_name))
  309 
! 310     return task_runs

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

gemini-code-assist

Code Review

This pull request refactors the spec creation logic, moving it from the TypeScript frontend (spec_persistence.ts) to the Python backend (spec_api.py). This is a significant and positive change, centralizing business logic on the server. A new endpoint /api/projects/{project_id}/tasks/{task_id}/spec_with_copilot is introduced to handle spec creation with Kiln Copilot, including eval creation, example generation, and judge configuration.

The review identified a few critical issues in the new backend implementation. Specifically, the persistence logic lacks transactionality, which could lead to an inconsistent state if an error occurs during file saving. There's also a bug in how JSON schemas are serialized before being sent to an external API. Additionally, there are some opportunities to improve code clarity and maintainability in both the frontend and backend code.

Overall, this is a great refactoring, but the identified issues, especially the critical ones, should be addressed before merging.

libs/server/kiln_server/spec_api.py

app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_builder/+page.svelte

libs/server/kiln_server/spec_api.py

sfierro · 2026-01-23T05:14:32Z

app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_utils.ts

+ * These examples form the golden dataset for the spec's eval.
+ * user_says_meets_spec is optional in the UI (not yet reviewed) but required when sent to backend.
+ */
+export type ReviewRow = {


moved from spec_persistence

libs/server/kiln_server/spec_api.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/spec_utils.ts:
- Around line 22-35: Prettier formatting failed for the file containing the
ReviewRow type; run a code formatter (e.g., run prettier --write on the file) to
fix whitespace/formatting issues in the ReviewRow declaration and surrounding
comments, then stage and re-commit the updated file so the CI pipeline passes.

♻️ Duplicate comments (4)

libs/server/kiln_server/spec_api.py (4)

83-96: Rename input to avoid shadowing the built‑in

Shadowing input() is a frequent lint/source-of-confusion hotspot. Prefer an internal name with an alias so the external API remains unchanged.

♻️ Suggested refactor

 class ReviewedExample(BaseModel):
@@
-    input: str = Field(alias="input")
+    input_str: str = Field(alias="input")
@@
 def _create_task_run_from_reviewed(
     example: ReviewedExample, tag: str, spec_name: str
 ) -> TaskRun:
@@
-        input=example.input,
+        input=example.input_str,

Pydantic v2 Field alias and populate_by_name usage for renaming internal fields while keeping external names

Also applies to: 306-324

28-33: Keep libs/server free of desktop-only dependencies

Importing app.desktop... couples the library to the desktop app and breaks portability/packaging. Please move this endpoint to app/desktop/studio_server or extract the copilot client/types into a shared package and depend on that instead.

1-2: Serialize schemas with json.dumps (str() isn’t JSON)

str(dict) yields single quotes and is not valid JSON; downstream APIs expecting JSON strings will misparse it. Use json.dumps instead.

🐛 Proposed fix

-import random
+import json
+import random
@@
-            task_input_schema=str(task.input_json_schema)
+            task_input_schema=json.dumps(task.input_json_schema)
             if task.input_json_schema
             else "",
-            task_output_schema=str(task.output_json_schema)
+            task_output_schema=json.dumps(task.output_json_schema)
             if task.output_json_schema
             else "",

Python str(dict) vs json.dumps — why str(dict) output is not valid JSON

Based on learnings, avoid str() for schema serialization.

Also applies to: 526-533

563-569: Add rollback to avoid partial persistence

If any save fails, you can leave orphaned Eval/Config/TaskRun files. A rollback cleanup (like the old TS flow) is needed to keep storage consistent.

🧹 Suggested rollback pattern

-        eval_model.save_to_file()
-        eval_config.save_to_file()
-        for run in task_runs:
-            run.save_to_file()
-        spec.save_to_file()
+        saved_models: list[Eval | EvalConfig | TaskRun | Spec] = []
+        try:
+            eval_model.save_to_file()
+            saved_models.append(eval_model)
+            eval_config.save_to_file()
+            saved_models.append(eval_config)
+            for run in task_runs:
+                run.save_to_file()
+                saved_models.append(run)
+            spec.save_to_file()
+            saved_models.append(spec)
+        except Exception:
+            for model in reversed(saved_models):
+                try:
+                    model.delete()
+                except Exception:
+                    pass
+            raise

app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_utils.ts

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@app/desktop/studio_server/copilot_api.py`:
- Around line 490-498: The code passes Python repr strings to
_generate_copilot_examples by calling str(task.input_json_schema) and
str(task.output_json_schema); replace these with proper JSON serialization
(e.g., json.dumps(task.input_json_schema) and
json.dumps(task.output_json_schema)) and import json at the top, ensuring you
still pass an empty string when the schema is falsy (keep the existing
conditional). Update the arguments in the _generate_copilot_examples call
(task_input_schema and task_output_schema) to use json.dumps so the Copilot API
receives valid JSON strings.

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/spec_builder/+page.svelte:
- Around line 364-416: Remove the debug console.log calls in the spec creation
flow: delete the console.log statements referencing "use_kiln_copilot",
"judge_info", "data", and "api_error" inside the if (use_kiln_copilot) branch in
the +page.svelte spec builder (they appear around the POST to
"/spec_with_copilot"); leave error handling (throw api_error) intact and, if you
need persisted logging, replace with a proper logger call rather than
console.log.
- Line 39: Update the import of ReviewRow in +page.svelte to point to the parent
directory where spec_utils.ts actually lives; replace the current relative
import "./spec_utils.ts" with the correct "../spec_utils.ts" so the type
ReviewRow resolves from spec_utils.ts used by the spec_builder page.

🧹 Nitpick comments (7)

app/desktop/util/spec_creation.py (4)
34-38: Redundant Field(alias="input") when field name matches alias.

The alias="input" is unnecessary here since the field is already named input. The alias only has effect when the alias differs from the field name. This can be simplified.
Suggested simplification
 class SampleApi(BaseModel):
     """A sample input/output pair."""

-    input: str = Field(alias="input")
+    input: str
     output: str
41-54: Same redundant alias issue and missing model_config in SampleApi.

ReviewedExample has populate_by_name=True configured, but SampleApi doesn't. If these models are used with aliased JSON keys, consider adding model_config to SampleApi as well for consistency. Also, the alias="input" is redundant when the field name is already input.
Suggested consistency fix
 class SampleApi(BaseModel):
     """A sample input/output pair."""

-    input: str = Field(alias="input")
+    input: str
     output: str

+    model_config = {"populate_by_name": True}
+

 class ReviewedExample(BaseModel):
     """A reviewed example from the spec review process.

     Extends SampleApi with review-specific fields for tracking
     model and user judgments on spec compliance.
     """

-    input: str = Field(alias="input")
+    input: str
     output: str
     model_says_meets_spec: bool
     user_says_meets_spec: bool
     feedback: str

     model_config = {"populate_by_name": True}
79-110: Missing default case in match statement.

The spec_eval_template function uses a match statement without a wildcard case _: default. While Python won't raise an error, it will implicitly return None for unhandled SpecType values. If new SpecType variants are added in the future, this could silently return None unexpectedly. Consider adding an explicit default case for clarity.
Add explicit default case
         case (
             SpecType.tone
             | SpecType.formatting
             | SpecType.localization
             | SpecType.hallucinations
             | SpecType.completeness
             | SpecType.nsfw
             | SpecType.taboo
             | SpecType.prompt_leakage
         ):
             return None
+        case _:
+            return None
133-154: Unnecessary model_dump when field name equals alias.

Since the field name is input and the alias is also "input", you can access sample.input directly without needing model_dump(by_alias=True). The comment suggests uncertainty about alias behavior that doesn't apply here.
Simplify direct field access
 def create_task_run_from_sample(sample: SampleApi, tag: str) -> TaskRun:
     """Create a TaskRun from a SampleApi (without parent set)."""
     data_source = DataSource(
         type=DataSourceType.synthetic,
         properties={
             "adapter_name": KILN_ADAPTER_NAME,
             "model_name": KILN_COPILOT_MODEL_NAME,
             "model_provider": KILN_COPILOT_MODEL_PROVIDER,
         },
     )

-    # Access input using model_dump since SampleApi uses alias
-    sample_dict = sample.model_dump(by_alias=True)
     return TaskRun(
-        input=sample_dict["input"],
+        input=sample.input,
         input_source=data_source,
         output=TaskOutput(
             output=sample.output,
             source=data_source,
         ),
         tags=[tag],
     )
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_builder/+page.svelte (2)
350-362: Potential runtime error if value is not a string.

The filter uses value.trim() but value is typed as string | null. While the !== null check filters nulls, if values contains non-string values at runtime (e.g., numbers or booleans from misconfiguration), this would throw. Consider adding a type guard.
Add type safety
     // Build properties object with spec_type, filtering out null and empty values
     const filteredValues = Object.fromEntries(
       Object.entries(values).filter(
-        ([_, value]) => value !== null && value.trim() !== "",
+        ([_, value]) => typeof value === "string" && value.trim() !== "",
       ),
     )
394-412: Non-copilot path has incomplete implementation (TODO comment).

The TODO at line 406 indicates this endpoint doesn't create the eval with eval tags. This may result in incomplete spec creation for non-copilot flows. Consider either implementing the missing functionality or documenting this as a known limitation.

Would you like me to help implement the non-copilot eval creation, or should this be tracked as a separate issue?
app/desktop/studio_server/copilot_api.py (1)
527-553: Cleanup logic re-raises the original exception without context.

The cleanup logic properly attempts to delete saved models on failure, but the bare raise will propagate the original exception. Consider wrapping in an HTTPException with a 500 status to ensure consistent API error responses, or at minimum log the original error before cleanup.
Improve error handling
         saved_models: list[Eval | EvalConfig | TaskRun | Spec] = []
         try:
             eval_model.save_to_file()
             saved_models.append(eval_model)

             eval_config.save_to_file()
             saved_models.append(eval_config)

             for run in task_runs:
                 run.save_to_file()
                 saved_models.append(run)

             spec.save_to_file()
             saved_models.append(spec)
-        except Exception:
+        except Exception as e:
+            logger.exception("Failed to save spec with copilot, rolling back")
             # Clean up any models that were successfully saved before the error
             for model in reversed(saved_models):
                 try:
                     model.delete()
                 except Exception:
                     # Log cleanup error but continue, the original error is more important
                     logger.exception(
                         f"Failed to delete {type(model).__name__} during cleanup"
                     )
-            raise
+            raise HTTPException(
+                status_code=500,
+                detail=f"Failed to create spec: {str(e)}",
+            ) from e

app/desktop/studio_server/copilot_api.py

app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_builder/+page.svelte

Spec schema update

changed persistence to be more atomic in backend instead of client side

c9730aa

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

sfierro commented Jan 23, 2026

View reviewed changes

leonardmq requested changes Jan 23, 2026

View reviewed changes

libs/server/kiln_server/spec_api.py Outdated Show resolved Hide resolved

Merge branch 'sfierro/spec_questions' into sfierro/spec-persistence

c47d8ec

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/spec_utils.ts Outdated Show resolved Hide resolved

merged parent branch

82bdfba

Base automatically changed from sfierro/spec_questions to scosman/spec_questions January 25, 2026 00:44

Base automatically changed from scosman/spec_questions to sfierro/specs January 25, 2026 00:47

sfierro added 3 commits January 24, 2026 16:47

Merge branch 'sfierro/specs' into sfierro/spec-persistence

3314c0c

WIP

56d4aac

WIP

924ab9f

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

sfierro added 5 commits January 26, 2026 10:44

WIP

747f0f0

WIP

d210e0e

merged parent branch

9c38674

update schema

66dc8af

Merge pull request #981 from Kiln-AI/sfierro/spec-schema-update-012626

2033c46

Spec schema update

leonardmq approved these changes Jan 27, 2026

View reviewed changes

sfierro merged commit fdc0a3b into sfierro/specs Jan 27, 2026
10 checks passed

sfierro deleted the sfierro/spec-persistence branch January 27, 2026 04:48

This was referenced Jan 28, 2026

Spec schema update #992

Merged

pass though server error message when it has an explicit user_message #994

Merged

[Copilot] Move Copilot API models from server to lib/core #1003

Merged

This was referenced Feb 20, 2026

Revert "Update server SDK" #1069

Merged

Revert "[Copilot] Move Copilot API models from server to lib/core" #1072

Merged

Conversation

sfierro commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage Report

Diff: origin/sfierro/specs...HEAD

Summary

Line-by-line

app/desktop/studio_server/copilot_api.py

app/desktop/util/spec_utils.py

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfierro Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sfierro commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading

github-actions bot commented Jan 23, 2026 •

edited

Loading