Skip to content

Conversation

@lemon07r
Copy link
Owner

@lemon07r lemon07r commented Feb 8, 2026

This pull request introduces significant improvements to the build pipeline, dataset transformation capabilities, and documentation for vellumforge2. The most notable changes are the addition of a robust offline dataset transformation command, enhanced support for dual dataset generation, and a modernized Makefile for streamlined development and CI workflows. Documentation is updated to reflect new features, usage, and configuration tips.

Build & CI Improvements:

  • Modernized Makefile with new targets for quality checks, coverage, linting, and CI, including colorized output and improved help descriptions. Adds targets for dependency verification, pre-commit checks, and security scanning. [1] [2]
  • Added .golangci.yml for consistent linting configuration across environments.

Dataset Transformation Features:

  • Introduced a new transform command to convert SFT datasets to DPO and regenerate rejected responses, supporting checkpoint/resume and dual dataset (reasoning/plain) workflows. [1] [2] [3]
  • Enhanced dataset writer logic to support dual output when reasoning capture is enabled, both for generation and resume flows. [1] [2]

Documentation Updates:

  • Expanded README.md with detailed instructions for the new transform command, checkpoint/resume with custom config and env files, and guidance for handling long-form generation timeouts and retries. [1] [2] [3]
  • Updated version references in documentation to v1.6.0. [1] [2]

Session Resume Enhancements:

  • Improved resume logic to support loading environment variables from a file, enabling more flexible configuration for checkpoint resumption.

Miscellaneous:

  • Minor import ordering and refactoring for clarity and maintainability.

Summary by cubic

Adds streaming generation and dual dataset output with reasoning capture, plus a new transform CLI to convert SFT→DPO and regenerate rejected responses. Improves long-run reliability with per-model HTTP timeouts, better resume, and output validation; updates Makefile, linting, and docs.

  • New Features

    • Transform CLI: SFT→DPO and rejected regeneration with checkpoint/resume and optional dual outputs (reasoning/plain).
    • Streaming chat completions to bypass gateway timeouts; captures reasoning_content; per-model HTTP timeouts via config.
    • Dual dataset writer and HF uploader: writes dataset.jsonl and dataset_reasoning.jsonl; utils for tags and meta cleanup.
    • SFT format option (alpaca or sharegpt) for SFT mode outputs.
    • Resume and validation: --env-file support for checkpoints; stronger checks for incomplete/refusal; stress-test suite and provider configs.
  • Build & CI

    • Modernized Makefile with deps/fmt/vet/lint/test/coverage/pre-commit/security/CI targets and colorized help.
    • Added golangci-lint configuration.
    • Docs updated for transform, streaming, dual dataset, and resume flags; version bumped to v1.6.0.

Written for commit d1d7786. Summary will update on new commits.

… training

- Implemented streaming tests for reasoning models in `reasoning_streaming_test.go`.
- Developed comprehensive reasoning tests in `reasoning_test.go` to validate logical thinking prompts.
- Created a streaming API in `streaming.go` to handle reasoning content in responses.
- Introduced utility functions in `reasoning.go` and `think_tags.go` for managing reasoning content and tags.
- Added tests for utility functions to ensure correct handling of think tags.
- Developed `DualDatasetWriter` in `dual_dataset.go` to write both regular and reasoning-aware datasets simultaneously.
- Enhanced error handling and logging throughout the new implementations.
…upport and increase HTTP timeout settings for long-form generation
- Add use_streaming config option to ModelConfig
- Integrate ChatCompletionStreaming into orchestrator worker for chosen/rejected generation
- Apply HTTP timeout settings to streaming requests
- Update example config with streaming documentation
- Enable streaming in config.sft.toml for long-form generation

Streaming keeps connection alive with data flowing, bypassing 504 gateway timeouts
that occur with long-running requests (16k+ tokens taking 5-10+ minutes).
- Add Writer interface for unified dataset writer abstraction
- Update DatasetWriter and DualDatasetWriter to implement Writer interface
- Integrate DualDatasetWriter into orchestrator based on enable_reasoning_capture config
- Capture ReasoningContent from API responses (both chosen and rejected)
- Pass reasoning content through to write methods for dual dataset output
- Update config.sft.toml to enable dual dataset mode with streaming
- Document dual dataset feature in example configs

Dual dataset mode generates two files simultaneously:
1. dataset.jsonl - Regular responses (content only)
2. dataset_reasoning.jsonl - Responses with <think> tags containing reasoning

Reasoning capture requires streaming mode (use_streaming=true) to access
reasoning_content field from reasoning models like Kimi-K2-Thinking.
…esence

The worker was incorrectly trying to use rejected model in SFT mode if it was
configured, even though SFT only needs chosen responses. Added explicit check
for DatasetModeSFT to skip rejected model generation.

This fixes a regression where SFT mode would try to connect to localhost:8080
for rejected responses that weren't needed.
- Updated .gitignore to include environment files for different providers.
- Simplified .golangci.yml by removing unused linters and settings.
- Added functions to validate API response completeness in validation.go.
- Integrated validation checks into worker.go to handle incomplete outputs and finish reasons.
- Created example environment files for chutes and nahcrof APIs.
- Generated detailed stress test results in RESULTS.md, highlighting performance comparisons between nahcrof and chutes.
- Expanded TEST_README.md with comprehensive setup instructions and analysis of test results.
- Implemented analyze_results.py to process and summarize test outcomes.
- Developed run_test.sh script to automate testing and analysis for both API providers.
…nd rejected response regeneration with checkpoint support
…ntegrated with progress bars during transformation and a dedicated cleanup script.
@lemon07r
Copy link
Owner Author

lemon07r commented Feb 8, 2026

@codex review

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 41 files

Confidence score: 3/5

  • There are a couple of concrete user-impacting risks: resume behavior in internal/dataset/transform.go can silently create new outputs and skip completed jobs, leading to incomplete datasets.
  • internal/config/loader.go appears to override an explicit http_timeout_seconds = 0, which is a functional regression for users who rely on disabling timeouts.
  • Overall this is a sizable set of behavioral concerns, so risk is moderate despite no critical blockers called out.
  • Pay close attention to internal/dataset/transform.go, internal/config/loader.go, tests/api-reliability-test/run_test.sh - resume correctness, timeout semantics, and test failure propagation.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="tests/api-reliability-test/run_test.sh">

<violation number="1" location="tests/api-reliability-test/run_test.sh:2">
P2: `set -e` doesn’t catch failures inside pipelines; the `timeout ... | tee` command can fail without stopping the script. Enable `pipefail` (and optionally `-u`) so test failures propagate correctly.</violation>
</file>

<file name="internal/dataset/transform.go">

<violation number="1" location="internal/dataset/transform.go:197">
P2: Resuming should not create a new output dataset when the original file is missing; it will skip completed jobs from the checkpoint and produce an incomplete dataset. Fail fast when the resume output file does not exist.</violation>

<violation number="2" location="internal/dataset/transform.go:438">
P2: When resuming reasoning output, silently creating a new file will cause the checkpoint to skip already completed jobs and produce an incomplete reasoning dataset. Resume should error if the reasoning output file is missing.</violation>
</file>

<file name="tests/api-reliability-test/analyze_results.py">

<violation number="1" location="tests/api-reliability-test/analyze_results.py:62">
P2: Avoid a bare except when parsing dataset lines; it hides unexpected errors and can mask incorrect analysis results. Catch the expected JSON decode failure explicitly.</violation>
</file>

<file name="internal/config/loader.go">

<violation number="1" location="internal/config/loader.go:103">
P2: This default overrides an explicit `http_timeout_seconds = 0`, even though the config docs specify that 0 disables timeouts. As written, users can no longer set "no timeout" behavior, which is a functional regression for configs relying on that setting.</violation>
</file>

<file name="internal/api/reasoning_stream_test.go">

<violation number="1" location="internal/api/reasoning_stream_test.go:81">
P2: Increase the scanner buffer size for streaming responses so large SSE chunks don't trigger `bufio.ErrTooLong`.</violation>
</file>

<file name="internal/api/streaming.go">

<violation number="1" location="internal/api/streaming.go:102">
P3: The marshal → unmarshal → add key → re-marshal round-trip to inject `"stream": true` is inefficient and fragile. Consider adding a `Stream bool` field (with `json:"stream,omitempty"`) to `ChatCompletionRequest` instead, which avoids the double-serialization overhead and potential floating-point representation issues.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@@ -0,0 +1,92 @@
#!/bin/bash
set -e
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: set -e doesn’t catch failures inside pipelines; the timeout ... | tee command can fail without stopping the script. Enable pipefail (and optionally -u) so test failures propagate correctly.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/api-reliability-test/run_test.sh, line 2:

<comment>`set -e` doesn’t catch failures inside pipelines; the `timeout ... | tee` command can fail without stopping the script. Enable `pipefail` (and optionally `-u`) so test failures propagate correctly.</comment>

<file context>
@@ -0,0 +1,92 @@
+#!/bin/bash
+set -e
+
+# API Reliability Stress Test Runner
</file context>
Fix with Cubic


if opts.OutputReasoningPath != "" {
if opts.Resume {
reasoningFile, err = os.OpenFile(opts.OutputReasoningPath, os.O_APPEND|os.O_WRONLY, 0o644)
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: When resuming reasoning output, silently creating a new file will cause the checkpoint to skip already completed jobs and produce an incomplete reasoning dataset. Resume should error if the reasoning output file is missing.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/dataset/transform.go, line 438:

<comment>When resuming reasoning output, silently creating a new file will cause the checkpoint to skip already completed jobs and produce an incomplete reasoning dataset. Resume should error if the reasoning output file is missing.</comment>

<file context>
@@ -0,0 +1,957 @@
+
+	if opts.OutputReasoningPath != "" {
+		if opts.Resume {
+			reasoningFile, err = os.OpenFile(opts.OutputReasoningPath, os.O_APPEND|os.O_WRONLY, 0o644)
+			if err != nil {
+				if os.IsNotExist(err) {
</file context>
Fix with Cubic

// Open output file (append on resume, create otherwise).
var outputFile *os.File
if opts.Resume {
outputFile, err = os.OpenFile(opts.OutputPath, os.O_APPEND|os.O_WRONLY, 0o644)
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Resuming should not create a new output dataset when the original file is missing; it will skip completed jobs from the checkpoint and produce an incomplete dataset. Fail fast when the resume output file does not exist.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/dataset/transform.go, line 197:

<comment>Resuming should not create a new output dataset when the original file is missing; it will skip completed jobs from the checkpoint and produce an incomplete dataset. Fail fast when the resume output file does not exist.</comment>

<file context>
@@ -0,0 +1,957 @@
+	// Open output file (append on resume, create otherwise).
+	var outputFile *os.File
+	if opts.Resume {
+		outputFile, err = os.OpenFile(opts.OutputPath, os.O_APPEND|os.O_WRONLY, 0o644)
+		if err != nil {
+			if os.IsNotExist(err) {
</file context>
Fix with Cubic

'length': len(output_text),
'ending': output_text[-60:]
})
except:
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Avoid a bare except when parsing dataset lines; it hides unexpected errors and can mask incorrect analysis results. Catch the expected JSON decode failure explicitly.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/api-reliability-test/analyze_results.py, line 62:

<comment>Avoid a bare except when parsing dataset lines; it hides unexpected errors and can mask incorrect analysis results. Catch the expected JSON decode failure explicitly.</comment>

<file context>
@@ -0,0 +1,430 @@
+                                'length': len(output_text),
+                                'ending': output_text[-60:]
+                            })
+                except:
+                    pass
+    
</file context>
Fix with Cubic

}
// Default HTTP timeout: 120 seconds
// For very long-form generation (>4000 tokens), set this higher (e.g., 600-1200)
if model.HTTPTimeoutSeconds == 0 {
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This default overrides an explicit http_timeout_seconds = 0, even though the config docs specify that 0 disables timeouts. As written, users can no longer set "no timeout" behavior, which is a functional regression for configs relying on that setting.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/config/loader.go, line 103:

<comment>This default overrides an explicit `http_timeout_seconds = 0`, even though the config docs specify that 0 disables timeouts. As written, users can no longer set "no timeout" behavior, which is a functional regression for configs relying on that setting.</comment>

<file context>
@@ -93,6 +98,11 @@ func applyDefaults(cfg *Config) {
 		}
+		// Default HTTP timeout: 120 seconds
+		// For very long-form generation (>4000 tokens), set this higher (e.g., 600-1200)
+		if model.HTTPTimeoutSeconds == 0 {
+			model.HTTPTimeoutSeconds = 120 // 2 minutes default
+		}
</file context>
Fix with Cubic

t.Logf("Response status: %d", resp.StatusCode)

// Read streaming response
scanner := bufio.NewScanner(resp.Body)
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Increase the scanner buffer size for streaming responses so large SSE chunks don't trigger bufio.ErrTooLong.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/api/reasoning_stream_test.go, line 81:

<comment>Increase the scanner buffer size for streaming responses so large SSE chunks don't trigger `bufio.ErrTooLong`.</comment>

<file context>
@@ -0,0 +1,300 @@
+	t.Logf("Response status: %d", resp.StatusCode)
+
+	// Read streaming response
+	scanner := bufio.NewScanner(resp.Body)
+	chunkCount := 0
+	hasReasoningContent := false
</file context>
Fix with Cubic

}

// Add stream parameter (not part of ChatCompletionRequest struct)
reqMap := make(map[string]interface{})
Copy link

@cubic-dev-ai cubic-dev-ai bot Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The marshal → unmarshal → add key → re-marshal round-trip to inject "stream": true is inefficient and fragile. Consider adding a Stream bool field (with json:"stream,omitempty") to ChatCompletionRequest instead, which avoids the double-serialization overhead and potential floating-point representation issues.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At internal/api/streaming.go, line 102:

<comment>The marshal → unmarshal → add key → re-marshal round-trip to inject `"stream": true` is inefficient and fragile. Consider adding a `Stream bool` field (with `json:"stream,omitempty"`) to `ChatCompletionRequest` instead, which avoids the double-serialization overhead and potential floating-point representation issues.</comment>

<file context>
@@ -0,0 +1,372 @@
+	}
+
+	// Add stream parameter (not part of ChatCompletionRequest struct)
+	reqMap := make(map[string]interface{})
+	reqBytes, err := json.Marshal(req)
+	if err != nil {
</file context>
Fix with Cubic

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands vellumforge2 with (1) streaming + reasoning capture support (dual dataset outputs), and (2) an offline dataset transform command for SFT→DPO conversion / rejected-response regeneration, alongside updated build tooling and new API reliability stress-test assets.

Changes:

  • Add streaming support + reasoning capture plumbing (reasoning_content parsing, dual dataset writing, config flags).
  • Introduce vellumforge2 transform for offline dataset conversion/regeneration with checkpoint/resume support.
  • Modernize developer tooling/docs (Makefile targets, golangci config, README updates, stress-test scripts/docs).

Reviewed changes

Copilot reviewed 40 out of 41 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/api-reliability-test/run_test.sh Runner script for reliability stress tests + analysis invocation
tests/api-reliability-test/config.nahcrof.toml Stress-test config for nahcrof provider
tests/api-reliability-test/analyze_results.py Parses sessions/logs to summarize reliability + write results.json
tests/api-reliability-test/TEST_README.md Detailed setup/run instructions for the reliability test package
tests/api-reliability-test/RESULTS.md Captured stress-test results writeup
tests/api-reliability-test/.env.nahcrof.example Example env template for nahcrof API key
tests/api-reliability-test/.env.chutes.example Example env template for chutes API key
pkg/models/types.go Adds SFT format options + expands SFTRecord and GenerationResult for reasoning fields
internal/writer/session.go Adds reasoning dataset output path helper
internal/writer/interface.go Introduces writer.Writer interface supporting single/dual dataset modes
internal/writer/dual_dataset_test.go Unit tests for applying reasoning into Alpaca/ShareGPT SFT records
internal/writer/dual_dataset.go DualDatasetWriter writing both plain + reasoning JSONL outputs
internal/writer/dataset.go Updates DatasetWriter method signatures to match new interface
internal/util/think_tags_test.go Tests for think-tag detection/extraction/stripping helpers
internal/util/think_tags.go Implements think-tag detection/extraction/stripping helpers
internal/util/reasoning.go Helpers for wrapping/combining/stripping reasoning via <think> tags
internal/util/meta_cleanup_test.go Tests for stripping meta chatter from LLM outputs
internal/util/meta_cleanup.go Implements meta cleanup heuristic for rejected generations
internal/orchestrator/worker.go Switches to streaming when enabled, captures reasoning, adds completion validation
internal/orchestrator/validation.go Adds finish_reason validation + incomplete-output detection; refines refusal detection
internal/orchestrator/sft_format_test.go Tests for writing SFT records in Alpaca vs ShareGPT formats
internal/orchestrator/orchestrator.go Uses writer.Writer interface + resume progress rendering changes
internal/hfhub/uploader.go Includes optional reasoning dataset in HF uploads (skips if missing)
internal/dataset/transform.go Implements offline dataset transform modes with worker pool + checkpointing
internal/config/loader.go Adds defaults for SFT format + per-model HTTP timeout
internal/config/config.go Adds new generation/model config fields + secret mapping for new providers
internal/checkpoint/manager.go Minor import formatting
internal/api/types.go Extends API types to include reasoning_content + reasoning token usage fields
internal/api/streaming.go Adds SSE streaming client to assemble content + reasoning_content
internal/api/reasoning_test.go Integration tests for reasoning capabilities and reasoning_content parsing
internal/api/reasoning_streaming_test.go Integration tests for reasoning_content in streaming mode
internal/api/reasoning_stream_test.go Raw HTTP streaming experiments for reasoning_content field discovery
internal/api/reasoning_debug_test.go Debug integration tests for raw responses / temperature variation
internal/api/client.go Moves timeout control to per-request contexts; adds per-model timeout enforcement
configs/config.example.toml Documents new config options (sft_format, streaming, timeouts, reasoning capture)
configs/.env.example Adds CHUTES_API_KEY and NAHCROF_API_KEY templates
cmd/vellumforge2/main.go Adds transform command + env-file support on resume; dual-writer selection
README.md Documents transform usage + resume flags + timeout tuning; bumps version to v1.6.0
Makefile Adds new targets, formatting checks, CI helpers, colorized output; bumps version
.golangci.yml Adds repository lint configuration
.gitignore Improves env ignore patterns; ignores results.json; un-ignores config.nahcrof.toml

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +288 to +292
├── analyze_results.py # Results analysis script
├── config.nahcrof.toml # nahcrof test config
├── config.chutes.toml # chutes test config
├── .env.nahcrof.example # API key template (nahcrof)
├── .env.chutes.example # API key template (chutes)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README and runner script reference config.chutes.toml, but that file isn’t present in tests/api-reliability-test/ in this PR. As-is, run_test.sh chutes will fail. Either add the missing config file or update the docs/script to match the files that are actually checked in.

Copilot uses AI. Check for mistakes.
transformCmd.Flags().StringVar(&configPath, "config", "config.toml", "Path to configuration file")
transformCmd.Flags().StringVar(&envFile, "env-file", ".env", "Path to environment file")
transformCmd.Flags().BoolVarP(&verbose, "verbose", "v", false, "Enable verbose logging")
transformCmd.Flags().StringVar(&transformMode, "mode", string(dataset.TransformSFTToDPO), "Transform mode: 'sft-to-dpo' or 'regen-rejected'")
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--mode is given a default (sft-to-dpo), but later this flag is marked required. That combination is confusing for users because the help implies it’s optional while Cobra will still error unless it’s explicitly set. Consider removing the default or dropping the required constraint.

Suggested change
transformCmd.Flags().StringVar(&transformMode, "mode", string(dataset.TransformSFTToDPO), "Transform mode: 'sft-to-dpo' or 'regen-rejected'")
transformCmd.Flags().StringVar(&transformMode, "mode", "", "Transform mode: 'sft-to-dpo' or 'regen-rejected'")

Copilot uses AI. Check for mistakes.
Comment on lines +313 to +316
if err := parentCtx.Err(); err != nil && err != context.Canceled {
return err
}

Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both transform modes, Ctrl+C / context cancellation will currently return nil because cancellation is explicitly ignored (err != nil && err != context.Canceled). That makes interrupted transforms look successful and can hide partial output. Consider returning context.Canceled (or ctx.Err()) after saving the checkpoint so the CLI can report an interrupted run correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +72
// Very short outputs are likely incomplete (already caught by refusal check)
if len(trimmed) < 100 {
return true, "output too short (< 100 chars)"
}

Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isIncompleteOutput treats any output < 100 chars as incomplete. This will hard-fail legitimate short completions (especially outside story-generation use cases) and overlaps with the refusal check (<50 chars). Consider removing the length-based rejection or making it conditional (e.g., only when streaming is enabled, or only when max_output_tokens is large / long-form prompts).

Suggested change
// Very short outputs are likely incomplete (already caught by refusal check)
if len(trimmed) < 100 {
return true, "output too short (< 100 chars)"
}

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +63
except:
pass
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid bare except: here; it will silently swallow unexpected errors and can hide malformed JSONL lines (skewing the stats). Catch specific exceptions (e.g., json.JSONDecodeError) and consider counting/reporting skipped lines.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +21
define print_step
@echo -e "$(COLOR_BOLD)$(COLOR_BLUE)==> $(1)$(COLOR_RESET)"
endef
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo -e is not portable across /bin/sh implementations (e.g., dash prints -e literally). Since print_step is used by most targets, consider using printf for ANSI color output to keep the Makefile working consistently across environments.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +37
## ci: Full CI pipeline (format check + all validations + coverage)
ci: fmt-check vet lint test-coverage build
$(call print_step,"✅ CI pipeline completed successfully!")
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ci target currently runs test-coverage, which runs the full test suite (including integration tests that can make real API calls). This can be costly/flaky when CI has credentials available. Consider using the test (short) target in ci and keeping integration tests behind an explicit target/job.

Copilot uses AI. Check for mistakes.
var responseCreated int64
var finishReason string

scanner := bufio.NewScanner(httpResp.Body)
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bufio.Scanner has a relatively small max token size by default. Some providers may emit large SSE data: lines, which can cause scanner errors and truncated streaming. Consider calling scanner.Buffer(...) with a larger max (similar to the JSONL scanner usage elsewhere).

Suggested change
scanner := bufio.NewScanner(httpResp.Body)
scanner := bufio.NewScanner(httpResp.Body)
// Increase scanner buffer to handle large SSE data: lines
scanner.Buffer(make([]byte, 0, 64*1024), 10*1024*1024)

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +47
# Run the test
timeout 1800 ./bin/vellumforge2 run --config "$config" --env-file "$env_file" 2>&1 | \
tee "tests/api-reliability-test/test_${provider}.log"
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script assumes the timeout utility is available. On macOS it’s typically missing (users often have gtimeout via coreutils instead). Consider adding a small portability shim (detect timeout vs gtimeout) and/or documenting the dependency so the script works across the platforms you build for.

Copilot uses AI. Check for mistakes.
import json
import os
import glob
from pathlib import Path
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Copilot uses AI. Check for mistakes.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1d7786d6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +265 to +269
func (dw *DualDatasetWriter) Close() error {
var errs []error

if err := dw.regularFile.Close(); err != nil {
errs = append(errs, fmt.Errorf("regular dataset: %w", err))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Flush buffered records before closing dual writer

DualDatasetWriter.Close closes the files directly without persisting dw.records. In MO-DPO runs, writeMODPORecord writes through WriteRecord/UpdateRecord (buffered path), so enabling generation.enable_reasoning_capture switches to this writer and silently drops all buffered rows at shutdown, leaving dataset.jsonl (and no reasoning equivalent) incomplete or empty despite successful generation logs.

Useful? React with 👍 / 👎.

"model", modelCfg.ModelName,
"timeout", timeout)
}
ctx, cancel = context.WithTimeout(ctx, timeout)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply HTTP timeout per attempt, not across queue/retries

Creating a timeout-scoped context at function entry makes the timeout budget cover rate-limiter waiting plus every retry/backoff, not just an HTTP attempt. In low-RPM or high-concurrency runs, Wait can exhaust the 120s budget before any request is sent; and when an attempt times out, subsequent retries immediately hit ctx.Done() and return, so configured max_retries cannot recover from timeout failures.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant