Skip to content

feat:Add evals-related configurations and CI commandsAdd evals-relate…#29

Merged
krislavten merged 2 commits intomainfrom
feature-evals-result
Feb 4, 2026
Merged

feat:Add evals-related configurations and CI commandsAdd evals-relate…#29
krislavten merged 2 commits intomainfrom
feature-evals-result

Conversation

@Galekanyun1123a
Copy link
Contributor

[2 tools called]

对比 main 分支的修改总结

提交记录

6c460d0 feat:Add evals-related configurations and CI commands

修改统计

  • 52 个文件变更
  • +8,954 行,-35 行

主要修改内容

1. 新增评估系统 (evals/)

模块 文件数 说明
harness/ 5 评估核心框架(环境、运行器、报告、转录)
graders/ 11 评分器(SDK、Server、E2E、LLM、动态项目)
tasks/ 8 6 个评估任务定义
rubrics/ 9 LLM Judge 评分标准

2. CI/CD 配置

  • .github/workflows/evals.yml - GitHub Actions 评估流水线(441 行)
    • 支持 PR 评论显示结果
    • 支持与 main 分支对比
    • 显示详细得分

3. 项目配置

  • package.json - 新增 eval 脚本和依赖
  • vitest.evals.config.ts - 评估测试配置
  • pnpm-lock.yaml - 依赖锁定

4. 小修改

  • packages/sdk/src/index.ts - SDK 小改动
  • packages/server/src/cli.ts - Server CLI 小改动

评估任务(精简后 6 个)

ID 任务名 说明
001 Server 基础功能测试 原 002+003 合并
002 问题检测测试 原 004+005 合并
003 错误修复测试 -
004 完整集成测试 -
005 Skill 遵循度评估 LLM Judge
006 动态项目评估 LLM 生成项目 + 检测

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

📊 Agent-aware 评估结果

📊 评估结果

  • 总体得分: 93%
  • 通过数: 6/6

📈 任务得分明细

任务 状态 各评分器得分
001-server-basic server: 100%
002-issue-detection server: 100%
003-error-fix server: 100%, e2e: 100%
004-full-integration build: 100%, sdk: 100%, server: 100%, e2e: 100%
005-skill-compliance server: 100%, llm: 95%
006-dynamic-project-eval dynamic-project: 50%, server: 100%, e2e: 100%, llm: 60%

Generated by Agent-aware Evals

@krislavten
Copy link
Contributor

@claude review this PR

@kris-dong-dong
Copy link

kris-dong-dong bot commented Feb 4, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@Galekanyun1123a
Copy link
Contributor Author

@claude review this PR

@kris-dong-dong
Copy link

kris-dong-dong bot commented Feb 4, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@krislavten
Copy link
Contributor

@claude review this PR

@kris-dong-dong
Copy link

kris-dong-dong bot commented Feb 4, 2026

Claude finished @krislavten's task in 2m 19s —— View job


🔍 PR Review Complete

  • Read project structure and configuration
  • Review evaluation harness (runner, environment, reporter)
  • Review graders implementation
  • Review tasks and rubrics
  • Review CI/CD workflow
  • Check for security issues and best practices
  • Provide comprehensive review feedback

PR Summary

This PR adds a comprehensive evaluation system for the Agent-aware project, implementing ~8,900 lines of new code across 52 files. The system follows best practices outlined in Anthropic's evaluation guide and includes:

  • Evaluation Harness: Runner, environment isolation, transcript recording, and reporting
  • Graders: Build, SDK, Server, E2E, LLM (using OpenAI/Anthropic/AWS Bedrock), and Dynamic Project graders
  • 6 Evaluation Tasks: Covering server functionality, issue detection, error fixing, integration, skill compliance, and dynamic project generation
  • CI/CD Workflow: GitHub Actions with PR comments showing comparison against main branch

Review Feedback

Strengths

  1. Well-Architected Design - The evaluation system follows a clean separation of concerns with modular graders, isolated test environments, and incremental reporting.

  2. Comprehensive Type Definitions (evals/harness/types.ts) - Strong TypeScript types for all evaluation components, improving maintainability.

  3. Multiple LLM Provider Support (evals/graders/llm/client.ts) - Flexible support for OpenAI, Anthropic, and AWS Bedrock with proper proxy configuration.

  4. Concurrency Control (evals/harness/runner.ts:233-260) - Smart parallel execution with port offset management to avoid conflicts.

  5. CI Workflow (.github/workflows/evals.yml) - Includes main branch comparison, artifact archival, and detailed PR comments.


Issues to Address

1. Missing Error Handler in Process Spawn (Medium)

File: evals/graders/dynamic-project/index.ts:186-250

The startDevServer function doesn't handle the case where the spawned process emits an error after the timeout resolves successfully.

proc.on('error', (err) => {
  if (!started) {
    reject(new Error(`Dev server error: ${err.message}`));
  }
  // After started=true, errors are silently ignored
});

Suggestion: Add logging for post-startup errors to aid debugging.

Fix this →


2. Duplicate updateSdkEndpoint Function (Low)

Files:

  • evals/harness/environment.ts:132-165
  • evals/graders/dynamic-project/index.ts:19-56

The same function is duplicated in two locations with nearly identical logic.

Suggestion: Extract to evals/graders/shared/index.ts and import where needed.

Fix this →


3. README Task List Mismatch (Low)

File: evals/README.md:49-59

The README lists 9 tasks (001-009), but the actual implementation only has 6 tasks (001-006). The task IDs and names have been renumbered/consolidated but the README wasn't fully updated.

Suggestion: Update the README to match the actual 6 tasks defined in evals/tasks/index.ts.

Fix this →


4. Hardcoded Timeout Values (Low)

Files: Multiple locations

Several timeout values are hardcoded:

  • evals/harness/environment.ts:56 - 30000 (waitForServer)
  • evals/graders/dynamic-project/index.ts:243 - 120000 (startDevServer)
  • evals/graders/dynamic-project/index.ts:288 - 180000 (installDependencies)

Suggestion: Consider making these configurable through EvalConfig for CI environments where timeouts may need adjustment.


5. Missing Graceful Shutdown (Low)

File: evals/harness/environment.ts:207-221

The cleanup function uses SIGTERM followed by a fixed 1-second wait. In CI environments, processes may need more time to clean up.

cleanup: async () => {
  if (env.serverProcess) {
    env.serverProcess.kill('SIGTERM');
  }
  // ...
  await sleep(1000);  // Fixed wait may not be sufficient

Suggestion: Consider waiting for process exit events rather than using a fixed sleep.


Minor Suggestions

  1. Console Logging - Consider using a structured logger instead of console.log for better filtering in CI output.

  2. Type Assertion (evals/graders/dynamic-project/index.ts:391) - The type assertion (env as { testAppPath: string }).testAppPath suggests the interface could be improved to make this property mutable.

  3. Error Message Truncation (evals/harness/reporter.ts:181-182) - Error messages are truncated to 100 chars. Consider making this configurable or showing full errors in detailed view.


Security Review

The implementation appears secure with no obvious vulnerabilities:

  • No command injection risks (user input is not directly used in shell commands)
  • API keys are properly read from environment variables
  • No sensitive data is logged or stored inappropriately
  • Temporary directories are properly cleaned up

Overall Assessment

This is a well-designed evaluation system that will provide valuable automated testing for the Agent-aware project. The architecture is clean, extensible, and follows industry best practices. The issues identified are minor and don't block merging.

Recommendation: Approve with suggestions to address the README mismatch and code duplication when convenient.


@krislavten krislavten merged commit 4283969 into main Feb 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants