feat：Add evals-related configurations and CI commandsAdd evals-relate… by Galekanyun1123a · Pull Request #29 · kanyun-inc/agent-aware

Galekanyun1123a · 2026-02-03T10:13:06Z

[2 tools called]

对比 main 分支的修改总结

提交记录

6c460d0 feat：Add evals-related configurations and CI commands

修改统计

52 个文件变更
+8,954 行，-35 行

主要修改内容

1. 新增评估系统 (`evals/`)

模块	文件数	说明
`harness/`	5	评估核心框架（环境、运行器、报告、转录）
`graders/`	11	评分器（SDK、Server、E2E、LLM、动态项目）
`tasks/`	8	6 个评估任务定义
`rubrics/`	9	LLM Judge 评分标准

2. CI/CD 配置

.github/workflows/evals.yml - GitHub Actions 评估流水线（441 行）
- 支持 PR 评论显示结果
- 支持与 main 分支对比
- 显示详细得分

3. 项目配置

package.json - 新增 eval 脚本和依赖
vitest.evals.config.ts - 评估测试配置
pnpm-lock.yaml - 依赖锁定

4. 小修改

packages/sdk/src/index.ts - SDK 小改动
packages/server/src/cli.ts - Server CLI 小改动

评估任务（精简后 6 个）

ID	任务名	说明
001	Server 基础功能测试	原 002+003 合并
002	问题检测测试	原 004+005 合并
003	错误修复测试	-
004	完整集成测试	-
005	Skill 遵循度评估	LLM Judge
006	动态项目评估	LLM 生成项目 + 检测

…d configurations and CI commands

github-actions · 2026-02-03T10:15:37Z

📊 Agent-aware 评估结果

📊 评估结果

总体得分: 93%
通过数: 6/6

📈 任务得分明细

任务	状态	各评分器得分
001-server-basic	✅	server: 100%
002-issue-detection	✅	server: 100%
003-error-fix	✅	server: 100%, e2e: 100%
004-full-integration	✅	build: 100%, sdk: 100%, server: 100%, e2e: 100%
005-skill-compliance	✅	server: 100%, llm: 95%
006-dynamic-project-eval	✅	dynamic-project: 50%, server: 100%, e2e: 100%, llm: 60%

Generated by Agent-aware Evals

krislavten · 2026-02-04T02:31:47Z

@claude review this PR

kris-dong-dong · 2026-02-04T02:31:59Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

Galekanyun1123a · 2026-02-04T02:40:46Z

@claude review this PR

kris-dong-dong · 2026-02-04T02:40:59Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

krislavten · 2026-02-04T02:57:08Z

@claude review this PR

kris-dong-dong · 2026-02-04T02:57:19Z

Claude finished @krislavten's task in 2m 19s —— View job

🔍 PR Review Complete

Read project structure and configuration
Review evaluation harness (runner, environment, reporter)
Review graders implementation
Review tasks and rubrics
Review CI/CD workflow
Check for security issues and best practices
Provide comprehensive review feedback

PR Summary

This PR adds a comprehensive evaluation system for the Agent-aware project, implementing ~8,900 lines of new code across 52 files. The system follows best practices outlined in Anthropic's evaluation guide and includes:

Evaluation Harness: Runner, environment isolation, transcript recording, and reporting
Graders: Build, SDK, Server, E2E, LLM (using OpenAI/Anthropic/AWS Bedrock), and Dynamic Project graders
6 Evaluation Tasks: Covering server functionality, issue detection, error fixing, integration, skill compliance, and dynamic project generation
CI/CD Workflow: GitHub Actions with PR comments showing comparison against main branch

Review Feedback

Strengths

Well-Architected Design - The evaluation system follows a clean separation of concerns with modular graders, isolated test environments, and incremental reporting.
Comprehensive Type Definitions (evals/harness/types.ts) - Strong TypeScript types for all evaluation components, improving maintainability.
Multiple LLM Provider Support (evals/graders/llm/client.ts) - Flexible support for OpenAI, Anthropic, and AWS Bedrock with proper proxy configuration.
Concurrency Control (evals/harness/runner.ts:233-260) - Smart parallel execution with port offset management to avoid conflicts.
CI Workflow (.github/workflows/evals.yml) - Includes main branch comparison, artifact archival, and detailed PR comments.

Issues to Address

1. Missing Error Handler in Process Spawn (Medium)

File: evals/graders/dynamic-project/index.ts:186-250

The startDevServer function doesn't handle the case where the spawned process emits an error after the timeout resolves successfully.

proc.on('error', (err) => {
  if (!started) {
    reject(new Error(`Dev server error: ${err.message}`));
  }
  // After started=true, errors are silently ignored
});

Suggestion: Add logging for post-startup errors to aid debugging.

Fix this →

2. Duplicate `updateSdkEndpoint` Function (Low)

Files:

evals/harness/environment.ts:132-165
evals/graders/dynamic-project/index.ts:19-56

The same function is duplicated in two locations with nearly identical logic.

Suggestion: Extract to evals/graders/shared/index.ts and import where needed.

Fix this →

3. README Task List Mismatch (Low)

File: evals/README.md:49-59

The README lists 9 tasks (001-009), but the actual implementation only has 6 tasks (001-006). The task IDs and names have been renumbered/consolidated but the README wasn't fully updated.

Suggestion: Update the README to match the actual 6 tasks defined in evals/tasks/index.ts.

Fix this →

4. Hardcoded Timeout Values (Low)

Files: Multiple locations

Several timeout values are hardcoded:

evals/harness/environment.ts:56 - 30000 (waitForServer)
evals/graders/dynamic-project/index.ts:243 - 120000 (startDevServer)
evals/graders/dynamic-project/index.ts:288 - 180000 (installDependencies)

Suggestion: Consider making these configurable through EvalConfig for CI environments where timeouts may need adjustment.

5. Missing Graceful Shutdown (Low)

File: evals/harness/environment.ts:207-221

The cleanup function uses SIGTERM followed by a fixed 1-second wait. In CI environments, processes may need more time to clean up.

cleanup: async () => {
  if (env.serverProcess) {
    env.serverProcess.kill('SIGTERM');
  }
  // ...
  await sleep(1000);  // Fixed wait may not be sufficient

Suggestion: Consider waiting for process exit events rather than using a fixed sleep.

Minor Suggestions

Console Logging - Consider using a structured logger instead of console.log for better filtering in CI output.
Type Assertion (evals/graders/dynamic-project/index.ts:391) - The type assertion (env as { testAppPath: string }).testAppPath suggests the interface could be improved to make this property mutable.
Error Message Truncation (evals/harness/reporter.ts:181-182) - Error messages are truncated to 100 chars. Consider making this configurable or showing full errors in detailed view.

Security Review

The implementation appears secure with no obvious vulnerabilities:

No command injection risks (user input is not directly used in shell commands)
API keys are properly read from environment variables
No sensitive data is logged or stored inappropriately
Temporary directories are properly cleaned up

Overall Assessment

This is a well-designed evaluation system that will provide valuable automated testing for the Agent-aware project. The architecture is clean, extensible, and follows industry best practices. The issues identified are minor and don't block merging.

Recommendation: Approve with suggestions to address the README mismatch and code duplication when convenient.

feat：Add evals-related configurations and CI commandsAdd evals-relate…

6c460d0

…d configurations and CI commands

feat：Add the latest rating score of the main branch

fe6cd33

krislavten merged commit 4283969 into main Feb 4, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat：Add evals-related configurations and CI commandsAdd evals-relate…#29

feat：Add evals-related configurations and CI commandsAdd evals-relate…#29
krislavten merged 2 commits intomainfrom
feature-evals-result

Galekanyun1123a commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

krislavten commented Feb 4, 2026

Uh oh!

kris-dong-dong bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Galekanyun1123a commented Feb 4, 2026

Uh oh!

kris-dong-dong bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

krislavten commented Feb 4, 2026

Uh oh!

kris-dong-dong bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Galekanyun1123a commented Feb 3, 2026

对比 main 分支的修改总结

提交记录

修改统计

主要修改内容

1. 新增评估系统 (evals/)

2. CI/CD 配置

3. 项目配置

4. 小修改

评估任务（精简后 6 个）

Uh oh!

github-actions bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Agent-aware 评估结果

📊 评估结果

📈 任务得分明细

Uh oh!

krislavten commented Feb 4, 2026

Uh oh!

kris-dong-dong bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Galekanyun1123a commented Feb 4, 2026

Uh oh!

kris-dong-dong bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krislavten commented Feb 4, 2026

Uh oh!

kris-dong-dong bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review Complete

PR Summary

Review Feedback

Strengths

Issues to Address

1. Missing Error Handler in Process Spawn (Medium)

2. Duplicate updateSdkEndpoint Function (Low)

3. README Task List Mismatch (Low)

4. Hardcoded Timeout Values (Low)

5. Missing Graceful Shutdown (Low)

Minor Suggestions

Security Review

Overall Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. 新增评估系统 (`evals/`)

github-actions bot commented Feb 3, 2026 •

edited

Loading

kris-dong-dong bot commented Feb 4, 2026 •

edited

Loading

kris-dong-dong bot commented Feb 4, 2026 •

edited

Loading

kris-dong-dong bot commented Feb 4, 2026 •

edited

Loading

2. Duplicate `updateSdkEndpoint` Function (Low)