-
Notifications
You must be signed in to change notification settings - Fork 0
submit a new scheduler #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Walkthrough本次变更重构并扩展了调度子系统(资源/任务/算法/定时管理器)、新增批量任务与启动序列接口及客户端支持、扩展 TaskInfo 字段、调整 Worker 注册流、补充 Docker 文档/compose,并新增调度相关测试与若干工具函数(含日志与小幅 API 调整)。 Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant HTTP as Master HTTP服务
participant BL as SchedulerBl
participant TM as TaskManager
participant AM as AlgorithmManager
participant RM as ResourceManager
participant BR as Broker/Workers
participant WK as Worker
rect rgb(241,248,255)
C->>HTTP: POST /tasks/create(/batch)
HTTP->>BL: QueueTask / BatchQueueTask
BL->>TM: QueueTask(Waiting)
TM-->>BL: ok
note right of BL: 入队后等待调度
end
rect rgb(245,255,245)
BL->>BL: Ticker 触发 TryScheduleNextTasks
BL->>TM: 刷新 task→group 映射
TM-->>BL: 映射/任务列表
BL->>RM: 查询 Idle/Concurrent 组
RM-->>BL: 组集合
BL->>AM: ScheduleNextTasks(过滤+选择)
AM-->>BL: 下一批待启动任务
BL->>BL: 推入 startChan
end
rect rgb(255,250,240)
BL->>BL: waitingStartedTask 消费 startChan
BL->>RM: GetAgentAndAssignTask
RM->>BR: ApplyAgent(force?)
BR-->>RM: Agent, 状态, workers
RM-->>BL: Agent, 状态
BL->>WK: Start Task
WK-->>BL: 结果(成功/失败)
BL->>TM: AddTaskStartSequence
end
alt 任务完成
WK-->>HTTP: 回调/查询得知完成
HTTP->>BL: ComputeTaskStatus/LoadTask 到达完成
BL->>TM: MarkTaskComplete
BL->>RM: ReleaseByTaskID
else 取消
C->>HTTP: POST /tasks/cancel
HTTP->>BL: CancelTask
BL->>RM: ReleaseByTaskID
BL->>TM: Remove/Mark
end
sequenceDiagram
autonumber
participant C as Client
participant HTTP as Master HTTP服务
participant CM as CronManager
participant BL as SchedulerBl
participant TC as TaskCreator
rect rgb(245,245,255)
C->>HTTP: POST /tasks/create {cron_expr}
HTTP->>BL: QueueTask
BL->>CM: AddCronTask(task)
CM-->>BL: ok
end
rect rgb(255,245,245)
note over CM: 定时触发
CM->>TC: queueTemplateHandler(copy)
TC-->>CM: 新任务ID
CM->>BL: (间接)新任务已入队
end
sequenceDiagram
autonumber
participant W as Worker
participant GRPC as Master gRPC
participant WM as WorkerManager
participant BL as SchedulerBl
participant RM as ResourceManager
W->>GRPC: SayHelloMaster
GRPC->>WM: Add WorkerClient
GRPC->>BL: Scheduler.ChangeWorkerStatus(worker, Idle)
BL->>RM: ChangeWorkerStatus(Idle)
RM-->>BL: ok
BL->>BL: TryScheduleNextTasks()
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new scheduler system for the Vermeer graph computing platform. The implementation provides priority-based, age-based, and dependency-based scheduling capabilities to improve task execution management.
- Implements a comprehensive scheduler with priority/elder/dependency-based algorithms
- Adds cron-based task scheduling capabilities
- Introduces batch task creation and sequence tracking functionality
Reviewed Changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| vermeer/test/scheduler/ | New test suite for validating scheduler functionality including priority, batch, and routine tests |
| vermeer/apps/master/schedules/ | Core scheduler implementation with task manager, resource manager, algorithm manager, and cron manager |
| vermeer/client/ | Enhanced client with batch task creation and sequence tracking API support |
| vermeer/apps/master/services/ | HTTP handlers for batch task creation and start sequence retrieval |
| vermeer/apps/master/bl/ | Business logic updates for scheduler integration and task lifecycle management |
| vermeer/apps/structure/task.go | Extended task structure with scheduler-specific fields (priority, preorders, exclusive, cron) |
| config files | Configuration updates for new worker groups and scheduler parameters |
Comments suppressed due to low confidence (1)
vermeer/test/scheduler/routine.go:1
- Method name has a typo: 'Waitng' should be 'Waiting'
package scheduler
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| func (lt *LoadTaskLocal) TaskLoadBodyWithNum(num int) map[string]string { | ||
| vertexBackends := []string{"db", "mem"} | ||
|
|
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition and assignment logic appears backwards - if num is 10 or less, it's set to 30, which seems counterintuitive for a function parameter named 'num'. Consider clarifying this logic or adding a comment explaining the intention.
| // If num is less than or equal to 10, set it to 30 to ensure a minimum number of files are loaded. |
| priorityCost := priorityParam * int64(task.Priority) | ||
| // step 3: resource cost | ||
| graph := structure.GraphManager.GetGraphByName(task.SpaceName, task.GraphName) | ||
| resourceCost := resourceParam / max(1, graph.VertexCount+graph.EdgeCount) // Avoid division by zero, ensure at least 1 |
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max function is not defined in this context. Use Go's built-in math.Max function or implement a simple max function for int64 values.
| //若成功启动Compute Task,开始轮询tasksGet,解析response,得到状态为完成时break。 | ||
| var taskResp *client.TaskResponse | ||
| var err error | ||
| for i := 0; i < ctb.waitSecond; i++ { |
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable shadowing: the inner loop variable 'i' shadows the outer loop variable 'i'. Use a different variable name for the inner loop, such as 'j' or 'attempt'.
| for i := 0; i < ctb.waitSecond; i++ { | |
| for attempt := 0; attempt < ctb.waitSecond; attempt++ { |
| if taskInfo == nil { | ||
| return errors.New("the argument `taskInfo` is nil") | ||
| } | ||
|
|
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: calling RemoveTask immediately after IsTaskOngoing without proper synchronization could lead to inconsistent state if the task changes between these calls.
| s.Lock() | |
| defer s.Unlock() |
| // Check if all dependencies are satisfied | ||
| allDepsSatisfied := true | ||
| for _, dep := range depends { | ||
| if depTask, exists := allTaskIDs[dep]; !exists || depTask.State != structure.TaskStateWaiting { |
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic error: checking if dependency task state is 'Waiting' instead of 'Complete'. Dependencies should be satisfied when the dependent task is complete, not waiting.
| if depTask, exists := allTaskIDs[dep]; !exists || depTask.State != structure.TaskStateWaiting { | |
| if depTask, exists := allTaskIDs[dep]; !exists || depTask.State != structure.TaskStateComplete { |
| if err != nil { | ||
| return false, err | ||
| } | ||
| resp, err := vc.post(vc.httpAddr+"/admin/workers/alloc/"+groupName+"/$DEFAULT/"+graphName, reader) |
Copilot
AI
Sep 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded '$DEFAULT' string should be defined as a constant to improve maintainability and avoid potential typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 29
🧹 Nitpick comments (38)
vermeer/.gitignore (2)
87-89: 将忽略规则锚定到当前目录,避免误伤子路径当前写法
test/case/会匹配任意层级下名为test/case的目录。若只想忽略vermeer/test/case/,建议加前导/进行锚定。-test/case/ +/test/case/
87-89: 确认是否真的要忽略测试用例目录;如需保留目录结构/部分文件,请改为忽略内容并反向包含PR 目标提到测试仍需补充;将整个
test/case/目录忽略可能导致用例或基准数据无法入库。若仅想忽略大文件或产物,建议更细粒度控制,或保留占位文件。可选方案(保留目录,仅忽略其内容,并保留 .gitkeep):
-/test/case/ +/test/case/* +!/test/case/.gitkeep如需保留脚本/文档,可继续反向包含:
+!/test/case/**/*.md +!/test/case/**/*.sh请确认预期:是忽略用例产出,还是连同用例也不入库?
vermeer/apps/graphio/local_file.go (3)
67-92: 结构化调试日志 + 去空白增强健壮性(vertex 段)
- 对范围两端做 TrimSpace,避免含空格导致
Atoi失败或零填充长度误判。- 使用结构化日志,便于检索与统计(同时输出生成的分片数量)。
- ss := strings.Split(files[bIdx+1:eIdx], ",") + ss := strings.Split(files[bIdx+1:eIdx], ",") if len(ss) != 2 { s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s", files) logrus.Errorf(s) return nil, errors.New(s) } - s, err := strconv.Atoi(ss[0]) + sStr := strings.TrimSpace(ss[0]) + eStr := strings.TrimSpace(ss[1]) + s, err := strconv.Atoi(sStr) if err != nil { s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s, %s", files, err) logrus.Errorf(s) return nil, errors.New(s) } - e, err := strconv.Atoi(ss[1]) + e, err := strconv.Atoi(eStr) if err != nil { s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s, %s", files, err) logrus.Errorf(s) return nil, errors.New(s) } - logrus.Debugf("MakeTask LoadTypeLocal parse file: %s, s:%d, e:%d", files, s, e) + logrus.WithFields(logrus.Fields{ + "file_spec": files, + "start": s, + "end": e, + "count": e - s + 1, + }).Debug("MakeTask LoadTypeLocal parsed file range") for i := s; i <= e; i++ { part := LoadPartition{} part.Init(partID, taskID, LoadPartTypeVertex) part.IpAddr = ip part.Params = make(map[string]string) - part.Params["load.file_path"] = files[:bIdx] + common.ItoaPad(i, len(ss[1])) + part.Params["load.file_path"] = files[:bIdx] + common.ItoaPad(i, len(eStr)) loadParts = append(loadParts, part) partID += 1 }
111-137: 与 vertex 分支保持观测对称性;修复报错缺少 err 与去空白(edge 段)
- edge 段也应输出同等调试信息,便于排障。
strconv.Atoi失败时当前日志未包含err,排障信息不足。- 同步引入 TrimSpace 与零填充长度修正。
- ss := strings.Split(files[bIdx+1:eIdx], ",") + ss := strings.Split(files[bIdx+1:eIdx], ",") if len(ss) != 2 { s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s", files) logrus.Errorf(s) return nil, errors.New(s) } - s, err := strconv.Atoi(ss[0]) + sStr := strings.TrimSpace(ss[0]) + eStr := strings.TrimSpace(ss[1]) + s, err := strconv.Atoi(sStr) if err != nil { - s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s", files) + s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s, %s", files, err) logrus.Errorf(s) return nil, errors.New(s) } - e, err := strconv.Atoi(ss[1]) + e, err := strconv.Atoi(eStr) if err != nil { - s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s", files) + s := fmt.Sprintf("MakeTask LoadTypeLocal parse file error: %s, %s", files, err) logrus.Errorf(s) return nil, errors.New(s) } + logrus.WithFields(logrus.Fields{ + "file_spec": files, + "start": s, + "end": e, + "count": e - s + 1, + }).Debug("MakeTask LoadTypeLocal parsed file range") for i := s; i <= e; i++ { part := LoadPartition{} part.Init(partID, taskID, LoadPartTypeEdge) part.IpAddr = ip part.Params = make(map[string]string) - part.Params["load.file_path"] = files[:bIdx] + common.ItoaPad(i, len(ss[1])) + part.Params["load.file_path"] = files[:bIdx] + common.ItoaPad(i, len(eStr)) loadParts = append(loadParts, part) partID += 1 }
50-139: 抽取公共解析逻辑减少重复vertex/edge 两段解析与分片生成基本一致,可抽成一个辅助函数(传入 partType、key、日志前缀),降低重复与维护成本。
示例签名:
- func appendRangeParts(ip, files string, partType LoadPartType, taskID int32, partID *int32) ([]LoadPartition, error)
vermeer/docker-compose.yaml (1)
29-29: 文件末尾缺少换行符为兼容性与规范,请在文件末尾添加换行。
vermeer/README.md (1)
34-34: 更正目录名描述“useconfig folder” 疑为笔误,应为 “config folder”。
应用此变更:
-The parameter env specifies the name of the configuration file in the useconfig folder. +The parameter env specifies the name of the configuration file in the config folder.vermeer/apps/structure/task.go (1)
59-64: 为新增字段定义更安全的默认值与注释
Preorders作为切片,默认nil在 JSON 序列化/遍历一般安全,但在外部 append 前若被共享引用可能引入歧义。建议在创建任务处初始化为空切片。- 注释已说明
Exclusive语义;建议在 README/API 文档中进一步明确“互斥”范围(全局/Space/Graph 级)。可在
CreateTask初始化时补充(该改动发生在本文件其他位置,示例代码如下):task.Workers = make([]*TaskWorker, 0) task.SpaceName = spaceName +// scheduler defaults +task.Preorders = []int32{} +task.Priority = 0 +task.Exclusive = false +task.CronExpr = ""vermeer/test/functional/load_local.go (2)
57-57: 日志信息可能产生误导日志记录的是
num-1,但实际文件范围使用的是[0, num-1],共num个文件。建议修改日志信息以更准确地反映实际加载的文件数量。- logrus.Infof("load with num: " + strconv.Itoa(num-1)) + logrus.Infof("load with %d files: [0, %d]", num, num-1)
67-68: 使用 fmt.Sprintf 提高可读性字符串拼接方式较为复杂,建议使用
fmt.Sprintf来提高代码可读性。- "load.vertex_files": "{\"127.0.0.1\":\"" + "test/case/vertex/vertex_[0," + strconv.Itoa(num-1) + "]" + "\"}", - "load.edge_files": "{\"127.0.0.1\":\"" + "test/case/edge/edge_[0," + strconv.Itoa(num-1) + "]" + "\"}", + "load.vertex_files": fmt.Sprintf(`{"127.0.0.1":"test/case/vertex/vertex_[0,%d]"}`, num-1), + "load.edge_files": fmt.Sprintf(`{"127.0.0.1":"test/case/edge/edge_[0,%d]"}`, num-1),记得在文件开头导入
fmt包。vermeer/apps/master/schedules/broker.go (1)
101-109: 条件判断逻辑复杂,建议简化当前的条件判断较为复杂,可以简化以提高可读性。
- if !(forceApply != nil && len(forceApply) > 0 && forceApply[0]) { - if b.isAgentBusy(agent) { - return nil, AgentStatusAgentBusy, nil, nil - } - - if b.isWorkerBusy(workers, agent) { - return nil, AgentStatusWorkerBusy, nil, nil - } + shouldCheckBusy := len(forceApply) == 0 || !forceApply[0] + if shouldCheckBusy { + if b.isAgentBusy(agent) { + return nil, AgentStatusAgentBusy, nil, nil + } + + if b.isWorkerBusy(workers, agent) { + return nil, AgentStatusWorkerBusy, nil, nil + } }vermeer/client/structure.go (2)
41-49: 新增 StartSequence 请求/响应类型,看起来与服务端契合;命名建议统一建议将 TaskStartSequenceResp 统一命名为 TaskStartSequenceResponse,与现有 BaseResponse/TasksResponse 等保持一致(非必须,避免 API 破坏可择机调整)。
50-51: 批量创建返回值为裸数组,建议与其它响应保持一致的 Envelope 风格当前
type TaskBatchCreateResponse []TaskResponse不含顶层 errcode/message,和其余 *Response 不一致,调用方难以统一处理错误。可考虑:type TaskBatchCreateResponse struct { client.BaseResponse Results []client.TaskResponse `json:"results,omitempty"` }如服务端已定型,可保留现状并在文档中明确约定。
vermeer/test/scheduler/batch.go (1)
1-14: 批量用例为空实现,请补齐以覆盖“按时间顺序执行”的断言建议最小实现:调用 CreateTaskBatch 创建一批 compute 任务(设置不同提交时间/priority 相同),随后轮询 GetTaskStartSequence 校验启动顺序与提交顺序一致。
可参考骨架(伪代码,依赖已有 client 方法名):
func TestBatch(t *testing.T, expectRes *functional.ExpectRes, healthCheck *functional.HealthCheck, masterHttp *client.VermeerClient, graphName []string, factor string, waitSecond int) { require.Len(t, graphName, 1) g := graphName[0] req := client.TaskCreateBatchRequest{ {TaskType: "compute", GraphName: g, Params: map[string]string{"algorithm": "pagerank"}}, {TaskType: "compute", GraphName: g, Params: map[string]string{"algorithm": "pagerank"}}, {TaskType: "compute", GraphName: g, Params: map[string]string{"algorithm": "pagerank"}}, } res, err := masterHttp.CreateTaskBatch(req) require.NoError(t, err) require.Len(t, res, len(req)) // 收集新建任务ID,并查询启动顺序 var ids []int32 for _, r := range res { ids = append(ids, r.Task.ID) } seqResp, err := masterHttp.GetTaskStartSequence(client.TaskStartSequenceRequest{QueryTasks: ids}) require.NoError(t, err) require.Equal(t, ids, seqResp.Sequence, "批量任务应按提交顺序启动") }需要我补齐实际 client 方法调用与断言细节吗?
vermeer/apps/master/bl/task_bl.go (2)
186-188: 取消定时任务未处理错误返回(若有),建议记录日志如果 Scheduler.CancelCronTask 返回错误,应至少 Warn 级别记录;若其为幂等空操作则保持现状即可。
请确认 CancelCronTask 的签名;若返回 error,可改为:
- Scheduler.CancelCronTask(task) + if err := Scheduler.CancelCronTask(task); err != nil { + logrus.Warnf("cancel cron for task %d failed: %v", task.ID, err) + }
248-254: 批量入队错误处理策略可再明确当前直接返回 errs;若出现部分失败,是否需要:重试失败项/回滚已入队项/将失败项标记错误?建议在调用方或此处形成统一策略并补充测试。
vermeer/test/scheduler/test_scheduler.go (2)
36-45: 第二个图未执行 CheckGraph,可能引入用例偶发失败多数后续子测依赖图已加载完成,建议也对 graphName+"_2" 执行 CheckGraph,或等待其状态就绪。
建议:
loadTest2 := functional.LoadTaskLocal{} loadTest2.Init(graphName+"_2", expectRes, &masterHttp, waitSecond, t, &healthCheck) loadTest2.SendLoadRequest(loadTest2.TaskLoadBodyWithNum(20)) -// loadTest2.CheckGraph() +loadTest2.CheckGraph()
53-57: 建议使用 t.Cleanup 确保资源回收若中途断言失败,删除图的逻辑不会执行。可用 t.Cleanup 注册清理动作,提升稳定性。
-// Finally, delete graph -deleteGraph := functional.DeleteGraph{} -deleteGraph.DeleteGraph(t, &masterHttp, graphName+"_1") -deleteGraph.DeleteGraph(t, &masterHttp, graphName+"_2") +t.Cleanup(func() { + deleteGraph := functional.DeleteGraph{} + deleteGraph.DeleteGraph(t, &masterHttp, graphName+"_1") + deleteGraph.DeleteGraph(t, &masterHttp, graphName+"_2") +})vermeer/test/scheduler/routine.go (2)
22-23: 硬编码的 cron 表达式可能导致测试不稳定当前测试使用
"* * * * *"(每分钟执行)的 cron 表达式,然后等待 2 分钟来验证任务是否被触发。这种方式存在以下问题:
- 等待时间过长,拖慢测试执行
- 依赖系统时钟,可能因时间边界问题导致测试不稳定
建议将 cron 表达式作为参数传入,或使用更短的间隔(如每秒执行)来加快测试速度。
- // every 1 minute - taskComputeBody["cron_expr"] = "* * * * *" + // every 30 seconds for faster testing + taskComputeBody["cron_expr"] = "*/30 * * * * *"同时相应地调整等待时间:
- time.Sleep(2 * time.Minute) + time.Sleep(35 * time.Second)
41-41: 缺少取消任务后的清理验证调用
GetTaskCancel后没有验证返回值或确认任务确实被取消。建议添加断言来确保取消操作成功。- masterHttp.GetTaskCancel(int(taskid)) + success, err := masterHttp.GetTaskCancel(int(taskid)) + require.NoError(t, err) + require.True(t, success, "Failed to cancel task")vermeer/test/functional/compute_base.go (3)
100-109: 新方法缺少错误处理和日志记录
SendComputeReqAsyncNotWait方法创建任务后立即返回,但没有记录任务创建成功的日志。建议添加日志以便调试。func (ctb *ComputeTaskBase) SendComputeReqAsyncNotWait(params map[string]string) int32 { //create Compute Task resp, err := ctb.masterHttp.CreateTaskAsync(client.TaskCreateRequest{ TaskType: "compute", GraphName: ctb.graphName, Params: params, }) require.NoError(ctb.t, err) + ctb.t.Logf("Created async task ID: %d", resp.Task.ID) return int32(resp.Task.ID) }
116-120: graph_name 参数覆盖逻辑可能引起混淆在批量创建任务时,如果
params[i]["graph_name"]不为空,会覆盖ctb.graphName。这种隐式覆盖可能导致调试困难。建议添加注释说明这是有意为之的设计。for i := 0; i < len(params); i++ { + // 允许每个任务指定自己的图名,用于跨图测试场景 graph := ctb.graphName if params[i]["graph_name"] != "" { graph = params[i]["graph_name"] }
143-143: 循环变量重复使用可能引起混淆在第 143 行,内层循环重复使用了变量
i,这会遮蔽外层循环的i变量(第 138 行)。虽然在这个上下文中不会造成错误,但可能引起混淆。- for i := 0; i < ctb.waitSecond; i++ { + for j := 0; j < ctb.waitSecond; j++ {vermeer/client/client.go (1)
153-171: 新增的 AllocGroupGraph 方法使用了硬编码的路径片段方法中硬编码了
"$DEFAULT"作为路径的一部分。这可能是一个特定的约定,但缺少文档说明其含义。+// AllocGroupGraph allocates a graph to a worker group +// The "$DEFAULT" segment in the path represents the default space/namespace func (vc *VermeerClient) AllocGroupGraph(graphName string, groupName string) (bool, error) { reader, err := Request2Reader(struct{}{}) if err != nil { return false, err } resp, err := vc.post(vc.httpAddr+"/admin/workers/alloc/"+groupName+"/$DEFAULT/"+graphName, reader)vermeer/apps/master/services/http_tasks.go (2)
257-260: 序列构建逻辑可以简化当前代码先创建一个容量为 1 的切片,然后通过 append 添加元素。这种方式效率不高。
- sequence := make([]int32, 0, 1) + sequence := make([]int32, 0, len(r)) for _, task := range r { sequence = append(sequence, int32(task.ID)) }
255-256: 建议:避免直接使用全局 Scheduler,改为依赖注入以提升可测性vermeer/apps/master/bl/bl.go:46 定义了全局 var Scheduler = &ScheduleBl{};vermeer/apps/master/bl/scheduler_bl.go:31 的 ScheduleBl 嵌入了 structure.MutexLocker(已有并发保护),因此并发风险已被部分控制,但全局实例仍降低单元测试可替换性并增加耦合。
- 建议:抽象一个最小 Scheduler 接口(例如包含 TaskStartSequence、QueueTask、CloseCurrent、CancelCronTask、PeekSpaceTail、IsDispatchPaused 等被调用的方法),在需要的 handler/BL 通过构造函数或字段注入该接口;保留现有全局实例作为默认生产实现。
- 需要修改的调用点示例:vermeer/apps/master/services/http_tasks.go:255(r := Scheduler.TaskStartSequence(...))及仓库中其他对 Scheduler 的直接引用(见多个文件中的 Scheduler.* 调用),改为使用注入的接口实例。
vermeer/apps/master/schedules/scheduler_cron_manager.go (3)
12-13: map 字段命名和注释不一致第 12 行的注释说 "cron expression to TaskInfo",但实际上 map 的键是任务 ID(int32),不是 cron 表达式。同样,第 13 行的注释也有同样的问题。
- cronTasks map[int32][]*structure.TaskInfo // cron expression to TaskInfo. Origin task ID to copied tasks - crons map[int32][]*cron.Cron // cron expression to cron jobs + cronTasks map[int32][]*structure.TaskInfo // Origin task ID to copied tasks + crons map[int32][]*cron.Cron // Origin task ID to cron jobs
48-61: 闭包中的 taskInfo 检查时机不当第 49 行检查
taskInfo == nil,但这个检查在闭包内部,而 taskInfo 是在创建闭包时捕获的。如果 taskInfo 在调用 AddFunc 时为 nil,应该在第 37-39 行就返回错误。_, err := cronJob.AddFunc(taskInfo.CronExpr, func() { - if taskInfo == nil { - return - } - // CREATE a new task from the original task, using taskbl, it is handled in queueTemplateHandler // copy a new taskInfo newID, err := t.queueTemplateHandler(taskInfo)
86-109: DeleteTaskByGraph 可能存在性能问题当图中有大量任务时,这个方法需要遍历所有任务来查找匹配的任务。对于大规模部署,这可能成为性能瓶颈。
建议考虑维护一个额外的索引(graph -> task IDs)来加速这种查询:
// 在 SchedulerCronManager 结构中添加 graphTasks map[string][]int32 // spaceName:graphName -> task IDsvermeer/test/scheduler/priority.go (3)
113-113: 硬编码的组名可能导致测试冲突使用硬编码的组名
"test"可能在并发测试时造成冲突。- masterHttp.AllocGroupGraph(graphName[0]+"_3", "test") + // 使用唯一的组名避免并发测试冲突 + groupName := fmt.Sprintf("test_%d", time.Now().UnixNano()) + masterHttp.AllocGroupGraph(graphName[0]+"_3", groupName)
87-87: 注释与实际测试内容不符注释说"发送不同大小的两个任务",但实际上发送的是两个相同的任务。
- // send two tasks with different size + // send two identical tasks to test concurrent execution
125-132: map 复制逻辑可以简化当前使用循环复制 map,可以使用更简洁的方式。
- new_body := make(map[string]string) - for k, v := range taskComputeBody { - new_body[k] = v - } + new_body := make(map[string]string, len(taskComputeBody)) + for k, v := range taskComputeBody { + new_body[k] = v + }或者使用 maps 包(如果 Go 版本支持):
import "maps" new_body := maps.Clone(taskComputeBody)vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
108-114: 缺少 nil 检查可能导致 panic虽然 Line 109-111 检查了 worker 是否为 nil,但从
rm.broker.ApplyAgent返回的workersmap 本身也可能为 nil。建议在遍历前添加检查。+ if workers == nil { + return agent, status, nil + } for _, worker := range workers {
176-181: 公共方法命名建议改进
ChangeWorkerStatus是公共方法,但内部调用了私有方法changeWorkerStatus。考虑到这只是一个加锁的包装器,建议直接在私有方法中处理锁,或者为公共方法添加更多的验证逻辑。vermeer/apps/master/schedules/scheduler_task_manager.go (2)
99-107: 拼写错误:GetAllTasksWaitng方法名应该是
GetAllTasksWaiting而不是GetAllTasksWaitng(缺少 'i')。-func (t *SchedulerTaskManager) GetAllTasksWaitng() []*structure.TaskInfo { +func (t *SchedulerTaskManager) GetAllTasksWaiting() []*structure.TaskInfo {
137-140: 日志级别使用不当Line 137 使用
Info级别记录函数调用,但 Line 139 使用Debug级别记录详细信息。建议保持一致性,要么都用 Debug(因为这是调试/测试功能),要么都用 Info。vermeer/apps/master/bl/scheduler_bl.go (1)
244-247: BatchQueueTask 中的延迟解锁注释Line 247 有一个被注释掉的
defer s.Unlock(s.Lock())。这看起来像是遗留代码,应该被移除。同时需要确认PauseDispatch和ResumeDispatch是否已经处理了必要的锁定。vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
24-28: 拼写错误:schuduledSchedulerAlgorithms字段名
schuduledSchedulerAlgorithms应该是scheduledSchedulerAlgorithms(scheduled 而不是 schuduled)。type SchedulerAlgorithmManager struct { filteredSchedulerAlgorithms map[string]SchedulerAlgorithm - schuduledSchedulerAlgorithms map[string]SchedulerAlgorithm + scheduledSchedulerAlgorithms map[string]SchedulerAlgorithm dispatchPaused bool }这个拼写错误在整个文件中多处出现,需要全部修正。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
vermeer/go.sumis excluded by!**/*.sum
📒 Files selected for processing (33)
vermeer/.gitignore(1 hunks)vermeer/README.md(1 hunks)vermeer/README.zh-CN.md(1 hunks)vermeer/apps/graphio/local_file.go(1 hunks)vermeer/apps/master/bl/grpc_handlers.go(2 hunks)vermeer/apps/master/bl/scheduler_bl.go(5 hunks)vermeer/apps/master/bl/task_bl.go(4 hunks)vermeer/apps/master/bl/task_creator.go(1 hunks)vermeer/apps/master/bl/worker_bl.go(1 hunks)vermeer/apps/master/master_main.go(1 hunks)vermeer/apps/master/schedules/broker.go(2 hunks)vermeer/apps/master/schedules/scheduler_algorithm_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_cron_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_resource_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_task_manager.go(1 hunks)vermeer/apps/master/services/http_tasks.go(2 hunks)vermeer/apps/master/services/router.go(1 hunks)vermeer/apps/master/workers/worker_manager.go(1 hunks)vermeer/apps/structure/task.go(1 hunks)vermeer/client/client.go(3 hunks)vermeer/client/structure.go(2 hunks)vermeer/config/master.ini(2 hunks)vermeer/config/worker04.ini(1 hunks)vermeer/docker-compose.yaml(1 hunks)vermeer/go.mod(1 hunks)vermeer/test/functional/compute_base.go(1 hunks)vermeer/test/functional/compute_task.go(1 hunks)vermeer/test/functional/load_local.go(2 hunks)vermeer/test/scheduler/batch.go(1 hunks)vermeer/test/scheduler/priority.go(1 hunks)vermeer/test/scheduler/routine.go(1 hunks)vermeer/test/scheduler/test_scheduler.go(1 hunks)vermeer/vermeer_test.go(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (19)
vermeer/apps/master/bl/grpc_handlers.go (1)
vermeer/apps/master/schedules/scheduler_resource_manager.go (1)
WorkerOngoingStatusIdle(13-13)
vermeer/apps/master/services/router.go (1)
vermeer/apps/master/services/http_tasks.go (6)
TasksHandler(34-36)TaskCreateHandler(64-66)TaskCreateBatchHandler(118-120)TaskCreateSyncHandler(84-86)ComputeValueHandler(173-175)TaskStartSequenceHandler(235-237)
vermeer/apps/master/master_main.go (1)
vermeer/apps/master/services/router.go (1)
SetAdminRouters(74-96)
vermeer/test/scheduler/routine.go (1)
vermeer/test/functional/compute_task.go (1)
MakeComputeTask(47-128)
vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/test/scheduler/batch.go (1)
vermeer/client/client.go (1)
VermeerClient(31-36)
vermeer/client/structure.go (1)
vermeer/apps/master/services/http_tasks.go (1)
TaskCreateRequest(68-72)
vermeer/apps/master/schedules/broker.go (2)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/workers/worker_manager.go (1)
WorkerClient(44-59)
vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
vermeer/apps/master/schedules/broker.go (4)
Broker(56-60)Agent(42-45)AgentStatus(30-30)AgentStatusError(34-34)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/apps/master/bl/task_creator.go (2)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/client/structure.go (1)
TaskInfo(52-62)
vermeer/client/client.go (1)
vermeer/client/structure.go (5)
Request2Reader(181-187)TaskCreateBatchRequest(175-175)TaskBatchCreateResponse(50-50)TaskStartSequenceResp(45-48)TaskStartSequenceRequest(41-43)
vermeer/test/scheduler/test_scheduler.go (5)
vermeer/client/client.go (1)
VermeerClient(31-36)vermeer/test/functional/load_local.go (1)
LoadTaskLocal(29-31)vermeer/test/scheduler/priority.go (1)
TestPriority(146-179)vermeer/test/scheduler/batch.go (1)
TestBatch(9-14)vermeer/test/scheduler/routine.go (1)
TestRoutine(46-54)
vermeer/apps/master/schedules/scheduler_task_manager.go (1)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/test/scheduler/priority.go (3)
vermeer/client/client.go (1)
VermeerClient(31-36)vermeer/test/functional/compute_task.go (1)
MakeComputeTask(47-128)vermeer/test/functional/load_local.go (1)
LoadTaskLocal(29-31)
vermeer/test/functional/compute_base.go (3)
vermeer/apps/master/services/http_tasks.go (2)
TaskCreateRequest(68-72)TaskCreateBatchRequest(122-122)vermeer/client/structure.go (5)
TaskCreateRequest(169-173)TaskInfo(52-62)TaskCreateBatchRequest(175-175)BaseResponse(27-30)TaskResponse(36-39)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/vermeer_test.go (1)
vermeer/test/scheduler/test_scheduler.go (1)
TestScheduler(15-60)
vermeer/apps/master/services/http_tasks.go (1)
vermeer/client/structure.go (2)
TaskStartSequenceRequest(41-43)TaskStartSequenceResp(45-48)
vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/apps/master/bl/scheduler_bl.go (5)
vermeer/apps/master/schedules/scheduler_resource_manager.go (3)
SchedulerResourceManager(20-28)WorkerOngoingStatusDeleted(17-17)WorkerOngoingStatus(10-10)vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
SchedulerAlgorithmManager(24-28)vermeer/apps/master/schedules/scheduler_task_manager.go (1)
SchedulerTaskManager(10-18)vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
SchedulerCronManager(11-16)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
🪛 YAMLlint (1.37.1)
vermeer/docker-compose.yaml
[error] 29-29: no new line character at the end of file
(new-line-at-end-of-file)
🪛 markdownlint-cli2 (0.17.2)
vermeer/README.md
9-9: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
17-17: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
24-24: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
vermeer/README.zh-CN.md
9-9: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
16-16: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
22-22: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: check-dependency-license
- GitHub Check: Analyze (java)
- GitHub Check: Analyze (go)
🔇 Additional comments (22)
vermeer/apps/master/bl/task_creator.go (1)
102-122: 必须修复:避免浅拷贝 Params/Preorders(共享引用 & 并发风险)task.Params = src.Params 与 task.Preorders = src.Preorders 为浅拷贝,会导致新旧任务共享同一 map/slice,后续修改互相影响并可能在并发场景下引发数据竞争,应改为深拷贝。
文件:vermeer/apps/master/bl/task_creator.go 行:102-122
func (bc *baseCreator) CopyTaskInfo(src *structure.TaskInfo) (*structure.TaskInfo, error) { @@ - task.Params = src.Params + // deep copy params + if src.Params != nil { + cp := make(map[string]string, len(src.Params)) + for k, v := range src.Params { + cp[k] = v + } + task.Params = cp + } @@ - task.Preorders = src.Preorders + // deep copy preorders + if src.Preorders != nil { + task.Preorders = append([]int32(nil), src.Preorders...) + } }仓库自动搜索未返回匹配,请人工确认是否存在其它类似赋值并替换为深拷贝。
vermeer/config/master.ini (1)
28-28: 确认 ticker_interval 单位与最小/最大范围校验vermeer/config/master.ini 行28:ticker_interval=1
搜索未能定位到 LoadConfig/startTicker 等读取/校验实现;在 LoadConfig/startTicker 中确保对单位(秒)和最小/最大值做校验(例如最小 1s),并在 README 标注默认与建议值;若已实现,列出实现文件/函数以便复核。
vermeer/go.mod (1)
75-75: 将 github.com/robfig/cron/v3 设为直接依赖并升级到 v3.0.1
在线查询显示 v3.0.1 为最新稳定版(截至 2025-09-15)。vermeer/go.mod 当前将其列为 indirect(github.com/robfig/cron/v3 v3.0.0 // indirect)。操作:在 require 块中添加或替换为 github.com/robfig/cron/v3 v3.0.1(移除 // indirect),运行 go mod tidy 更新 go.sum 并提交变更。vermeer/apps/master/schedules/broker.go (1)
76-113: 确认:ApplyAgent 的新返回值 workers 在调用方已被正确接收仓库中仅发现一处调用:vermeer/apps/master/schedules/scheduler_resource_manager.go:85 — 调用为 agent, status, workers, err := rm.broker.ApplyAgent(taskInfo, !taskInfo.Exclusive),workers 已被接收;未发现其他调用方。
vermeer/apps/master/services/router.go (2)
40-44: 确认新增调度路由的权限与限流
- vermeer/apps/master/services/router.go(约第45、51行)路由注册时传入了 authFilters;确认该 filter 是否包含对 /create/batch 和 /start_sequence 的角色/权限校验(若无,补入明确的权限检查)。
- 仓库中未检索到限流实现:为批量创建与启动序列添加限流/并发限制或将请求入队,并定义阈值与降级策略。
38-44: 确认 /create/batch 与 /start_sequence 对应处理器实现vermeer/apps/master/services/http_tasks.go 中定义了 TaskCreateBatchHandler(第 118–120 行)和 TaskStartSequenceHandler(第 235–237 行),但未检索到各自的接收者方法实现。确认是否依赖嵌入的 common.SenHandler 提供路由所需行为并满足接口;若无,补充或移动缺失的处理方法。
vermeer/vermeer_test.go (5)
34-35: 引入调度器测试套件 OK
99-101: 为测试开关加入 scheduler 分支,集成良好
108-111: 新增 testScheduler 入口,结构清晰
114-116: 仅注释改动,无需处理
170-170: 仅注释改动,无需处理vermeer/client/structure.go (1)
175-176: 批量创建请求类型别名简单直接,OKvermeer/apps/master/bl/task_bl.go (1)
24-26: 为调度参数解析引入 strconv/strings,合理vermeer/apps/master/bl/worker_bl.go (1)
66-76: CloseCurrent 签名已变更 — 请全仓库核对调用点并调整置错/关闭顺序vermeer/apps/master/bl/worker_bl.go(行66–76)由 CloseCurrent(taskInfo.ID) 改为 CloseCurrent(taskInfo.ID, workerName)。尝试自动搜索未返回结果,无法确认仓库内其它调用点是否已同步更新。
- 在仓库根目录运行以下命令查找仍为单参的调用并修复:
git grep -n -- 'CloseCurrent(' || true rg -nP --hidden --no-ignore 'CloseCurrent\s*\(' -S -g '!vendor/**' || true rg -nP --hidden --no-ignore 'CloseCurrent\s*\([^,)]*\)' -S -g '!vendor/**' || true
- 建议变更顺序:先调用 Scheduler.CloseCurrent(...),根据其返回结果再调用 taskMgr.SetError(...) 或记录失败信息,避免出现“已置错但未真正关闭”的不一致。
vermeer/client/client.go (2)
251-274: 批量创建任务的响应处理逻辑正确新增的
CreateTaskBatch方法遵循了客户端的既定模式,错误处理和响应解析都很完善。
318-341: GetTaskStartSequence 方法实现规范新方法遵循了既定的错误处理模式,请求构建和响应解析都很清晰。
vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
72-84: DeleteTask 方法实现正确删除任务的逻辑清晰,正确地停止了所有相关的 cron 任务并清理了映射。
vermeer/test/scheduler/priority.go (1)
41-42: 注释与实现一致 — 无需修改调度器在 vermeer/apps/master/schedules/scheduler_algorithm_manager.go 中按 Priority 降序排序(数值越大优先级越高),因此测试中“优先级1在优先级0之前执行”的注释与断言是正确的。参见 vermeer/apps/master/schedules/scheduler_algorithm_manager.go 行173–175 和 vermeer/test/scheduler/priority.go 行39–41。
vermeer/apps/master/services/http_tasks.go (1)
115-115: 确认:类型重命名是否为向后不兼容变更?文件: vermeer/apps/master/services/http_tasks.go:115
简短说明:仅更改服务端 Go 类型名(TaskResp -> TaskCreateResponse)本身不一定破坏兼容性,关键看公开的 JSON 响应形状、OpenAPI/JSON‑Schema 标识和生成的客户端是否改变。
核查项:
- 检查 OpenAPI/JSON‑Schema(components.schemas 或 definitions)是否把 TaskResp 的 schema key 改名为 TaskCreateResponse;若改名,则为破坏性变更。
- 确认实际 HTTP 响应(JSON 字段名与结构)和返回的状态码未发生变化。
- 确认是否存在自动生成的 SDK/客户端(CI/生成脚本),更名会否导致生成模型/类名变化并破坏已部署客户端。
- 兼容性建议:若需兼容,保留旧 schema 名作为别名或标记为 deprecated,或按 API 版本策略(major bump 或兼容性声明)发布变更并通知用户。
核实以上各点后再决定接受、更名并发布兼容说明,或回退更名。
vermeer/apps/master/bl/scheduler_bl.go (2)
391-393: 变量命名错误Line 392 使用了未定义的
task变量,应该是调用s.taskManager.RemoveTask(taskInfo.ID)的返回值。- task := s.taskManager.RemoveTask(taskInfo.ID) + err := s.taskManager.RemoveTask(taskInfo.ID) s.cronManager.DeleteTask(taskInfo.ID) - // err := s.taskManager.CancelTask(taskInfo) isInQueue := false - if task != nil { + if err == nil {Likely an incorrect or invalid review comment.
350-354: 修复 TOCTOU:在 TaskManager 内实现原子状态转换(mutex 或原子 CAS)文件 vermeer/apps/master/bl/scheduler_bl.go 第350–354行的 TODO 正确 —— 在检查 taskInfo.State 与随后设置状态之间存在竞态(TOCTOU)。集中在 TaskManager 中做原子化状态转换并替换散落的检查/赋值。示例:
// 在 TaskManager 中实现原子状态转换 func (tm *TaskManager) TransitionState(taskInfo *TaskInfo, from, to TaskState) error { tm.mu.Lock() defer tm.mu.Unlock() if taskInfo.State != from { return fmt.Errorf("invalid state transition: %v -> %v", taskInfo.State, to) } taskInfo.State = to return nil }调用处改为通过 TransitionState 完成检查与设置,避免 TOCTOU。
vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
396-401: DependsSchedulerAlgorithm 依赖检查逻辑错误FilterNextTasks 方法中,Line 397 检查依赖任务状态不是
TaskStateComplete,但这个逻辑可能不完整。如果依赖任务不存在于taskIDs中(即不在当前任务列表中),应该假设它已经完成。for _, dep := range depends { - if depTask, exists := taskIDs[dep]; exists && depTask.State != structure.TaskStateComplete { + // 如果依赖任务在当前列表中且未完成,则依赖未满足 + if depTask, exists := taskIDs[dep]; exists { + if depTask.State != structure.TaskStateComplete { + allDepsSatisfied = false + break + } + } + // 如果依赖任务不在列表中,假设已完成 }Likely an incorrect or invalid review comment.
| func (s *ScheduleBl) Init() { | ||
| logrus.Info("Initializing ScheduleBl...") | ||
| s.LoadConfig() | ||
| startChan := make(chan *structure.TaskInfo, s.startChanSize) | ||
| s.startChan = startChan | ||
|
|
||
| s.resourceManager = &schedules.SchedulerResourceManager{} | ||
| s.resourceManager.Init() | ||
| s.taskManager = &schedules.SchedulerTaskManager{} | ||
| s.taskManager.Init() | ||
| s.algorithmManager = &schedules.SchedulerAlgorithmManager{} | ||
| s.algorithmManager.Init() | ||
| s.cronManager = &schedules.SchedulerCronManager{} | ||
| s.cronManager.Init(s.QueueTaskFromTemplate) | ||
| go s.startTicker() | ||
| go s.waitingStartedTask() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Init 方法缺少错误处理
各个管理器的初始化可能失败,但当前实现没有返回错误。建议添加错误处理机制。
-func (s *ScheduleBl) Init() {
+func (s *ScheduleBl) Init() error {
logrus.Info("Initializing ScheduleBl...")
- s.LoadConfig()
+ if err := s.LoadConfig(); err != nil {
+ return fmt.Errorf("failed to load config: %w", err)
+ }
// ... 其余代码
+ return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *ScheduleBl) Init() { | |
| logrus.Info("Initializing ScheduleBl...") | |
| s.LoadConfig() | |
| startChan := make(chan *structure.TaskInfo, s.startChanSize) | |
| s.startChan = startChan | |
| s.resourceManager = &schedules.SchedulerResourceManager{} | |
| s.resourceManager.Init() | |
| s.taskManager = &schedules.SchedulerTaskManager{} | |
| s.taskManager.Init() | |
| s.algorithmManager = &schedules.SchedulerAlgorithmManager{} | |
| s.algorithmManager.Init() | |
| s.cronManager = &schedules.SchedulerCronManager{} | |
| s.cronManager.Init(s.QueueTaskFromTemplate) | |
| go s.startTicker() | |
| go s.waitingStartedTask() | |
| } | |
| func (s *ScheduleBl) Init() error { | |
| logrus.Info("Initializing ScheduleBl...") | |
| if err := s.LoadConfig(); err != nil { | |
| return fmt.Errorf("failed to load config: %w", err) | |
| } | |
| startChan := make(chan *structure.TaskInfo, s.startChanSize) | |
| s.startChan = startChan | |
| s.resourceManager = &schedules.SchedulerResourceManager{} | |
| s.resourceManager.Init() | |
| s.taskManager = &schedules.SchedulerTaskManager{} | |
| s.taskManager.Init() | |
| s.algorithmManager = &schedules.SchedulerAlgorithmManager{} | |
| s.algorithmManager.Init() | |
| s.cronManager = &schedules.SchedulerCronManager{} | |
| s.cronManager.Init(s.QueueTaskFromTemplate) | |
| go s.startTicker() | |
| go s.waitingStartedTask() | |
| return nil | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/scheduler_bl.go around lines 49 to 65, the Init method
calls several manager Init methods without checking for failures; change Init to
return an error and add error handling: update ScheduleBl.Init signature to
return error, make each manager.Init return error (or adapt to their existing
error-returning variants), call each Init and if any returns non-nil capture and
return that error (and log it with context), only start
cronManager.Init(s.QueueTaskFromTemplate) and the goroutines when all inits
succeeded, and update all callers of ScheduleBl.Init to handle the returned
error accordingly.
| // for scheduler | ||
| taskInfo.Priority = 0 | ||
| taskInfo.Preorders = make([]int32, 0) | ||
| taskInfo.Exclusive = false // default to false, can be set to true if needed | ||
| // taskInfo.Exclusive = true | ||
| if params != nil { | ||
| if priority, ok := params["priority"]; ok { | ||
| if p, err := strconv.Atoi(priority); err == nil { | ||
| taskInfo.Priority = int32(p) | ||
| } else { | ||
| logrus.Warnf("priority convert to int32 error:%v", err) | ||
| } | ||
| } | ||
| if preorders, ok := params["preorders"]; ok { | ||
| preorderList := strings.Split(preorders, ",") | ||
| for _, preorder := range preorderList { | ||
| if pid, err := strconv.Atoi(preorder); err == nil { | ||
| taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) | ||
| } else { | ||
| logrus.Warnf("preorder convert to int32 error:%v", err) | ||
| } | ||
| } | ||
| } | ||
| if exclusive, ok := params["exclusive"]; ok { | ||
| if ex, err := strconv.ParseBool(exclusive); err == nil { | ||
| taskInfo.Exclusive = ex | ||
| } else { | ||
| logrus.Warnf("exclusive convert to bool error:%v", err) | ||
| } | ||
| } | ||
| if cronExpr, ok := params["cron_expr"]; ok { | ||
| taskInfo.CronExpr = cronExpr | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
调度参数解析建议增强健壮性(空格/范围/重复值)
- preorders 未 trim 空白,存在空串时会反复告警;建议 TrimSpace 并跳过空值。
- priority 建议限定范围(如非负)避免异常值干扰调度。
- exclusive/cron_expr 建议校验(布尔/cron 格式)并在无效时回退默认值。
建议局部改造:
- if preorders, ok := params["preorders"]; ok {
- preorderList := strings.Split(preorders, ",")
- for _, preorder := range preorderList {
- if pid, err := strconv.Atoi(preorder); err == nil {
- taskInfo.Preorders = append(taskInfo.Preorders, int32(pid))
- } else {
- logrus.Warnf("preorder convert to int32 error:%v", err)
- }
- }
- }
+ if s, ok := params["preorders"]; ok {
+ for _, tok := range strings.Split(s, ",") {
+ tok = strings.TrimSpace(tok)
+ if tok == "" {
+ continue
+ }
+ if pid, err := strconv.Atoi(tok); err == nil {
+ taskInfo.Preorders = append(taskInfo.Preorders, int32(pid))
+ } else {
+ logrus.Warnf("preorders item '%s' invalid: %v", tok, err)
+ }
+ }
+ }
@@
- if priority, ok := params["priority"]; ok {
+ if priority, ok := params["priority"]; ok {
if p, err := strconv.Atoi(priority); err == nil {
- taskInfo.Priority = int32(p)
+ if p < 0 {
+ logrus.Warnf("priority < 0, fallback to 0")
+ p = 0
+ }
+ taskInfo.Priority = int32(p)
} else {
logrus.Warnf("priority convert to int32 error:%v", err)
}
}若 cron 需预校验,可在此处集成解析器并在失败时记录警告。
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // for scheduler | |
| taskInfo.Priority = 0 | |
| taskInfo.Preorders = make([]int32, 0) | |
| taskInfo.Exclusive = false // default to false, can be set to true if needed | |
| // taskInfo.Exclusive = true | |
| if params != nil { | |
| if priority, ok := params["priority"]; ok { | |
| if p, err := strconv.Atoi(priority); err == nil { | |
| taskInfo.Priority = int32(p) | |
| } else { | |
| logrus.Warnf("priority convert to int32 error:%v", err) | |
| } | |
| } | |
| if preorders, ok := params["preorders"]; ok { | |
| preorderList := strings.Split(preorders, ",") | |
| for _, preorder := range preorderList { | |
| if pid, err := strconv.Atoi(preorder); err == nil { | |
| taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) | |
| } else { | |
| logrus.Warnf("preorder convert to int32 error:%v", err) | |
| } | |
| } | |
| } | |
| if exclusive, ok := params["exclusive"]; ok { | |
| if ex, err := strconv.ParseBool(exclusive); err == nil { | |
| taskInfo.Exclusive = ex | |
| } else { | |
| logrus.Warnf("exclusive convert to bool error:%v", err) | |
| } | |
| } | |
| if cronExpr, ok := params["cron_expr"]; ok { | |
| taskInfo.CronExpr = cronExpr | |
| } | |
| } | |
| // for scheduler | |
| taskInfo.Priority = 0 | |
| taskInfo.Preorders = make([]int32, 0) | |
| taskInfo.Exclusive = false // default to false, can be set to true if needed | |
| // taskInfo.Exclusive = true | |
| if params != nil { | |
| if priority, ok := params["priority"]; ok { | |
| if p, err := strconv.Atoi(priority); err == nil { | |
| if p < 0 { | |
| logrus.Warnf("priority < 0, fallback to 0") | |
| p = 0 | |
| } | |
| taskInfo.Priority = int32(p) | |
| } else { | |
| logrus.Warnf("priority convert to int32 error:%v", err) | |
| } | |
| } | |
| if s, ok := params["preorders"]; ok { | |
| for _, tok := range strings.Split(s, ",") { | |
| tok = strings.TrimSpace(tok) | |
| if tok == "" { | |
| continue | |
| } | |
| if pid, err := strconv.Atoi(tok); err == nil { | |
| taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) | |
| } else { | |
| logrus.Warnf("preorders item '%s' invalid: %v", tok, err) | |
| } | |
| } | |
| } | |
| if exclusive, ok := params["exclusive"]; ok { | |
| if ex, err := strconv.ParseBool(exclusive); err == nil { | |
| taskInfo.Exclusive = ex | |
| } else { | |
| logrus.Warnf("exclusive convert to bool error:%v", err) | |
| } | |
| } | |
| if cronExpr, ok := params["cron_expr"]; ok { | |
| taskInfo.CronExpr = cronExpr | |
| } | |
| } |
| SendComputeReqAsyncNotWait(params map[string]string) int32 | ||
| SendComputeReqAsyncBatchPriority(params []map[string]string) ([]int32, []int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
新增的异步方法需要文档说明
接口新增了两个异步方法,但缺少文档说明它们的行为、返回值含义以及与现有 SendComputeReqAsync 方法的区别。
建议在接口方法上方添加注释:
// SendComputeReqAsync 发送异步计算请求并等待完成
SendComputeReqAsync(params map[string]string)
// SendComputeReqAsyncNotWait 发送异步计算请求但不等待完成,返回任务ID
SendComputeReqAsyncNotWait(params map[string]string) int32
// SendComputeReqAsyncBatchPriority 批量发送带优先级的异步计算请求
// 返回值:(任务ID列表, 优先级列表)
SendComputeReqAsyncBatchPriority(params []map[string]string) ([]int32, []int32)🤖 Prompt for AI Agents
In vermeer/test/functional/compute_task.go around lines 39 to 40, the newly
added asynchronous interface methods lack documentation describing their
behavior and return values; add Go-style comments above each method declaration
explaining: SendComputeReqAsync waits for completion; SendComputeReqAsyncNotWait
does not wait and returns the task ID (int32); SendComputeReqAsyncBatchPriority
sends a batch with per-request priority and returns (taskID slice, priority
slice) ([]int32, []int32). Ensure comments are concise, use proper Go comment
format beginning with the function name, and mention any differences from the
existing SendComputeReqAsync.
| if num <= 10 { | ||
| num = 30 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
考虑提供更合理的默认值或返回错误
当 num <= 10 时,函数会强制设置 num = 30,这种静默修改参数的行为可能让调用者感到困惑。建议要么返回错误,要么在函数文档中明确说明此行为。
建议修改为返回错误:
func (lt *LoadTaskLocal) TaskLoadBodyWithNum(num int) map[string]string {
vertexBackends := []string{"db", "mem"}
if num <= 10 {
- num = 30
+ logrus.Warnf("num %d is too small, minimum value is 11", num)
+ return nil
}或者在函数上方添加文档注释:
// TaskLoadBodyWithNum creates load configuration with specified number of files.
// If num <= 10, it will be automatically adjusted to 30 to ensure minimum test coverage.
func (lt *LoadTaskLocal) TaskLoadBodyWithNum(num int) map[string]string {🤖 Prompt for AI Agents
In vermeer/test/functional/load_local.go around lines 53-55, the function
silently overrides num <= 10 to 30 which is confusing; change the API to return
an error instead of mutating the input: update the function signature to return
(map[string]string, error), validate num at the start and return a descriptive
error when num <= 10, remove the silent assignment, and update all call sites to
handle the error; alternatively, if you must keep the current signature, add a
clear doc comment above the function stating that num will be auto-adjusted to
30 when <= 10 (but prefer the error-return approach for clearer behavior).
| fmt.Printf("Test Small start with task: %s\n", computeTask) | ||
| bTime := time.Now() | ||
| computeTest, err := functional.MakeComputeTask(computeTask) | ||
| computeTaskSmall, err := functional.MakeComputeTask(computeTask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
未使用的错误变量
第 52 行声明了 err 变量但在重新赋值前没有检查第一个 MakeComputeTask 的错误。
computeTest, err := functional.MakeComputeTask(computeTask)
- computeTaskSmall, err := functional.MakeComputeTask(computeTask)
require.NoError(t, err)
+ computeTaskSmall, err := functional.MakeComputeTask(computeTask)
+ require.NoError(t, err)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| computeTaskSmall, err := functional.MakeComputeTask(computeTask) | |
| computeTest, err := functional.MakeComputeTask(computeTask) | |
| require.NoError(t, err) | |
| computeTaskSmall, err := functional.MakeComputeTask(computeTask) | |
| require.NoError(t, err) |
🤖 Prompt for AI Agents
In vermeer/test/scheduler/priority.go around line 52, the call
`computeTaskSmall, err := functional.MakeComputeTask(computeTask)` declares
`err` but the error from the previous `MakeComputeTask` call was not checked
before this reassignment; fix by checking the first call's error immediately
after it (returning or failing the test as appropriate) instead of shadowing it,
or use a new variable name for the second call (e.g., `computeTaskSmall, err2 :=
...`) and check both errors, ensuring each MakeComputeTask error is handled
before proceeding.
| queue := []int32{} | ||
| queue = append(queue, int32(taskid+1)) | ||
| result, err := masterHttp.GetTaskStartSequence(queue) | ||
| require.NoError(t, err) | ||
| require.Equal(t, 1, len(result.Sequence)) | ||
| require.Greater(t, result.Sequence[0], int32(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
验证逻辑不够严谨
当前代码创建的任务 ID 是 taskid,但验证时使用的是 taskid+1。这个逻辑看起来是假设 cron 触发会创建一个新任务,其 ID 为原任务 ID + 1。这种假设很脆弱,因为:
- 任务 ID 的生成策略可能改变
- 如果有并发测试运行,ID 序列可能被打断
建议通过 API 查询与原始 cron 任务相关的所有任务,而不是依赖 ID 递增的假设。
🏁 Script executed:
#!/bin/bash
# 查找是否有获取 cron 任务相关任务列表的 API
rg -n "cron.*task|task.*cron" --type go -A 3 -B 3Length of output: 9776
不要假设 cron 触发的复制任务 ID = 原任务 ID + 1,改为显式定位复制任务
当前测试在 vermeer/test/scheduler/routine.go(约 34–39 行)用 taskid+1 验证新任务不严谨:cron 管理器会复制原任务并通过 queueTemplateHandler 生成新 ID(vermeer/apps/master/schedules/scheduler_cron_manager.go:12,45-66),且新任务在创建时会清空 CronExpr(vermeer/apps/master/bl/task_creator.go:116),因此不能依赖自增或顺序分配(并发场景会打断序列)。
- 建议改为通过可用的任务列表 API 或在测试中列出并筛选任务(按原任务 ID / space / graph / 创建时间 / 参数等)以定位由该 cron 触发的复制任务的真实 ID,然后用该真实 ID 调用 GetTaskStartSequence;若无现成查询接口,请在测试中实现基于列表+过滤的查找而不是硬编码 +1。
🤖 Prompt for AI Agents
In vermeer/test/scheduler/routine.go around lines 34–39, the test incorrectly
assumes the cron-created replica task ID equals original taskid+1; instead,
modify the test to locate the replica explicitly by calling the tasks/listing
API (or implementing a list+filter in the test) and filter results by
identifying attributes (e.g., original task ID reference, space, graph, creation
timestamp window, CronExpr cleared, or task parameters) to find the copied
task's actual ID; then pass that real ID to masterHttp.GetTaskStartSequence and
assert the returned sequence rather than relying on a +1 id assumption.
| var computeTask = "pagerank" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
测试函数中的硬编码值降低了灵活性
TestRoutine 函数硬编码了 computeTask = "pagerank",但函数签名中有一个未使用的 factor 参数。这可能是设计遗留问题。
如果 factor 参数本意是用来指定计算任务类型,建议修正:
-func TestRoutine(t *testing.T, expectRes *functional.ExpectRes, healthCheck *functional.HealthCheck, masterHttp *client.VermeerClient, graphName []string, factor string, waitSecond int) {
- var computeTask = "pagerank"
+func TestRoutine(t *testing.T, expectRes *functional.ExpectRes, healthCheck *functional.HealthCheck, masterHttp *client.VermeerClient, graphName []string, computeTask string, waitSecond int) {
+ if computeTask == "" {
+ computeTask = "pagerank" // 默认值
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var computeTask = "pagerank" | |
| func TestRoutine(t *testing.T, expectRes *functional.ExpectRes, healthCheck *functional.HealthCheck, masterHttp *client.VermeerClient, graphName []string, computeTask string, waitSecond int) { | |
| if computeTask == "" { | |
| computeTask = "pagerank" // 默认值 | |
| } |
🤖 Prompt for AI Agents
In vermeer/test/scheduler/routine.go around lines 47-48, the test hardcodes
computeTask = "pagerank" while the TestRoutine signature accepts an unused
factor parameter; either use factor to set the computeTask or remove the unused
parameter. If factor should determine task type, replace the hardcoded
assignment with logic that sets computeTask = factor (validate non-empty and
provide a sensible default if needed); otherwise remove the factor parameter
from the function signature and any callsites and update the test accordingly.
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
1 similar comment
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
Caution CodeRabbit failed during planning: Script execution failed: sed: -e expression #1, char 20: unknown command: `' |
|
Caution An unexpected error occurred while opening a pull request: Repository rule violations found Changes must be made through a pull request. |
96f1aa0 to
dad0b39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vermeer/apps/master/services/http_tasks.go (1)
27-31: 重复导入同一包(具名+点导入)会导致编译错误,统一为具名导入并显式引用同时使用
bl和点导入会编译失败。移除点导入,并在代码中使用bl.Scheduler。- "vermeer/apps/master/bl" - . "vermeer/apps/master/bl" + "vermeer/apps/master/bl"并修改调用处:
- r := Scheduler.TaskStartSequence(req.QueryTasks) + r := bl.Scheduler.TaskStartSequence(req.QueryTasks)
🧹 Nitpick comments (7)
vermeer/apps/master/schedules/broker.go (1)
101-109: 布尔可变参判断写法可读性差,建议简化条件当前判断等价于“未传值或传入 false 时执行分支”,建议改为更直观写法:
- if !(forceApply != nil && len(forceApply) > 0 && forceApply[0]) { + if len(forceApply) == 0 || !forceApply[0] {vermeer/apps/master/services/http_tasks.go (1)
257-263: 微优化:预分配正确容量,避免切片扩容初始化时可按返回任务数进行容量预分配。
- sequence := make([]int32, 0, 1) + sequence := make([]int32, 0, len(r))vermeer/apps/master/schedules/scheduler_resource_manager.go (1)
123-125: 重复分配任务到 Agent(职责重复)Broker.ApplyAgent 内已调用
agent.AssignTask(taskInfo),此处再次调用造成重复写入同一映射,虽结果相同但增加耦合与混淆。建议移除本处的 Assign。- // Assign the task to the agent - agent.AssignTask(taskInfo)vermeer/test/scheduler/priority.go (1)
252-263: 协程内打印未定义/无意义的 err 变量该分支无可用的 err(调用不返回错误),日志会误导。建议移除或改为固定信息。
go func(body map[string]string) { defer wg.Done() taskID := computeTest.SendComputeReqAsyncNotWait(body) if taskID != 0 { taskIDs <- taskID } else { - logrus.Errorf("Failed to submit task: %v", err) + logrus.Error("Failed to submit task") } }(taskBodies[i])vermeer/apps/master/schedules/scheduler_task_manager.go (2)
72-79: 重复赋值 taskToworkerGroupMap,保持单一职责调用
AssignGroup(taskInfo)后又手动写入一遍同一键值,冗余且易不一致。移除手动赋值。for _, taskInfo := range t.GetAllTasksNotComplete() { if taskInfo == nil { continue } t.AssignGroup(taskInfo) - t.taskToworkerGroupMap[taskInfo.ID] = workerMgr.ApplyGroup(taskInfo.SpaceName, taskInfo.GraphName) }
195-201: 容量预估不当,建议使用 notCompleteTasks 的长度当前用 allTaskMap 的长度来为 notCompleteTasks 结果预分配容量,可能导致多余分配。
-func (t *SchedulerTaskManager) GetAllTasksNotComplete() []*structure.TaskInfo { - tasks := make([]*structure.TaskInfo, 0, len(t.allTaskMap)) +func (t *SchedulerTaskManager) GetAllTasksNotComplete() []*structure.TaskInfo { + tasks := make([]*structure.TaskInfo, 0, len(t.notCompleteTasks))vermeer/apps/master/bl/scheduler_bl.go (1)
121-129: 可停止的 ticker 更稳妥time.Tick 无法停止,长期运行可能造成泄漏。建议改用 time.NewTicker 并在关闭时停止。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
vermeer/go.sumis excluded by!**/*.sum
📒 Files selected for processing (32)
vermeer/apps/graphio/local_file.go(1 hunks)vermeer/apps/master/bl/compute_task.go(1 hunks)vermeer/apps/master/bl/grpc_handlers.go(2 hunks)vermeer/apps/master/bl/load_task.go(1 hunks)vermeer/apps/master/bl/scheduler_bl.go(5 hunks)vermeer/apps/master/bl/task_bl.go(4 hunks)vermeer/apps/master/bl/task_creator.go(1 hunks)vermeer/apps/master/bl/worker_bl.go(1 hunks)vermeer/apps/master/master_main.go(1 hunks)vermeer/apps/master/schedules/broker.go(2 hunks)vermeer/apps/master/schedules/scheduler_algorithm_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_cron_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_resource_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_task_manager.go(1 hunks)vermeer/apps/master/services/http_tasks.go(2 hunks)vermeer/apps/master/services/router.go(1 hunks)vermeer/apps/master/workers/worker_manager.go(1 hunks)vermeer/apps/structure/task.go(1 hunks)vermeer/client/client.go(3 hunks)vermeer/client/structure.go(2 hunks)vermeer/config/master.ini(1 hunks)vermeer/config/worker04.ini(1 hunks)vermeer/go.mod(1 hunks)vermeer/test/functional/compute_base.go(1 hunks)vermeer/test/functional/compute_task.go(1 hunks)vermeer/test/functional/http_interface.go(1 hunks)vermeer/test/functional/load_local.go(2 hunks)vermeer/test/scheduler/batch.go(1 hunks)vermeer/test/scheduler/priority.go(1 hunks)vermeer/test/scheduler/routine.go(1 hunks)vermeer/test/scheduler/test_scheduler.go(1 hunks)vermeer/vermeer_test.go(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (15)
- vermeer/client/structure.go
- vermeer/apps/master/bl/grpc_handlers.go
- vermeer/test/scheduler/batch.go
- vermeer/test/scheduler/routine.go
- vermeer/apps/graphio/local_file.go
- vermeer/config/master.ini
- vermeer/test/functional/load_local.go
- vermeer/config/worker04.ini
- vermeer/vermeer_test.go
- vermeer/apps/master/bl/task_creator.go
- vermeer/apps/master/master_main.go
- vermeer/client/client.go
- vermeer/apps/master/services/router.go
- vermeer/apps/structure/task.go
- vermeer/apps/master/schedules/scheduler_cron_manager.go
🧰 Additional context used
🧬 Code graph analysis (10)
vermeer/test/scheduler/test_scheduler.go (5)
vermeer/client/client.go (1)
VermeerClient(31-36)vermeer/test/scheduler/priority.go (1)
TestPriority(308-349)vermeer/test/scheduler/batch.go (1)
TestBatch(19-24)vermeer/test/scheduler/routine.go (1)
TestRoutine(56-64)vermeer/test/functional/http_interface.go (2)
CancelTask(56-57)DeleteGraph(32-33)
vermeer/apps/master/schedules/broker.go (2)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/workers/worker_manager.go (1)
WorkerClient(44-59)
vermeer/test/functional/compute_base.go (2)
vermeer/apps/master/services/http_tasks.go (2)
TaskCreateRequest(68-72)TaskCreateBatchRequest(122-122)vermeer/client/structure.go (5)
TaskCreateRequest(169-173)TaskInfo(52-62)TaskCreateBatchRequest(175-175)BaseResponse(27-30)TaskResponse(36-39)
vermeer/test/scheduler/priority.go (3)
vermeer/test/functional/compute_task.go (1)
MakeComputeTask(48-129)vermeer/test/functional/load_local.go (1)
LoadTaskLocal(29-31)vermeer/test/functional/http_interface.go (1)
CancelTask(56-57)
vermeer/apps/master/services/http_tasks.go (1)
vermeer/client/structure.go (2)
TaskStartSequenceRequest(41-43)TaskStartSequenceResp(45-48)
vermeer/apps/master/bl/scheduler_bl.go (6)
vermeer/apps/master/schedules/scheduler_resource_manager.go (3)
SchedulerResourceManager(28-36)WorkerOngoingStatusDeleted(21-21)WorkerOngoingStatus(14-14)vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
SchedulerAlgorithmManager(33-37)vermeer/apps/master/schedules/scheduler_task_manager.go (1)
SchedulerTaskManager(14-26)vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
SchedulerCronManager(15-20)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/bl/task_canceler.go (1)
NewTaskCanceler(35-51)
vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
vermeer/apps/master/schedules/broker.go (4)
Broker(56-60)Agent(42-45)AgentStatus(30-30)AgentStatusError(34-34)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/apps/master/schedules/scheduler_task_manager.go (1)
vermeer/client/structure.go (1)
TaskInfo(52-62)
vermeer/test/functional/http_interface.go (1)
vermeer/client/client.go (1)
VermeerClient(31-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Analyze (go)
- GitHub Check: check-dependency-license
- GitHub Check: Analyze (java)
🔇 Additional comments (15)
vermeer/apps/master/workers/worker_manager.go (1)
580-582: 缺少读锁导致竞态风险这里新增的公开方法直接遍历
wm.workersByName,但不像其它 Getter 一样持有管理器的互斥锁。并发的 AddWorker/RemoveWorker 会与这里的数据访问发生竞态,最坏情况下导致 map 迭代 panic。请按照同文件中其它公开读取方法的模式先加锁再返回结果。func (wm *workerManager) GetGroupWorkers(workerGroup string) []*WorkerClient { - return wm.getGroupWorkers(workerGroup) + defer wm.Unlock(wm.Lock()) + return wm.getGroupWorkers(workerGroup) }vermeer/apps/master/schedules/broker.go (1)
76-114: 不建议用可变参布尔表达可选参数,易混淆variadic bool 用于“可选开关”不直观,建议改为两个公开方法或使用 Options 结构体的方式,提升 API 清晰度。此问题在先前评审已提出,建议采纳其中一个方案。
vermeer/test/functional/compute_base.go (1)
186-195: 补充校验:确认序列包含且仅包含创建的任务 ID当前仅校验长度,未校验成员一致性与重复。建议加入集合校验,确保无缺失/无多余/无重复。
require.NoError(ctb.t, err) sequence := response.Sequence for i, id := range sequence { fmt.Printf("Task %d started at position %d in the sequence\n", id, i+1) } require.Equal(ctb.t, len(taskIds), len(sequence)) + + // 验证序列成员一致性且无重复 + seen := make(map[int32]bool, len(taskIds)) + want := make(map[int32]bool, len(taskIds)) + for _, id := range taskIds { + want[id] = true + } + for _, id := range sequence { + require.True(ctb.t, want[id], "unexpected task ID in sequence: %d", id) + require.False(ctb.t, seen[id], "duplicate task ID in sequence: %d", id) + seen[id] = true + } + require.Equal(ctb.t, len(taskIds), len(seen), "missing IDs in sequence")vermeer/apps/master/schedules/scheduler_resource_manager.go (3)
115-121: 非独占任务直接跳过“繁忙检查”可能放宽过度,确认预期
ApplyAgent(taskInfo, !taskInfo.Exclusive)传入 true 将同时跳过 agent 繁忙与 worker 繁忙检查(见 broker.ApplyAgent 内部判断),可能导致同组资源过量并发。请确认非独占语义是否应跳过所有繁忙检查,或仅跳过 agent 级别检查仍保留 worker 冲突检查。
128-136: 状态设置分支冗余,可简化为基于 exclusive 与现状的判定当前先判断 map 存在性再分别设置状态,逻辑重复。可按是否独占与当前组任务数简化。
- exclusive := taskInfo.Exclusive - runningStatus := WorkerOngoingStatusRunning - if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists && exclusive { - rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} - runningStatus = WorkerOngoingStatusRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } else { - runningStatus = WorkerOngoingStatusConcurrentRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } + exclusive := taskInfo.Exclusive + if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists { + rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} + } + runningStatus := WorkerOngoingStatusConcurrentRunning + if exclusive && len(rm.runningWorkerGroupTasks[agent.GroupName()]) == 0 { + runningStatus = WorkerOngoingStatusRunning + } + rm.workerGroupStatus[agent.GroupName()] = runningStatus
57-82: ReleaseByTaskID 删除组状态后仍访问其状态,存在逻辑与并发一致性问题;同时重复获取组内 workers,影响性能
- 先
delete(rm.workerGroupStatus, workerGroup)再根据是否还有任务去更新 worker 状态,导致组状态不一致且后续判断语义混乱。workerMgr.GetGroupWorkers(workerGroup)在两个分支分别调用,建议缓存。建议改为“调整状态而非删除”,并缓存组 workers:
for workerGroup, status := range rm.workerGroupStatus { if (status == WorkerOngoingStatusRunning || status == WorkerOngoingStatusConcurrentRunning) && rm.isTaskRunningOnWorkerGroup(workerGroup, taskID) { - delete(rm.workerGroupStatus, workerGroup) + // 从组正在运行任务列表中移除 if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; exists { for i, id := range tasks { if id == taskID { rm.runningWorkerGroupTasks[workerGroup] = append(tasks[:i], tasks[i+1:]...) if len(rm.runningWorkerGroupTasks[workerGroup]) == 0 { delete(rm.runningWorkerGroupTasks, workerGroup) } break } } } - if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusIdle) - } - } else { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusConcurrentRunning) - } - } + groupWorkers := workerMgr.GetGroupWorkers(workerGroup) + if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { + // 组内无任务:置空闲 + for _, worker := range groupWorkers { + rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusIdle) + } + rm.workerGroupStatus[workerGroup] = WorkerOngoingStatusIdle + } else { + // 组内仍有任务:置并发运行 + for _, worker := range groupWorkers { + rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusConcurrentRunning) + } + rm.workerGroupStatus[workerGroup] = WorkerOngoingStatusConcurrentRunning + } } }vermeer/apps/master/bl/scheduler_bl.go (2)
371-373: 拼写错误:received日志消息应为“received a nil task...”,并且不应中断循环(当前实现已选择 continue,合理)。
- logrus.Warnf("recieved a nil task from startChan") + logrus.Warnf("received a nil task from startChan")
57-73: Init 建议返回错误并在调用处处理,避免静默初始化失败多个子管理器初始化与配置加载可能失败,建议让 Init 返回 error 并由调用方决定处理策略(例如终止启动或降级)。
vermeer/test/scheduler/priority.go (1)
73-76: 未检查第一次 MakeComputeTask 的错误,err 被后续赋值覆盖需要对第一次返回的 err 立即断言;否则第一次错误会被覆盖。
computeTest, err := functional.MakeComputeTask(computeTask) -require.NoError(t, err) -computeTaskSmall, err := functional.MakeComputeTask(computeTask) -require.NoError(t, err) +require.NoError(t, err) +computeTaskSmall, err := functional.MakeComputeTask(computeTask) +require.NoError(t, err)vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
3-12: 兼容 Go 1.21+,保留 slices 导入
go.mod 已指定 Go 1.23.0 (>=1.21),标准库 slices 包可安全使用,无需替换为手写 contains。vermeer/test/functional/http_interface.go (1)
85-93: 直接取消任务的测试钩子实现合理新的
DirectCancelTask在 Line 86-92 直接调度取消并校验状态,为后续按任务 ID 的测试场景提供了明确入口,没有发现问题。vermeer/apps/master/bl/compute_task.go (1)
145-147: 补充调度器完成标记与状态一致在 Line 145-146 调用
Scheduler.taskManager.MarkTaskComplete,确保业务层完成状态能同步给调度器,这一步和后续的FinishTask操作保持一致性,赞。vermeer/apps/master/bl/load_task.go (1)
207-209: 加载任务完成后同步调度器状态是必要的Line 207-209 的补充让调度器及时获知加载任务完结,与 compute 分支的处理对齐,保持状态闭环。
vermeer/apps/master/bl/worker_bl.go (1)
73-74: 传递 workerName 以配合新版 CloseCurrent APILine 73 调整为传入
workerName,匹配新版CloseCurrent签名,有助于调度器精确释放对应 worker。vermeer/test/functional/compute_task.go (1)
39-41: 新增接口方法缺少 GoDoc 注释Line 39-41 的新增异步接口仍未补充 Go 风格注释,之前也提醒过;建议补齐,便于后续维护者理解差异。
| func (s *ScheduleBl) BatchQueueTask(taskInfos []*structure.TaskInfo) ([]bool, []error) { | ||
| if len(taskInfos) == 0 { | ||
| return []bool{}, []error{} | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
| s.PauseDispatch() | ||
|
|
||
| func (s *ScheduleBl) IsDispatchPaused() bool { | ||
| return s.isDispatchPaused | ||
| } | ||
| func (s *ScheduleBl) PauseDispatch() { | ||
| s.isDispatchPaused = true | ||
| } | ||
| defer s.ResumeDispatch() | ||
| // defer s.Unlock(s.Lock()) | ||
|
|
||
| func (s *ScheduleBl) ResumeDispatch() { | ||
| s.isDispatchPaused = false | ||
| errors := make([]error, len(taskInfos)) | ||
| oks := make([]bool, len(taskInfos)) | ||
|
|
||
| for _, taskInfo := range taskInfos { | ||
| ok, err := s.QueueTask(taskInfo) | ||
| if err != nil { | ||
| logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) | ||
| errors = append(errors, err) | ||
| } | ||
| oks = append(oks, ok) | ||
| } | ||
|
|
||
| return oks, errors | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchQueueTask 填充结果切片方式错误,当前会多出一半零值元素
先按固定长度初始化后又使用 append,导致返回切片长度翻倍且前半为零值。应按索引赋值。
- errors := make([]error, len(taskInfos))
- oks := make([]bool, len(taskInfos))
-
- for _, taskInfo := range taskInfos {
- ok, err := s.QueueTask(taskInfo)
- if err != nil {
- logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err)
- errors = append(errors, err)
- }
- oks = append(oks, ok)
- }
+ errors := make([]error, len(taskInfos))
+ oks := make([]bool, len(taskInfos))
+ for i, taskInfo := range taskInfos {
+ ok, err := s.QueueTask(taskInfo)
+ if err != nil {
+ logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err)
+ }
+ oks[i] = ok
+ errors[i] = err
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *ScheduleBl) BatchQueueTask(taskInfos []*structure.TaskInfo) ([]bool, []error) { | |
| if len(taskInfos) == 0 { | |
| return []bool{}, []error{} | |
| } | |
| return nil | |
| } | |
| s.PauseDispatch() | |
| func (s *ScheduleBl) IsDispatchPaused() bool { | |
| return s.isDispatchPaused | |
| } | |
| func (s *ScheduleBl) PauseDispatch() { | |
| s.isDispatchPaused = true | |
| } | |
| defer s.ResumeDispatch() | |
| // defer s.Unlock(s.Lock()) | |
| func (s *ScheduleBl) ResumeDispatch() { | |
| s.isDispatchPaused = false | |
| errors := make([]error, len(taskInfos)) | |
| oks := make([]bool, len(taskInfos)) | |
| for _, taskInfo := range taskInfos { | |
| ok, err := s.QueueTask(taskInfo) | |
| if err != nil { | |
| logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) | |
| errors = append(errors, err) | |
| } | |
| oks = append(oks, ok) | |
| } | |
| return oks, errors | |
| } | |
| func (s *ScheduleBl) BatchQueueTask(taskInfos []*structure.TaskInfo) ([]bool, []error) { | |
| if len(taskInfos) == 0 { | |
| return []bool{}, []error{} | |
| } | |
| s.PauseDispatch() | |
| defer s.ResumeDispatch() | |
| // defer s.Unlock(s.Lock()) | |
| errors := make([]error, len(taskInfos)) | |
| oks := make([]bool, len(taskInfos)) | |
| for i, taskInfo := range taskInfos { | |
| ok, err := s.QueueTask(taskInfo) | |
| if err != nil { | |
| logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) | |
| } | |
| oks[i] = ok | |
| errors[i] = err | |
| } | |
| return oks, errors | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/scheduler_bl.go around lines 303 to 326, the
BatchQueueTask function initializes oks and errors with fixed lengths then uses
append, which doubles slice length and leaves leading zero values; change the
loop to use index-based assignment (e.g., iterate with for i, taskInfo := range
taskInfos) and assign oks[i] = ok and errors[i] = err (only set the error when
non-nil), remove all append calls for those slices so the returned slices have
correct lengths and contents.
| func (s *ScheduleBl) CancelTask(taskInfo *structure.TaskInfo) error { | ||
| if taskInfo == nil { | ||
| return errors.New("the argument `taskInfo` is nil") | ||
| } | ||
|
|
||
| defer s.Unlock(s.Lock()) | ||
|
|
||
| isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) | ||
| task := s.taskManager.RemoveTask(taskInfo.ID) | ||
| s.cronManager.DeleteTask(taskInfo.ID) | ||
| // err := s.taskManager.CancelTask(taskInfo) | ||
| isInQueue := false | ||
| if task != nil { | ||
| logrus.Infof("removed task '%d' from space queue", taskInfo.ID) | ||
| isInQueue = true | ||
| } | ||
|
|
||
| if isInQueue && !isHeadTask { | ||
| if err := taskMgr.SetState(taskInfo, structure.TaskStateCanceled); err != nil { | ||
| return err | ||
| } | ||
| }() | ||
|
|
||
| if err := s.doDispatch(); err != nil { | ||
| logrus.Errorf("do dispatching error:%v", err) | ||
| logrus.Infof("set task '%d' to TaskStateCanceled", taskInfo.ID) | ||
| } else { | ||
| logrus.Infof("sending task '%d' to task canceler", taskInfo.ID) | ||
| return s.handleCancelTask(taskInfo) | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CancelTask 与 RemoveTask 新签名对齐并补齐错误处理
结合对 SchedulerTaskManager.RemoveTask 的修复,这里需接收返回的任务指针和错误,并据此判断是否在队列中。
- isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID)
- task := s.taskManager.RemoveTask(taskInfo.ID)
+ isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID)
+ removedTask, rmErr := s.taskManager.RemoveTask(taskInfo.ID)
s.cronManager.DeleteTask(taskInfo.ID)
@@
- isInQueue := false
- if task != nil {
+ isInQueue := false
+ if rmErr == nil && removedTask != nil {
logrus.Infof("removed task '%d' from space queue", taskInfo.ID)
isInQueue = true
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *ScheduleBl) CancelTask(taskInfo *structure.TaskInfo) error { | |
| if taskInfo == nil { | |
| return errors.New("the argument `taskInfo` is nil") | |
| } | |
| defer s.Unlock(s.Lock()) | |
| isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) | |
| task := s.taskManager.RemoveTask(taskInfo.ID) | |
| s.cronManager.DeleteTask(taskInfo.ID) | |
| // err := s.taskManager.CancelTask(taskInfo) | |
| isInQueue := false | |
| if task != nil { | |
| logrus.Infof("removed task '%d' from space queue", taskInfo.ID) | |
| isInQueue = true | |
| } | |
| if isInQueue && !isHeadTask { | |
| if err := taskMgr.SetState(taskInfo, structure.TaskStateCanceled); err != nil { | |
| return err | |
| } | |
| }() | |
| if err := s.doDispatch(); err != nil { | |
| logrus.Errorf("do dispatching error:%v", err) | |
| logrus.Infof("set task '%d' to TaskStateCanceled", taskInfo.ID) | |
| } else { | |
| logrus.Infof("sending task '%d' to task canceler", taskInfo.ID) | |
| return s.handleCancelTask(taskInfo) | |
| } | |
| return nil | |
| } | |
| func (s *ScheduleBl) CancelTask(taskInfo *structure.TaskInfo) error { | |
| if taskInfo == nil { | |
| return errors.New("the argument `taskInfo` is nil") | |
| } | |
| defer s.Unlock(s.Lock()) | |
| isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) | |
| removedTask, rmErr := s.taskManager.RemoveTask(taskInfo.ID) | |
| s.cronManager.DeleteTask(taskInfo.ID) | |
| // err := s.taskManager.CancelTask(taskInfo) | |
| isInQueue := false | |
| if rmErr == nil && removedTask != nil { | |
| logrus.Infof("removed task '%d' from space queue", taskInfo.ID) | |
| isInQueue = true | |
| } | |
| if isInQueue && !isHeadTask { | |
| if err := taskMgr.SetState(taskInfo, structure.TaskStateCanceled); err != nil { | |
| return err | |
| } | |
| logrus.Infof("set task '%d' to TaskStateCanceled", taskInfo.ID) | |
| } else { | |
| logrus.Infof("sending task '%d' to task canceler", taskInfo.ID) | |
| return s.handleCancelTask(taskInfo) | |
| } | |
| return nil | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/scheduler_bl.go around lines 456-485, CancelTask must
be updated to match the new RemoveTask signature that returns (task
*structure.Task, err error); call task, err :=
s.taskManager.RemoveTask(taskInfo.ID), check and return the error immediately if
non-nil, determine isInQueue by task != nil only after a successful removal, and
replace the stray taskMgr reference with s.taskManager when calling SetState;
keep existing logging and the head-task logic otherwise.
| if priority, ok := params["priority"]; ok { | ||
| if p, err := strconv.ParseInt(priority, 10, 32); err == nil { | ||
| if p < 0 { | ||
| return nil, fmt.Errorf("priority should be non-negative") | ||
| } | ||
| taskInfo.Priority = int32(p) | ||
| } else { | ||
| logrus.Warnf("priority convert to int32 error:%v", err) | ||
| return nil, err | ||
| } | ||
| } | ||
| if preorders, ok := params["preorders"]; ok { | ||
| preorderList := strings.Split(preorders, ",") | ||
| for _, preorder := range preorderList { | ||
| if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil { | ||
| if taskMgr.GetTaskByID(int32(pid)) == nil { | ||
| return nil, fmt.Errorf("preorder task id %d not exists", pid) | ||
| } | ||
| taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) | ||
| } else { | ||
| logrus.Warnf("preorder convert to int32 error:%v", err) | ||
| return nil, err | ||
| } | ||
| } | ||
| } | ||
| if exclusive, ok := params["exclusive"]; ok { | ||
| if ex, err := strconv.ParseBool(exclusive); err == nil { | ||
| taskInfo.Exclusive = ex | ||
| } else { | ||
| logrus.Warnf("exclusive convert to bool error:%v", err) | ||
| return nil, err | ||
| } | ||
| } | ||
| if cronExpr, ok := params["cron_expr"]; ok { | ||
| if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { | ||
| logrus.Warnf("cron_expr parse error:%v", err) | ||
| return nil, err | ||
| } | ||
| taskInfo.CronExpr = cronExpr | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
忽略空白字符会让可选调度参数解析失败
priority、preorders、exclusive、cron_expr 都是可选参数,但当前实现对带空格或空串的输入会直接返回解析错误;尤其是 preorders,常见的 "1, 2" 现在会触发 invalid syntax,使任务创建整体失败。建议在解析前先 TrimSpace,并跳过空字符串,这样可以容忍常见的格式化写法,同时仍保留对非法值的校验。
- if priority, ok := params["priority"]; ok {
- if p, err := strconv.ParseInt(priority, 10, 32); err == nil {
+ if priority, ok := params["priority"]; ok {
+ priority = strings.TrimSpace(priority)
+ if priority != "" {
+ if p, err := strconv.ParseInt(priority, 10, 32); err == nil {
if p < 0 {
return nil, fmt.Errorf("priority should be non-negative")
}
taskInfo.Priority = int32(p)
- } else {
- logrus.Warnf("priority convert to int32 error:%v", err)
- return nil, err
+ } else {
+ logrus.Warnf("priority convert to int32 error:%v", err)
+ return nil, err
+ }
}
}
if preorders, ok := params["preorders"]; ok {
preorderList := strings.Split(preorders, ",")
for _, preorder := range preorderList {
+ preorder = strings.TrimSpace(preorder)
+ if preorder == "" {
+ continue
+ }
if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil {
if taskMgr.GetTaskByID(int32(pid)) == nil {
return nil, fmt.Errorf("preorder task id %d not exists", pid)
}
taskInfo.Preorders = append(taskInfo.Preorders, int32(pid))
@@
- if exclusive, ok := params["exclusive"]; ok {
- if ex, err := strconv.ParseBool(exclusive); err == nil {
+ if exclusive, ok := params["exclusive"]; ok {
+ exclusive = strings.TrimSpace(exclusive)
+ if exclusive != "" {
+ if ex, err := strconv.ParseBool(exclusive); err == nil {
taskInfo.Exclusive = ex
- } else {
- logrus.Warnf("exclusive convert to bool error:%v", err)
- return nil, err
+ } else {
+ logrus.Warnf("exclusive convert to bool error:%v", err)
+ return nil, err
+ }
}
}
if cronExpr, ok := params["cron_expr"]; ok {
- if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil {
- logrus.Warnf("cron_expr parse error:%v", err)
- return nil, err
- }
- taskInfo.CronExpr = cronExpr
+ cronExpr = strings.TrimSpace(cronExpr)
+ if cronExpr != "" {
+ if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil {
+ logrus.Warnf("cron_expr parse error:%v", err)
+ return nil, err
+ }
+ taskInfo.CronExpr = cronExpr
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if priority, ok := params["priority"]; ok { | |
| if p, err := strconv.ParseInt(priority, 10, 32); err == nil { | |
| if p < 0 { | |
| return nil, fmt.Errorf("priority should be non-negative") | |
| } | |
| taskInfo.Priority = int32(p) | |
| } else { | |
| logrus.Warnf("priority convert to int32 error:%v", err) | |
| return nil, err | |
| } | |
| } | |
| if preorders, ok := params["preorders"]; ok { | |
| preorderList := strings.Split(preorders, ",") | |
| for _, preorder := range preorderList { | |
| if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil { | |
| if taskMgr.GetTaskByID(int32(pid)) == nil { | |
| return nil, fmt.Errorf("preorder task id %d not exists", pid) | |
| } | |
| taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) | |
| } else { | |
| logrus.Warnf("preorder convert to int32 error:%v", err) | |
| return nil, err | |
| } | |
| } | |
| } | |
| if exclusive, ok := params["exclusive"]; ok { | |
| if ex, err := strconv.ParseBool(exclusive); err == nil { | |
| taskInfo.Exclusive = ex | |
| } else { | |
| logrus.Warnf("exclusive convert to bool error:%v", err) | |
| return nil, err | |
| } | |
| } | |
| if cronExpr, ok := params["cron_expr"]; ok { | |
| if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { | |
| logrus.Warnf("cron_expr parse error:%v", err) | |
| return nil, err | |
| } | |
| taskInfo.CronExpr = cronExpr | |
| } | |
| } | |
| if priority, ok := params["priority"]; ok { | |
| priority = strings.TrimSpace(priority) | |
| if priority != "" { | |
| if p, err := strconv.ParseInt(priority, 10, 32); err == nil { | |
| if p < 0 { | |
| return nil, fmt.Errorf("priority should be non-negative") | |
| } | |
| taskInfo.Priority = int32(p) | |
| } else { | |
| logrus.Warnf("priority convert to int32 error:%v", err) | |
| return nil, err | |
| } | |
| } | |
| } | |
| if preorders, ok := params["preorders"]; ok { | |
| preorderList := strings.Split(preorders, ",") | |
| for _, preorder := range preorderList { | |
| preorder = strings.TrimSpace(preorder) | |
| if preorder == "" { | |
| continue | |
| } | |
| if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil { | |
| if taskMgr.GetTaskByID(int32(pid)) == nil { | |
| return nil, fmt.Errorf("preorder task id %d not exists", pid) | |
| } | |
| taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) | |
| } else { | |
| logrus.Warnf("preorder convert to int32 error:%v", err) | |
| return nil, err | |
| } | |
| } | |
| } | |
| if exclusive, ok := params["exclusive"]; ok { | |
| exclusive = strings.TrimSpace(exclusive) | |
| if exclusive != "" { | |
| if ex, err := strconv.ParseBool(exclusive); err == nil { | |
| taskInfo.Exclusive = ex | |
| } else { | |
| logrus.Warnf("exclusive convert to bool error:%v", err) | |
| return nil, err | |
| } | |
| } | |
| } | |
| if cronExpr, ok := params["cron_expr"]; ok { | |
| cronExpr = strings.TrimSpace(cronExpr) | |
| if cronExpr != "" { | |
| if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { | |
| logrus.Warnf("cron_expr parse error:%v", err) | |
| return nil, err | |
| } | |
| taskInfo.CronExpr = cronExpr | |
| } | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/task_bl.go around lines 72 to 112, optional params
(priority, preorders, exclusive, cron_expr) are parsed without trimming
whitespace which causes valid inputs like "1, 2" or values with surrounding
spaces to fail; before parsing call strings.TrimSpace on each param value, for
preorders split by "," then TrimSpace each entry and skip empty strings before
strconv.ParseInt, and for priority/exclusive/cron_expr use the trimmed value
when parsing and checking so whitespace-only values are ignored or handled
appropriately while preserving existing error returns for truly invalid formats.
| graph := structure.GraphManager.GetGraphByName(task.SpaceName, task.GraphName) | ||
| resourceCost := int64(0) | ||
| if graph == nil { | ||
| resourceCost = resourceParam // if graph not found, use max resource cost | ||
| } else { | ||
| resourceCost = resourceParam / max(1, graph.VertexCount+graph.EdgeCount) // Avoid division by zero, ensure at least 1 | ||
| } | ||
| // step 4: some random value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用内建 max 需 Go 1.21;为兼容性建议改为显式函数
如果仓库尚未升级到 Go 1.21,max 会未定义。为兼容,改用本地实现。
- resourceCost = resourceParam / max(1, graph.VertexCount+graph.EdgeCount) // Avoid division by zero, ensure at least 1
+ denom := graph.VertexCount + graph.EdgeCount
+ if denom < 1 {
+ denom = 1
+ }
+ resourceCost = resourceParam / denom或新增通用函数:
func maxI64(a, b int64) int64 {
if a > b { return a }
return b
}并替换为:
resourceCost = resourceParam / maxI64(1, graph.VertexCount+graph.EdgeCount)🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_algorithm_manager.go around lines 319
to 326, the code uses the built-in max which requires Go 1.21 and may be
undefined for older toolchains; replace it with a local integer max
implementation (e.g. add a small helper like maxI64 or inline conditional) and
use that helper when computing resourceCost to avoid build errors while
preserving the existing logic and division-by-zero protection.
| func (t *SchedulerTaskManager) MarkTaskComplete(taskID int32) error { | ||
| if _, exists := t.allTaskMap[taskID]; !exists { | ||
| return errors.New("task not found") | ||
| } | ||
| delete(t.notCompleteTasks, taskID) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MarkTaskComplete 缺少并发保护
并发环境下写入 notCompleteTasks 需要加锁。
func (t *SchedulerTaskManager) MarkTaskComplete(taskID int32) error {
+ defer t.Unlock(t.Lock())
if _, exists := t.allTaskMap[taskID]; !exists {
return errors.New("task not found")
}
delete(t.notCompleteTasks, taskID)
return nil
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 125-131,
the MarkTaskComplete method accesses and mutates t.notCompleteTasks without
synchronization; wrap the check and delete in the scheduler's mutex to prevent
race conditions: acquire the manager lock at the start of the method (e.g.,
t.mu.Lock()), perform the existence check against t.allTaskMap and the delete
from t.notCompleteTasks while holding the lock, and then release the lock (use
defer to unlock) so both the read and write are protected.
| // Only for debug or test, get task start sequence | ||
| func (t *SchedulerTaskManager) GetTaskStartSequence(queryTasks []int32) []*structure.TaskInfo { | ||
| if len(t.startTaskQueue) == 0 { | ||
| return nil | ||
| } | ||
| if len(queryTasks) == 0 { | ||
| return t.startTaskQueue | ||
| } | ||
| tasks := make([]*structure.TaskInfo, 0, len(queryTasks)) | ||
| taskSet := make(map[int32]struct{}) | ||
| for _, id := range queryTasks { | ||
| taskSet[id] = struct{}{} | ||
| } | ||
| for _, task := range t.startTaskQueue { | ||
| if _, exists := taskSet[task.ID]; exists { | ||
| tasks = append(tasks, task) | ||
| } | ||
| } | ||
| logrus.Infof("GetTaskStartSequence: return %d tasks", len(tasks)) | ||
| for _, task := range tasks { | ||
| logrus.Debugf("TaskID: %d", task.ID) | ||
| } | ||
| return tasks | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetTaskStartSequence 读写共享切片缺少并发保护
建议加锁,并考虑返回副本以避免外部修改内部切片。
func (t *SchedulerTaskManager) GetTaskStartSequence(queryTasks []int32) []*structure.TaskInfo {
+ defer t.Unlock(t.Lock())
if len(t.startTaskQueue) == 0 {
return nil
}
if len(queryTasks) == 0 {
- return t.startTaskQueue
+ // 返回副本
+ out := make([]*structure.TaskInfo, len(t.startTaskQueue))
+ copy(out, t.startTaskQueue)
+ return out
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 223-246,
GetTaskStartSequence reads and returns t.startTaskQueue without any concurrency
protection and returns the internal slice directly; wrap accesses with the
SchedulerTaskManager's mutex (use RLock/RUnlock for read-only access), build and
return a new slice that copies the selected *structure.TaskInfo pointers (never
return t.startTaskQueue directly), and perform logging after releasing the lock
so you don't hold the mutex while logging.
| func (t *SchedulerTaskManager) IsTaskOngoing(taskID int32) bool { | ||
| // Check if the task is currently ongoing | ||
| task, exists := t.allTaskMap[taskID] | ||
| if !exists { | ||
| return false | ||
| } | ||
| return task.State == structure.TaskStateCreated | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsTaskOngoing 缺少并发保护且状态判断可能过于狭窄
- 建议加锁读取。
- 仅以 Created 判定“进行中”可能不完整(如 Loading/Running 等),后续可按业务完善。
func (t *SchedulerTaskManager) IsTaskOngoing(taskID int32) bool {
- // Check if the task is currently ongoing
- task, exists := t.allTaskMap[taskID]
+ defer t.Unlock(t.Lock())
+ task, exists := t.allTaskMap[taskID]
if !exists {
return false
}
return task.State == structure.TaskStateCreated
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 260 to
267, IsTaskOngoing reads t.allTaskMap without any concurrency protection and
only treats TaskStateCreated as "ongoing"; add the appropriate mutex/RWMutex
lock/unlock around the map access (use RLock/RUnlock for a read) to prevent race
conditions, and extend the state check to consider other in-progress states
(e.g., Loading, Running, any business-specific intermediate states) by testing
membership in a set of ongoing states rather than equating to a single state.
| github.com/prometheus/client_model v0.2.1-0.20210607210712-147c58e9608a // indirect | ||
| github.com/prometheus/common v0.32.1 // indirect | ||
| github.com/prometheus/procfs v0.7.3 // indirect | ||
| github.com/robfig/cron/v3 v3.0.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
依赖声明标注错误可能导致构建失败
新增调度器直接引用 github.com/robfig/cron/v3,该依赖不应标记为 // indirect。否则下一次 go mod tidy 可能把它移除,导致构建缺少必要依赖。请将其声明为直接依赖并同步更新 go.sum。
- github.com/robfig/cron/v3 v3.0.0 // indirect
+ github.com/robfig/cron/v3 v3.0.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| github.com/robfig/cron/v3 v3.0.0 // indirect | |
| github.com/robfig/cron/v3 v3.0.0 |
🤖 Prompt for AI Agents
In vermeer/go.mod around line 75, the cron dependency is incorrectly marked as
"// indirect" even though the code directly imports github.com/robfig/cron/v3;
remove the "// indirect" annotation so it is a direct dependency, then run "go
mod tidy" (or "go get github.com/robfig/cron/[email protected]" then "go mod tidy") to
update go.sum and ensure the dependency is recorded and vendored correctly, and
commit the updated go.mod and go.sum.
| loadTest2.Init(graphName+"_2", expectRes, &masterHttp, waitSecond, t, &healthCheck) | ||
| loadTest2.SendLoadRequest(loadTest2.TaskLoadBodyWithNum(20)) | ||
| // loadTest2.CheckGraph() | ||
|
|
||
| TestPriority(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_1", graphName + "_2"}, factor, waitSecond) | ||
|
|
||
| TestBatch(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_1"}, factor, waitSecond) | ||
|
|
||
| TestRoutine(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_2"}, factor, waitSecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应等待第二个图加载完成后再启动调度测试
目前在 Line 49 仅发送加载请求却未调用 CheckGraph,随后的优先级/批量/例行测试就开始使用 graphName+"_2",可能出现图尚未加载完成的竞态。请恢复等待步骤,避免测试偶发失败。
loadTest2 := functional.LoadTaskLocal{}
loadTest2.Init(graphName+"_2", expectRes, &masterHttp, waitSecond, t, &healthCheck)
loadTest2.SendLoadRequest(loadTest2.TaskLoadBodyWithNum(20))
- // loadTest2.CheckGraph()
+ loadTest2.CheckGraph()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| loadTest2.Init(graphName+"_2", expectRes, &masterHttp, waitSecond, t, &healthCheck) | |
| loadTest2.SendLoadRequest(loadTest2.TaskLoadBodyWithNum(20)) | |
| // loadTest2.CheckGraph() | |
| TestPriority(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_1", graphName + "_2"}, factor, waitSecond) | |
| TestBatch(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_1"}, factor, waitSecond) | |
| TestRoutine(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_2"}, factor, waitSecond) | |
| loadTest2 := functional.LoadTaskLocal{} | |
| loadTest2.Init(graphName+"_2", expectRes, &masterHttp, waitSecond, t, &healthCheck) | |
| loadTest2.SendLoadRequest(loadTest2.TaskLoadBodyWithNum(20)) | |
| loadTest2.CheckGraph() | |
| TestPriority(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_1", graphName + "_2"}, factor, waitSecond) | |
| TestBatch(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_1"}, factor, waitSecond) | |
| TestRoutine(t, expectRes, &healthCheck, &masterHttp, []string{graphName + "_2"}, factor, waitSecond) |
🤖 Prompt for AI Agents
In vermeer/test/scheduler/test_scheduler.go around lines 47 to 55, the test
sends a load request for graphName+"_2" but does not wait for it to finish
(missing CheckGraph), causing race conditions when subsequent
TestPriority/TestBatch/TestRoutine use the graph; call loadTest2.CheckGraph()
after SendLoadRequest (or otherwise wait for the graph to be fully loaded)
before running the scheduling tests to ensure the graph is ready and avoid
intermittent failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
vermeer/apps/master/schedules/broker.go (2)
129-146: 用 gRPC connectivity 常量并简化就绪判断避免字符串比较和多余标志位,短路返回更清晰。
-func (b *Broker) isWorkersReady(workers map[string]*workers.WorkerClient) bool { - ok := false - for _, w := range workers { - if w.Connection == nil { - ok = false - break - } - - if w.Connection.GetState().String() == "READY" { - ok = true - } else { - ok = false - break - } - } - return ok -} +func (b *Broker) isWorkersReady(wm map[string]*workers.WorkerClient) bool { + for _, w := range wm { + if w == nil || w.Connection == nil { + return false + } + if w.Connection.GetState() != connectivity.Ready { + return false + } + } + return true +}另需在本文件 import 增加(文件头部):
import "google.golang.org/grpc/connectivity"
228-239: 避免变量名遮蔽包名局部变量
workers会遮蔽包名workers,可读性差,建议重命名。- workers := make(map[string]*workers.WorkerClient) + workerMap := make(map[string]*workers.WorkerClient) @@ - workers[w.Name] = wc + workerMap[w.Name] = wc @@ - return b.retrieveAgent(graph.WorkerGroup), workers, nil + return b.retrieveAgent(graph.WorkerGroup), workerMap, nilvermeer/apps/master/schedules/scheduler_resource_manager.go (1)
123-125: 重复分配任务
Broker.ApplyAgent内已agent.AssignTask(taskInfo),这里再次调用属冗余。- // Assign the task to the agent - agent.AssignTask(taskInfo)vermeer/apps/master/bl/scheduler_bl.go (3)
121-129: 用 time.NewTicker 替代 time.Tick,避免无法停止的 ticker
time.Tick不可关闭,建议使用time.NewTicker并defer ticker.Stop()。-func (s *ScheduleBl) startTicker() { - // Create a ticker with the specified interval - ticker := time.Tick(time.Duration(s.tickerInterval) * time.Second) - - for range ticker { +func (s *ScheduleBl) startTicker() { + ticker := time.NewTicker(time.Duration(s.tickerInterval) * time.Second) + defer ticker.Stop() + for range ticker.C { logrus.Debug("Ticker ticked") s.TryScheduleNextTasks() } }
420-424: TOCTOU 风险提示:检查/置位状态应尽量原子化
if taskInfo.State != Waiting与后续SetState(Created)之间可能竞态。若可行,统一由任务管理器提供原子状态迁移接口以避免竞争。
228-245: 区分持久化 TaskManager 与调度 TaskManager当前
ScheduleBl.QueueTask既用包级taskMgr(持久化状态、取任务、SetError) 又用实例字段s.taskManager(调度队列),职责混淆。建议将持久化相关操作封装到ScheduleBl(如新增s.storeMgr字段)或在SchedulerTaskManager中新增持久化接口,统一通过s.taskManager调用,避免直接引用全局变量。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
vermeer/apps/master/bl/scheduler_bl.go(5 hunks)vermeer/apps/master/schedules/broker.go(8 hunks)vermeer/apps/master/schedules/scheduler_resource_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_task_manager.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- vermeer/apps/master/schedules/scheduler_task_manager.go
🧰 Additional context used
🧬 Code graph analysis (3)
vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
vermeer/apps/master/schedules/broker.go (4)
Broker(55-59)Agent(41-44)AgentStatus(29-29)AgentStatusError(33-33)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/apps/master/bl/scheduler_bl.go (6)
vermeer/apps/master/schedules/scheduler_resource_manager.go (3)
SchedulerResourceManager(28-36)WorkerOngoingStatusDeleted(21-21)WorkerOngoingStatus(14-14)vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
SchedulerAlgorithmManager(33-37)vermeer/apps/master/schedules/scheduler_task_manager.go (1)
SchedulerTaskManager(14-26)vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
SchedulerCronManager(15-20)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/bl/task_canceler.go (1)
NewTaskCanceler(35-51)
vermeer/apps/master/schedules/broker.go (2)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/workers/worker_manager.go (1)
WorkerClient(44-59)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: check-dependency-license
- GitHub Check: Analyze (java)
- GitHub Check: Analyze (go)
🔇 Additional comments (8)
vermeer/apps/master/schedules/broker.go (2)
26-27: 移除点导入并统一 workers 包引用 — 已修复去掉点导入并统一使用
workers前缀,解决重复导入与命名冲突风险,赞。
75-113: 明确化可选参数,避免使用可变布尔参数
forceApply ...bool可读性差,易误用。建议改为单一布尔参数或选项结构体。可直接最小改动如下(调用方保持传入一个
bool即可):-func (b *Broker) ApplyAgent(taskInfo *structure.TaskInfo, forceApply ...bool) (*Agent, AgentStatus, map[string]*workers.WorkerClient, error) { +func (b *Broker) ApplyAgent(taskInfo *structure.TaskInfo, force bool) (*Agent, AgentStatus, map[string]*workers.WorkerClient, error) { @@ - if !(forceApply != nil && len(forceApply) > 0 && forceApply[0]) { + if !force {请确认所有调用点均已适配为传入单个
bool参数(如!taskInfo.Exclusive)。可用脚本定位调用点:#!/bin/bash rg -nP -C2 '\bApplyAgent\s*\(' --type=govermeer/apps/master/schedules/scheduler_resource_manager.go (2)
54-82: 释放任务时勿删除组状态,缓存组成员并一次性回写状态当前先删除
workerGroupStatus,随后又基于运行中任务重算,逻辑不一致且多次获取组成员。建议保留条目、移除任务ID后基于剩余任务设置 Idle/Concurrent,并缓存组成员以减少重复调用。func (rm *SchedulerResourceManager) ReleaseByTaskID(taskID int32) { defer rm.Unlock(rm.Lock()) for workerGroup, status := range rm.workerGroupStatus { if (status == WorkerOngoingStatusRunning || status == WorkerOngoingStatusConcurrentRunning) && rm.isTaskRunningOnWorkerGroup(workerGroup, taskID) { - delete(rm.workerGroupStatus, workerGroup) - if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; exists { + // remove taskID from running list + if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; exists { for i, id := range tasks { if id == taskID { rm.runningWorkerGroupTasks[workerGroup] = append(tasks[:i], tasks[i+1:]...) - if len(rm.runningWorkerGroupTasks[workerGroup]) == 0 { - delete(rm.runningWorkerGroupTasks, workerGroup) - } break } } } - if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusIdle) - } - } else { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusConcurrentRunning) - } - } + // compute new group status and propagate + gw := workerMgr.GetGroupWorkers(workerGroup) + if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { + delete(rm.runningWorkerGroupTasks, workerGroup) + rm.workerGroupStatus[workerGroup] = WorkerOngoingStatusIdle + for _, worker := range gw { + rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusIdle) + } + } else { + rm.workerGroupStatus[workerGroup] = WorkerOngoingStatusConcurrentRunning + for _, worker := range gw { + rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusConcurrentRunning) + } + } } } }
126-136: 并发状态判定逻辑冗余且不准确,按是否独占与当前运行数计算初始化切片与状态设置重复;应基于独占标记和组内当前运行数决定 Running 或 ConcurrentRunning。
- exclusive := taskInfo.Exclusive - runningStatus := WorkerOngoingStatusRunning - if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists && exclusive { - rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} - runningStatus = WorkerOngoingStatusRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } else { - runningStatus = WorkerOngoingStatusConcurrentRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } + exclusive := taskInfo.Exclusive + if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists { + rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} + } + var runningStatus WorkerOngoingStatus + if exclusive && len(rm.runningWorkerGroupTasks[agent.GroupName()]) == 0 { + runningStatus = WorkerOngoingStatusRunning + } else { + runningStatus = WorkerOngoingStatusConcurrentRunning + } + rm.workerGroupStatus[agent.GroupName()] = runningStatusvermeer/apps/master/bl/scheduler_bl.go (4)
57-73: Init 缺少失败传播与可停止机制(可延后处理)初始化链路与 goroutine 无法感知失败/退出,建议让 Init 返回 error,并为 ticker/waiter 提供 Stop(或 context)以便关闭。
可最低限度改为:
-func (s *ScheduleBl) Init() { +func (s *ScheduleBl) Init() error { logrus.Info("Initializing ScheduleBl...") - s.LoadConfig() + s.LoadConfig() // 或改为返回 error 并向上抛 @@ - go s.startTicker() - go s.waitingStartedTask() -} + go s.startTicker() + go s.waitingStartedTask() + return nil +}
369-376: 拼写修正并保留 continue日志拼写应为 “received”。
- logrus.Warnf("recieved a nil task from startChan") + logrus.Warnf("received a nil task from startChan")
303-326: 批量入队结果切片填充错误(长度翻倍且含零值)预分配后又
append,导致返回的切片长度错误。应按索引赋值。- errors := make([]error, len(taskInfos)) - oks := make([]bool, len(taskInfos)) - - for _, taskInfo := range taskInfos { - ok, err := s.QueueTask(taskInfo) - if err != nil { - logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) - } - errors = append(errors, err) - oks = append(oks, ok) - } + errors := make([]error, len(taskInfos)) + oks := make([]bool, len(taskInfos)) + for i, taskInfo := range taskInfos { + ok, err := s.QueueTask(taskInfo) + if err != nil { + logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) + } + oks[i] = ok + errors[i] = err + }
463-485: 无需在此处处理 RemoveTask 错误
当前RemoveTask签名仅返回*TaskInfo,无error返回,故无需接收或处理;如后续变更为(task, err)再补充错误处理。Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
vermeer/apps/master/schedules/scheduler_resource_manager.go (5)
224-227: Deleted 分支未回写组状态,可能残留旧状态删除 worker 后未重算组状态,可能造成组状态与实际不一致。
- } else if status == WorkerOngoingStatusDeleted { - delete(rm.workerStatus, workerName) + } else if status == WorkerOngoingStatusDeleted { + workerInfo := workerMgr.GetWorkerInfo(workerName) + delete(rm.workerStatus, workerName) + if workerInfo != nil && workerInfo.Group != "" { + gws := workerMgr.GetGroupWorkers(workerInfo.Group) + anyRunning, anyConcurrent, allIdle := false, false, true + for _, w := range gws { + switch st := rm.workerStatus[w.Name]; st { + case WorkerOngoingStatusRunning: + anyRunning = true; allIdle = false + case WorkerOngoingStatusConcurrentRunning: + anyConcurrent = true; allIdle = false + case WorkerOngoingStatusIdle: + default: + allIdle = false + } + } + var newStatus WorkerOngoingStatus + switch { + case anyRunning: + newStatus = WorkerOngoingStatusRunning + case anyConcurrent: + newStatus = WorkerOngoingStatusConcurrentRunning + case allIdle: + newStatus = WorkerOngoingStatusIdle + default: + newStatus = rm.workerGroupStatus[workerInfo.Group] + } + rm.changeWorkerGroupStatus(workerInfo.Group, newStatus) + }
231-234: 降低状态变更日志级别以减少噪音组状态变化较频繁,建议改为 Debug 级别。
- logrus.Infof("Change worker group '%s' status to '%s'", workerGroup, status) + logrus.Debugf("Change worker group '%s' status to '%s'", workerGroup, status)
42-47: Broker 的创建建议改为依赖注入,便于测试与替换new(Broker).Init() 方式耦合较强,不利于 mock 测试与替换实现。
可在 Init 增加可选参数或提供 Setter,从外部注入 broker。
153-163: 基于缓存状态的查询可能短暂不一致GetIdleWorkerGroups 仅读取 rm.workerGroupStatus,若状态尚未由回写路径更新,可能短暂漏报。可考虑基于 runningWorkerGroupTasks 推导或加入只读快照机制。
170-180: 同上:并发组查询建议与运行任务数据对齐建议用 runningWorkerGroupTasks 辅助校验,避免状态漂移带来的误报/漏报。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
vermeer/apps/master/schedules/scheduler_resource_manager.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
vermeer/apps/master/schedules/broker.go (4)
Broker(55-59)Agent(41-44)AgentStatus(29-29)AgentStatusError(33-33)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: check-dependency-license
- GitHub Check: Analyze (java)
- GitHub Check: Analyze (go)
🔇 Additional comments (4)
vermeer/apps/master/schedules/scheduler_resource_manager.go (4)
126-136: 精简独占/并发状态判定,避免冗余赋值当前既初始化切片又立即 append,并两处重复写组状态。建议根据是否独占且当前计数是否为 0 来判定一次,再写一次状态即可。
- exclusive := taskInfo.Exclusive - runningStatus := WorkerOngoingStatusRunning - if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists && exclusive { - rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} - runningStatus = WorkerOngoingStatusRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } else { - runningStatus = WorkerOngoingStatusConcurrentRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } - rm.runningWorkerGroupTasks[agent.GroupName()] = append(rm.runningWorkerGroupTasks[agent.GroupName()], taskInfo.ID) + exclusive := taskInfo.Exclusive + currentCount := len(rm.runningWorkerGroupTasks[agent.GroupName()]) + runningStatus := WorkerOngoingStatusRunning + if !(exclusive && currentCount == 0) { + runningStatus = WorkerOngoingStatusConcurrentRunning + } + rm.workerGroupStatus[agent.GroupName()] = runningStatus + rm.runningWorkerGroupTasks[agent.GroupName()] = append(rm.runningWorkerGroupTasks[agent.GroupName()], taskInfo.ID)
57-79: 修复 ReleaseByTaskID 的状态删除与重复查询问题(并兼容 Paused 状态)
- 不要在遍历时删除 rm.workerGroupStatus 条目,后续仍需回写;
- 释放逻辑不应依赖组状态为 Running/Concurrent,若组处于 Paused 也应释放;
- 避免重复调用 workerMgr.GetGroupWorkers,缓存一次。
建议修改:
- for workerGroup, status := range rm.workerGroupStatus { - if (status == WorkerOngoingStatusRunning || status == WorkerOngoingStatusConcurrentRunning) && rm.isTaskRunningOnWorkerGroup(workerGroup, taskID) { - delete(rm.workerGroupStatus, workerGroup) + for workerGroup := range rm.workerGroupStatus { + if rm.isTaskRunningOnWorkerGroup(workerGroup, taskID) { if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; exists { for i, id := range tasks { if id == taskID { rm.runningWorkerGroupTasks[workerGroup] = append(tasks[:i], tasks[i+1:]...) if len(rm.runningWorkerGroupTasks[workerGroup]) == 0 { delete(rm.runningWorkerGroupTasks, workerGroup) } break } } } - if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusIdle) - } - } else { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusConcurrentRunning) - } - } + gws := workerMgr.GetGroupWorkers(workerGroup) + if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { + rm.changeWorkerGroupStatus(workerGroup, WorkerOngoingStatusIdle) + for _, w := range gws { + rm.changeWorkerStatus(w.Name, WorkerOngoingStatusIdle) + } + } else { + rm.changeWorkerGroupStatus(workerGroup, WorkerOngoingStatusConcurrentRunning) + for _, w := range gws { + rm.changeWorkerStatus(w.Name, WorkerOngoingStatusConcurrentRunning) + } + } } }
188-221: 组状态推导不完整:应优先 Running,其次任一并发,否则 Idle当前仅在“全 Idle”或“全 Concurrent”时才回写,混合态下不更新,且未体现 Running 优先级,易导致组状态与实际不符,影响调度。
建议改为:任一 Running => Running;否则任一 Concurrent => Concurrent;否则 Idle。
- gws := workerMgr.GetGroupWorkers(groupName) - allIdle := true - allConcurrent := true - for _, w := range gws { - st := rm.workerStatus[w.Name] - if st != WorkerOngoingStatusIdle { - allIdle = false - } - if st != WorkerOngoingStatusConcurrentRunning { - allConcurrent = false - } - } - if allConcurrent || allIdle { - newStatus := WorkerOngoingStatusIdle - if allConcurrent { - newStatus = WorkerOngoingStatusConcurrentRunning - } - logrus.Debugf("Change worker group '%s' status to '%s' (derived from %d workers)", groupName, newStatus, len(gws)) - rm.changeWorkerGroupStatus(groupName, newStatus) - } + gws := workerMgr.GetGroupWorkers(groupName) + anyRunning, anyConcurrent, allIdle := false, false, true + for _, w := range gws { + switch st := rm.workerStatus[w.Name]; st { + case WorkerOngoingStatusRunning: + anyRunning = true + allIdle = false + case WorkerOngoingStatusConcurrentRunning: + anyConcurrent = true + allIdle = false + case WorkerOngoingStatusIdle: + // keep allIdle + default: + allIdle = false + } + } + var newStatus WorkerOngoingStatus + switch { + case anyRunning: + newStatus = WorkerOngoingStatusRunning + case anyConcurrent: + newStatus = WorkerOngoingStatusConcurrentRunning + case allIdle: + newStatus = WorkerOngoingStatusIdle + default: + newStatus = rm.workerGroupStatus[groupName] + } + logrus.Debugf("Change worker group '%s' status to '%s' (derived from %d workers)", groupName, newStatus, len(gws)) + rm.changeWorkerGroupStatus(groupName, newStatus)
115-116: 已确认forceApply语义与Exclusive相反且用法正确
ApplyAgent第二个可选布尔参数forceApply为 true 时会跳过忙碌检查(允许并发调度),传入!taskInfo.Exclusive正好对应:非独占任务并发调度,独占任务执行检查,无需修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vermeer/apps/master/bl/scheduler_bl.go (1)
423-427: 修复 TOCTOU 竞态条件:状态检查与设置需原子化确认问题存在。代码在第 423 行检查
taskInfo.State != TaskStateWaiting时未持锁,而taskMgr.SetState内部才获得锁。两者之间存在竞态窗口:其他 goroutine 可在此期间修改状态。更严重的是,SetState本身不验证前置状态,不做状态转移校验(如仅允许 Waiting→Created)。建议:
- 要么在外层用锁保护整个检查+设置过程
- 要么在
SetState内实现状态转移校验(仅允许合法的状态迁移)
♻️ Duplicate comments (11)
vermeer/apps/master/bl/task_bl.go (1)
68-116: 调度参数解析需增强健壮性(TrimSpace/跳过空项/错误边界)并移除不可达检查
- 未 TrimSpace:如 "1, 2" 会解析失败;空串会报错。
- preorders 未跳过空项与去重。
- priority 使用
ParseInt(..., 32)后再判断> math.MaxInt32永远为假,属冗余。- exclusive/cron_expr 同样需要 TrimSpace 后再判空。
建议按下述方式改造(含移除不可达检查):
@@ - if priority, ok := params["priority"]; ok { - if p, err := strconv.ParseInt(priority, 10, 32); err == nil { - if p < 0 { - return nil, fmt.Errorf("priority should be non-negative") - } - if p > math.MaxInt32 { - return nil, fmt.Errorf("priority exceeds maximum value: %d", math.MaxInt32) - } - taskInfo.Priority = int32(p) - } else { - logrus.Warnf("priority convert to int32 error:%v", err) - return nil, err - } - } + if v, ok := params["priority"]; ok { + v = strings.TrimSpace(v) + if v != "" { + if p, err := strconv.ParseInt(v, 10, 32); err == nil { + if p < 0 { + return nil, fmt.Errorf("priority should be non-negative") + } + taskInfo.Priority = int32(p) + } else { + logrus.Warnf("priority convert to int32 error: %v", err) + return nil, err + } + } + } @@ - if preorders, ok := params["preorders"]; ok { - preorderList := strings.Split(preorders, ",") - for _, preorder := range preorderList { - if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil { + if preorders, ok := params["preorders"]; ok { + seen := make(map[int32]struct{}) + for _, tok := range strings.Split(preorders, ",") { + tok = strings.TrimSpace(tok) + if tok == "" { + continue + } + if pid64, err := strconv.ParseInt(tok, 10, 32); err == nil { + pid := int32(pid64) + if _, dup := seen[p] ; dup { + continue + } + seen[pid] = struct{}{} if taskMgr.GetTaskByID(int32(pid)) == nil { - return nil, fmt.Errorf("preorder task with ID %d does not exist", pid) + return nil, fmt.Errorf("preorder task with ID %d does not exist", pid) } - taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) + taskInfo.Preorders = append(taskInfo.Preorders, pid) } else { - logrus.Warnf("preorder convert to int32 error:%v", err) + logrus.Warnf("preorders item '%s' invalid: %v", tok, err) return nil, err } } } @@ - if exclusive, ok := params["exclusive"]; ok { - if ex, err := strconv.ParseBool(exclusive); err == nil { + if exclusive, ok := params["exclusive"]; ok { + exclusive = strings.TrimSpace(exclusive) + if exclusive != "" { + if ex, err := strconv.ParseBool(exclusive); err == nil { taskInfo.Exclusive = ex - } else { - logrus.Warnf("exclusive convert to bool error:%v", err) - return nil, err + } else { + logrus.Warnf("exclusive convert to bool error:%v", err) + return nil, err + } } } @@ - if cronExpr, ok := params["cron_expr"]; ok { - if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { + if cronExpr, ok := params["cron_expr"]; ok { + cronExpr = strings.TrimSpace(cronExpr) + if cronExpr != "" { + if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { logrus.Warnf("cron_expr parse error:%v", err) return nil, err - } - taskInfo.CronExpr = cronExpr + } + taskInfo.CronExpr = cronExpr + } }另外:若移除
math.MaxInt32相关判断,请同步删除顶部import "math"(见下方补充 diff)。vermeer/apps/master/schedules/scheduler_task_manager.go (6)
176-187: GetLastTask 无锁访问且为 O(n) 扫描
- 需至少加锁保护读。
- 如队列大,该方法可能成为热点(可后续引入按 space 的索引以 O(1) 获取)。
-func (t *SchedulerTaskManager) GetLastTask(spaceName string) *structure.TaskInfo { +func (t *SchedulerTaskManager) GetLastTask(spaceName string) *structure.TaskInfo { + defer t.Unlock(t.Lock()) if len(t.allTaskQueue) == 0 { return nil } for i := len(t.allTaskQueue) - 1; i >= 0; i-- { if t.allTaskQueue[i].SpaceName == spaceName { return t.allTaskQueue[i] } } return nil }
267-274: IsTaskOngoing 缺少并发保护且状态判断可能过窄-func (t *SchedulerTaskManager) IsTaskOngoing(taskID int32) bool { - // Check if the task is currently ongoing - task, exists := t.allTaskMap[taskID] +func (t *SchedulerTaskManager) IsTaskOngoing(taskID int32) bool { + defer t.Unlock(t.Lock()) + task, exists := t.allTaskMap[taskID] if !exists { return false } - return task.State == structure.TaskStateCreated + // TODO: 依据业务补充其他“进行中”状态(如运行中等) + return task.State == structure.TaskStateCreated }
89-99: AddTaskStartSequence 缺少并发保护与原子性检查-func (t *SchedulerTaskManager) AddTaskStartSequence(taskID int32) error { +func (t *SchedulerTaskManager) AddTaskStartSequence(taskID int32) error { + defer t.Unlock(t.Lock()) if common.GetConfig("debug_mode").(string) != "debug" { logrus.Warn("TaskStartSequence called but debug features are disabled") return nil } if _, exists := t.allTaskMap[taskID]; !exists { return errors.New("task not found") } t.startTaskQueue = append(t.startTaskQueue, t.allTaskMap[taskID]) return nil }
107-123: RemoveTask 签名与调用方不匹配且加锁时序不当(会竞态/编译错误)
- 先读 map 再加锁,存在 TOCTOU 竞态。
- 调用方(scheduler_bl.go: CancelTask/CloseCurrent)将返回值当作任务指针使用;当前签名仅返回 error,导致编译错误。
请改为返回被移除的任务指针并全程持锁:
-func (t *SchedulerTaskManager) RemoveTask(taskID int32) error { - if _, exists := t.allTaskMap[taskID]; !exists { - return errors.New("task not found") - } - defer t.Unlock(t.Lock()) - delete(t.allTaskMap, taskID) +func (t *SchedulerTaskManager) RemoveTask(taskID int32) (*structure.TaskInfo, error) { + defer t.Unlock(t.Lock()) + task, exists := t.allTaskMap[taskID] + if !exists { + return nil, errors.New("task not found") + } + delete(t.allTaskMap, taskID) // remove from queue for i, task := range t.allTaskQueue { if task.ID == taskID { t.allTaskQueue = append(t.allTaskQueue[:i], t.allTaskQueue[i+1:]...) break } } delete(t.taskToworkerGroupMap, taskID) delete(t.notCompleteTasks, taskID) - return nil + return task, nil }并在所有调用点同步更新处理返回的
(task, err)。基于现有注释与使用方式,建议在调用处将task != nil作为“队列中存在”的依据。
131-138: MarkTaskComplete 未先加锁即读写共享状态-func (t *SchedulerTaskManager) MarkTaskComplete(taskID int32) error { - if _, exists := t.allTaskMap[taskID]; !exists { - return errors.New("task not found") - } - defer t.Unlock(t.Lock()) +func (t *SchedulerTaskManager) MarkTaskComplete(taskID int32) error { + defer t.Unlock(t.Lock()) + if _, exists := t.allTaskMap[taskID]; !exists { + return errors.New("task not found") + } delete(t.notCompleteTasks, taskID) return nil }
230-253: GetTaskStartSequence 读共享切片缺少并发保护且直接暴露内部切片-func (t *SchedulerTaskManager) GetTaskStartSequence(queryTasks []int32) []*structure.TaskInfo { - if len(t.startTaskQueue) == 0 { +func (t *SchedulerTaskManager) GetTaskStartSequence(queryTasks []int32) []*structure.TaskInfo { + defer t.Unlock(t.Lock()) + if len(t.startTaskQueue) == 0 { return nil } if len(queryTasks) == 0 { - return t.startTaskQueue + out := make([]*structure.TaskInfo, len(t.startTaskQueue)) + copy(out, t.startTaskQueue) + return out } tasks := make([]*structure.TaskInfo, 0, len(queryTasks)) taskSet := make(map[int32]struct{}) for _, id := range queryTasks { taskSet[id] = struct{}{} } for _, task := range t.startTaskQueue { if _, exists := taskSet[task.ID]; exists { tasks = append(tasks, task) } } logrus.Infof("GetTaskStartSequence: return %d tasks", len(tasks)) for _, task := range tasks { logrus.Debugf("TaskID: %d", task.ID) } return tasks }vermeer/apps/master/bl/scheduler_bl.go (4)
58-74: Init 缺少错误传播,初始化失败将静默继续-func (s *ScheduleBl) Init() { +func (s *ScheduleBl) Init() error { logrus.Info("Initializing ScheduleBl...") - s.LoadConfig() + if err := s.LoadConfig(); err != nil { + return fmt.Errorf("load config failed: %w", err) + } @@ - s.resourceManager = &schedules.SchedulerResourceManager{} - s.resourceManager.Init() + s.resourceManager = &schedules.SchedulerResourceManager{} + s.resourceManager.Init() @@ - s.taskManager = &schedules.SchedulerTaskManager{} - s.taskManager.Init() + s.taskManager = &schedules.SchedulerTaskManager{} + s.taskManager.Init() @@ - s.algorithmManager = &schedules.SchedulerAlgorithmManager{} - s.algorithmManager.Init() + s.algorithmManager = &schedules.SchedulerAlgorithmManager{} + s.algorithmManager.Init() @@ - s.cronManager = &schedules.SchedulerCronManager{} - s.cronManager.Init(s.QueueTaskFromTemplate) + s.cronManager = &schedules.SchedulerCronManager{} + s.cronManager.Init(s.QueueTaskFromTemplate) go s.startTicker() go s.waitingStartedTask() -} + return nil +}提示:若上述 Init() 均无错误返回,可仅对
LoadConfig()返回错误,并在此处保留基本的panic/recover保护即可。
306-329: BatchQueueTask 结果切片填充方式错误(长度翻倍/前半零值)- errors := make([]error, len(taskInfos)) - oks := make([]bool, len(taskInfos)) - - for _, taskInfo := range taskInfos { - ok, err := s.QueueTask(taskInfo) - if err != nil { - logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) - } - errors = append(errors, err) - oks = append(oks, ok) - } + errors := make([]error, len(taskInfos)) + oks := make([]bool, len(taskInfos)) + for i, taskInfo := range taskInfos { + ok, err := s.QueueTask(taskInfo) + if err != nil { + logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) + } + oks[i] = ok + errors[i] = err + }
331-355: CloseCurrent 使用 RemoveTask 的旧签名且未处理错误需配合
RemoveTask(taskID) (*TaskInfo, error)的修改;同时处理返回错误。- s.taskManager.RemoveTask(taskId) + removed, rmErr := s.taskManager.RemoveTask(taskId) + if rmErr != nil { + return rmErr + }
459-488: CancelTask 使用 RemoveTask 旧签名并未处理错误,存在竞态与编译错误按
SchedulerTaskManager.RemoveTask(taskID) (*TaskInfo, error)修改后,应同步调整:- isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) - task := s.taskManager.RemoveTask(taskInfo.ID) - s.cronManager.DeleteTask(taskInfo.ID) + isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) + removedTask, rmErr := s.taskManager.RemoveTask(taskInfo.ID) + s.cronManager.DeleteTask(taskInfo.ID) @@ - isInQueue := false - if task != nil { + isInQueue := false + if rmErr == nil && removedTask != nil { logrus.Infof("removed task '%d' from space queue", taskInfo.ID) isInQueue = true }另:
IsTaskOngoing/RemoveTask之间建议由同一把锁保护(当前s与taskManager不是同一锁)。请在taskManager.IsTaskOngoing/RemoveTask内加锁(见对应建议)以消除竞态。
🧹 Nitpick comments (4)
vermeer/apps/master/bl/task_bl.go (1)
20-27: 移除未使用的math依赖按上述修改后,
math将不再使用,建议移除以避免构建告警。import ( "errors" "fmt" - "math" "sort" "strconv" "strings"vermeer/apps/master/schedules/scheduler_task_manager.go (2)
70-80: RefreshTaskToWorkerGroupMap 重复赋值并忽略错误已调用
AssignGroup后又手动写入taskToworkerGroupMap,逻辑重复;且未处理AssignGroup的错误。- t.AssignGroup(taskInfo) - t.taskToworkerGroupMap[taskInfo.ID] = workerMgr.ApplyGroup(taskInfo.SpaceName, taskInfo.GraphName) + if err := t.AssignGroup(taskInfo); err != nil { + logrus.Warnf("assign group failed for task %d: %v", taskInfo.ID, err) + }
147-154: AssignGroup 的并发契约不明确(公共方法却依赖外部持锁)当前方法未自持锁,但常在持锁路径内调用;一旦被其它路径直接调用会竞态。建议:
- 将其改为非导出
assignGroup并在注释标明“需在锁内调用”,或- 方法内自持锁,并确保不会在已持锁路径中再次调用以避免死锁。
vermeer/apps/master/bl/scheduler_bl.go (1)
80-116: 配置解析建议使用 ParseBool/健壮处理并统一默认值-// softSchedule -softSchedule := common.GetConfigDefault("soft_schedule", "true").(string) -if softSchedule == "true" { - s.softSchedule = true -} else { - s.softSchedule = false -} +// softSchedule +softStr := common.GetConfigDefault("soft_schedule", "true").(string) +if b, err := strconv.ParseBool(softStr); err == nil { + s.softSchedule = b +} else { + logrus.Warnf("invalid soft_schedule=%q, fallback to true", softStr) + s.softSchedule = true +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
vermeer/apps/master/bl/scheduler_bl.go(6 hunks)vermeer/apps/master/bl/task_bl.go(4 hunks)vermeer/apps/master/schedules/scheduler_task_manager.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
vermeer/apps/master/bl/scheduler_bl.go (6)
vermeer/apps/master/schedules/scheduler_resource_manager.go (3)
SchedulerResourceManager(28-36)WorkerOngoingStatusDeleted(21-21)WorkerOngoingStatus(14-14)vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
SchedulerAlgorithmManager(33-37)vermeer/apps/master/schedules/scheduler_task_manager.go (1)
SchedulerTaskManager(15-27)vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
SchedulerCronManager(15-20)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/bl/task_canceler.go (1)
NewTaskCanceler(35-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Analyze (java)
- GitHub Check: Analyze (go)
- GitHub Check: check-dependency-license
🔇 Additional comments (2)
vermeer/apps/master/bl/scheduler_bl.go (2)
371-381: 修复 nil 任务提前返回为 continue:已妥当建议补充一次性小测试,向
startChan送入nil验证不会终止调度协程。
523-555: 对外查询方法简单代理:LGTM
| func (s *ScheduleBl) startTicker() { | ||
| // Create a ticker with the specified interval | ||
| ticker := time.Tick(time.Duration(s.tickerInterval) * time.Second) | ||
|
|
||
| for range ticker { | ||
| logrus.Debug("Ticker ticked") | ||
| s.TryScheduleNextTasks() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Tick 会导致定时器泄露,建议使用 NewTicker 并可控停止
-func (s *ScheduleBl) startTicker() {
- // Create a ticker with the specified interval
- ticker := time.Tick(time.Duration(s.tickerInterval) * time.Second)
-
- for range ticker {
+func (s *ScheduleBl) startTicker() {
+ ticker := time.NewTicker(time.Duration(s.tickerInterval) * time.Second)
+ defer ticker.Stop()
+ for range ticker.C {
logrus.Debug("Ticker ticked")
s.TryScheduleNextTasks()
}
}🤖 Prompt for AI Agents
In vermeer/apps/master/bl/scheduler_bl.go around lines 122-130, replace the
time.Tick usage (which can leak timers) with time.NewTicker and ensure it is
stopped when the scheduler shuts down: create a ticker with
time.NewTicker(time.Duration(s.tickerInterval)*time.Second), range over
ticker.C, and call ticker.Stop() when the ScheduleBl is stopped (e.g. when a
provided stop channel is closed or context is cancelled); if ScheduleBl has no
shutdown signal, add one (a stop channel or context+cancel) so you can break the
loop and stop the ticker to avoid leaks.
| logrus.Infof("scheduling task '%d' with type '%s' to start channel", task.ID, task.Type) | ||
| select { | ||
| case s.startChan <- task: | ||
| logrus.Infof("task '%d' sent to start channel", task.ID) | ||
| default: | ||
| errMsg := fmt.Sprintf("start channel is full, cannot schedule task %d", task.ID) | ||
| logrus.Errorf(errMsg) | ||
| taskMgr.SetError(task, errMsg) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startChan 满时将任务置为错误不合理,建议仅告警并由下个 tick 重试
当前将“队列已满”作为致命错误写入任务,可能误伤业务。建议告警并跳过,由下次调度重试。
- default:
- errMsg := fmt.Sprintf("start channel is full, cannot schedule task %d", task.ID)
- logrus.Errorf(errMsg)
- taskMgr.SetError(task, errMsg)
+ default:
+ logrus.Warnf("start channel is full, cannot schedule task %d; will retry next tick", task.ID)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logrus.Infof("scheduling task '%d' with type '%s' to start channel", task.ID, task.Type) | |
| select { | |
| case s.startChan <- task: | |
| logrus.Infof("task '%d' sent to start channel", task.ID) | |
| default: | |
| errMsg := fmt.Sprintf("start channel is full, cannot schedule task %d", task.ID) | |
| logrus.Errorf(errMsg) | |
| taskMgr.SetError(task, errMsg) | |
| } | |
| } | |
| logrus.Infof("scheduling task '%d' with type '%s' to start channel", task.ID, task.Type) | |
| select { | |
| case s.startChan <- task: | |
| logrus.Infof("task '%d' sent to start channel", task.ID) | |
| default: | |
| logrus.Warnf("start channel is full, cannot schedule task %d; will retry next tick", task.ID) | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/scheduler_bl.go around lines 200–209, currently when
s.startChan is full the code logs an error and calls taskMgr.SetError(task,...),
which wrongly marks the task as failed; instead, change the behavior to not set
the task error: log a warning (or increment a retry/drop metric) stating the
start channel is full and skip sending so the task remains schedulable and will
be retried on the next tick; leave the default case non-blocking and avoid
mutating task state or marking it as error.
| defer s.Unlock(s.Lock()) | ||
| s.resourceManager.ChangeWorkerStatus(workerName, status) | ||
|
|
||
| logrus.Infof("worker '%s' status changed to '%s'", workerName, status) | ||
| // After changing the worker status, we may need to reschedule tasks | ||
| s.TryScheduleNextTasks(true) | ||
|
|
||
| return true, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChangeWorkerStatus 忽略底层返回,建议向上传递结果与错误
-func (s *ScheduleBl) ChangeWorkerStatus(workerName string, status schedules.WorkerOngoingStatus) (bool, error) {
+func (s *ScheduleBl) ChangeWorkerStatus(workerName string, status schedules.WorkerOngoingStatus) (bool, error) {
defer s.Unlock(s.Lock())
- s.resourceManager.ChangeWorkerStatus(workerName, status)
+ ok, err := s.resourceManager.ChangeWorkerStatus(workerName, status)
- logrus.Infof("worker '%s' status changed to '%s'", workerName, status)
+ if err != nil {
+ logrus.Errorf("change worker '%s' to '%s' failed: %v", workerName, status, err)
+ return false, err
+ }
+ logrus.Infof("worker '%s' status changed to '%s'", workerName, status)
// After changing the worker status, we may need to reschedule tasks
s.TryScheduleNextTasks(true)
- return true, nil
+ return ok, nil
}🤖 Prompt for AI Agents
In vermeer/apps/master/bl/scheduler_bl.go around lines 360 to 368, the call to
s.resourceManager.ChangeWorkerStatus(workerName, status) ignores its return
values and error; capture its return (e.g., success, err), handle the error by
logging and returning the error (or propagating it) instead of unconditionally
logging success and returning true, and ensure the deferred unlock still runs;
update the function to return the underlying result and error from
ChangeWorkerStatus when appropriate.
| // stop the cron job if exists | ||
| Scheduler.CancelCronTask(task) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CancelTask 未处理取消 Cron 的错误返回
Scheduler.CancelCronTask(task) 的返回值被忽略,可能吞掉关键失败信息。建议至少记录警告。
- // stop the cron job if exists
- Scheduler.CancelCronTask(task)
+ // stop the cron job if exists
+ if err := Scheduler.CancelCronTask(task); err != nil {
+ logrus.Warnf("cancel cron for task %d failed: %v", task.ID, err)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // stop the cron job if exists | |
| Scheduler.CancelCronTask(task) | |
| // stop the cron job if exists | |
| if err := Scheduler.CancelCronTask(task); err != nil { | |
| logrus.Warnf("cancel cron for task %d failed: %v", task.ID, err) | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/task_bl.go around lines 202-204, the return
value/error from Scheduler.CancelCronTask(task) is ignored; capture its error
return and at minimum log a warning when cancellation fails (including task
identifier and error details) so failures aren't silently swallowed; adjust call
to check the returned error and use the existing logger to record a warning or
error with context.
| _, errs := Scheduler.BatchQueueTask(tasksInfo) | ||
| return errs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
批量入队对接正确,但请确认返回长度与顺序稳定
当前仅使用 errs。建议确保 Scheduler.BatchQueueTask 返回的 errs 与输入一一对应并固定长度,避免调用侧错配。
🏁 Script executed:
#!/bin/bash
# 检查 BatchQueueTask 的所有调用点与签名
rg -nP 'BatchQueueTask\s*\(' -C2Length of output: 693
需要进一步验证 BatchQueueTask 的实现和返回值语义。
需要查看实际的 BatchQueueTask 完整实现。
🏁 Script executed:
#!/bin/bash
# 查看 BatchQueueTask 的完整实现
cat vermeer/apps/master/bl/scheduler_bl.go | grep -n "func (s \*ScheduleBl) BatchQueueTask" -A 50Length of output: 1677
🏁 Script executed:
#!/bin/bash
# 搜索是否有其他调用点处理了 []bool 返回值
rg -nP 'BatchQueueTask\s*\(' --type=go -A 3 -B 1Length of output: 625
🏁 Script executed:
#!/bin/bash
# 检查是否有注释或文档解释返回值
rg -nP '@Return|return.*bool.*error' vermeer/apps/master/bl/ --type=go -B2Length of output: 981
🏁 Script executed:
#!/bin/bash
# 查看是否有测试用例
fd -e _test.go vermeer/apps/master/bl/ -x grep -l "BatchQueueTask"Length of output: 397
🏁 Script executed:
#!/bin/bash
# 查看任务入队的相关逻辑
cat vermeer/apps/master/bl/scheduler_bl.go | grep -n "struct.TaskInfo\|QueueTask" -B2 -A2 | head -100Length of output: 2005
🏁 Script executed:
#!/bin/bash
# 确认返回值在整个文件中的使用
rg -nP '\[\]bool|\[\]error' vermeer/apps/master/bl/scheduler_bl.go -B2 -A2Length of output: 450
🏁 Script executed:
#!/bin/bash
# 查找任务队列入队操作的具体实现
rg -nP 'for.*range.*taskInfos|for.*len.*taskInfos' vermeer/apps/master/bl/scheduler_bl.go -B3 -A5Length of output: 308
🏁 Script executed:
#!/bin/bash
# 确认两个返回值的长度关系
rg -nP 'append.*true|append.*false|append.*err' vermeer/apps/master/bl/scheduler_bl.go -B2 -A2 | head -50Length of output: 149
🏁 Script executed:
#!/bin/bash
# 全文搜索 BatchQueueTask 周围的代码
cat vermeer/apps/master/bl/scheduler_bl.go | sed -n '300,350p'Length of output: 1506
🏁 Script executed:
#!/bin/bash
# 确认调用处 task_bl.go 的上下文
cat vermeer/apps/master/bl/task_bl.go | sed -n '260,280p'Length of output: 650
🏁 Script executed:
#!/bin/bash
# 最后检查是否有使用 tasksInfo 之后的返回值检查
cat vermeer/apps/master/bl/task_bl.go | sed -n '250,300p'Length of output: 1585
🏁 Script executed:
#!/bin/bash
# 查看是否有同步使用 BatchQueueTask 返回的 bool 数组
rg -nP '(?<=BatchQueueTask.*\()\s*\[0-9a-zA-Z_.*\]\s*bool|:\s*=.*BatchQueueTask' vermeer/apps/master/bl/ --type=goLength of output: 84
🏁 Script executed:
#!/bin/bash
# 检查整个 task_bl.go 文件中 BatchQueueTask 函数的定义
rg -nP 'func.*BatchQueueTask' vermeer/apps/master/bl/task_bl.go -B5 -A15Length of output: 0
🏁 Script executed:
#!/bin/bash
# 查看返回值的具体含义:检查是否有条件判断基于 bool
rg -nP '\[.*\].*bool.*if|bool.*\[.*\].*if' vermeer/apps/master/bl/scheduler_bl.go -B3 -A3Length of output: 0
🏁 Script executed:
#!/bin/bash
# 更精确地查看 BatchQueueTask 实现中的 append 操作
ast-grep --pattern 'func (s *ScheduleBl) BatchQueueTask($_, $_) ($_) {
$$$
append($$$)
$$$
}'Length of output: 0
🏁 Script executed:
#!/bin/bash
# 查看每个 return 语句的上下文
rg -nP 'return.*\[\]bool.*\[\]error' vermeer/apps/master/bl/scheduler_bl.go -B20Length of output: 664
🏁 Script executed:
#!/bin/bash
# 搜索完整的批量入队实现,确认长度关系
sed -n '306,500p' vermeer/apps/master/bl/scheduler_bl.goLength of output: 5945
🏁 Script executed:
#!/bin/bash
# 确认输入参数长度与返回值长度的关系
rg -nP 'len\(taskInfos\)|len\(.*error.*\)|len\(.*bool.*\)' vermeer/apps/master/bl/scheduler_bl.go -B2 -A2Length of output: 344
🏁 Script executed:
#!/bin/bash
# 查看任务处理函数中对返回值的使用
grep -n "TaskBatchQueueTask\|return errs\|BatchQueueTask" vermeer/apps/master/bl/task_bl.go -B5 -A5Length of output: 396
🏁 Script executed:
#!/bin/bash
# 最终确认:完整查看 task_bl.go 中问题代码所在的函数
grep -n "func.*{" vermeer/apps/master/bl/task_bl.go | grep -B1 "268:" || sed -n '250,290p' vermeer/apps/master/bl/task_bl.goLength of output: 1201
🏁 Script executed:
#!/bin/bash
# 确认函数名称
sed -n '240,275p' vermeer/apps/master/bl/task_bl.goLength of output: 1003
修复 BatchQueueTask 中的切片长度问题(严重 bug)
在 scheduler_bl.go:316-325 中,使用 make([]error, len(taskInfos)) 预分配的切片后,又在循环中调用 append(),导致返回的两个切片长度为输入长度的两倍,而非一一对应的长度。
// 当前(错误)
errors := make([]error, len(taskInfos)) // 长度已为 len(taskInfos)
oks := make([]bool, len(taskInfos))
for _, taskInfo := range taskInfos {
ok, err := s.QueueTask(taskInfo)
errors = append(errors, err) // 导致长度翻倍
oks = append(oks, ok)
}改用索引赋值替代 append(),或者改用空切片后再 append()。
🤖 Prompt for AI Agents
In vermeer/apps/master/bl/task_bl.go around lines 268-269 we call
Scheduler.BatchQueueTask; the real bug is in scheduler_bl.go around lines
316-325 where errors and oks are created with make([]..., len(taskInfos)) and
then appended to in a loop, which doubles the slice length. Fix by either (A)
allocate with length and use index assignment inside the loop (errors[i] = err,
oks[i] = ok) ensuring you iterate with index, or (B) allocate an empty slice
with capacity (make([]error, 0, len(taskInfos))) and continue using append; pick
one approach and remove the mixed make+append pattern so the returned slices
have one entry per input task.
| defer t.Unlock(t.Lock()) | ||
|
|
||
| // Add the task to the task map | ||
| t.allTaskMap[taskInfo.ID] = taskInfo | ||
| t.allTaskQueue = append(t.allTaskQueue, taskInfo) | ||
| t.notCompleteTasks[taskInfo.ID] = taskInfo | ||
| t.AssignGroup(taskInfo) | ||
| return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueueTask 忽略 AssignGroup 错误且未防重复入队
- 未校验重复 taskID,可能覆盖已有任务。
AssignGroup返回错误未处理,分组失败会悄然忽略。
建议:
- // Add the task to the task map
- t.allTaskMap[taskInfo.ID] = taskInfo
- t.allTaskQueue = append(t.allTaskQueue, taskInfo)
- t.notCompleteTasks[taskInfo.ID] = taskInfo
- t.AssignGroup(taskInfo)
+ // reject duplicate
+ if _, exists := t.allTaskMap[taskInfo.ID]; exists {
+ return false, errors.New("task already queued")
+ }
+ t.allTaskMap[taskInfo.ID] = taskInfo
+ t.allTaskQueue = append(t.allTaskQueue, taskInfo)
+ t.notCompleteTasks[taskInfo.ID] = taskInfo
+ if err := t.AssignGroup(taskInfo); err != nil {
+ return false, err
+ }
return true, nil🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 56 to
63, QueueTask currently adds taskInfo to maps/queue without checking for
duplicate task IDs and calls AssignGroup while ignoring its error; change the
flow to first check if taskInfo.ID already exists and return a duplicate error
if so, then call AssignGroup and if it returns an error propagate that error (do
not add to maps/queue), and only after successful AssignGroup add the task to
allTaskMap, allTaskQueue and notCompleteTasks; alternatively, if you must call
AssignGroup after adding, capture its error and rollback by removing the task
from all maps/queue before returning the error.
| func (t *SchedulerTaskManager) GetTaskByID(taskID int32) (*structure.TaskInfo, error) { | ||
| task, exists := t.allTaskMap[taskID] | ||
| if !exists { | ||
| return nil, errors.New("task not found") | ||
| } | ||
| return task, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetTaskByID 缺少并发保护
-func (t *SchedulerTaskManager) GetTaskByID(taskID int32) (*structure.TaskInfo, error) {
- task, exists := t.allTaskMap[taskID]
+func (t *SchedulerTaskManager) GetTaskByID(taskID int32) (*structure.TaskInfo, error) {
+ defer t.Unlock(t.Lock())
+ task, exists := t.allTaskMap[taskID]
if !exists {
return nil, errors.New("task not found")
}
return task, nil
}🤖 Prompt for AI Agents
vermeer/apps/master/schedules/scheduler_task_manager.go around lines 162-168:
GetTaskByID accesses t.allTaskMap without concurrency protection; wrap the map
read with the appropriate mutex (use t.mu.RLock() and t.mu.RUnlock() if t.mu is
an RWMutex, or t.mu.Lock()/Unlock() if only a Mutex) to prevent race conditions,
and if the SchedulerTaskManager struct lacks a mutex field add a sync.RWMutex
named mu and use it for this method.
| func (t *SchedulerTaskManager) GetAllTasks() []*structure.TaskInfo { | ||
| tasks := make([]*structure.TaskInfo, 0, len(t.allTaskMap)) | ||
| for _, task := range t.allTaskMap { | ||
| tasks = append(tasks, task) | ||
| } | ||
| return tasks | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetAllTasks 无锁访问 map
-func (t *SchedulerTaskManager) GetAllTasks() []*structure.TaskInfo {
- tasks := make([]*structure.TaskInfo, 0, len(t.allTaskMap))
- for _, task := range t.allTaskMap {
+func (t *SchedulerTaskManager) GetAllTasks() []*structure.TaskInfo {
+ defer t.Unlock(t.Lock())
+ tasks := make([]*structure.TaskInfo, 0, len(t.allTaskMap))
+ for _, task := range t.allTaskMap {
tasks = append(tasks, task)
}
return tasks
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 194 to
200, the GetAllTasks iterates over t.allTaskMap without any synchronization,
causing a data race; protect the map access by acquiring the manager's mutex
before iterating and releasing it after — preferably use a read lock
(t.mu.RLock() / t.mu.RUnlock()) if the struct has a sync.RWMutex named mu,
otherwise use t.mu.Lock() / t.mu.Unlock() for a sync.Mutex; keep the current
slice creation and append logic inside the locked region so the map cannot be
mutated concurrently.
| func (t *SchedulerTaskManager) GetAllTasksNotComplete() []*structure.TaskInfo { | ||
| tasks := make([]*structure.TaskInfo, 0, len(t.allTaskMap)) | ||
| for _, task := range t.notCompleteTasks { | ||
| tasks = append(tasks, task) | ||
| } | ||
| return tasks | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetAllTasksNotComplete 无锁访问 map
-func (t *SchedulerTaskManager) GetAllTasksNotComplete() []*structure.TaskInfo {
- tasks := make([]*structure.TaskInfo, 0, len(t.allTaskMap))
- for _, task := range t.notCompleteTasks {
+func (t *SchedulerTaskManager) GetAllTasksNotComplete() []*structure.TaskInfo {
+ defer t.Unlock(t.Lock())
+ tasks := make([]*structure.TaskInfo, 0, len(t.notCompleteTasks))
+ for _, task := range t.notCompleteTasks {
tasks = append(tasks, task)
}
return tasks
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 202-208,
GetAllTasksNotComplete iterates over t.notCompleteTasks and reads
len(t.allTaskMap) without holding a lock; protect accesses by acquiring the
scheduler's read lock (e.g. t.mu.RLock()) before reading len and iterating,
compute capacity and copy items while holding the lock, then release the lock
(t.mu.RUnlock()) before returning to avoid concurrent map access panics or
races.
| func (t *SchedulerTaskManager) GetTaskToWorkerGroupMap() map[int32]string { | ||
| // Return a copy of the worker group map to avoid external modifications | ||
| taskNotComplete := t.GetAllTasksNotComplete() | ||
| groupMap := make(map[int32]string, len(taskNotComplete)) | ||
| for _, task := range taskNotComplete { | ||
| if group, exists := t.taskToworkerGroupMap[task.ID]; exists { | ||
| groupMap[task.ID] = group | ||
| } | ||
| } | ||
| return groupMap | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetTaskToWorkerGroupMap 无锁读共享状态,存在竞态
建议在锁内遍历 notCompleteTasks 并构建副本。
-func (t *SchedulerTaskManager) GetTaskToWorkerGroupMap() map[int32]string {
- // Return a copy of the worker group map to avoid external modifications
- taskNotComplete := t.GetAllTasksNotComplete()
- groupMap := make(map[int32]string, len(taskNotComplete))
- for _, task := range taskNotComplete {
- if group, exists := t.taskToworkerGroupMap[task.ID]; exists {
- groupMap[task.ID] = group
- }
- }
- return groupMap
-}
+func (t *SchedulerTaskManager) GetTaskToWorkerGroupMap() map[int32]string {
+ defer t.Unlock(t.Lock())
+ groupMap := make(map[int32]string, len(t.notCompleteTasks))
+ for id := range t.notCompleteTasks {
+ if group, exists := t.taskToworkerGroupMap[id]; exists {
+ groupMap[id] = group
+ }
+ }
+ return groupMap
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (t *SchedulerTaskManager) GetTaskToWorkerGroupMap() map[int32]string { | |
| // Return a copy of the worker group map to avoid external modifications | |
| taskNotComplete := t.GetAllTasksNotComplete() | |
| groupMap := make(map[int32]string, len(taskNotComplete)) | |
| for _, task := range taskNotComplete { | |
| if group, exists := t.taskToworkerGroupMap[task.ID]; exists { | |
| groupMap[task.ID] = group | |
| } | |
| } | |
| return groupMap | |
| } | |
| func (t *SchedulerTaskManager) GetTaskToWorkerGroupMap() map[int32]string { | |
| t.Lock() | |
| defer t.Unlock() | |
| groupMap := make(map[int32]string, len(t.notCompleteTasks)) | |
| for id := range t.notCompleteTasks { | |
| if group, exists := t.taskToworkerGroupMap[id]; exists { | |
| groupMap[id] = group | |
| } | |
| } | |
| return groupMap | |
| } |
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_task_manager.go around lines 255–265,
the method reads shared state without holding the SchedulerTaskManager lock
which can cause a race; modify GetTaskToWorkerGroupMap to acquire the manager's
mutex (use RLock for read-only access), iterate the internal not-complete tasks
and t.taskToworkerGroupMap while the lock is held to build the new map copy,
then defer the corresponding Unlock/RUnlock before returning so the returned map
is a safe snapshot created under the lock.
afe1c2b to
5ee6421
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (14)
vermeer/apps/master/bl/task_bl.go (3)
268-269: 确认批量返回与输入一一对应(长度与顺序)请确保
Scheduler.BatchQueueTask(tasksInfo)返回的errs与输入长度一致且按输入顺序对齐,否则调用侧难以定位失败项。此前排查过实现中“预分配后 append 导致长度翻倍”的问题,请确认已修复。
68-116: 可选参数解析过于脆弱(未 TrimSpace,空串导致失败)
priority/preorders/exclusive/cron_expr未去空白且将空串视为错误,常见输入如"1, 2"、" true "会导致整单创建失败。建议先 TrimSpace,空串跳过,仅对真正非法值报错;preorders按项 Trim 并跳过空项。请参考以下修正:
@@ - if params != nil { + if params != nil { if priority, ok := params["priority"]; ok { - if p, err := strconv.ParseInt(priority, 10, 32); err == nil { + priority = strings.TrimSpace(priority) + if priority != "" { + if p, err := strconv.ParseInt(priority, 10, 32); err == nil { if p < 0 { return nil, fmt.Errorf("priority should be non-negative") } - if p > math.MaxInt32 { - return nil, fmt.Errorf("priority exceeds maximum value: %d", math.MaxInt32) - } taskInfo.Priority = int32(p) - } else { - logrus.Warnf("priority convert to int32 error:%v", err) - return nil, err + } else { + logrus.Warnf("priority convert to int32 error:%v", err) + return nil, err + } } } if preorders, ok := params["preorders"]; ok { - preorderList := strings.Split(preorders, ",") + preorderList := strings.Split(preorders, ",") for _, preorder := range preorderList { - if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil { + preorder = strings.TrimSpace(preorder) + if preorder == "" { + continue + } + if pid, err := strconv.ParseInt(preorder, 10, 32); err == nil { if taskMgr.GetTaskByID(int32(pid)) == nil { return nil, fmt.Errorf("preorder task with ID %d does not exist", pid) } taskInfo.Preorders = append(taskInfo.Preorders, int32(pid)) } else { logrus.Warnf("preorder convert to int32 error:%v", err) return nil, err } } } if exclusive, ok := params["exclusive"]; ok { - if ex, err := strconv.ParseBool(exclusive); err == nil { - taskInfo.Exclusive = ex - } else { - logrus.Warnf("exclusive convert to bool error:%v", err) - return nil, err - } + exclusive = strings.TrimSpace(exclusive) + if exclusive != "" { + if ex, err := strconv.ParseBool(exclusive); err == nil { + taskInfo.Exclusive = ex + } else { + logrus.Warnf("exclusive convert to bool error:%v", err) + return nil, err + } + } } if cronExpr, ok := params["cron_expr"]; ok { - if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { + cronExpr = strings.TrimSpace(cronExpr) + if cronExpr != "" { + if err := Scheduler.cronManager.CheckCronExpression(cronExpr); err != nil { logrus.Warnf("cron_expr parse error:%v", err) return nil, err - } - taskInfo.CronExpr = cronExpr + } + taskInfo.CronExpr = cronExpr + } } }
202-204: 忽略了取消 Cron 的错误返回,可能吞错
Scheduler.CancelCronTask(task)的错误被丢弃,排障困难。至少记录告警。- // stop the cron job if exists - Scheduler.CancelCronTask(task) + // stop the cron job if exists + if err := Scheduler.CancelCronTask(task); err != nil { + logrus.Warnf("cancel cron for task %d failed: %v", task.ID, err) + }vermeer/test/scheduler/test_scheduler.go (1)
63-67: 应等待第二个图加载完成再运行后续调度测试未调用
loadTest2.CheckGraph(),后续直接使用graphName+"_2",存在竞态与偶发失败风险。请恢复等待步骤。- // loadTest2.CheckGraph() + loadTest2.CheckGraph()vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
125-142: 避免在持锁状态下调用外部 broker,防止阻塞/死锁
defer rm.Unlock(rm.Lock())之后立刻调用rm.broker.ApplyAgent。若外部阻塞或尝试反入锁,会造成系统卡死。应先申请,再持锁更新本地状态并分配任务。func (rm *SchedulerResourceManager) GetAgentAndAssignTask(taskInfo *structure.TaskInfo) (*Agent, AgentStatus, error) { @@ - defer rm.Unlock(rm.Lock()) - - agent, status, workers, err := rm.broker.ApplyAgent(taskInfo, !taskInfo.Exclusive) + // 外部调用前不持有 rm 锁 + agent, status, workers, err := rm.broker.ApplyAgent(taskInfo, !taskInfo.Exclusive) if err != nil { return nil, AgentStatusError, err } if agent == nil { return nil, status, nil } - // Assign the task to the agent - agent.AssignTask(taskInfo) + // 从这里开始仅更新本地状态,持锁保护 + defer rm.Unlock(rm.Lock()) + // Assign the task to the agent + agent.AssignTask(taskInfo) @@ - if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists && exclusive { - rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} - runningStatus = WorkerOngoingStatusRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } else { - runningStatus = WorkerOngoingStatusConcurrentRunning - rm.workerGroupStatus[agent.GroupName()] = runningStatus - } + if _, exists := rm.runningWorkerGroupTasks[agent.GroupName()]; !exists { + rm.runningWorkerGroupTasks[agent.GroupName()] = []int32{} + } + if exclusive && len(rm.runningWorkerGroupTasks[agent.GroupName()]) == 0 { + runningStatus = WorkerOngoingStatusRunning + } else { + runningStatus = WorkerOngoingStatusConcurrentRunning + } + rm.workerGroupStatus[agent.GroupName()] = runningStatus(关于冗余初始化/状态逻辑)
71-99: 释放任务时不要先删组状态;重新推导并回写,且避免重复取组成员当前先
delete(rm.workerGroupStatus, workerGroup),后续又依赖其存在性/长度判断,逻辑不一致;同时多次调用workerMgr.GetGroupWorkers。建议:先从runningWorkerGroupTasks移除任务,基于剩余数量决定组新状态,然后一次性回写,并缓存组成员。func (rm *SchedulerResourceManager) ReleaseByTaskID(taskID int32) { defer rm.Unlock(rm.Lock()) for workerGroup, status := range rm.workerGroupStatus { if (status == WorkerOngoingStatusRunning || status == WorkerOngoingStatusConcurrentRunning) && rm.isTaskRunningOnWorkerGroup(workerGroup, taskID) { - delete(rm.workerGroupStatus, workerGroup) if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; exists { for i, id := range tasks { if id == taskID { rm.runningWorkerGroupTasks[workerGroup] = append(tasks[:i], tasks[i+1:]...) - if len(rm.runningWorkerGroupTasks[workerGroup]) == 0 { - delete(rm.runningWorkerGroupTasks, workerGroup) - } break } } } - if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusIdle) - } - } else { - for _, worker := range workerMgr.GetGroupWorkers(workerGroup) { - rm.changeWorkerStatus(worker.Name, WorkerOngoingStatusConcurrentRunning) - } - } + gws := workerMgr.GetGroupWorkers(workerGroup) + newStatus := WorkerOngoingStatusConcurrentRunning + if tasks, exists := rm.runningWorkerGroupTasks[workerGroup]; !exists || len(tasks) == 0 { + newStatus = WorkerOngoingStatusIdle + delete(rm.runningWorkerGroupTasks, workerGroup) + } + rm.workerGroupStatus[workerGroup] = newStatus + for _, worker := range gws { + rm.changeWorkerStatus(worker.Name, newStatus) + } } } }vermeer/test/scheduler/priority.go (1)
90-93: 重复声明 err 且未及时检查,可能漏检第一处错误第一个 MakeComputeTask 的错误在第二次赋值前未检查,存在误判风险。请立即校验第一次调用的 err,或使用独立变量名。
- computeTest, err := functional.MakeComputeTask(computeTask) - computeTaskSmall, err := functional.MakeComputeTask(computeTask) - require.NoError(t, err) + computeTest, err := functional.MakeComputeTask(computeTask) + require.NoError(t, err) + computeTaskSmall, err := functional.MakeComputeTask(computeTask) + require.NoError(t, err)vermeer/apps/master/bl/scheduler_bl.go (6)
122-130: time.Tick 会泄漏定时器,需改用 NewTicker 并可控停止-func (s *ScheduleBl) startTicker() { - // Create a ticker with the specified interval - ticker := time.Tick(time.Duration(s.tickerInterval) * time.Second) - - for range ticker { +func (s *ScheduleBl) startTicker() { + ticker := time.NewTicker(time.Duration(s.tickerInterval) * time.Second) + defer ticker.Stop() + for range ticker.C { logrus.Debug("Ticker ticked") s.TryScheduleNextTasks() } }
200-209: startChan 满时将任务置为错误会误伤业务,应仅告警并等待下次调度default: - errMsg := fmt.Sprintf("start channel is full, cannot schedule task %d", task.ID) - logrus.Errorf(errMsg) - taskMgr.SetError(task, errMsg) + logrus.Warnf("start channel is full, cannot schedule task %d; will retry next tick", task.ID)
316-329: BatchQueueTask 预分配切片后又 append,返回结果长度翻倍且前半为零值- errors := make([]error, len(taskInfos)) - oks := make([]bool, len(taskInfos)) - - for _, taskInfo := range taskInfos { - ok, err := s.QueueTask(taskInfo) - if err != nil { - logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) - } - errors = append(errors, err) - oks = append(oks, ok) - } + errors := make([]error, len(taskInfos)) + oks := make([]bool, len(taskInfos)) + for i, taskInfo := range taskInfos { + ok, err := s.QueueTask(taskInfo) + if err != nil { + logrus.Errorf("failed to queue task '%d': %v", taskInfo.ID, err) + } + oks[i] = ok + errors[i] = err + }
357-368: ChangeWorkerStatus 忽略底层返回值与错误,易产生误报成功的日志func (s *ScheduleBl) ChangeWorkerStatus(workerName string, status schedules.WorkerOngoingStatus) (bool, error) { defer s.Unlock(s.Lock()) - s.resourceManager.ChangeWorkerStatus(workerName, status) - - logrus.Infof("worker '%s' status changed to '%s'", workerName, status) + ok, err := s.resourceManager.ChangeWorkerStatus(workerName, status) + if err != nil { + logrus.Errorf("change worker '%s' to '%s' failed: %v", workerName, status, err) + return false, err + } + logrus.Infof("worker '%s' status changed to '%s'", workerName, status) // After changing the worker status, we may need to reschedule tasks s.TryScheduleNextTasks(true) - return true, nil + return ok, nil }
466-488: CancelTask 未对 RemoveTask 的新签名对齐,也未根据错误判断是否在队列中- isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) - task := s.taskManager.RemoveTask(taskInfo.ID) + isHeadTask := s.taskManager.IsTaskOngoing(taskInfo.ID) + removedTask, rmErr := s.taskManager.RemoveTask(taskInfo.ID) s.cronManager.DeleteTask(taskInfo.ID) @@ - isInQueue := false - if task != nil { + isInQueue := false + if rmErr == nil && removedTask != nil { logrus.Infof("removed task '%d' from space queue", taskInfo.ID) isInQueue = true }
58-74: Init 缺乏错误传播与失败回滚,初始化失败将进入不一致状态各 Manager.Init 与 LoadConfig 未做错误检查/传播;goroutine 也在未验证成功前启动,可能导致空指针或半初始化运行。
-func (s *ScheduleBl) Init() { +func (s *ScheduleBl) Init() error { logrus.Info("Initializing ScheduleBl...") - s.LoadConfig() + if err := s.LoadConfig(); err != nil { + return fmt.Errorf("load config failed: %w", err) + } - startChan := make(chan *structure.TaskInfo, s.startChanSize) - s.startChan = startChan + s.startChan = make(chan *structure.TaskInfo, s.startChanSize) s.resourceManager = &schedules.SchedulerResourceManager{} - s.resourceManager.Init() + if err := s.resourceManager.Init(); err != nil { return err } s.taskManager = &schedules.SchedulerTaskManager{} - s.taskManager.Init() + if err := s.taskManager.Init(); err != nil { return err } s.algorithmManager = &schedules.SchedulerAlgorithmManager{} - s.algorithmManager.Init() + if err := s.algorithmManager.Init(); err != nil { return err } s.cronManager = &schedules.SchedulerCronManager{} - s.cronManager.Init(s.QueueTaskFromTemplate) + if err := s.cronManager.Init(s.QueueTaskFromTemplate); err != nil { return err } go s.startTicker() go s.waitingStartedTask() -} + return nil +}如相关 Init 目前不返回 error,建议调整其签名或在此处做健壮性自检并返回错误。Based on learnings
vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
336-343: 使用内建 max 需要 Go 1.21;为兼容性改为本地实现当前
resourceParam / max(1, graph.VertexCount+graph.EdgeCount)在低版本 Go 下将编译失败。- resourceCost = resourceParam / max(1, graph.VertexCount+graph.EdgeCount) // Avoid division by zero, ensure at least 1 + den := graph.VertexCount + graph.EdgeCount + if den < 1 { + den = 1 + } + resourceCost = resourceParam / den // Avoid division by zero, ensure at least 1或新增辅助函数并复用:
func maxI64(a, b int64) int64 { if a > b { return a }; return b }
🧹 Nitpick comments (9)
vermeer/test/scheduler/batch.go (1)
36-41: 补齐批量调度用例,否则此测试形同空壳目前仅占位注释,未覆盖 BatchQueueTask/启动顺序校验等关键路径,无法发现回归(例如批量结果与输入一一对应等)。建议最小实现:构造同图下多任务批量入队,轮询获取启动序列并断言 FIFO/依赖顺序与预期一致;再补一例跨图并发性验证。需要我按现有 client/functional 封装给出完整用例骨架吗?
vermeer/apps/master/bl/task_bl.go (3)
23-26: 移除不可达的 MaxInt32 上界判断,简化解析逻辑
strconv.ParseInt(priority, 10, 32)已做 32 位范围校验,随后再比较> math.MaxInt32不会命中,且引入了不必要的math依赖。建议删除该判断与多余 import。@@ -import "math" @@ - if p > math.MaxInt32 { - return nil, fmt.Errorf("priority exceeds maximum value: %d", math.MaxInt32) - }Also applies to: 74-84
198-200: 权限确认:是否允许管理员取消他人任务?当前仅允许创建者取消,管理员也会被拒绝。若产品预期管理员可跨用户管理,请改为仅非管理员时做创建者校验:
- if task.CreateUser != tb.Cred.User() { + if !tb.Cred.IsAdmin() && task.CreateUser != tb.Cred.User() { return fmt.Errorf("cannot cancel the task with id '%v' as it was not created by you", taskID) }
158-160: 保护性检查 limit,避免负值造成切片越界为稳健起见,对
limit < 0先归零。- if len(tasks) > limit { + if limit < 0 { + limit = 0 + } + if len(tasks) > limit { return tasks[:limit], nil }vermeer/test/scheduler/test_scheduler.go (1)
20-30: Master 地址拼接不够健壮(可能重复协议头)
"http://"+masterHttpAddr在masterHttpAddr已含协议时会生成非法 URL。建议按需补全协议。@@ -import ( +import ( "fmt" "net/http" "testing" "time" + "strings" @@ - masterHttp := client.VermeerClient{} - masterHttp.Init("http://"+masterHttpAddr, http.DefaultClient) + masterHttp := client.VermeerClient{} + addr := masterHttpAddr + if !strings.HasPrefix(masterHttpAddr, "http://") && !strings.HasPrefix(masterHttpAddr, "https://") { + addr = "http://" + masterHttpAddr + } + masterHttp.Init(addr, http.DefaultClient)Also applies to: 49-51
vermeer/apps/master/schedules/scheduler_resource_manager.go (1)
208-238: 组状态推导更稳健:基于成员“是否有 Running/Concurrent”决定,而非必须“全 Idle/全 Concurrent”当前用 “全 Idle/全 Concurrent 才变更”,混合状态下可能保持过期状态。建议:若任一成员 Running => 组 Running;否则若任一成员 ConcurrentRunning => 组 ConcurrentRunning;否则 Idle。
- gws := workerMgr.GetGroupWorkers(groupName) - allIdle := true - allConcurrent := true - for _, w := range gws { - st := rm.workerStatus[w.Name] - if st != WorkerOngoingStatusIdle { - allIdle = false - } - if st != WorkerOngoingStatusConcurrentRunning { - allConcurrent = false - } - } - if allConcurrent || allIdle { - newStatus := WorkerOngoingStatusIdle - if allConcurrent { - newStatus = WorkerOngoingStatusConcurrentRunning - } + gws := workerMgr.GetGroupWorkers(groupName) + anyRunning := false + anyConcurrent := false + for _, w := range gws { + st := rm.workerStatus[w.Name] + if st == WorkerOngoingStatusRunning { + anyRunning = true + } else if st == WorkerOngoingStatusConcurrentRunning { + anyConcurrent = true + } + } + newStatus := WorkerOngoingStatusIdle + if anyRunning { + newStatus = WorkerOngoingStatusRunning + } else if anyConcurrent { + newStatus = WorkerOngoingStatusConcurrentRunning } logrus.Debugf("Change worker group '%s' status to '%s' (derived from %d workers)", groupName, newStatus, len(gws)) rm.changeWorkerGroupStatus(groupName, newStatus)vermeer/test/scheduler/priority.go (1)
299-307: 固定休眠易致脆弱,建议轮询等待取消状态并设置超时固定 sleep 可能不足或过长,导致偶发失败或拖慢用例。建议在超时窗口内轮询直至状态为 Canceled。
- logrus.Info("Waiting for tasks to settle...") - time.Sleep(time.Duration(waitSecond) * time.Second) + logrus.Info("Waiting for tasks to settle...") + deadline := time.Now().Add(time.Duration(waitSecond) * time.Second) + for { + checkTask, err := masterHttp.GetTask(int(submittedTaskIDs[numTasks-1])) + require.NoError(t, err, "Error fetching task status after cancellation.") + if checkTask != nil && structure.TaskState(checkTask.Task.Status) == structure.TaskStateCanceled { + break + } + if time.Now().After(deadline) { + require.Fail(t, "Timeout waiting for task to be canceled") + break + } + time.Sleep(200 * time.Millisecond) + } - checkTask, err := masterHttp.GetTask(int(submittedTaskIDs[numTasks-1])) - require.NoError(t, err, "Error fetching task status after cancellation.") - require.NotNil(t, checkTask, "Task should exist after cancellation.") - - if structure.TaskState(checkTask.Task.Status) != structure.TaskStateCanceled { - logrus.Warn("No tasks were cancelled; check scheduler behavior.") - require.Fail(t, "Expected at least some tasks to be cancelled.") - }vermeer/apps/master/bl/scheduler_bl.go (2)
332-350: CloseCurrent 未处理 ChangeWorkerStatus 的错误返回- s.ChangeWorkerStatus(workerName, schedules.WorkerOngoingStatusDeleted) + if ok, err := s.ChangeWorkerStatus(workerName, schedules.WorkerOngoingStatusDeleted); err != nil || !ok { + return fmt.Errorf("remove worker '%s' failed: %w", workerName, err) + }
423-427: 存在 TOCTOU 风险:状态检查与状态变更非原子检查 Waiting 后再 SetState 为 Created 缺少原子性,竞态下可能被他处修改。建议在任务管理器中提供 CAS/原子状态迁移接口或在单一持锁区间内完成校验与迁移。
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
vermeer/apps/master/bl/scheduler_bl.go(6 hunks)vermeer/apps/master/bl/task_bl.go(4 hunks)vermeer/apps/master/schedules/scheduler_algorithm_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_cron_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_resource_manager.go(1 hunks)vermeer/apps/master/schedules/scheduler_task_manager.go(1 hunks)vermeer/docker-compose.yaml(1 hunks)vermeer/test/scheduler/batch.go(1 hunks)vermeer/test/scheduler/priority.go(1 hunks)vermeer/test/scheduler/routine.go(1 hunks)vermeer/test/scheduler/test_scheduler.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- vermeer/docker-compose.yaml
- vermeer/apps/master/schedules/scheduler_task_manager.go
- vermeer/test/scheduler/routine.go
🧰 Additional context used
🧬 Code graph analysis (7)
vermeer/test/scheduler/batch.go (1)
vermeer/client/client.go (1)
VermeerClient(31-36)
vermeer/test/scheduler/test_scheduler.go (6)
vermeer/client/client.go (1)
VermeerClient(31-36)vermeer/test/functional/load_local.go (1)
LoadTaskLocal(29-31)vermeer/test/scheduler/priority.go (1)
TestPriority(325-366)vermeer/test/scheduler/batch.go (1)
TestBatch(36-41)vermeer/test/scheduler/routine.go (1)
TestRoutine(73-81)vermeer/test/functional/http_interface.go (2)
CancelTask(56-57)DeleteGraph(32-33)
vermeer/apps/master/schedules/scheduler_resource_manager.go (2)
vermeer/apps/master/schedules/broker.go (4)
Broker(55-59)Agent(41-44)AgentStatus(29-29)AgentStatusError(33-33)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
vermeer/test/scheduler/priority.go (4)
vermeer/client/client.go (1)
VermeerClient(31-36)vermeer/test/functional/compute_task.go (1)
MakeComputeTask(48-129)vermeer/test/functional/load_local.go (1)
LoadTaskLocal(29-31)vermeer/test/functional/http_interface.go (1)
CancelTask(56-57)
vermeer/apps/master/bl/scheduler_bl.go (6)
vermeer/apps/master/schedules/scheduler_resource_manager.go (3)
SchedulerResourceManager(45-53)WorkerOngoingStatusDeleted(38-38)WorkerOngoingStatus(31-31)vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
SchedulerAlgorithmManager(50-54)vermeer/apps/master/schedules/scheduler_task_manager.go (1)
SchedulerTaskManager(32-44)vermeer/apps/master/schedules/scheduler_cron_manager.go (1)
SchedulerCronManager(32-37)vermeer/apps/structure/task.go (1)
TaskInfo(41-64)vermeer/apps/master/bl/task_canceler.go (1)
NewTaskCanceler(35-51)
vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
vermeer/apps/structure/task.go (1)
TaskInfo(41-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (java)
- GitHub Check: check-dependency-license
🔇 Additional comments (3)
vermeer/apps/master/schedules/scheduler_algorithm_manager.go (1)
515-516: 不需要进行后备实现;工具链已支持根据验证结果,项目使用的 Go 版本为 1.25.1,远超过
slices.Contains()所需的最低版本 1.21。工具链已完全支持此用法,不存在兼容性问题。vermeer/test/scheduler/priority.go (1)
223-237: 原审查意见不成立 - 实现明确返回 -1,断言正确对齐验证发现
SendComputeReqAsyncNotWaitWithError在vermeer/test/functional/compute_base.go:129明确定义了错误返回值:return -1, err。测试对taskID == -1的断言已正确对齐实现契约。-1 是明确的哨兵值,而非任意选择。无需修改。Likely an incorrect or invalid review comment.
vermeer/apps/master/bl/scheduler_bl.go (1)
231-246: 混用两个不同类型的任务管理器,需要明确职责边界或统一接口验证发现
taskMgr(全局的structure.TaskManager)与s.taskManager(实例字段*schedules.SchedulerTaskManager)确实是不同的类型,分别处理持久化/状态管理和调度操作。当前代码在QueueTask和其他方法中混用两者:
- 使用全局
taskMgr进行SetState/GetTaskByID/SetError操作- 使用实例
s.taskManager进行QueueTask等调度操作这反映的是架构设计问题而非简单的重构。建议:
- 明确两个管理器的职责边界和交互模式
- 考虑为
SchedulerTaskManager提供统一的操作接口,减少对外部taskMgr的直接依赖- 或将任务状态管理封装进
SchedulerTaskManager,统一通过实例进行操作
| // Need to put DependsSchedulerAlgorithm before WaitingSchedulerAlgorithm | ||
| func (am *SchedulerAlgorithmManager) Init() { | ||
| am.filteredSchedulerAlgorithms = make(map[string]SchedulerAlgorithm) | ||
| am.scheduledSchedulerAlgorithms = make(map[string]SchedulerAlgorithm) | ||
| am.dispatchPaused = false | ||
| // Register filter and schedule algorithms | ||
| am.RegisterFilterAlgorithm(&DependsSchedulerAlgorithm{}) | ||
| am.RegisterFilterAlgorithm(&WaitingSchedulerAlgorithm{}) | ||
| // Register default SchedulerAlgorithms | ||
| am.RegisterSchedulerAlgorithm(&PriorityElderSchedulerAlgorithm{}) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
过滤算法用 map 存储导致执行顺序不确定,违背“Depends 先于 Waiting”的设计约束
Go map 无序,当前顺序要求无法保证,可能出现依赖未满足的任务提前入选。
建议以切片维护过滤器的确定顺序,并按顺序执行:
-type SchedulerAlgorithmManager struct {
- filteredSchedulerAlgorithms map[string]SchedulerAlgorithm
- scheduledSchedulerAlgorithms map[string]SchedulerAlgorithm
- dispatchPaused bool
-}
+type SchedulerAlgorithmManager struct {
+ filterFilters []SchedulerAlgorithm // 按顺序执行
+ scheduler SchedulerAlgorithm // 仅支持一个调度器
+ dispatchPaused bool
+}
@@
-func (am *SchedulerAlgorithmManager) Init() {
- am.filteredSchedulerAlgorithms = make(map[string]SchedulerAlgorithm)
- am.scheduledSchedulerAlgorithms = make(map[string]SchedulerAlgorithm)
+func (am *SchedulerAlgorithmManager) Init() error {
+ am.filterFilters = make([]SchedulerAlgorithm, 0, 4)
am.dispatchPaused = false
// Register filter and schedule algorithms
- am.RegisterFilterAlgorithm(&DependsSchedulerAlgorithm{})
- am.RegisterFilterAlgorithm(&WaitingSchedulerAlgorithm{})
+ am.RegisterFilterAlgorithm(&DependsSchedulerAlgorithm{})
+ am.RegisterFilterAlgorithm(&WaitingSchedulerAlgorithm{})
// Register default SchedulerAlgorithms
- am.RegisterSchedulerAlgorithm(&PriorityElderSchedulerAlgorithm{})
+ am.RegisterSchedulerAlgorithm(&PriorityElderSchedulerAlgorithm{})
+ return nil
}
@@
-func (am *SchedulerAlgorithmManager) RegisterSchedulerAlgorithm(schedulerAlgorithm SchedulerAlgorithm) {
+func (am *SchedulerAlgorithmManager) RegisterSchedulerAlgorithm(schedulerAlgorithm SchedulerAlgorithm) {
if schedulerAlgorithm == nil {
return
}
- name := schedulerAlgorithm.Name()
- if _, exists := am.scheduledSchedulerAlgorithms[name]; exists {
- return // SchedulerAlgorithm already registered
- }
- // only support one scheduling algorithm for now
- if len(am.scheduledSchedulerAlgorithms) > 0 {
- return // Only one scheduling algorithm can be registered
- }
+ if am.scheduler != nil { return }
schedulerAlgorithm.Init()
- am.scheduledSchedulerAlgorithms[name] = schedulerAlgorithm
+ am.scheduler = schedulerAlgorithm
}
@@
-func (am *SchedulerAlgorithmManager) RegisterFilterAlgorithm(filterAlgorithm SchedulerAlgorithm) {
+func (am *SchedulerAlgorithmManager) RegisterFilterAlgorithm(filterAlgorithm SchedulerAlgorithm) {
if filterAlgorithm == nil {
return
}
- name := filterAlgorithm.Name()
- if _, exists := am.filteredSchedulerAlgorithms[name]; exists {
- return // SchedulerAlgorithm already registered
- }
filterAlgorithm.Init()
- am.filteredSchedulerAlgorithms[name] = filterAlgorithm
+ am.filterFilters = append(am.filterFilters, filterAlgorithm)
}
@@
- filteredTasks := allTasks
- for _, algorithm := range am.filteredSchedulerAlgorithms {
+ filteredTasks := allTasks
+ for _, algorithm := range am.filterFilters {
var err error
filteredTasks, err = algorithm.FilterNextTasks(filteredTasks, taskToWorkerGroupMap, idleWorkerGroups, concurrentWorkerGroups, softSchedule)
if err != nil {
return nil, err
}
}
@@
- for _, algorithm := range am.scheduledSchedulerAlgorithms {
- tasks, err := algorithm.ScheduleNextTasks(filteredTasks, taskToWorkerGroupMap, idleWorkerGroups, concurrentWorkerGroups, softSchedule)
+ if am.scheduler != nil {
+ tasks, err := am.scheduler.ScheduleNextTasks(filteredTasks, taskToWorkerGroupMap, idleWorkerGroups, concurrentWorkerGroups, softSchedule)
if err != nil {
return nil, err
}
return tasks, nil // Return the scheduled tasks
}Also applies to: 153-160
🤖 Prompt for AI Agents
In vermeer/apps/master/schedules/scheduler_algorithm_manager.go around lines
60-70 (and similarly at lines 153-160) the filter algorithms are stored in a map
which makes execution order non-deterministic and violates the "Depends before
Waiting" constraint; change the implementation to maintain filter algorithms in
a slice (to preserve registration order) and iterate that slice in
registration/execution order, while optionally keeping a map only for fast
existence/lookup if needed; update RegisterFilterAlgorithm and any iteration
sites to append to the slice instead of inserting into the map, ensure
removal/lookup logic accounts for both structures if both are used, and add
tests or comments to assert the required ordering (DependsSchedulerAlgorithm
must be registered before WaitingSchedulerAlgorithm).
| import ( | ||
| "errors" | ||
| "vermeer/apps/structure" | ||
|
|
||
| "github.com/robfig/cron/v3" | ||
| "github.com/sirupsen/logrus" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
并发安全缺失:对内部 map 的读写未加锁,存在数据竞争与运行时崩溃风险
cronTasks/crons 在 Add/Delete/ByGraph 中被并发访问但未加锁,容易触发 “concurrent map read and map write”。建议引入 RWMutex,写入加 Lock,扫描用 RLock,并在 DeleteTaskByGraph 先收集再逐个删除。
@@
-import (
+import (
"errors"
"vermeer/apps/structure"
"github.com/robfig/cron/v3"
"github.com/sirupsen/logrus"
+ "sync"
)
@@
type SchedulerCronManager struct {
+ mu sync.RWMutex
cronTasks map[int32][]*structure.TaskInfo // cron expression to TaskInfo. Origin task ID to copied tasks
crons map[int32][]*cron.Cron // cron expression to cron jobs
// queueTemplateHandler is a function that handles the task queue
queueTemplateHandler func(*structure.TaskInfo) (int32, error)
}
@@
func (t *SchedulerCronManager) AddCronTask(taskInfo *structure.TaskInfo) error {
+ t.mu.Lock()
+ defer t.mu.Unlock()
if taskInfo == nil {
return errors.New("the argument `taskInfo` is nil")
}
@@
t.cronTasks[taskInfo.ID] = append(t.cronTasks[taskInfo.ID], taskInfo)
t.crons[taskInfo.ID] = append(t.crons[taskInfo.ID], cronJob)
cronJob.Start()
logrus.Infof("Added cron task for task ID %d with expression %s", taskInfo.ID, taskInfo.CronExpr)
return nil
}
@@
func (t *SchedulerCronManager) DeleteTask(taskID int32) error {
+ t.mu.Lock()
+ defer t.mu.Unlock()
if _, exists := t.cronTasks[taskID]; !exists {
return errors.New("task not found in cron tasks")
}
@@
func (t *SchedulerCronManager) DeleteTaskByGraph(spaceName, graphName string) error {
if spaceName == "" || graphName == "" {
return errors.New("the argument `spaceName` or `graphName` is empty")
}
- var toDelete []int32
- for taskID, tasks := range t.cronTasks {
- for _, task := range tasks {
- if task.SpaceName == spaceName && task.GraphName == graphName {
- toDelete = append(toDelete, taskID)
- break
- }
- }
- }
+ var toDelete []int32
+ t.mu.RLock()
+ for taskID, tasks := range t.cronTasks {
+ for _, task := range tasks {
+ if task.SpaceName == spaceName && task.GraphName == graphName {
+ toDelete = append(toDelete, taskID)
+ break
+ }
+ }
+ }
+ t.mu.RUnlock()
for _, taskID := range toDelete {
if err := t.DeleteTask(taskID); err != nil {
logrus.Errorf("Failed to delete cron task for task ID %d: %v", taskID, err)
return err
}
}
logrus.Infof("Deleted cron tasks for space %s and graph %s", spaceName, graphName)
return nil
}Also applies to: 32-37, 75-109, 117-129, 138-161
| params := make([]map[string]string, 0) | ||
| // default is false, actually do not need to set | ||
| taskComputeBody["exclusive"] = "false" | ||
| params = append(params, taskComputeBody) | ||
| params = append(params, taskComputeBody) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
向切片两次追加同一 map 实例,存在别名与竞态风险
params 连续 append 相同的 map 引用,后续修改会相互影响,导致难以定位的用例波动。
- params = append(params, taskComputeBody)
- params = append(params, taskComputeBody)
+ params = append(params, taskComputeBody)
+ body2 := make(map[string]string, len(taskComputeBody))
+ for k, v := range taskComputeBody {
+ body2[k] = v
+ }
+ params = append(params, body2)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| params := make([]map[string]string, 0) | |
| // default is false, actually do not need to set | |
| taskComputeBody["exclusive"] = "false" | |
| params = append(params, taskComputeBody) | |
| params = append(params, taskComputeBody) | |
| params := make([]map[string]string, 0) | |
| // default is false, actually do not need to set | |
| taskComputeBody["exclusive"] = "false" | |
| params = append(params, taskComputeBody) | |
| body2 := make(map[string]string, len(taskComputeBody)) | |
| for k, v := range taskComputeBody { | |
| body2[k] = v | |
| } | |
| params = append(params, body2) |
🤖 Prompt for AI Agents
In vermeer/test/scheduler/priority.go around lines 137 to 142, the code appends
the same map instance (taskComputeBody) twice into params, causing aliasing and
race/test flakiness; replace the second append with a fresh map copy (or
construct a new map) so each element in params is an independent map (i.e.,
clone taskComputeBody into a new map before appending or build two separate map
instances), and optionally omit setting "exclusive" when default false.
| go func(body map[string]string) { | ||
| defer wg.Done() | ||
| taskID := computeTest.SendComputeReqAsyncNotWait(body) | ||
| if taskID != 0 { | ||
| taskIDs <- taskID | ||
| } else { | ||
| logrus.Errorf("Failed to submit task: %v", err) | ||
| } | ||
| }(taskBodies[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goroutine 中记录错误使用了外层旧变量 err,日志误导
此处输出的 err 并非提交失败的原因。应记录 taskID 或改用返回 error 的接口;若无 error 返回,仅输出失败的 body 信息。
- if taskID != 0 {
+ if taskID != 0 {
taskIDs <- taskID
} else {
- logrus.Errorf("Failed to submit task: %v", err)
+ logrus.Errorf("Failed to submit task, body: %+v", body)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| go func(body map[string]string) { | |
| defer wg.Done() | |
| taskID := computeTest.SendComputeReqAsyncNotWait(body) | |
| if taskID != 0 { | |
| taskIDs <- taskID | |
| } else { | |
| logrus.Errorf("Failed to submit task: %v", err) | |
| } | |
| }(taskBodies[i]) | |
| go func(body map[string]string) { | |
| defer wg.Done() | |
| taskID := computeTest.SendComputeReqAsyncNotWait(body) | |
| if taskID != 0 { | |
| taskIDs <- taskID | |
| } else { | |
| logrus.Errorf("Failed to submit task, body: %+v", body) | |
| } | |
| }(taskBodies[i]) |
🤖 Prompt for AI Agents
In vermeer/test/scheduler/priority.go around lines 272 to 280, the goroutine
logs an outer variable err when task submission fails which is misleading;
update the code to stop referencing the outer err and instead either (a) change
SendComputeReqAsyncNotWait to return (taskID int, err error) and log the
returned err on failure, or (b) if the function only returns an int, log the
failing taskID (0) and include the submitted body for context (e.g.,
logrus.Errorf("Failed to submit task, body=%v, got taskID=%d", body, taskID)).
Ensure wg.Done remains in defer and that the correct local variables are used in
the log call.
Purpose of the PR
Main Changes
Verifying these changes
I have write some scripts in /test folder.
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No NeedSummary by CodeRabbit