[history server] Web Server + Event Processor #4329

Future-Outlier · 2026-01-02T07:23:09Z

Co-authored-by: @chiayi [email protected]
Co-authored-by: @KunWuLuan [email protected]

Why are these changes needed?

This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

Note: most code is copied from this branch #4187 , and this branch #4253

diagram

How it works in high level.

flowchart TB
    subgraph storage [MinIO S3 Storage]
        S3[(S3 Bucket)]
        JobEvents[job_events/]
        NodeEvents[node_events/]
    end

    subgraph historyserver [History Server - Modified]
        Main[main.go]
        SH[ServerHandler]
        EH[EventHandler]
        SR[StorageReader]
        
        Main --> SH
        Main -->|"go EH.Run()"| EH
        SH -->|"Reference"| EH
        EH --> SR
        SR --> S3
        SR --> JobEvents
        SR --> NodeEvents
        
        EH --> TaskMap[ClusterTaskMap]
        EH --> ActorMap[ClusterActorMap]
    end

    subgraph endpoints [New Endpoints]
        T1["/api/v0/tasks (from EventHandler)"]
        A1["/logical/actors (from EventHandler)"]
    end
    
    SH --> T1
    SH --> A1
    T1 -.->|"Query"| TaskMap
    A1 -.->|"Query"| ActorMap

Screenshot proof

take http://localhost:8080/api/v0/tasks as example.

logs in the historyserver

How to test and develop in your local env

checkout this branch
kind create cluster --image=kindest/node:v1.29.0
build your ray-operator and run it (binary or deployment both work)
kubectl apply -f historyserver/config/minio.yaml
build collector and history server, and load them to your k8s cluster
1. cd historyserver
2. make localimage-historyserver;kind load docker-image historyserver:v0.1.0;
3. make localimage-collector;kind load docker-image collector:v0.1.0;
kubectl apply -f historyserver/config/raycluster.yaml
kubectl apply -f historyserver/config/rayjob.yaml
kubectl delete -f historyserver/config/raycluster.yaml
kubectl apply -f config/historyserver.yaml;
hit the historyserver's endpoint
1. kubectl port-forward svc/historyserver 8080:30080
2. curl -c cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/session_2026-01-06_07-07-00_383444_1"
3. cat cookies.txt
4. curl -b cookies.txt http://localhost:8080/api/v0/tasks
5. note: you should change the session dir to the correct one, login to the minio console and get the right session
  1. ref: https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#deploy-minio-for-log-and-event-storage
you can test the following endpoints

echo "=== Health Check ==="
curl "http://localhost:8080/readz"
curl "http://localhost:8080/livez"

echo "=== Clusters List ==="
curl "http://localhost:8080/clusters"

SESSION="session_2026-01-08_07-01-21_915465_1" # change to your session
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

echo "=== All Tasks ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

echo "=== Tasks by job_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_values=01000000"

<img width="793" height="272" alt="image" src="https://github.com/user-attachments/assets/e8258fa2-c6fa-4ec9-90a1-657cbbef2c44" />

echo "=== Task by task_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=task_id&filter_values=YOUR_TASK_ID"

<img width="785" height="150" alt="image" src="https://github.com/user-attachments/assets/058f8f5a-072c-493f-83d9-c59e47635b2c" />


echo "=== All Actors ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors"

<img width="798" height="163" alt="image" src="https://github.com/user-attachments/assets/67b57583-4c75-4939-b8e2-a11c91505605" />

echo "=== Single Actor ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors/YOUR_ACTOR_ID"
<img width="787" height="483" alt="image" src="https://github.com/user-attachments/assets/c2886452-5927-483b-86db-7079220aaae0" />

echo "=== Nodes ==="
curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

<img width="781" height="447" alt="image" src="https://github.com/user-attachments/assets/80c752ab-b0c1-46d5-9d4a-208bdee937c4" />

Related issue number

#3966

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Co-authored-by: chiayi [email protected] Co-authored-by: KunWuLuan [email protected]

Signed-off-by: Future-Outlier <[email protected]>

Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: KunWuLuan <[email protected]>

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier

cc @chiayi @KunWuLuan
to help review, thank you!

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/historyserver/router.go

historyserver/pkg/historyserver/reader.go

historyserver/pkg/historyserver/clientmanager.go

historyserver/pkg/historyserver/reader.go

historyserver/pkg/eventserver/eventserver.go

historyserver/pkg/historyserver/reader.go

historyserver/pkg/historyserver/router.go

…server

Signed-off-by: Future-Outlier <[email protected]>

historyserver/pkg/historyserver/router.go

cursor · 2026-01-08T15:12:53Z

historyserver/pkg/historyserver/server.go

+		if err != nil {
+			logrus.Fatalf("Error starting server: %v", err)
+			os.Exit(1)
+		}


Graceful shutdown incorrectly treated as fatal error

Medium Severity

When server.Shutdown is called for graceful shutdown, ListenAndServe returns http.ErrServerClosed. The error check if err != nil treats this as a fatal error and calls logrus.Fatalf, causing the program to exit with code 1 even during normal graceful shutdown. The check needs to exclude http.ErrServerClosed from fatal error handling.

Signed-off-by: Future-Outlier <[email protected]>

win5923 · 2026-01-08T16:17:31Z

historyserver/pkg/eventserver/types/task.go

+const (
+	NIL                                        TaskStatus = "NIL"
+	PENDING_ARGS_AVAIL                         TaskStatus = "PENDING_ARGS_AVAIL"
+	PENDING_NODE_ASSIGNMENT                    TaskStatus = "PENDING_NODE_ASSIGNMENT"
+	PENDING_OBJ_STORE_MEM_AVAIL                TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL"
+	PENDING_ARGS_FETCH                         TaskStatus = "PENDING_ARGS_FETCH"
+	SUBMITTED_TO_WORKER                        TaskStatus = "SUBMITTED_TO_WORKER"
+	PENDING_ACTOR_TASK_ARGS_FETCH              TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH"
+	PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
+	RUNNING                                    TaskStatus = "RUNNING"
+	RUNNING_IN_RAY_GET                         TaskStatus = "RUNNING_IN_RAY_GET"
+	RUNNING_IN_RAY_WAIT                        TaskStatus = "RUNNING_IN_RAY_WAIT"
+	FINISHED                                   TaskStatus = "FINISHED"
+	FAILED                                     TaskStatus = "FAILED"
+)


No GETTING_AND_PINNING_ARGS?
https://github.com/ray-project/ray/blob/9700991f2b212fb97f1aa8b9bf9b3bcd8e1fdb3b/src/ray/protobuf/common.proto#L919-L921

Signed-off-by: Future-Outlier <[email protected]>

cursor · 2026-01-08T17:17:29Z

historyserver/pkg/eventserver/eventserver.go

+			}
+			if err := h.storeEvent(currEventData); err != nil {
+				return err
+			}


Event processor failure causes event processing to block

High Severity

When storeEvent returns an error (e.g., from a malformed event), ProcessEvents returns the error and the processor goroutine terminates. However, the main event reader loop at line 153 continues sending events to all channels including the dead processor's channel. Once the channel buffer (size 20) fills up, the main loop blocks indefinitely at the send operation. A single corrupted event file in S3 will cause the history server to stop processing any new events. As noted in the PR discussion, this can lead to a crash loop scenario.

Additional Locations (1)

historyserver/pkg/eventserver/eventserver.go#L148-L154

cursor · 2026-01-08T17:17:29Z

historyserver/pkg/eventserver/eventserver.go

+			if storedTask.AttemptNumber < currTask.AttemptNumber {
+				storedTask.AttemptNumber = currTask.AttemptNumber
+			}
+			clusterTaskMapObject.TaskMap[taskId] = storedTask


Task update discards all fields except attempt number

Medium Severity

When a task event arrives with a higher AttemptNumber than the stored task, the code updates storedTask.AttemptNumber from currTask but then saves storedTask back to the map instead of currTask. This means all other updated fields from the newer task attempt (NodeID, WorkerID, State, ErrorType, ErrorMessage, etc.) are discarded. Only the attempt number is preserved from the newer event while the rest of the stale data remains, causing incorrect task information to be displayed for retried tasks.

cursor · 2026-01-08T17:17:29Z

historyserver/pkg/historyserver/reader.go

+	}
+
+	// Construct the full path to the static directory
+	fullPath := filepath.Join(s.dashboardDir, prefix, "static", path)


Path traversal vulnerability in static file handler

High Severity

The staticFileHandler constructs file paths using user-controlled input without path traversal validation. Both the path URL parameter and version cookie value are directly used in filepath.Join to build fullPath. An attacker could supply path traversal sequences (e.g., ../../../etc/passwd in the path parameter, or ../../etc in the dashboard_version cookie) to access arbitrary files outside the intended dashboard directory. While filepath.Join cleans the path, it does not prevent escaping the base directory, allowing reads of sensitive files on the server.

Additional Locations (1)

historyserver/pkg/historyserver/router.go#L144-L151

win5923 · 2026-01-08T17:18:31Z

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

Future-Outlier · 2026-01-08T17:23:14Z

todo:

support live clusters
fix others endpoints like getTaskSummarize
delete dead code
solve cursor bug bot's review

Future-Outlier · 2026-01-08T17:24:17Z

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

yes it will, and this will be solved in the beta version.
we will need to store processed events in the database.
good point, thank you!

chiayi and others added 3 commits December 3, 2025 10:58

Add event server for history server.

3ce2df3

Co-authored-by: chiayi [email protected] Co-authored-by: KunWuLuan [email protected]

Update test

785df87

[history server] Web Server

13b9187

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier closed this Jan 2, 2026

Future-Outlier reopened this Jan 2, 2026

Future-Outlier and others added 5 commits January 6, 2026 16:13

add Kun Wu's setting

ba17941

Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: KunWuLuan <[email protected]>

Merge branch 'master' into historyserver-webserver

3dab9dc

Signed-off-by: Future-Outlier <[email protected]>

Merge branch 'historyserver-eventserver' into historyserver-webserver

f8c7214

Signed-off-by: Future-Outlier <[email protected]>

a worked version

72a9134

Signed-off-by: Future-Outlier <[email protected]>

a worked version, will revise it

11d6eda

Signed-off-by: Future-Outlier <[email protected]>

Future-Outlier commented Jan 7, 2026

View reviewed changes

Future-Outlier added the P0 Critical issue that should be fixed ASAP label Jan 7, 2026

Future-Outlier changed the title ~~[WIP][history server] Web Server~~ [history server] Web Server + Event Processor Jan 7, 2026

Future-Outlier marked this pull request as ready for review January 7, 2026 03:40

Trigger CI

4bd398c

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

Future-Outlier added 2 commits January 8, 2026 22:42

Merge remote-tracking branch 'upstream/master' into historyserver-web…

44cb52e

…server

merge master

3912d2f

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 8, 2026

View reviewed changes

turn chinese comments to english

f16a7e2

Signed-off-by: Future-Outlier <[email protected]>

win5923 reviewed Jan 8, 2026

View reviewed changes

fix bugs and make dead cluster endpoint work or return not yet supported

1524b44

Signed-off-by: Future-Outlier <[email protected]>

cursor bot reviewed Jan 8, 2026

View reviewed changes

[history server] Web Server + Event Processor #4329

Are you sure you want to change the base?

[history server] Web Server + Event Processor #4329

Uh oh!

Conversation

Future-Outlier commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

diagram

Screenshot proof

How to test and develop in your local env

Related issue number

Checks

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Graceful shutdown incorrectly treated as fatal error

Uh oh!

win5923 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Event processor failure causes event processing to block

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Task update discards all fields except attempt number

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Path traversal vulnerability in static file handler

Uh oh!

win5923 commented Jan 8, 2026

Uh oh!

Future-Outlier commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Future-Outlier commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Future-Outlier commented Jan 2, 2026 •

edited

Loading

Future-Outlier commented Jan 8, 2026 •

edited

Loading