Skip to content

Conversation

@Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Jan 2, 2026

Co-authored-by: @chiayi [email protected]
Co-authored-by: @KunWuLuan [email protected]

Why are these changes needed?

This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

Note: most code is copied from this branch #4187 , and this branch #4253

diagram

How it works in high level.

flowchart TB
    subgraph storage [MinIO S3 Storage]
        S3[(S3 Bucket)]
        JobEvents[job_events/]
        NodeEvents[node_events/]
    end

    subgraph historyserver [History Server - Modified]
        Main[main.go]
        SH[ServerHandler]
        EH[EventHandler]
        SR[StorageReader]
        
        Main --> SH
        Main -->|"go EH.Run()"| EH
        SH -->|"Reference"| EH
        EH --> SR
        SR --> S3
        SR --> JobEvents
        SR --> NodeEvents
        
        EH --> TaskMap[ClusterTaskMap]
        EH --> ActorMap[ClusterActorMap]
    end

    subgraph endpoints [New Endpoints]
        T1["/api/v0/tasks (from EventHandler)"]
        A1["/logical/actors (from EventHandler)"]
    end
    
    SH --> T1
    SH --> A1
    T1 -.->|"Query"| TaskMap
    A1 -.->|"Query"| ActorMap
Loading

Screenshot proof

take http://localhost:8080/api/v0/tasks as example.

image

logs in the historyserver

image

How to test and develop in your local env

  1. checkout this branch
  2. kind create cluster --image=kindest/node:v1.29.0
  3. build your ray-operator and run it (binary or deployment both work)
  4. kubectl apply -f historyserver/config/minio.yaml
  5. build collector and history server, and load them to your k8s cluster
    1. cd historyserver
    2. make localimage-historyserver;kind load docker-image historyserver:v0.1.0;
    3. make localimage-collector;kind load docker-image collector:v0.1.0;
  6. kubectl apply -f historyserver/config/raycluster.yaml
  7. kubectl apply -f historyserver/config/rayjob.yaml
  8. kubectl delete -f historyserver/config/raycluster.yaml
  9. kubectl apply -f config/historyserver.yaml;
  10. hit the historyserver's endpoint
    1. kubectl port-forward svc/historyserver 8080:30080
    2. curl -c cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/session_2026-01-06_07-07-00_383444_1"
    3. cat cookies.txt
    4. curl -b cookies.txt http://localhost:8080/api/v0/tasks
    5. note: you should change the session dir to the correct one, login to the minio console and get the right session
      1. ref: https://github.com/ray-project/kuberay/blob/master/historyserver/docs/set_up_collector.md#deploy-minio-for-log-and-event-storage
  11. you can test the following endpoints
echo "=== Health Check ==="
curl "http://localhost:8080/readz"
curl "http://localhost:8080/livez"

echo "=== Clusters List ==="
curl "http://localhost:8080/clusters"

SESSION="session_2026-01-08_07-01-21_915465_1" # change to your session
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

echo "=== All Tasks ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks"

echo "=== Tasks by job_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=job_id&filter_values=01000000"

<img width="793" height="272" alt="image" src="https://github.com/user-attachments/assets/e8258fa2-c6fa-4ec9-90a1-657cbbef2c44" />

echo "=== Task by task_id ==="
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks?filter_keys=task_id&filter_values=YOUR_TASK_ID"

<img width="785" height="150" alt="image" src="https://github.com/user-attachments/assets/058f8f5a-072c-493f-83d9-c59e47635b2c" />


echo "=== All Actors ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors"

<img width="798" height="163" alt="image" src="https://github.com/user-attachments/assets/67b57583-4c75-4939-b8e2-a11c91505605" />

echo "=== Single Actor ==="
curl -b ~/cookies.txt "http://localhost:8080/logical/actors/YOUR_ACTOR_ID"
<img width="787" height="483" alt="image" src="https://github.com/user-attachments/assets/c2886452-5927-483b-86db-7079220aaae0" />

echo "=== Nodes ==="
curl -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq .

<img width="781" height="447" alt="image" src="https://github.com/user-attachments/assets/80c752ab-b0c1-46d5-9d4a-208bdee937c4" />



Related issue number

#3966

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

chiayi and others added 3 commits December 3, 2025 10:58
Future-Outlier and others added 5 commits January 6, 2026 16:13
Copy link
Member Author

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @chiayi @KunWuLuan
to help review, thank you!

@Future-Outlier Future-Outlier added the P0 Critical issue that should be fixed ASAP label Jan 7, 2026
@Future-Outlier Future-Outlier changed the title [WIP][history server] Web Server [history server] Web Server + Event Processor Jan 7, 2026
@Future-Outlier Future-Outlier marked this pull request as ready for review January 7, 2026 03:40
Signed-off-by: Future-Outlier <[email protected]>
if err != nil {
logrus.Fatalf("Error starting server: %v", err)
os.Exit(1)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful shutdown incorrectly treated as fatal error

Medium Severity

When server.Shutdown is called for graceful shutdown, ListenAndServe returns http.ErrServerClosed. The error check if err != nil treats this as a fatal error and calls logrus.Fatalf, causing the program to exit with code 1 even during normal graceful shutdown. The check needs to exclude http.ErrServerClosed from fatal error handling.

Fix in Cursor Fix in Web

Signed-off-by: Future-Outlier <[email protected]>
Comment on lines +10 to +24
const (
NIL TaskStatus = "NIL"
PENDING_ARGS_AVAIL TaskStatus = "PENDING_ARGS_AVAIL"
PENDING_NODE_ASSIGNMENT TaskStatus = "PENDING_NODE_ASSIGNMENT"
PENDING_OBJ_STORE_MEM_AVAIL TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL"
PENDING_ARGS_FETCH TaskStatus = "PENDING_ARGS_FETCH"
SUBMITTED_TO_WORKER TaskStatus = "SUBMITTED_TO_WORKER"
PENDING_ACTOR_TASK_ARGS_FETCH TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH"
PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY"
RUNNING TaskStatus = "RUNNING"
RUNNING_IN_RAY_GET TaskStatus = "RUNNING_IN_RAY_GET"
RUNNING_IN_RAY_WAIT TaskStatus = "RUNNING_IN_RAY_WAIT"
FINISHED TaskStatus = "FINISHED"
FAILED TaskStatus = "FAILED"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
if err := h.storeEvent(currEventData); err != nil {
return err
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Event processor failure causes event processing to block

High Severity

When storeEvent returns an error (e.g., from a malformed event), ProcessEvents returns the error and the processor goroutine terminates. However, the main event reader loop at line 153 continues sending events to all channels including the dead processor's channel. Once the channel buffer (size 20) fills up, the main loop blocks indefinitely at the send operation. A single corrupted event file in S3 will cause the history server to stop processing any new events. As noted in the PR discussion, this can lead to a crash loop scenario.

Additional Locations (1)

Fix in Cursor Fix in Web

if storedTask.AttemptNumber < currTask.AttemptNumber {
storedTask.AttemptNumber = currTask.AttemptNumber
}
clusterTaskMapObject.TaskMap[taskId] = storedTask
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task update discards all fields except attempt number

Medium Severity

When a task event arrives with a higher AttemptNumber than the stored task, the code updates storedTask.AttemptNumber from currTask but then saves storedTask back to the map instead of currTask. This means all other updated fields from the newer task attempt (NodeID, WorkerID, State, ErrorType, ErrorMessage, etc.) are discarded. Only the attempt number is preserved from the newer event while the rest of the stale data remains, causing incorrect task information to be displayed for retried tasks.

Fix in Cursor Fix in Web

}

// Construct the full path to the static directory
fullPath := filepath.Join(s.dashboardDir, prefix, "static", path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path traversal vulnerability in static file handler

High Severity

The staticFileHandler constructs file paths using user-controlled input without path traversal validation. Both the path URL parameter and version cookie value are directly used in filepath.Join to build fullPath. An attacker could supply path traversal sequences (e.g., ../../../etc/passwd in the path parameter, or ../../etc in the dashboard_version cookie) to access arbitrary files outside the intended dashboard directory. While filepath.Join cleans the path, it does not prevent escaping the base directory, allowing reads of sensitive files on the server.

Additional Locations (1)

Fix in Cursor Fix in Web

@win5923
Copy link
Collaborator

win5923 commented Jan 8, 2026

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

@Future-Outlier
Copy link
Member Author

Future-Outlier commented Jan 8, 2026

todo:

  1. support live clusters
  2. fix others endpoints like getTaskSummarize
  3. delete dead code
  4. solve cursor bug bot's review

@Future-Outlier
Copy link
Member Author

LGTM! Just a question you mentioned:

As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.

How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request?

yes it will, and this will be solved in the beta version.
we will need to store processed events in the database.
good point, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P0 Critical issue that should be fixed ASAP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants