-
Notifications
You must be signed in to change notification settings - Fork 392
feat(dashboard): enhance dashboard UI and fix Ray runner state reporting #6008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Greptile OverviewGreptile SummaryThis PR enhances the Daft dashboard with improved UI and fixes critical state reporting issues for Ray queries. Key ChangesBackend Improvements:
Frontend Enhancements:
Code Quality:
Implementation NotesThe core fix addresses a state machine issue where Ray queries that failed couldn't transition to the Failed state, causing backend 400 errors. The solution makes The Ray dashboard URL extraction uses Minor IssuesAll findings are non-blocking style/documentation issues (see inline comments for details). Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant Runner as Runner (Native/Ray)
participant Context as DaftContext
participant Subscriber as DashboardSubscriber
participant Backend as Dashboard Backend
participant Frontend as Dashboard Frontend
User->>Runner: Execute query
Runner->>Context: _notify_query_start(query_id, metadata)
Note over Runner: metadata includes runner, entrypoint, ray_dashboard_url
Context->>Subscriber: on_query_start(query_id, metadata)
Subscriber->>Backend: POST /query/{id}/start
Backend->>Frontend: WebSocket update
Runner->>Context: _notify_optimization_start(query_id)
Context->>Subscriber: on_optimization_start(query_id)
Subscriber->>Backend: POST /query/{id}/plan/start
Backend->>Frontend: WebSocket update (status: Optimizing)
Runner->>Runner: Optimize plan
Runner->>Context: _notify_optimization_end(query_id, optimized_plan)
Context->>Subscriber: on_optimization_end(query_id, plan)
Subscriber->>Backend: POST /query/{id}/plan/end
Backend->>Frontend: WebSocket update (status: Setup)
Runner->>Context: _notify_exec_start(query_id, physical_plan)
Context->>Subscriber: on_exec_start(query_id, physical_plan)
Subscriber->>Backend: POST /query/{id}/exec/start
Backend->>Frontend: WebSocket update (status: Executing)
loop For each result
Runner->>Context: _notify_exec_emit_stats(query_id, node_id, stats)
Context->>Subscriber: on_exec_emit_stats(query_id, stats)
Subscriber->>Backend: POST /query/{id}/exec/op/{op_id}/emit_stats
Backend->>Frontend: WebSocket update (progress data)
end
alt Success
Runner->>Context: _notify_query_end(query_id, Finished)
Context->>Subscriber: on_query_end(query_id, result)
Subscriber->>Backend: POST /query/{id}/end (Finished)
Backend->>Frontend: WebSocket update (status: Finished)
else Failure
Runner->>Context: _notify_query_end(query_id, Failed)
Context->>Subscriber: on_query_end(query_id, result)
Subscriber->>Backend: POST /query/{id}/end (Failed)
Note over Backend: Accepts Failed from Executing state
Backend->>Frontend: WebSocket update (status: Failed)
end
Frontend->>User: Display query status and Ray dashboard link
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 5 comments
|
|
||
| # Optimize the logical plan. | ||
| ctx._notify_query_start(query_id, PyQueryMetadata(output_schema._schema, builder.repr_json())) | ||
| import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] Import statements should be at the top of the file. Move import sys to the imports section at the beginning of the file (around line 3-4) per the project's import style guidelines.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/runners/native_runner.py
Line: 99:99
Comment:
[P2] Import statements should be at the top of the file. Move `import sys` to the imports section at the beginning of the file (around line 3-4) per the project's import style guidelines.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
daft/runners/native_runner.py
Outdated
| # Try to emit stats for operator 0? | ||
| # If NativeExecutor doesn't start operator 0, this might fail or be ignored. | ||
| # Let's comment it out for now to see baseline behavior. | ||
| ctx._notify_exec_emit_stats(query_id, 0, {"rows in": total_rows, "rows out": total_rows}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P3] The comment says "Let's comment it out for now to see baseline behavior" but the code on line 129 is actually executing. Either remove this misleading comment or comment out the code if it was intended to be disabled.
Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/runners/native_runner.py
Line: 126:129
Comment:
[P3] The comment says "Let's comment it out for now to see baseline behavior" but the code on line 129 is actually executing. Either remove this misleading comment or comment out the code if it was intended to be disabled.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats shouldn't be emitted here, they're already emmited within the NativeExecutor itself
src/daft-dashboard/src/engine.rs
Outdated
| tracing::info!("Received exec_start for query {}", query_id); | ||
| let query_info = state.queries.get_mut(&query_id); | ||
| let Some(mut query_info) = query_info else { | ||
| tracing::error!("Query {} not found in exec_start", query_id); | ||
| return StatusCode::BAD_REQUEST; | ||
| }; | ||
|
|
||
| // Debug state | ||
| tracing::info!("Query {} state: {:?}", query_id, query_info.state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P3] Debug logging statements appear to have been left in from development. The comment "Debug state" at line 181 and tracing::info! calls at lines 174 and 182 look like temporary debugging code. Consider removing or converting to tracing::debug! if this level of verbosity isn't needed in production.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-dashboard/src/engine.rs
Line: 174:182
Comment:
[P3] Debug logging statements appear to have been left in from development. The comment "Debug state" at line 181 and `tracing::info!` calls at lines 174 and 182 look like temporary debugging code. Consider removing or converting to `tracing::debug!` if this level of verbosity isn't needed in production.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| // @ts-ignore | ||
| columnHelper.accessor("ray_dashboard_url", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] Using @ts-ignore suppresses TypeScript type checking instead of fixing the underlying type issue. The ray_dashboard_url property is properly defined as optional in the QuerySummary type. Consider properly typing the accessor or using @ts-expect-error with a specific explanation if this is a known limitation of the column helper library.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-dashboard/frontend/src/app/queries/page.tsx
Line: 123:124
Comment:
[P2] Using `@ts-ignore` suppresses TypeScript type checking instead of fixing the underlying type issue. The `ray_dashboard_url` property is properly defined as optional in the `QuerySummary` type. Consider properly typing the accessor or using `@ts-expect-error` with a specific explanation if this is a known limitation of the column helper library.
How can I resolve this? If you propose a fix, please make it concise.7b32c5d to
1a47668
Compare
|
@srilman tagging you on this one |
srilman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify a couple of points?
daft/runners/native_runner.py
Outdated
| # Try to emit stats for operator 0? | ||
| # If NativeExecutor doesn't start operator 0, this might fail or be ignored. | ||
| # Let's comment it out for now to see baseline behavior. | ||
| ctx._notify_exec_emit_stats(query_id, 0, {"rows in": total_rows, "rows out": total_rows}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats shouldn't be emitted here, they're already emmited within the NativeExecutor itself
daft/runners/ray_runner.py
Outdated
| ): | ||
| if result.metadata() is not None: | ||
| total_rows += result.metadata().num_rows | ||
| ctx._notify_exec_emit_stats(query_id, 0, {"rows in": total_rows, "rows out": total_rows}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, we shouldn't emit stats here as well, because they are expected to be in a specific format per operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been an update here. Could you please check again if it meets the expectations?
| output_schema: PySchema | ||
| unoptimized_plan: str | ||
| runner: str | ||
| ray_dashboard_url: str | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why emit the ray dashboard URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense, good idea
0b386f5 to
9838c15
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6008 +/- ##
==========================================
+ Coverage 72.95% 73.58% +0.62%
==========================================
Files 970 972 +2
Lines 126744 126949 +205
==========================================
+ Hits 92471 93416 +945
+ Misses 34273 33533 -740
🚀 New features to boost your workflow:
|
9838c15 to
0867a8e
Compare
|
@srilman I have updated all your comments. You can check if there are any other issues. |
srilman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a couple of clarifying questions
| ) | ||
|
|
||
| try: | ||
| total_rows = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why collect this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a debug code; I deleted something here.
| # Log Dashboard URL if configured | ||
| dashboard_url = os.environ.get("DAFT_DASHBOARD_URL") | ||
| if dashboard_url: | ||
| print(f"Daft Dashboard: {dashboard_url}/query/{query_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I changed print to logger, mainly to clearly show users how to access the dashboard. What do you think?
| ))) | ||
| .await?; | ||
| fn on_exec_start(&self, query_id: QueryID, physical_plan: QueryPlan) -> DaftResult<()> { | ||
| let execution_id = format!("{}-driver", query_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the point of execution_id when its just query_id with a fixed or randomly generated tag, but its only created once. Why not just store query_id directly if its already a random UUID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, I agree. The modifications have been made.
| from daft import udf | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would these tests work without anything to actually launch a dashboard server or mocking the server for testing?
| for node_id in &active_nodes { | ||
| let runtime_stats = &node_stats_map[node_id]; | ||
| let event = runtime_stats.snapshot(); | ||
| let event = runtime_stats.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't flush, that makes it do atomic synchronization which is less efficient
| ); | ||
| } | ||
|
|
||
| // Emit final stats to all subscribers before finishing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, since the finalize_node step should already emit the final stats for that node.
|
also @Jay-ju if possible, could we split this PR into smaller pieces? this modifies a lot of small aspects of observability, and since we're also actively working on it, this pr will end up having a lot of merge conflicts |
0867a8e to
1789f71
Compare
@srilman I have split this PR into two PRs: one for the frontend and one for the backend, and I have also resolved some conflicts: Could you please take another look when you have time? |

Frontend Enhancements:
Backend Fixes & Improvements:
Code Cleanup:
Changes Made
Related Issues