Skip to content

Conversation

@Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented Jan 11, 2026

  1. Frontend Enhancements:

    • All Queries Page: Updated table header to use white background (bg-white) with black text and grey separators, improving readability.
    • Query Detail Page:
      • Added Entrypoint (command line) and Engine (Swordfish/Flotilla) fields to the metadata section.
      • Added a direct link to the Ray Dashboard for Ray-based queries.
      • Improved metadata visibility by using high-contrast text (text-zinc-100).
      • Progress Table: Refined table headers with dark theme (bg-zinc-800), white text, and clear column separators. Added hover effects for better interactivity.
    • Engine Naming: Standardized engine display names (Native -> Swordfish, Ray -> Flotilla).
  2. Backend Fixes & Improvements:

    • State Management: Fixed an issue where failed Ray queries were not correctly reporting their terminal state to the dashboard (causing 400 errors). Now allows transitions to Failed state from active states.
    • Metadata Propagation: Updated RayRunner to capture and transmit entrypoint and ray_dashboard_url to the dashboard backend.
    • Python API: Exposed repr_json on DistributedPhysicalPlan in init.pyi to fix mypy errors and support plan visualization.
  3. Code Cleanup:

    • Removed unused imports and debug logging.
    • Standardized sys and os imports in ray_runner.py.
    • Fixed mypy type definition errors in daft/init.pyi related to context notification methods.

Changes Made

image image

Related Issues

@github-actions github-actions bot added the feat label Jan 11, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 11, 2026

Greptile Overview

Greptile Summary

This PR enhances the Daft dashboard with improved UI and fixes critical state reporting issues for Ray queries.

Key Changes

Backend Improvements:

  • Fixes Ray runner state management by allowing terminal state (Failed/Canceled) transitions from active states (Executing, Setup, Optimizing), resolving 400 errors when queries fail
  • Changes timestamp precision from u64 to f64 throughout the stack for millisecond-level accuracy
  • Adds runner, ray_dashboard_url, and entrypoint fields to query metadata for better tracking
  • Removes the Ray runner restriction from dashboard subscriber
  • Adds comprehensive query lifecycle notifications (notify_exec_start, notify_exec_end, notify_exec_operator_start, etc.)

Frontend Enhancements:

  • Adds Duration, Entrypoint, Engine, and Ray UI columns to the queries table
  • Implements direct Ray Dashboard links for Ray-based queries with job ID appending
  • Improves table styling with white headers, better borders, and hover effects
  • Standardizes engine naming (Native → Swordfish, Ray → Flotilla)
  • Enhances timestamp formatting to show milliseconds

Code Quality:

  • Exposes repr_json() on DistributedPhysicalPlan (currently returns dummy JSON)
  • Updates Python type stubs to match new API

Implementation Notes

The core fix addresses a state machine issue where Ray queries that failed couldn't transition to the Failed state, causing backend 400 errors. The solution makes plan_info and exec_info optional in Failed/Canceled states and allows transitions from any active state (lines 330-346 in engine.rs).

The Ray dashboard URL extraction uses ray.worker.get_dashboard_url() and attempts to append the job ID when available, falling back gracefully on errors.

Minor Issues

All findings are non-blocking style/documentation issues (see inline comments for details).

Confidence Score: 4/5

  • Safe to merge with minor style improvements recommended
  • The core functionality changes are sound: the state transition fix properly addresses the Ray query failure reporting issue, metadata propagation is implemented consistently across the stack, and frontend changes are purely additive UI enhancements. The timestamp precision change from u64 to f64 is handled correctly throughout. However, there are several minor style issues: inline import in native_runner.py violates project guidelines, misleading comment about commented-out code that actually executes, debug logging left in production code, @ts-ignore suppressing type errors, and undocumented gravitino import removal. These are all non-blocking style/cleanup issues that don't affect correctness.
  • daft/runners/native_runner.py (inline import and misleading comment), daft/init.py (undocumented gravitino change), src/daft-dashboard/frontend/src/app/queries/page.tsx (@ts-ignore)

Important Files Changed

File Analysis

Filename Score Overview
daft/runners/native_runner.py 3/5 Adds entrypoint tracking and query lifecycle notifications; contains inline import violation and misleading comment about code that is actually executing
daft/runners/ray_runner.py 4/5 Adds comprehensive query lifecycle tracking with Ray dashboard URL extraction and proper error handling
daft/init.py 4/5 Comments out gravitino imports (unrelated change not mentioned in PR description)
src/daft-dashboard/src/engine.rs 4/5 Changes timestamps to f64, adds new metadata fields, relaxes state transition requirements for terminal states, includes debug logging
src/daft-dashboard/src/state.rs 5/5 Updates state structs to use f64 timestamps and makes plan_info/exec_info optional for Failed/Canceled states
src/daft-dashboard/frontend/src/app/queries/page.tsx 3/5 Adds new columns for duration, entrypoint, engine, and Ray UI link; includes @ts-ignore for type error

Sequence Diagram

sequenceDiagram
    participant User
    participant Runner as Runner (Native/Ray)
    participant Context as DaftContext
    participant Subscriber as DashboardSubscriber
    participant Backend as Dashboard Backend
    participant Frontend as Dashboard Frontend
    
    User->>Runner: Execute query
    Runner->>Context: _notify_query_start(query_id, metadata)
    Note over Runner: metadata includes runner, entrypoint, ray_dashboard_url
    Context->>Subscriber: on_query_start(query_id, metadata)
    Subscriber->>Backend: POST /query/{id}/start
    Backend->>Frontend: WebSocket update
    
    Runner->>Context: _notify_optimization_start(query_id)
    Context->>Subscriber: on_optimization_start(query_id)
    Subscriber->>Backend: POST /query/{id}/plan/start
    Backend->>Frontend: WebSocket update (status: Optimizing)
    
    Runner->>Runner: Optimize plan
    Runner->>Context: _notify_optimization_end(query_id, optimized_plan)
    Context->>Subscriber: on_optimization_end(query_id, plan)
    Subscriber->>Backend: POST /query/{id}/plan/end
    Backend->>Frontend: WebSocket update (status: Setup)
    
    Runner->>Context: _notify_exec_start(query_id, physical_plan)
    Context->>Subscriber: on_exec_start(query_id, physical_plan)
    Subscriber->>Backend: POST /query/{id}/exec/start
    Backend->>Frontend: WebSocket update (status: Executing)
    
    loop For each result
        Runner->>Context: _notify_exec_emit_stats(query_id, node_id, stats)
        Context->>Subscriber: on_exec_emit_stats(query_id, stats)
        Subscriber->>Backend: POST /query/{id}/exec/op/{op_id}/emit_stats
        Backend->>Frontend: WebSocket update (progress data)
    end
    
    alt Success
        Runner->>Context: _notify_query_end(query_id, Finished)
        Context->>Subscriber: on_query_end(query_id, result)
        Subscriber->>Backend: POST /query/{id}/end (Finished)
        Backend->>Frontend: WebSocket update (status: Finished)
    else Failure
        Runner->>Context: _notify_query_end(query_id, Failed)
        Context->>Subscriber: on_query_end(query_id, result)
        Subscriber->>Backend: POST /query/{id}/end (Failed)
        Note over Backend: Accepts Failed from Executing state
        Backend->>Frontend: WebSocket update (status: Failed)
    end
    
    Frontend->>User: Display query status and Ray dashboard link
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile


# Optimize the logical plan.
ctx._notify_query_start(query_id, PyQueryMetadata(output_schema._schema, builder.repr_json()))
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Import statements should be at the top of the file. Move import sys to the imports section at the beginning of the file (around line 3-4) per the project's import style guidelines.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/runners/native_runner.py
Line: 99:99

Comment:
[P2] Import statements should be at the top of the file. Move `import sys` to the imports section at the beginning of the file (around line 3-4) per the project's import style guidelines.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Comment on lines 126 to 129
# Try to emit stats for operator 0?
# If NativeExecutor doesn't start operator 0, this might fail or be ignored.
# Let's comment it out for now to see baseline behavior.
ctx._notify_exec_emit_stats(query_id, 0, {"rows in": total_rows, "rows out": total_rows})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] The comment says "Let's comment it out for now to see baseline behavior" but the code on line 129 is actually executing. Either remove this misleading comment or comment out the code if it was intended to be disabled.

Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/runners/native_runner.py
Line: 126:129

Comment:
[P3] The comment says "Let's comment it out for now to see baseline behavior" but the code on line 129 is actually executing. Either remove this misleading comment or comment out the code if it was intended to be disabled.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats shouldn't be emitted here, they're already emmited within the NativeExecutor itself

Comment on lines 174 to 182
tracing::info!("Received exec_start for query {}", query_id);
let query_info = state.queries.get_mut(&query_id);
let Some(mut query_info) = query_info else {
tracing::error!("Query {} not found in exec_start", query_id);
return StatusCode::BAD_REQUEST;
};

// Debug state
tracing::info!("Query {} state: {:?}", query_id, query_info.state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] Debug logging statements appear to have been left in from development. The comment "Debug state" at line 181 and tracing::info! calls at lines 174 and 182 look like temporary debugging code. Consider removing or converting to tracing::debug! if this level of verbosity isn't needed in production.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-dashboard/src/engine.rs
Line: 174:182

Comment:
[P3] Debug logging statements appear to have been left in from development. The comment "Debug state" at line 181 and `tracing::info!` calls at lines 174 and 182 look like temporary debugging code. Consider removing or converting to `tracing::debug!` if this level of verbosity isn't needed in production.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +123 to +124
// @ts-ignore
columnHelper.accessor("ray_dashboard_url", {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Using @ts-ignore suppresses TypeScript type checking instead of fixing the underlying type issue. The ray_dashboard_url property is properly defined as optional in the QuerySummary type. Consider properly typing the accessor or using @ts-expect-error with a specific explanation if this is a known limitation of the column helper library.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daft-dashboard/frontend/src/app/queries/page.tsx
Line: 123:124

Comment:
[P2] Using `@ts-ignore` suppresses TypeScript type checking instead of fixing the underlying type issue. The `ray_dashboard_url` property is properly defined as optional in the `QuerySummary` type. Consider properly typing the accessor or using `@ts-expect-error` with a specific explanation if this is a known limitation of the column helper library.

How can I resolve this? If you propose a fix, please make it concise.

@Jay-ju Jay-ju force-pushed the jay/dashboard-ui-improvements branch from 7b32c5d to 1a47668 Compare January 12, 2026 12:33
@kevinzwang kevinzwang requested a review from srilman January 12, 2026 19:04
@kevinzwang
Copy link
Contributor

@srilman tagging you on this one

Copy link
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a couple of points?

Comment on lines 126 to 129
# Try to emit stats for operator 0?
# If NativeExecutor doesn't start operator 0, this might fail or be ignored.
# Let's comment it out for now to see baseline behavior.
ctx._notify_exec_emit_stats(query_id, 0, {"rows in": total_rows, "rows out": total_rows})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats shouldn't be emitted here, they're already emmited within the NativeExecutor itself

):
if result.metadata() is not None:
total_rows += result.metadata().num_rows
ctx._notify_exec_emit_stats(query_id, 0, {"rows in": total_rows, "rows out": total_rows})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we shouldn't emit stats here as well, because they are expected to be in a specific format per operator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been an update here. Could you please check again if it meets the expectations?

output_schema: PySchema
unoptimized_plan: str
runner: str
ray_dashboard_url: str | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why emit the ray dashboard URL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Here is a link to the ray task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense, good idea

@Jay-ju Jay-ju force-pushed the jay/dashboard-ui-improvements branch 5 times, most recently from 0b386f5 to 9838c15 Compare January 14, 2026 10:05
@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

❌ Patch coverage is 24.89209% with 522 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.58%. Comparing base (1d7b41a) to head (0867a8e).
⚠️ Report is 31 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-dashboard/src/engine.rs 0.00% 194 Missing ⚠️
src/daft-context/src/subscribers/dashboard.rs 0.00% 127 Missing ⚠️
src/daft-distributed/src/python/dashboard.rs 0.00% 62 Missing ⚠️
src/daft-context/src/lib.rs 35.71% 54 Missing ⚠️
src/daft-context/src/python.rs 48.71% 40 Missing ⚠️
src/daft-dashboard/src/state.rs 0.00% 15 Missing ⚠️
daft/runners/ray_runner.py 75.47% 13 Missing ⚠️
src/daft-distributed/src/statistics/stats.rs 0.00% 8 Missing ⚠️
src/daft-distributed/src/python/mod.rs 80.00% 3 Missing ⚠️
daft/context.py 80.00% 2 Missing ⚠️
... and 2 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6008      +/-   ##
==========================================
+ Coverage   72.95%   73.58%   +0.62%     
==========================================
  Files         970      972       +2     
  Lines      126744   126949     +205     
==========================================
+ Hits        92471    93416     +945     
+ Misses      34273    33533     -740     
Files with missing lines Coverage Δ
daft/runners/native_runner.py 84.52% <100.00%> (+0.77%) ⬆️
src/daft-context/src/subscribers/mod.rs 62.50% <100.00%> (+22.50%) ⬆️
src/daft-distributed/src/pipeline_node/mod.rs 31.77% <100.00%> (+4.49%) ⬆️
...l-execution/src/runtime_stats/subscribers/query.rs 91.66% <100.00%> (+2.19%) ⬆️
daft/context.py 87.95% <80.00%> (-1.09%) ⬇️
daft/runners/flotilla.py 47.39% <50.00%> (-0.25%) ⬇️
src/daft-local-execution/src/runtime_stats/mod.rs 91.84% <88.88%> (-0.11%) ⬇️
src/daft-distributed/src/python/mod.rs 42.53% <80.00%> (+5.34%) ⬆️
src/daft-distributed/src/statistics/stats.rs 32.96% <0.00%> (-3.18%) ⬇️
daft/runners/ray_runner.py 68.06% <75.47%> (+0.65%) ⬆️
... and 6 more

... and 126 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jay-ju Jay-ju force-pushed the jay/dashboard-ui-improvements branch from 9838c15 to 0867a8e Compare January 14, 2026 12:05
@Jay-ju
Copy link
Contributor Author

Jay-ju commented Jan 14, 2026

@srilman I have updated all your comments. You can check if there are any other issues.

Copy link
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a couple of clarifying questions

)

try:
total_rows = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why collect this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a debug code; I deleted something here.

# Log Dashboard URL if configured
dashboard_url = os.environ.get("DAFT_DASHBOARD_URL")
if dashboard_url:
print(f"Daft Dashboard: {dashboard_url}/query/{query_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I changed print to logger, mainly to clearly show users how to access the dashboard. What do you think?

)))
.await?;
fn on_exec_start(&self, query_id: QueryID, physical_plan: QueryPlan) -> DaftResult<()> {
let execution_id = format!("{}-driver", query_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the point of execution_id when its just query_id with a fixed or randomly generated tag, but its only created once. Why not just store query_id directly if its already a random UUID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, I agree. The modifications have been made.

from daft import udf


@pytest.fixture(scope="module")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would these tests work without anything to actually launch a dashboard server or mocking the server for testing?

for node_id in &active_nodes {
let runtime_stats = &node_stats_map[node_id];
let event = runtime_stats.snapshot();
let event = runtime_stats.flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't flush, that makes it do atomic synchronization which is less efficient

);
}

// Emit final stats to all subscribers before finishing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, since the finalize_node step should already emit the final stats for that node.

@srilman
Copy link
Contributor

srilman commented Jan 16, 2026

also @Jay-ju if possible, could we split this PR into smaller pieces? this modifies a lot of small aspects of observability, and since we're also actively working on it, this pr will end up having a lot of merge conflicts

@Jay-ju Jay-ju force-pushed the jay/dashboard-ui-improvements branch from 0867a8e to 1789f71 Compare January 16, 2026 07:18
@Jay-ju
Copy link
Contributor Author

Jay-ju commented Jan 21, 2026

also @Jay-ju if possible, could we split this PR into smaller pieces? this modifies a lot of small aspects of observability, and since we're also actively working on it, this pr will end up having a lot of merge conflicts

@srilman I have split this PR into two PRs: one for the frontend and one for the backend, and I have also resolved some conflicts:
batckend #6008
frontend:#6063

Could you please take another look when you have time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants