Skip to content

Conversation

@Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented Jan 26, 2026

Changes Made

image It often happens that the head node still has a lot of resources, but the scheduling cannot be applied. Therefore, I have removed the restriction here.

Related Issues

@github-actions github-actions bot added the perf label Jan 26, 2026
@Jay-ju
Copy link
Contributor Author

Jay-ju commented Jan 26, 2026

@colin-ho @desmondcheongzx Please take a look at the changes here when you have time.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

This PR removes the worker_threads(1) restriction from the scheduler runtime initialization, allowing the tokio multi-threaded runtime to use its default thread count (typically equal to the number of CPU cores).

Key Changes:

  • Removed the single-thread constraint on the Daft-Scheduler runtime
  • The runtime now defaults to tokio's standard behavior: creating worker threads equal to available CPU cores
  • Addresses resource utilization issues where the head node had available resources but scheduling was bottlenecked by the single-threaded constraint

Impact:

  • Improves scheduler throughput by allowing concurrent async operations across multiple threads
  • Better utilizes available CPU resources on the head/scheduler node
  • Should reduce scheduling bottlenecks when the head node has spare capacity

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it's a well-scoped performance optimization
  • The change is minimal (removing one line), well-justified by the PR description, and aligns with standard tokio runtime practices. The scheduler runtime was artificially constrained to a single thread when created, which is unusual for a multi-threaded runtime. Removing this constraint allows the runtime to use its default behavior (worker threads = CPU cores), which should improve scheduler performance without introducing bugs or breaking changes.
  • No files require special attention

Important Files Changed

Filename Overview
src/daft-distributed/src/utils/runtime.rs Removed single-thread restriction from scheduler runtime, allowing tokio to use default multi-threaded behavior (typically CPU core count)

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant PR as PlanRunner
    participant Runtime as get_or_init_runtime()
    participant Tokio as Tokio Runtime
    participant Scheduler as Scheduler Actor
    participant Workers as Worker Pool
    
    App->>PR: run_plan()
    PR->>Runtime: Initialize runtime
    
    Note over Runtime,Tokio: Before PR: worker_threads(1)<br/>After PR: Default (# of CPU cores)
    
    Runtime->>Tokio: new_multi_thread()
    Runtime->>Tokio: enable_all()
    Runtime->>Tokio: thread_name_fn("Daft-Scheduler")
    Tokio-->>Runtime: Runtime created
    
    PR->>Runtime: block_on_current_thread()
    Runtime->>Scheduler: spawn_scheduler_actor()
    
    loop Task Execution
        Scheduler->>Workers: Schedule tasks
        Workers-->>Scheduler: Task results
        Note over Scheduler,Workers: With more threads, scheduler<br/>can handle more concurrent<br/>async operations
    end
    
    Scheduler-->>PR: Results
    PR-->>App: PlanResult
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.90%. Comparing base (538c593) to head (055dca1).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6089      +/-   ##
==========================================
- Coverage   72.91%   72.90%   -0.02%     
==========================================
  Files         973      973              
  Lines      126184   126195      +11     
==========================================
- Hits        92011    92002       -9     
- Misses      34173    34193      +20     
Files with missing lines Coverage Δ
src/daft-distributed/src/utils/runtime.rs 0.00% <ø> (ø)

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the high cpu coming from? Is it ray or flotilla scheduler? When you remove the worker thread restriction, do you see increased rate of scheduling?

I'm worried that simply increasing worker threads is masking an underlying problem. The reason this was 1 is because the flotilla scheduler should not be doing any meaningful compute. It also makes it much easier to debug when there's only one thread. If possible, I actually wanted to move this to the current thread runtime instead of multi thread.

@srilman
Copy link
Contributor

srilman commented Feb 2, 2026

cc @Jay-ju

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants