Skip to content

[core] Fix actor task queue blocked after cancelling head task#60850

Open
Yicheng-Lu-llll wants to merge 2 commits intoray-project:masterfrom
Yicheng-Lu-llll:fix/actor-cancel-send-pending-tasks-v2
Open

[core] Fix actor task queue blocked after cancelling head task#60850
Yicheng-Lu-llll wants to merge 2 commits intoray-project:masterfrom
Yicheng-Lu-llll:fix/actor-cancel-send-pending-tasks-v2

Conversation

@Yicheng-Lu-llll
Copy link
Member

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 9, 2026

Description

When the head actor task in the queue is waiting for dependency resolution and gets cancelled via ray.cancel(), subsequent tasks in the queue may get stuck forever.

  1. Task A is submitted with an unresolved dependency, queued at the head of pending_requests_
  2. Task B (no dependencies) is submitted, queued behind A (actor tasks are sent in FIFO order)
  3. Task B's dependency resolution callback fires, but it can't be sent yet (A is still at the head)
  4. ray.cancel(ref_a) is called, which removes A from the queue
  5. Bug: CancelTask doesn't call SendPendingTasks(), so B remains stuck in the queue forever
  6. Task B's callback already fired, so nothing will trigger SendPendingTasks again

Fix:
In CancelTask, after removing the cancelled task from the queue, call SendPendingTasks() to unblock any subsequent ready tasks.

Test

before:

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/ray
configfile: pytest.ini
plugins: asyncio-1.3.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                               

python/ray/tests/test_actor_cancel.py::test_cancel_head_unblocks_queue FAILED [100%]

=================================== FAILURES ===================================
_______________________ test_cancel_head_unblocks_queue ________________________

shutdown_only = None

    def test_cancel_head_unblocks_queue(shutdown_only):
        """
        Test that when the head task of an actor's queue is cancelled,
        subsequent tasks with resolved dependencies can proceed.
    
        Scenario:
        - Task A depends on a long-running task (unresolved dependency)
        - Task B has no dependencies (resolved immediately)
        - Task B is queued behind Task A
        - Cancel Task A
        - Task B should now execute
        """
    
        @ray.remote
        class Actor:
            def process(self, data):
                return f"processed: {data}"
    
        @ray.remote
        def long_running_task():
            time.sleep(3600)
    
        ray.init()
        actor = Actor.remote()
    
        # Task A with unresolved dependency
        ref_long = long_running_task.remote()
        ref_a = actor.process.remote(ref_long)
    
        # Task B with no dependencies, queued behind A
        ref_b = actor.process.remote("ready_value")
    
        # Wait for Task B's dependency resolution to complete. After resolution,
        # Task B won't trigger SendPendingTasks again (it only happens in the
        # resolution callback). So if CancelTask doesn't call SendPendingTasks
        # properly, Task B will remain stuck in the queue forever.
        # TODO(Yicheng-Lu-llll): Use a deterministic approach if an API becomes
        # available to query whether a task's dependencies are resolved while
        # still queued on the client side.
        time.sleep(1)
    
        # Cancel Task A
        ray.cancel(ref_a)
    
        # Task B should complete
>       result = ray.get(ref_b, timeout=5)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^

python/ray/tests/test_actor_cancel.py:645: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
python/ray/_private/auto_init_hook.py:22: in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
python/ray/_private/client_mode_hook.py:104: in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
python/ray/_private/worker.py:2974: in get
    values, debugger_breakpoint = worker.get_objects(
python/ray/_private/worker.py:982: in get_objects
    ] = self.core_worker.get_objects(
python/ray/_raylet.pyx:2976: in ray._raylet.CoreWorker.get_objects
    check_status(op_status)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   raise GetTimeoutError(message)
E   ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.

python/ray/includes/common.pxi:106: GetTimeoutError
----------------------------- Captured stderr call -----------------------------
2026-02-09 04:55:28,695 INFO worker.py:1997 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
=========================== short test summary info ============================
FAILED python/ray/tests/test_actor_cancel.py::test_cancel_head_unblocks_queue - ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
============================== 1 failed in 16.63s ==============================

after:

(docs) ubuntu@devbox:~/ray$  python -m pytest python/ray/tests/test_actor_cancel.py::test_cancel_head_unblocks_queue -v
================================================= test session starts =================================================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/ray
configfile: pytest.ini
plugins: asyncio-1.3.0, anyio-4.11.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                      

python/ray/tests/test_actor_cancel.py::test_cancel_head_unblocks_queue 
PASSED                                   [100%]

================================================= 1 passed in 12.12s ==================================================
(docs) ubuntu@devbox:~/ray$ 

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly fixes a bug where the actor task queue could remain blocked if the head task was cancelled while still queued (e.g., waiting for dependency resolution). The changes ensure that the pending calls counter is properly decremented and that subsequent tasks in the queue are triggered for submission. The added test case in python/ray/tests/test_actor_cancel.py effectively reproduces and validates the fix.

Signed-off-by: yicheng <yicheng@anyscale.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the fix/actor-cancel-send-pending-tasks-v2 branch from 38578cc to 6398c2c Compare February 9, 2026 04:48
@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as ready for review February 9, 2026 05:03
@Yicheng-Lu-llll Yicheng-Lu-llll requested a review from a team as a code owner February 9, 2026 05:03
@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant