Skip to content

fix: workflow incorrectly marked as completed while nodes are still executing#43

Open
tomerqodo wants to merge 5 commits intogreptile_combined_100_qodo_grep_sentry_1_base_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133from
greptile_combined_100_qodo_grep_sentry_1_head_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133
Open

fix: workflow incorrectly marked as completed while nodes are still executing#43
tomerqodo wants to merge 5 commits intogreptile_combined_100_qodo_grep_sentry_1_base_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133from
greptile_combined_100_qodo_grep_sentry_1_head_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133

Conversation

@tomerqodo
Copy link

Benchmark PR from qodo-benchmark#133

@greptile-apps
Copy link

greptile-apps bot commented Jan 10, 2026

Greptile Overview

Greptile Summary

Overview

This PR attempts to fix a race condition where workflows are incorrectly marked as completed while nodes are still executing. The fix adds a start_execution() call in the skip propagation flow and reorders operations in enqueue_node().

Changes Made

  1. skip_propagator.py: Added start_execution() call before enqueue_node() when a node with taken edges is discovered (line 62-63)
  2. graph_state_manager.py: Swapped the order of ready_queue.put() and state assignment in enqueue_node() (line 53-54)
  3. test_skip_propagator.py: Added comprehensive unit tests for skip propagation logic (new file, 309 lines)
  4. init.py: Added package marker file for test directory (new file)

Root Cause of the Bug

The workflow completion check (is_execution_complete()) evaluates: ready_queue.empty() AND executing_nodes.count == 0

A race condition existed where:

  1. Thread A (Dispatcher) checks both conditions
  2. Thread B (Event Handler) calls enqueue_node() → queue becomes non-empty
  3. Thread B calls start_execution() → executing_nodes incremented
  4. If Thread A checked between these operations, it could see empty queue + zero executing nodes → false completion

Critical Issues Found

1. Order Inconsistency Across Codebase (Logic Bug)

The new code in skip_propagator.py calls start_execution()enqueue_node(), but ALL other locations use the opposite order:

  • event_handlers.py (3 locations): enqueue_node()start_execution()
  • graph_engine.py (2 locations): enqueue_node()start_execution()

This inconsistency creates maintenance burden and potential for subtle bugs. If the new order is correct for race condition prevention, all call sites should be updated.

2. Duplicate Enqueueing Bug (Critical Logic Bug)

Nodes with multiple incoming edges can be enqueued multiple times. When multiple edges become TAKEN, each triggers enqueueing without checking if the node is already queued/executing:

Scenario:

NodeA --edge1--> NodeC
NodeB --edge2--> NodeC

1. NodeA completes → edge1 TAKEN → NodeC.is_ready() = true → enqueue NodeC
2. NodeB completes → edge2 TAKEN → NodeC.is_ready() = true → enqueue NodeC AGAIN

Both skip_propagator.py (new code) and edge_processor.py (existing) lack idempotency checks. The node will execute multiple times.

Fix needed: Add state check in enqueue_node():

if self._graph.nodes[node_id].state != NodeState.TAKEN:
    self._ready_queue.put(node_id)
    self._graph.nodes[node_id].state = NodeState.TAKEN

3. Operation Order Change in enqueue_node() (Logic Bug)

Swapping put() and state assignment changes error handling semantics. If put() throws an exception (e.g., queue full), the original order ensured state was already updated; new order leaves state inconsistent.

4. Test Gaps (Style Issue)

Tests verify that both start_execution() and enqueue_node() are called but don't verify their order, which is critical for the race condition fix.

Recommendation

Do not merge as-is. While the PR addresses a real race condition, it introduces multiple critical bugs and inconsistencies that could cause nodes to execute multiple times or create subtle timing issues.

Confidence Score: 1/5

  • This PR has critical logic bugs that could cause nodes to execute multiple times and introduces inconsistencies across the codebase
  • Score of 1 reflects multiple critical issues: (1) duplicate enqueueing bug allows nodes to execute multiple times when they have multiple incoming edges - this is a severe correctness issue, (2) order inconsistency between skip_propagator and rest of codebase creates maintenance hazards and potential race conditions, (3) operation reordering in enqueue_node changes error handling semantics, (4) tests don't verify the critical ordering requirement. While the PR attempts to fix a real race condition, it introduces bugs that are equally serious or worse.
  • api/core/workflow/graph_engine/graph_traversal/skip_propagator.py needs duplicate enqueueing protection, api/core/workflow/graph_engine/graph_state_manager.py needs idempotency checks and operation order justification

Important Files Changed

File Analysis

Filename Score Overview
api/core/workflow/graph_engine/graph_traversal/skip_propagator.py 2/5 Added start_execution() call before enqueue_node() to prevent race condition, but introduces order inconsistency with rest of codebase and potential duplicate enqueueing bug
api/core/workflow/graph_engine/graph_state_manager.py 2/5 Swapped operation order in enqueue_node() which changes error handling behavior; lacks duplicate enqueueing protection
api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py 3/5 Comprehensive unit tests added but missing order verification and duplicate enqueueing edge cases

Sequence Diagram

sequenceDiagram
    participant D as Dispatcher
    participant EH as EventHandler
    participant EP as EdgeProcessor
    participant SP as SkipPropagator
    participant SM as StateManager
    participant RQ as ReadyQueue
    participant W as Worker

    Note over D,W: Node Completion Flow (Fixed)
    
    W->>EH: NodeRunSucceededEvent
    EH->>EP: process_node_success(node_id)
    EP->>SM: mark_edge_taken(edge_id)
    EP->>SM: is_node_ready(downstream_node)
    SM-->>EP: true (edge is TAKEN)
    EP-->>EH: [downstream_node]
    
    Note over EH,RQ: Critical Race Condition Window
    EH->>SM: enqueue_node(downstream_node)
    SM->>RQ: put(downstream_node)
    SM->>SM: set state=TAKEN
    EH->>SM: start_execution(downstream_node)
    SM->>SM: add to executing_nodes
    
    Note over D,RQ: Dispatcher checks completion
    D->>RQ: empty()?
    RQ-->>D: false (node in queue)
    D->>SM: get_executing_count()
    SM-->>D: 1 (node marked executing)
    Note over D: execution_complete = false ✓
    
    EH->>SM: finish_execution(completed_node)
    
    Note over D,W: Skip Propagation Flow (New Code)
    
    EP->>SP: skip_branch_paths(unselected_edges)
    SP->>SM: mark_edge_skipped(edge_id)
    SP->>SP: propagate_skip_from_edge(edge_id)
    SP->>SM: analyze_edge_states(incoming_edges)
    SM-->>SP: {has_taken: true, ...}
    
    Note over SP,RQ: NEW: start_execution called BEFORE enqueue
    SP->>SM: start_execution(downstream_node)
    SM->>SM: add to executing_nodes
    SP->>SM: enqueue_node(downstream_node)
    SM->>RQ: put(downstream_node)
    SM->>SM: set state=TAKEN
    
    Note over SP,SM: Potential Issue: Duplicate Enqueueing
    Note over SP: If multiple edges processed concurrently,<br/>node can be enqueued multiple times<br/>(no idempotency check)
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +62 to 63
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of start_execution() before enqueue_node() is inconsistent with the rest of the codebase. Throughout the system, these calls follow the pattern:

  1. enqueue_node(node_id) first
  2. start_execution(node_id) second

Evidence from existing code:

  • event_handlers.py:196-197: enqueue_nodestart_execution
  • event_handlers.py:279-280: enqueue_nodestart_execution
  • event_handlers.py:309-310: enqueue_nodestart_execution
  • graph_engine.py:334-335: enqueue_nodestart_execution
  • graph_engine.py:338-339: enqueue_nodestart_execution

Only this location uses the reverse order, creating an inconsistency that could lead to subtle bugs or make the code harder to maintain. While calling start_execution() first may prevent a specific race condition (ensuring executing_nodes is incremented before the queue becomes non-empty), this should be done consistently across the entire codebase.

Suggested change
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
# Enqueue node
self._state_manager.enqueue_node(downstream_node_id)
self._state_manager.start_execution(downstream_node_id)

If the reverse order is intentionally correct here (and it likely is for fixing the race condition), then ALL other call sites should be updated to match this pattern for consistency.

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_traversal/skip_propagator.py
Line: 62:63

Comment:
The order of `start_execution()` before `enqueue_node()` is inconsistent with the rest of the codebase. Throughout the system, these calls follow the pattern:

1. `enqueue_node(node_id)` first
2. `start_execution(node_id)` second

**Evidence from existing code:**
- `event_handlers.py:196-197`: `enqueue_node``start_execution`
- `event_handlers.py:279-280`: `enqueue_node``start_execution`  
- `event_handlers.py:309-310`: `enqueue_node``start_execution`
- `graph_engine.py:334-335`: `enqueue_node``start_execution`
- `graph_engine.py:338-339`: `enqueue_node``start_execution`

**Only this location uses the reverse order**, creating an inconsistency that could lead to subtle bugs or make the code harder to maintain. While calling `start_execution()` first may prevent a specific race condition (ensuring `executing_nodes` is incremented before the queue becomes non-empty), this should be done consistently across the entire codebase.

```suggestion
            # Enqueue node
            self._state_manager.enqueue_node(downstream_node_id)
            self._state_manager.start_execution(downstream_node_id)
```

**If the reverse order is intentionally correct here** (and it likely is for fixing the race condition), then ALL other call sites should be updated to match this pattern for consistency.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 52 to +54
with self._lock:
self._graph.nodes[node_id].state = NodeState.TAKEN
self._ready_queue.put(node_id)
self._graph.nodes[node_id].state = NodeState.TAKEN
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of operations inside enqueue_node() was changed - put() is now called before setting the node state to TAKEN. While both operations are protected by self._lock, this change has implications:

Potential Issue: If ready_queue.put() raises an exception (e.g., if the queue implementation has capacity limits or validation), the node state will not be set to TAKEN, leaving the node in an inconsistent state where it may be in the queue but not marked properly.

Original order was safer:

self._graph.nodes[node_id].state = NodeState.TAKEN  # Update state first
self._ready_queue.put(node_id)                        # Then enqueue

If the enqueue operation fails, the state is already updated. The worker will dequeue and execute. If state update happens AFTER enqueue and enqueue succeeds but state update fails (unlikely but possible with property setters), the node is queued but not marked TAKEN.

Question: What was the specific reason for changing this order? If it's to ensure the queue reflects reality before the state changes, please document this reasoning in a comment.

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_state_manager.py
Line: 52:54

Comment:
The order of operations inside `enqueue_node()` was changed - `put()` is now called before setting the node state to `TAKEN`. While both operations are protected by `self._lock`, this change has implications:

**Potential Issue:** If `ready_queue.put()` raises an exception (e.g., if the queue implementation has capacity limits or validation), the node state will not be set to `TAKEN`, leaving the node in an inconsistent state where it may be in the queue but not marked properly.

**Original order was safer:**
```python
self._graph.nodes[node_id].state = NodeState.TAKEN  # Update state first
self._ready_queue.put(node_id)                        # Then enqueue
```

If the enqueue operation fails, the state is already updated. The worker will dequeue and execute. If state update happens AFTER enqueue and enqueue succeeds but state update fails (unlikely but possible with property setters), the node is queued but not marked TAKEN.

**Question:** What was the specific reason for changing this order? If it's to ensure the queue reflects reality before the state changes, please document this reasoning in a comment.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +80 to +81
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies that both start_execution() and enqueue_node() are called, but does NOT verify the order of these calls. Since the order matters for preventing race conditions (as evidenced by the fix in this PR), the test should explicitly verify the calling order.

Consider using assert_has_calls() to verify the exact sequence:

Suggested change
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")
# Assert - verify both the calls and their order
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")
# Verify order: start_execution must be called before enqueue_node
calls = mock_state_manager.method_calls
start_idx = next(i for i, call in enumerate(calls) if call[0] == 'start_execution')
enqueue_idx = next(i for i, call in enumerate(calls) if call[0] == 'enqueue_node')
assert start_idx < enqueue_idx, "start_execution should be called before enqueue_node"

This ensures the fix for the race condition is maintained in future changes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py
Line: 80:81

Comment:
The test verifies that both `start_execution()` and `enqueue_node()` are called, but does NOT verify the order of these calls. Since the order matters for preventing race conditions (as evidenced by the fix in this PR), the test should explicitly verify the calling order.

Consider using `assert_has_calls()` to verify the exact sequence:

```suggestion
        # Assert - verify both the calls and their order
        mock_state_manager.start_execution.assert_called_once_with("node_2")
        mock_state_manager.enqueue_node.assert_called_once_with("node_2")
        # Verify order: start_execution must be called before enqueue_node
        calls = mock_state_manager.method_calls
        start_idx = next(i for i, call in enumerate(calls) if call[0] == 'start_execution')
        enqueue_idx = next(i for i, call in enumerate(calls) if call[0] == 'enqueue_node')
        assert start_idx < enqueue_idx, "start_execution should be called before enqueue_node"
```

This ensures the fix for the race condition is maintained in future changes.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 59 to 64
# If any edge is taken, node may still execute
if edge_states["has_taken"]:
# Enqueue node
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Bug: Node can be enqueued multiple times

When a node has multiple incoming edges that are processed as TAKEN, propagate_skip_from_edge() will be called once per edge. Each call checks has_taken and enqueues the node again, causing the same node to appear in the ready queue multiple times.

Example Scenario:

Node A has 2 incoming edges (edge1, edge2)
1. Edge1 marked TAKEN → propagate_skip_from_edge(edge1) called
   - Sees has_taken=true → enqueues Node A
2. Edge2 marked TAKEN → propagate_skip_from_edge(edge2) called  
   - Sees has_taken=true → enqueues Node A AGAIN

Result: Node A is in the queue twice, will be executed twice by workers.

Fix: Check if the node is already enqueued/executing before enqueueing:

Suggested change
# If any edge is taken, node may still execute
if edge_states["has_taken"]:
# Enqueue node
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
return
# If any edge is taken, node may still execute
if edge_states["has_taken"]:
# Only enqueue if not already enqueued or executing
if not self._state_manager.is_executing(downstream_node_id):
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
return

Alternatively, check the node state to prevent duplicate enqueueing:

if self._state_manager.get_node_state(downstream_node_id) == NodeState.UNKNOWN:
    # Node hasn't been processed yet
    self._state_manager.start_execution(downstream_node_id)
    self._state_manager.enqueue_node(downstream_node_id)
Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_traversal/skip_propagator.py
Line: 59:64

Comment:
**Critical Bug: Node can be enqueued multiple times**

When a node has multiple incoming edges that are processed as TAKEN, `propagate_skip_from_edge()` will be called once per edge. Each call checks `has_taken` and enqueues the node again, causing the same node to appear in the ready queue multiple times.

**Example Scenario:**
```
Node A has 2 incoming edges (edge1, edge2)
1. Edge1 marked TAKEN → propagate_skip_from_edge(edge1) called
   - Sees has_taken=true → enqueues Node A
2. Edge2 marked TAKEN → propagate_skip_from_edge(edge2) called  
   - Sees has_taken=true → enqueues Node A AGAIN
```

Result: Node A is in the queue twice, will be executed twice by workers.

**Fix:** Check if the node is already enqueued/executing before enqueueing:

```suggestion
        # If any edge is taken, node may still execute
        if edge_states["has_taken"]:
            # Only enqueue if not already enqueued or executing
            if not self._state_manager.is_executing(downstream_node_id):
                self._state_manager.start_execution(downstream_node_id)
                self._state_manager.enqueue_node(downstream_node_id)
            return
```

Alternatively, check the node state to prevent duplicate enqueueing:
```python
if self._state_manager.get_node_state(downstream_node_id) == NodeState.UNKNOWN:
    # Node hasn't been processed yet
    self._state_manager.start_execution(downstream_node_id)
    self._state_manager.enqueue_node(downstream_node_id)
```

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link

greptile-apps bot commented Jan 10, 2026

Additional Comments (1)

api/core/workflow/graph_engine/graph_state_manager.py
Pre-existing Issue: No protection against duplicate enqueueing

The enqueue_node() method doesn't check if a node is already enqueued (state=TAKEN) before adding it to the ready queue. This can cause nodes to be executed multiple times when they have multiple incoming edges.

Scenario:

NodeA --edge1--> NodeC
NodeB --edge2--> NodeC

1. NodeA completes → edge1 marked TAKEN → NodeC is ready → enqueued (state=TAKEN)
2. NodeB completes → edge2 marked TAKEN → NodeC still reports as ready → enqueued AGAIN

Result: NodeC appears in the queue twice and will be executed by two workers.

Recommendation: Add idempotency check:

    def enqueue_node(self, node_id: str) -> None:
        """
        Mark a node as TAKEN and add it to the ready queue.

        This combines the state transition and enqueueing operations
        that always occur together when preparing a node for execution.

        Args:
            node_id: The ID of the node to enqueue
        """
        with self._lock:
            # Only enqueue if not already enqueued
            if self._graph.nodes[node_id].state != NodeState.TAKEN:
                self._ready_queue.put(node_id)
                self._graph.nodes[node_id].state = NodeState.TAKEN

Note: This is a pre-existing issue, not introduced by this PR, but should be addressed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_state_manager.py
Line: 42:54

Comment:
**Pre-existing Issue: No protection against duplicate enqueueing**

The `enqueue_node()` method doesn't check if a node is already enqueued (state=TAKEN) before adding it to the ready queue. This can cause nodes to be executed multiple times when they have multiple incoming edges.

**Scenario:**
```
NodeA --edge1--> NodeC
NodeB --edge2--> NodeC

1. NodeA completes → edge1 marked TAKEN → NodeC is ready → enqueued (state=TAKEN)
2. NodeB completes → edge2 marked TAKEN → NodeC still reports as ready → enqueued AGAIN
```

Result: NodeC appears in the queue twice and will be executed by two workers.

**Recommendation:** Add idempotency check:

```suggestion
    def enqueue_node(self, node_id: str) -> None:
        """
        Mark a node as TAKEN and add it to the ready queue.

        This combines the state transition and enqueueing operations
        that always occur together when preparing a node for execution.

        Args:
            node_id: The ID of the node to enqueue
        """
        with self._lock:
            # Only enqueue if not already enqueued
            if self._graph.nodes[node_id].state != NodeState.TAKEN:
                self._ready_queue.put(node_id)
                self._graph.nodes[node_id].state = NodeState.TAKEN
```

Note: This is a pre-existing issue, not introduced by this PR, but should be addressed.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants