Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api/core/workflow/graph_engine/graph_state_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ def enqueue_node(self, node_id: str) -> None:
node_id: The ID of the node to enqueue
"""
with self._lock:
self._graph.nodes[node_id].state = NodeState.TAKEN
self._ready_queue.put(node_id)
self._graph.nodes[node_id].state = NodeState.TAKEN
Comment on lines 52 to +54
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of operations inside enqueue_node() was changed - put() is now called before setting the node state to TAKEN. While both operations are protected by self._lock, this change has implications:

Potential Issue: If ready_queue.put() raises an exception (e.g., if the queue implementation has capacity limits or validation), the node state will not be set to TAKEN, leaving the node in an inconsistent state where it may be in the queue but not marked properly.

Original order was safer:

self._graph.nodes[node_id].state = NodeState.TAKEN  # Update state first
self._ready_queue.put(node_id)                        # Then enqueue

If the enqueue operation fails, the state is already updated. The worker will dequeue and execute. If state update happens AFTER enqueue and enqueue succeeds but state update fails (unlikely but possible with property setters), the node is queued but not marked TAKEN.

Question: What was the specific reason for changing this order? If it's to ensure the queue reflects reality before the state changes, please document this reasoning in a comment.

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_state_manager.py
Line: 52:54

Comment:
The order of operations inside `enqueue_node()` was changed - `put()` is now called before setting the node state to `TAKEN`. While both operations are protected by `self._lock`, this change has implications:

**Potential Issue:** If `ready_queue.put()` raises an exception (e.g., if the queue implementation has capacity limits or validation), the node state will not be set to `TAKEN`, leaving the node in an inconsistent state where it may be in the queue but not marked properly.

**Original order was safer:**
```python
self._graph.nodes[node_id].state = NodeState.TAKEN  # Update state first
self._ready_queue.put(node_id)                        # Then enqueue
```

If the enqueue operation fails, the state is already updated. The worker will dequeue and execute. If state update happens AFTER enqueue and enqueue succeeds but state update fails (unlikely but possible with property setters), the node is queued but not marked TAKEN.

**Question:** What was the specific reason for changing this order? If it's to ensure the queue reflects reality before the state changes, please document this reasoning in a comment.

How can I resolve this? If you propose a fix, please make it concise.


def mark_node_skipped(self, node_id: str) -> None:
"""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ def propagate_skip_from_edge(self, edge_id: str) -> None:
# If any edge is taken, node may still execute
if edge_states["has_taken"]:
# Enqueue node
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
Comment on lines +62 to 63
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of start_execution() before enqueue_node() is inconsistent with the rest of the codebase. Throughout the system, these calls follow the pattern:

  1. enqueue_node(node_id) first
  2. start_execution(node_id) second

Evidence from existing code:

  • event_handlers.py:196-197: enqueue_nodestart_execution
  • event_handlers.py:279-280: enqueue_nodestart_execution
  • event_handlers.py:309-310: enqueue_nodestart_execution
  • graph_engine.py:334-335: enqueue_nodestart_execution
  • graph_engine.py:338-339: enqueue_nodestart_execution

Only this location uses the reverse order, creating an inconsistency that could lead to subtle bugs or make the code harder to maintain. While calling start_execution() first may prevent a specific race condition (ensuring executing_nodes is incremented before the queue becomes non-empty), this should be done consistently across the entire codebase.

Suggested change
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
# Enqueue node
self._state_manager.enqueue_node(downstream_node_id)
self._state_manager.start_execution(downstream_node_id)

If the reverse order is intentionally correct here (and it likely is for fixing the race condition), then ALL other call sites should be updated to match this pattern for consistency.

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_traversal/skip_propagator.py
Line: 62:63

Comment:
The order of `start_execution()` before `enqueue_node()` is inconsistent with the rest of the codebase. Throughout the system, these calls follow the pattern:

1. `enqueue_node(node_id)` first
2. `start_execution(node_id)` second

**Evidence from existing code:**
- `event_handlers.py:196-197`: `enqueue_node``start_execution`
- `event_handlers.py:279-280`: `enqueue_node``start_execution`  
- `event_handlers.py:309-310`: `enqueue_node``start_execution`
- `graph_engine.py:334-335`: `enqueue_node``start_execution`
- `graph_engine.py:338-339`: `enqueue_node``start_execution`

**Only this location uses the reverse order**, creating an inconsistency that could lead to subtle bugs or make the code harder to maintain. While calling `start_execution()` first may prevent a specific race condition (ensuring `executing_nodes` is incremented before the queue becomes non-empty), this should be done consistently across the entire codebase.

```suggestion
            # Enqueue node
            self._state_manager.enqueue_node(downstream_node_id)
            self._state_manager.start_execution(downstream_node_id)
```

**If the reverse order is intentionally correct here** (and it likely is for fixing the race condition), then ALL other call sites should be updated to match this pattern for consistency.

How can I resolve this? If you propose a fix, please make it concise.

return
Comment on lines 59 to 64
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Bug: Node can be enqueued multiple times

When a node has multiple incoming edges that are processed as TAKEN, propagate_skip_from_edge() will be called once per edge. Each call checks has_taken and enqueues the node again, causing the same node to appear in the ready queue multiple times.

Example Scenario:

Node A has 2 incoming edges (edge1, edge2)
1. Edge1 marked TAKEN → propagate_skip_from_edge(edge1) called
   - Sees has_taken=true → enqueues Node A
2. Edge2 marked TAKEN → propagate_skip_from_edge(edge2) called  
   - Sees has_taken=true → enqueues Node A AGAIN

Result: Node A is in the queue twice, will be executed twice by workers.

Fix: Check if the node is already enqueued/executing before enqueueing:

Suggested change
# If any edge is taken, node may still execute
if edge_states["has_taken"]:
# Enqueue node
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
return
# If any edge is taken, node may still execute
if edge_states["has_taken"]:
# Only enqueue if not already enqueued or executing
if not self._state_manager.is_executing(downstream_node_id):
self._state_manager.start_execution(downstream_node_id)
self._state_manager.enqueue_node(downstream_node_id)
return

Alternatively, check the node state to prevent duplicate enqueueing:

if self._state_manager.get_node_state(downstream_node_id) == NodeState.UNKNOWN:
    # Node hasn't been processed yet
    self._state_manager.start_execution(downstream_node_id)
    self._state_manager.enqueue_node(downstream_node_id)
Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/workflow/graph_engine/graph_traversal/skip_propagator.py
Line: 59:64

Comment:
**Critical Bug: Node can be enqueued multiple times**

When a node has multiple incoming edges that are processed as TAKEN, `propagate_skip_from_edge()` will be called once per edge. Each call checks `has_taken` and enqueues the node again, causing the same node to appear in the ready queue multiple times.

**Example Scenario:**
```
Node A has 2 incoming edges (edge1, edge2)
1. Edge1 marked TAKEN → propagate_skip_from_edge(edge1) called
   - Sees has_taken=true → enqueues Node A
2. Edge2 marked TAKEN → propagate_skip_from_edge(edge2) called  
   - Sees has_taken=true → enqueues Node A AGAIN
```

Result: Node A is in the queue twice, will be executed twice by workers.

**Fix:** Check if the node is already enqueued/executing before enqueueing:

```suggestion
        # If any edge is taken, node may still execute
        if edge_states["has_taken"]:
            # Only enqueue if not already enqueued or executing
            if not self._state_manager.is_executing(downstream_node_id):
                self._state_manager.start_execution(downstream_node_id)
                self._state_manager.enqueue_node(downstream_node_id)
            return
```

Alternatively, check the node state to prevent duplicate enqueueing:
```python
if self._state_manager.get_node_state(downstream_node_id) == NodeState.UNKNOWN:
    # Node hasn't been processed yet
    self._state_manager.start_execution(downstream_node_id)
    self._state_manager.enqueue_node(downstream_node_id)
```

How can I resolve this? If you propose a fix, please make it concise.


Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Tests for graph traversal components."""
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
"""Unit tests for skip propagator."""

from typing import Any
from unittest.mock import MagicMock, create_autospec

from core.workflow.graph import Edge, Graph
from core.workflow.graph_engine.graph_state_manager import GraphStateManager
from core.workflow.graph_engine.graph_traversal.skip_propagator import SkipPropagator


class TestSkipPropagator:
"""Test suite for SkipPropagator."""

def test_propagate_skip_from_edge_with_unknown_edges_stops_processing(self) -> None:
"""When there are unknown incoming edges, propagation should stop."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

# Create a mock edge
mock_edge = MagicMock(spec=Edge)
mock_edge.id = "edge_1"
mock_edge.head = "node_2"

# Setup graph edges dict
mock_graph.edges = {"edge_1": mock_edge}

# Setup incoming edges
incoming_edges = [MagicMock(spec=Edge), MagicMock(spec=Edge)]
mock_graph.get_incoming_edges.return_value = incoming_edges

# Setup state manager to return has_unknown=True
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": True,
"has_taken": False,
"all_skipped": False,
}

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert
mock_graph.get_incoming_edges.assert_called_once_with("node_2")
mock_state_manager.analyze_edge_states.assert_called_once_with(incoming_edges)
# Should not call any other state manager methods
mock_state_manager.enqueue_node.assert_not_called()
mock_state_manager.start_execution.assert_not_called()
mock_state_manager.mark_node_skipped.assert_not_called()

def test_propagate_skip_from_edge_with_taken_edge_enqueues_node(self) -> None:
"""When there is at least one taken edge, node should be enqueued."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

# Create a mock edge
mock_edge = MagicMock(spec=Edge)
mock_edge.id = "edge_1"
mock_edge.head = "node_2"

mock_graph.edges = {"edge_1": mock_edge}
incoming_edges = [MagicMock(spec=Edge)]
mock_graph.get_incoming_edges.return_value = incoming_edges

# Setup state manager to return has_taken=True
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": False,
"has_taken": True,
"all_skipped": False,
}

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")
Comment on lines +80 to +81
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies that both start_execution() and enqueue_node() are called, but does NOT verify the order of these calls. Since the order matters for preventing race conditions (as evidenced by the fix in this PR), the test should explicitly verify the calling order.

Consider using assert_has_calls() to verify the exact sequence:

Suggested change
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")
# Assert - verify both the calls and their order
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")
# Verify order: start_execution must be called before enqueue_node
calls = mock_state_manager.method_calls
start_idx = next(i for i, call in enumerate(calls) if call[0] == 'start_execution')
enqueue_idx = next(i for i, call in enumerate(calls) if call[0] == 'enqueue_node')
assert start_idx < enqueue_idx, "start_execution should be called before enqueue_node"

This ensures the fix for the race condition is maintained in future changes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py
Line: 80:81

Comment:
The test verifies that both `start_execution()` and `enqueue_node()` are called, but does NOT verify the order of these calls. Since the order matters for preventing race conditions (as evidenced by the fix in this PR), the test should explicitly verify the calling order.

Consider using `assert_has_calls()` to verify the exact sequence:

```suggestion
        # Assert - verify both the calls and their order
        mock_state_manager.start_execution.assert_called_once_with("node_2")
        mock_state_manager.enqueue_node.assert_called_once_with("node_2")
        # Verify order: start_execution must be called before enqueue_node
        calls = mock_state_manager.method_calls
        start_idx = next(i for i, call in enumerate(calls) if call[0] == 'start_execution')
        enqueue_idx = next(i for i, call in enumerate(calls) if call[0] == 'enqueue_node')
        assert start_idx < enqueue_idx, "start_execution should be called before enqueue_node"
```

This ensures the fix for the race condition is maintained in future changes.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

mock_state_manager.mark_node_skipped.assert_not_called()

def test_propagate_skip_from_edge_with_all_skipped_propagates_to_node(self) -> None:
"""When all incoming edges are skipped, should propagate skip to node."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

# Create a mock edge
mock_edge = MagicMock(spec=Edge)
mock_edge.id = "edge_1"
mock_edge.head = "node_2"

mock_graph.edges = {"edge_1": mock_edge}
incoming_edges = [MagicMock(spec=Edge)]
mock_graph.get_incoming_edges.return_value = incoming_edges

# Setup state manager to return all_skipped=True
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": False,
"has_taken": False,
"all_skipped": True,
}

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert
mock_state_manager.mark_node_skipped.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_not_called()
mock_state_manager.start_execution.assert_not_called()

def test_propagate_skip_to_node_marks_node_and_outgoing_edges_skipped(self) -> None:
"""_propagate_skip_to_node should mark node and all outgoing edges as skipped."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

# Create outgoing edges
edge1 = MagicMock(spec=Edge)
edge1.id = "edge_2"
edge1.head = "node_downstream_1" # Set head for propagate_skip_from_edge

edge2 = MagicMock(spec=Edge)
edge2.id = "edge_3"
edge2.head = "node_downstream_2"

# Setup graph edges dict for propagate_skip_from_edge
mock_graph.edges = {"edge_2": edge1, "edge_3": edge2}
mock_graph.get_outgoing_edges.return_value = [edge1, edge2]

# Setup get_incoming_edges to return empty list to stop recursion
mock_graph.get_incoming_edges.return_value = []

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Use mock to call private method
# Act
propagator._propagate_skip_to_node("node_1")

# Assert
mock_state_manager.mark_node_skipped.assert_called_once_with("node_1")
mock_state_manager.mark_edge_skipped.assert_any_call("edge_2")
mock_state_manager.mark_edge_skipped.assert_any_call("edge_3")
assert mock_state_manager.mark_edge_skipped.call_count == 2
# Should recursively propagate from each edge
# Since propagate_skip_from_edge is called, we need to verify it was called
# But we can't directly verify due to recursion. We'll trust the logic.

def test_skip_branch_paths_marks_unselected_edges_and_propagates(self) -> None:
"""skip_branch_paths should mark all unselected edges as skipped and propagate."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

# Create unselected edges
edge1 = MagicMock(spec=Edge)
edge1.id = "edge_1"
edge1.head = "node_downstream_1"

edge2 = MagicMock(spec=Edge)
edge2.id = "edge_2"
edge2.head = "node_downstream_2"

unselected_edges = [edge1, edge2]

# Setup graph edges dict
mock_graph.edges = {"edge_1": edge1, "edge_2": edge2}
# Setup get_incoming_edges to return empty list to stop recursion
mock_graph.get_incoming_edges.return_value = []

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Act
propagator.skip_branch_paths(unselected_edges)

# Assert
mock_state_manager.mark_edge_skipped.assert_any_call("edge_1")
mock_state_manager.mark_edge_skipped.assert_any_call("edge_2")
assert mock_state_manager.mark_edge_skipped.call_count == 2
# propagate_skip_from_edge should be called for each edge
# We can't directly verify due to the mock, but the logic is covered

def test_propagate_skip_from_edge_recursively_propagates_through_graph(self) -> None:
"""Skip propagation should recursively propagate through the graph."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

# Create edge chain: edge_1 -> node_2 -> edge_3 -> node_4
edge1 = MagicMock(spec=Edge)
edge1.id = "edge_1"
edge1.head = "node_2"

edge3 = MagicMock(spec=Edge)
edge3.id = "edge_3"
edge3.head = "node_4"

mock_graph.edges = {"edge_1": edge1, "edge_3": edge3}

# Setup get_incoming_edges to return different values based on node
def get_incoming_edges_side_effect(node_id: Any) -> Any:
if node_id == "node_2":
return [edge1]
elif node_id == "node_4":
return [edge3]
return []

mock_graph.get_incoming_edges.side_effect = get_incoming_edges_side_effect

# Setup get_outgoing_edges to return different values based on node
def get_outgoing_edges_side_effect(node_id):
if node_id == "node_2":
return [edge3]
elif node_id == "node_4":
return [] # No outgoing edges, stops recursion
return []

mock_graph.get_outgoing_edges.side_effect = get_outgoing_edges_side_effect

# Setup state manager to return all_skipped for both nodes
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": False,
"has_taken": False,
"all_skipped": True,
}

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert
# Should mark node_2 as skipped
mock_state_manager.mark_node_skipped.assert_any_call("node_2")
# Should mark edge_3 as skipped
mock_state_manager.mark_edge_skipped.assert_any_call("edge_3")
# Should propagate to node_4
mock_state_manager.mark_node_skipped.assert_any_call("node_4")
assert mock_state_manager.mark_node_skipped.call_count == 2

def test_propagate_skip_from_edge_with_mixed_edge_states_handles_correctly(self) -> None:
"""Test with mixed edge states (some unknown, some taken, some skipped)."""
# Arrange
mock_graph = create_autospec(Graph)
mock_state_manager = create_autospec(GraphStateManager)

mock_edge = MagicMock(spec=Edge)
mock_edge.id = "edge_1"
mock_edge.head = "node_2"

mock_graph.edges = {"edge_1": mock_edge}
incoming_edges = [MagicMock(spec=Edge), MagicMock(spec=Edge), MagicMock(spec=Edge)]
mock_graph.get_incoming_edges.return_value = incoming_edges

# Test 1: has_unknown=True, has_taken=False, all_skipped=False
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": True,
"has_taken": False,
"all_skipped": False,
}

propagator = SkipPropagator(mock_graph, mock_state_manager)

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert - should stop processing
mock_state_manager.enqueue_node.assert_not_called()
mock_state_manager.mark_node_skipped.assert_not_called()

# Reset mocks for next test
mock_state_manager.reset_mock()
mock_graph.reset_mock()

# Test 2: has_unknown=False, has_taken=True, all_skipped=False
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": False,
"has_taken": True,
"all_skipped": False,
}

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert - should enqueue node
mock_state_manager.start_execution.assert_called_once_with("node_2")
mock_state_manager.enqueue_node.assert_called_once_with("node_2")

# Reset mocks for next test
mock_state_manager.reset_mock()
mock_graph.reset_mock()

# Test 3: has_unknown=False, has_taken=False, all_skipped=True
mock_state_manager.analyze_edge_states.return_value = {
"has_unknown": False,
"has_taken": False,
"all_skipped": True,
}

# Act
propagator.propagate_skip_from_edge("edge_1")

# Assert - should propagate skip
mock_state_manager.mark_node_skipped.assert_called_once_with("node_2")