Skip to content

fix: workflow incorrectly marked as completed while nodes are still executing#83

Open
tomerqodo wants to merge 5 commits intoqodo_combined_100_qodo_grep_sentry_1_base_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133from
qodo_combined_100_qodo_grep_sentry_1_head_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133
Open

fix: workflow incorrectly marked as completed while nodes are still executing#83
tomerqodo wants to merge 5 commits intoqodo_combined_100_qodo_grep_sentry_1_base_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133from
qodo_combined_100_qodo_grep_sentry_1_head_fix_workflow_incorrectly_marked_as_completed_while_nodes_are_still_executing_pr133

Conversation

@tomerqodo
Copy link

Benchmark PR from qodo-benchmark#133

@qodo-code-review
Copy link

Code Review by Qodo (Alpha)

🐞 Bugs (1) 📘 Rule Violations (1) 📎 Requirement Gaps (0) 💡 Suggestions (0)

Grey Divider


Action Required

1. Invalid node IDs queued 🐞 Bug
Description
enqueue_node() now pushes node_id into the ready queue before touching
  self._graph.nodes[node_id]; if the node ID is invalid, a KeyError is raised after the enqueue,
  leaving a poisoned queue item behind.
• Workers dereference self._graph.nodes[node_id] outside the worker’s try: block, so a poisoned
  queue item can crash the worker thread and stall execution.
• Resume flow enqueues paused_nodes without validating the node still exists, increasing the
  chance of this regression surfacing (e.g., stale snapshots or graph changes between pause/resume).

reliability

Code

api/core/workflow/graph_engine/graph_state_manager.py[R52-54]

        with self._lock:
-            self._graph.nodes[node_id].state = NodeState.TAKEN
            self._ready_queue.put(node_id)
+            self._graph.nodes[node_id].state = NodeState.TAKEN
Evidence
The new enqueue_node() ordering enqueues first and only then indexes graph.nodes[node_id], so an
invalid node_id can be inserted into the ready queue before the method fails. Worker threads then
pop from the queue and immediately index graph.nodes[node_id] (outside any error handling), which
can raise KeyError and kill the thread. The resume path feeds paused_nodes directly into
enqueue_node() without checking membership in graph.nodes, and paused_nodes are
stored/returned without validation.

api/core/workflow/graph_engine/graph_state_manager.py[42-55]
api/core/workflow/graph_engine/worker.py[100-112]
api/core/workflow/graph_engine/graph_engine.py[317-340]
api/core/workflow/runtime/graph_runtime_state.py[342-352]

Agent Prompt
### Issue description
`GraphStateManager.enqueue_node()` currently enqueues `node_id` into the ready queue before verifying that `node_id` exists in `self._graph.nodes`. If the node ID is invalid/stale, the method raises after the enqueue, leaving a bad ID in the queue; workers then crash when dereferencing `graph.nodes[node_id]`.

### Issue Context
This is particularly risky in resume flows: `GraphEngine._start_execution(resume=True)` enqueues `paused_nodes` without validating membership in the current graph, and `GraphRuntimeState` does not validate paused IDs.

### Fix Focus Areas
- api/core/workflow/graph_engine/graph_state_manager.py[42-55]
- api/core/workflow/graph_engine/graph_engine.py[317-340]
- api/core/workflow/graph_engine/worker.py[100-112]
- api/core/workflow/runtime/graph_runtime_state.py[342-352]

### Suggested implementation sketch
- In `enqueue_node()`:
 - resolve/validate the node first (`node = self._graph.nodes.get(node_id)`; if None -> raise or no-op depending on desired behavior)
 - then `self._ready_queue.put(node_id)`
 - then set `node.state = NodeState.TAKEN`
- (Optional hardening) Add a new `schedule_node()` API that performs enqueue + execution-tracking updates atomically and use it consistently across call sites.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation Recommended

2. `Any` used in tests 📘 Rule Violation
Description
• The new unit test introduces typing.Any and uses it in annotations for
  get_incoming_edges_side_effect, weakening static typing and making refactors/type-checking less
  effective.
• This conflicts with the requirement to use strong typing and avoid overly-permissive types unless
  strictly necessary.
• It may also mask incorrect mock usage/signatures that stronger types would catch earlier.

reliability

Code

api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[3]

+from typing import Any
Evidence
PR Compliance ID 9 requires strong typing and avoiding Any. The added test file explicitly imports
Any and uses it in the get_incoming_edges_side_effect function signature, demonstrating the
introduction of permissive typing in new code.

AGENTS.md
api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[3-3]
api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[205-210]

Agent Prompt
## Issue description
The new unit tests introduce `typing.Any` and use it in annotations, which violates the strong-typing guideline and reduces the effectiveness of type checking.

## Issue Context
These helper functions are only used as mock side effects and can be typed precisely (e.g., node IDs are `str`, and the functions return lists of `Edge`-like objects).

## Fix Focus Areas
- api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[3-3]
- api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[205-223]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@@ -0,0 +1,308 @@
"""Unit tests for skip propagator."""

from typing import Any

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remediation Recommended

1. any used in tests 📘 Rule Violation

• The new unit test introduces typing.Any and uses it in annotations for
  get_incoming_edges_side_effect, weakening static typing and making refactors/type-checking less
  effective.
• This conflicts with the requirement to use strong typing and avoid overly-permissive types unless
  strictly necessary.
• It may also mask incorrect mock usage/signatures that stronger types would catch earlier.
Agent Prompt
## Issue description
The new unit tests introduce `typing.Any` and use it in annotations, which violates the strong-typing guideline and reduces the effectiveness of type checking.

## Issue Context
These helper functions are only used as mock side effects and can be typed precisely (e.g., node IDs are `str`, and the functions return lists of `Edge`-like objects).

## Fix Focus Areas
- api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[3-3]
- api/tests/unit_tests/core/workflow/graph_engine/graph_traversal/test_skip_propagator.py[205-223]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines 52 to +54
with self._lock:
self._graph.nodes[node_id].state = NodeState.TAKEN
self._ready_queue.put(node_id)
self._graph.nodes[node_id].state = NodeState.TAKEN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action Required

2. Invalid node ids queued 🐞 Bug

enqueue_node() now pushes node_id into the ready queue before touching
  self._graph.nodes[node_id]; if the node ID is invalid, a KeyError is raised after the enqueue,
  leaving a poisoned queue item behind.
• Workers dereference self._graph.nodes[node_id] outside the worker’s try: block, so a poisoned
  queue item can crash the worker thread and stall execution.
• Resume flow enqueues paused_nodes without validating the node still exists, increasing the
  chance of this regression surfacing (e.g., stale snapshots or graph changes between pause/resume).
Agent Prompt
### Issue description
`GraphStateManager.enqueue_node()` currently enqueues `node_id` into the ready queue before verifying that `node_id` exists in `self._graph.nodes`. If the node ID is invalid/stale, the method raises after the enqueue, leaving a bad ID in the queue; workers then crash when dereferencing `graph.nodes[node_id]`.

### Issue Context
This is particularly risky in resume flows: `GraphEngine._start_execution(resume=True)` enqueues `paused_nodes` without validating membership in the current graph, and `GraphRuntimeState` does not validate paused IDs.

### Fix Focus Areas
- api/core/workflow/graph_engine/graph_state_manager.py[42-55]
- api/core/workflow/graph_engine/graph_engine.py[317-340]
- api/core/workflow/graph_engine/worker.py[100-112]
- api/core/workflow/runtime/graph_runtime_state.py[342-352]

### Suggested implementation sketch
- In `enqueue_node()`:
  - resolve/validate the node first (`node = self._graph.nodes.get(node_id)`; if None -> raise or no-op depending on desired behavior)
  - then `self._ready_queue.put(node_id)`
  - then set `node.state = NodeState.TAKEN`
- (Optional hardening) Add a new `schedule_node()` API that performs enqueue + execution-tracking updates atomically and use it consistently across call sites.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants