Skip to content

fix(python_strategy): clean up stale RUNNING_STRATEGIES and surface scheduled-start failures#1297

Open
RobinAThomas wants to merge 2 commits intomarketcalls:mainfrom
RobinAThomas:fix/stale-running-strategies-cleanup
Open

fix(python_strategy): clean up stale RUNNING_STRATEGIES and surface scheduled-start failures#1297
RobinAThomas wants to merge 2 commits intomarketcalls:mainfrom
RobinAThomas:fix/stale-running-strategies-cleanup

Conversation

@RobinAThomas
Copy link
Copy Markdown
Contributor

@RobinAThomas RobinAThomas commented Apr 24, 2026

Summary

Two related bugs cause scheduled Python strategies to silently fail to start the day after a successful run.

Bug 1 — start_strategy_process does not validate stale in-memory entries

if strategy_id in RUNNING_STRATEGIES:
    return False, "Strategy already running"

The RUNNING_STRATEGIES dict is only cleaned by stop_strategy_process(). Any code path that bypasses that — natural exit, internal squareoff, scheduled stop that did not fire, container hiccup — leaves a stale entry that blocks every future start until the worker is restarted.

Bug 2 — scheduled_start_strategy discards the failure

logger.info(f"... all checks passed, starting")
start_strategy_process(strategy_id)   # return value thrown away

When Bug 1 fires, there is no log line, no last_error on the config, no SSE broadcast. No log file is ever created (start_strategy_process bails before opening one), making this very hard to diagnose. From the user side it looks like the scheduler simply did not fire.

Reproduction

  1. Schedule a strategy to start at 09:00 and stop at 16:00.
  2. Let the strategy self-terminate before 16:00 (e.g. internal squareoff that calls sys.exit) so the child process exits without going through stop_strategy_process().
  3. Next morning at 09:00 the scheduler ticks — strategy silently does not start, no new log file is created, and the strategy card still shows yesterday's stop time.

Fix

  • start_strategy_process: when a RUNNING_STRATEGIES entry already exists, validate liveness via Popen.poll() / psutil.is_running() / psutil.pid_exists() depending on the stored object type. If the process is dead, log a warning, close the log handle, drop the stale entry, and fall through to start a fresh one. If the process is alive, behaviour is unchanged.
  • scheduled_start_strategy: capture (success, message) and on failure log an error, set last_error on the config, and broadcast an error status update via SSE so the UI reflects reality.

Both changes are surgical (50 + / 2 −) and live entirely inside the existing PROCESS_LOCK critical section, so concurrency semantics are unchanged.

Testing

  • Verified hot-patched on a production deployment (v2.0.0.2) that hit this exact bug today; restart was clean and the patch loads without error.
  • python -m py_compile blueprints/python_strategy.py passes.
  • Existing callers of start_strategy_process (stop_strategy_process, the manual start route at line 1653, the post-login restoration paths at 2589/2654) are unaffected — they all pass through the same gateway and now also benefit from the stale-entry cleanup.

Summary by cubic

Fixes silent no-starts for scheduled Python strategies. Cleans stale RUNNING_STRATEGIES entries and surfaces scheduled-start failures in logs, the UI, and persisted config.

  • Bug Fixes
    • start_strategy_process: validate existing process (poll/is_running/pid_exists); if dead, warn, close the log handle, drop the entry, then start fresh.
    • scheduled_start_strategy: handle (success, message); on failure log an error, set and persist last_error on the config, and broadcast an SSE error update.

Written for commit e534957. Summary will update on new commits.

…cheduled-start failures

Two related bugs caused scheduled strategies to silently fail to start
the day after a successful run:

1. start_strategy_process() rejects with 'Strategy already running' as
   soon as strategy_id is in the in-memory RUNNING_STRATEGIES dict,
   without checking whether the tracked process is actually alive.
   The dict is only cleaned by stop_strategy_process(); any path that
   bypasses that (natural exit, scheduler not firing scheduled_stop,
   container hiccup) leaves a stale entry that blocks all future starts
   until the worker is restarted.

2. scheduled_start_strategy() discards the (success, message) return
   value from start_strategy_process(), so when (1) fires there is no
   log line, no last_error on the config, no SSE broadcast - the user
   sees the scheduler tick happen and then... nothing. No log file is
   ever created (start_strategy_process bails before opening one),
   making this very hard to diagnose.

Fix:

- start_strategy_process(): when a RUNNING_STRATEGIES entry already
  exists, validate liveness via Popen.poll() / psutil.is_running() /
  psutil.pid_exists() depending on stored type. If the process is
  dead, log a warning, close the log handle, drop the stale entry and
  fall through to start a fresh one.

- scheduled_start_strategy(): capture the result and on failure log
  an error, set last_error on the config, and broadcast an 'error'
  status update via SSE so the UI reflects reality.

Reproduction:
- Start a strategy via the scheduler at 09:00.
- Let it self-terminate (e.g. via internal squareoff at 16:00) so the
  child process exits without going through stop_strategy_process().
- Next morning at 09:00, scheduler ticks - silent failure, no log
  file created, last run shows yesterday's stop time.

After the fix the next-day start either reuses the slot cleanly, or
if it genuinely fails for another reason the failure is now visible
in /app/log and on the strategy card.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="blueprints/python_strategy.py">

<violation number="1" location="blueprints/python_strategy.py:1105">
P2: Scheduled auto-start incorrectly reports idempotent "already running" results as errors, causing false error state/SSE updates.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread blueprints/python_strategy.py Outdated
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
@marketcalls
Copy link
Copy Markdown
Owner

@cubic-dev-ai what is the current status?

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Apr 24, 2026

@cubic-dev-ai what is the current status?

@marketcalls The PR is currently in a ready state for final review.

Here is where we stand:

  • Fixes Applied: The core logic to prevent silent start failures is implemented. I've confirmed that start_strategy_process now checks for process liveness (using poll(), is_running(), or pid_exists) instead of blindly trusting the RUNNING_STRATEGIES dictionary.
  • Improved Visibility: Failures in scheduled_start_strategy are now properly caught, logged, and broadcasted to the UI via SSE, resolving the "silent failure" issue described in the PR.
  • Applied Suggestions: The latest commit includes a refinement I previously suggested for blueprints/python_strategy.py.

The changes are surgical and contained within the PROCESS_LOCK, preserving existing thread-safety. You can now proceed with merging if the manual tests on the production-like environment (v2.0.0.2) remain stable.

For more details on how I handle these reviews, you can check Cubic's documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants