Skip to content

[Bug Report] StorageMonitor exits prematurely when new futures are added to a running monitor #1449

@neilSchroeder

Description

@neilSchroeder

original report

I'm attempting to use lithops with cubed to cloud process a very large zarr store.

The graph for this looks it has 5 steps. The middle three steps have about ~3300 jobs that lithops attempts to submit to aws lambda. The first couple steps seem to work fine and all jobs successfully submit and run, but then on the third step it successfully submits 500 jobs (the limit I've set), all of which finish successfully, and then it hangs and fails to submit any more jobs just repetitively informing me that it's waiting for 2800 more jobs to run.

Is this a known issue? Is there something I can do to fix this?

I'm running on python 3.12 and am happy to provide more details as necessary.


Bug Report: StorageMonitor exits prematurely when new futures are added to a running monitor

When using a FunctionExecutor to run multiple sequential map() calls within the same executor context, the StorageMonitor can exit prematurely, leaving newly added futures unmonitored. This causes jobs to hang indefinitely at "Waiting for 0% of N function activations to complete".

Environment

  • Lithops: 3.6.2
  • Python: 3.12.11
  • Backend: AWS Lambda (us-west-2)
  • Storage: AWS S3

Analysis

This bug appears to be a race condition in lithops.monitor involving the interaction of StorageMonitor.run, Monitor._all_ready(), and Monitor.add_futures().

The race conditions seems to go like this:

  1. Job N completes: All futures in self.futures done
  2. _all_ready() returns True: The while loop condition becomes false
  3. Job N+1 starts: add_futures() is called to add new futures to self.futures
  4. Monitor exits: The while loop exits before checking the newly added futures
  5. Result: The new futures from Job N+1 are never monitored

Debug Logs

2025-11-29 16:36:49,038 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 0 - Running: 0 - Done: 3328
... [a bunch of invoker/wait/futures/etc. outputs here]...

# the very next monitor statements say:
2025-11-29 16:37:00,166 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 3226 - Running: 54 - Done: 3376 # note that 3376 is more than than the number pending so the monitor exits
2025-11-29 16:37:00,175 [DEBUG] monitor.py:481 -- ExecutorID a31010-20 - Storage job monitor finished
...
2025-11-29 16:37:39,674 [INFO] wait.py:105 -- ExecutorID a31010-20 - Waiting for 0% of 3280 function activations to complete
# ^^ Hangs here forever - monitor has exited, 3280 tasks never processed

Second potential bug:

JobMonitor.is_alive() calls self.monitor.is_alive() but does not return anything. Additionally, no other monitor appears to have an is_alive definition. This function will always return None resulting in some strange behavior since None is falsey and may be contributing to the issue above by causing incorrect decisions about whether or not to create or start a new monitor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions