-
Notifications
You must be signed in to change notification settings - Fork 117
Description
original report
I'm attempting to use lithops with cubed to cloud process a very large zarr store.
The graph for this looks it has 5 steps. The middle three steps have about ~3300 jobs that lithops attempts to submit to aws lambda. The first couple steps seem to work fine and all jobs successfully submit and run, but then on the third step it successfully submits 500 jobs (the limit I've set), all of which finish successfully, and then it hangs and fails to submit any more jobs just repetitively informing me that it's waiting for 2800 more jobs to run.
Is this a known issue? Is there something I can do to fix this?
I'm running on python 3.12 and am happy to provide more details as necessary.
Bug Report: StorageMonitor exits prematurely when new futures are added to a running monitor
When using a FunctionExecutor to run multiple sequential map() calls within the same executor context, the StorageMonitor can exit prematurely, leaving newly added futures unmonitored. This causes jobs to hang indefinitely at "Waiting for 0% of N function activations to complete".
Environment
- Lithops: 3.6.2
- Python: 3.12.11
- Backend: AWS Lambda (us-west-2)
- Storage: AWS S3
Analysis
This bug appears to be a race condition in lithops.monitor involving the interaction of StorageMonitor.run, Monitor._all_ready(), and Monitor.add_futures().
The race conditions seems to go like this:
- Job N completes: All futures in
self.futuresdone _all_ready()returns True: The while loop condition becomes false- Job N+1 starts:
add_futures()is called to add new futures toself.futures - Monitor exits: The while loop exits before checking the newly added futures
- Result: The new futures from Job N+1 are never monitored
Debug Logs
2025-11-29 16:36:49,038 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 0 - Running: 0 - Done: 3328
... [a bunch of invoker/wait/futures/etc. outputs here]...
# the very next monitor statements say:
2025-11-29 16:37:00,166 [DEBUG] monitor.py:147 -- ExecutorID a31010-20 - Pending: 3226 - Running: 54 - Done: 3376 # note that 3376 is more than than the number pending so the monitor exits
2025-11-29 16:37:00,175 [DEBUG] monitor.py:481 -- ExecutorID a31010-20 - Storage job monitor finished
...
2025-11-29 16:37:39,674 [INFO] wait.py:105 -- ExecutorID a31010-20 - Waiting for 0% of 3280 function activations to complete
# ^^ Hangs here forever - monitor has exited, 3280 tasks never processed
Second potential bug:
JobMonitor.is_alive() calls self.monitor.is_alive() but does not return anything. Additionally, no other monitor appears to have an is_alive definition. This function will always return None resulting in some strange behavior since None is falsey and may be contributing to the issue above by causing incorrect decisions about whether or not to create or start a new monitor.