Skip to content

[FIX] reduce peak memory usage during single-cell extraction#384

Open
sophiamaedler wants to merge 11 commits intomainfrom
improve_mem_extraction
Open

[FIX] reduce peak memory usage during single-cell extraction#384
sophiamaedler wants to merge 11 commits intomainfrom
improve_mem_extraction

Conversation

@sophiamaedler
Copy link
Collaborator

@sophiamaedler sophiamaedler commented Mar 9, 2026

reported by @vvarlamova

  • implemented automatic flushing and gc.collection after N batches to clean up caches
  • capped the maximum number of inflight results kept (i.e. computed results that have not yet been written to disk)

Copilot AI review requested due to automatic review settings March 9, 2026 06:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces peak memory usage during multiprocessing single-cell extraction by streaming extraction results directly from the worker pool iterator instead of materializing the full result list in memory.

Changes:

  • Replace list(tqdm(pool.imap(...))) with direct iteration over tqdm(pool.imap(...)) to avoid accumulating all batch results at once.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sophiamaedler
Copy link
Collaborator Author

Did a quick benchmark on a small dataset which shows improved memory usage.
image

Will need to repeat on a larger dataset.

@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Mar 14, 2026

I ran this on a larger example with the current implementation of the code vs the implementation in main:
extraction changed from [list(tqdm(pool.imap(...)))] to streamed [tqdm(pool.imap_unordered(...))].

image

This has improved run time but has not reduced memory-usage during the extraction process.

We also see a continuously increasing main Process memory requirements. Working theory is that:

  • Workers produce large batch results in parallel.
  • Main process is the single writer (_write_to_hdf5 under a lock).
  • If producer throughput > writer throughput, pending results accumulate in parent-side buffers/queues.
  • Parent RSS rises over time even without a classic leak.

@sophiamaedler
Copy link
Collaborator Author

I tested this theory by monitoring the time spent waiting for worker results versus the time spent writing each batch to HDF5 in the main process.

The results strongly support a writer backpressure bottleneck:

batches avg_wait_for_result_s avg_write_s fast_fetch_fraction write_wait_ratio
25 0.1445 0.9259 0.92 6.41
50 0.0729 0.8739 0.96 11.99
75 0.0493 0.8597 0.97 17.44
100 0.0374 0.8602 0.98 23.00
125 0.0302 0.8566 0.98 28.36

Interpretation:

  • avg_wait_for_result_s is low and keeps decreasing, so the main process rarely waits for workers.
  • avg_write_s stays much higher, so writing is the slow stage.
  • fast_fetch_fraction is very high (0.92-0.98), meaning results are usually immediately available.
  • write_wait_ratio increases over time, indicating the writer is increasingly the bottleneck relative to result availability.

Conclusion: workers are producing faster than the single-writer path can drain, which is consistent with the observed memory growth from queued/in-flight batch results.

@sophiamaedler
Copy link
Collaborator Author

I instrumented HDF5 writing to break down per-batch time into:

  • file open + dataset lookup (avg_open_lookup_s)
  • actual write loop (avg_write_loop_s)
  • file close (avg_close_s)

Early results:

batches avg_open_lookup_s avg_write_loop_s avg_close_s avg_total_s
25 0.0011 1.0393 0.0206 1.0610
50 0.0011 1.0921 0.0165 1.1096
75 0.0009 1.0113 0.0160 1.0282
100 0.0009 0.9927 0.0160 1.0096
125 0.0017 1.0007 0.0152 1.0176
150 0.0016 1.0045 0.0156 1.0217

Interpretation:

  • Open/lookup is ~1 ms per batch.
  • Close is ~15-21 ms per batch.
  • The write loop is ~1.0 s per batch and dominates total time (>95%).

Conclusion:

  • The bottleneck is actual HDF5 dataset writes/compression, not file open/lookup overhead.
  • Increasing batch size to amortize open cost is unlikely to significantly improve throughput.
  • Optimization should target write-loop cost (compression/chunking/write pattern).

@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Mar 15, 2026

Limit the number of inflight-batches we keep (i.e. completed results waiting to be written to disk):

  • Limits number of in-flight returned result batches (pending_results) to a max N.
  • Submits new batch tasks only when one completed result is consumed/written.
  • Uses auto-calculated N by default based on RAM budget targeting 85% utilization (or user specified value)
image

Setting lower N values does indeed cap main RSS usage during cell-extraction step. Memory still spikes significantly at beginning when mapping input-images and masks to memory-mapped temp arrays.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants