-
-
Notifications
You must be signed in to change notification settings - Fork 138
Update inference_cli.py Long videos and low RAM refactor #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Will-I4M
commented
Dec 31, 2025
- Multi-GPU streaming refactor: added --recycle_workers_every and changed dispatch to process per-GPU sub-segments in cycles, respawning workers each cycle to release allocator memory.
- Spill-based stitching: introduced disk spill per chunk and incremental stitcher that blends overlaps while streaming to a persistent writer (or PNGs) without holding full segments in RAM.
- Output controls: ffmpeg writer now supports configurable codec/pix_fmt/bitrate/CRF/preset and higher bit depths; spill files stored in uint8/uint16 based on --output_bitdepth.
- Resilience: added OOM retry with cleanup/backoff for chunk processing; disk saves use atomic writes with retries.
- Diagnostics: RAM telemetry per chunk/phase, per-PID monitors for workers, and detailed logging around chunk processing, saving, and stitching.
- Input handling: explicit freeing of input tensors (new_frames) between streamed chunks to reduce transient RAM.
- CLI additions: --spill_dir, --output_bitdepth, --video_codec, --video_pix_fmt, --video_crf, --video_bitrate, --video_preset, --recycle_workers_every.
- Minor safety: pre-parse CUDA device; auto-enable streaming for long videos when --chunk_size is unset to avoid full loads.
- Multi-GPU streaming refactor: added --recycle_workers_every and changed dispatch to process per-GPU sub-segments in cycles, respawning workers each cycle to release allocator memory. - Spill-based stitching: introduced disk spill per chunk and incremental stitcher that blends overlaps while streaming to a persistent writer (or PNGs) without holding full segments in RAM. - Output controls: ffmpeg writer now supports configurable codec/pix_fmt/bitrate/CRF/preset and higher bit depths; spill files stored in uint8/uint16 based on --output_bitdepth. - Resilience: added OOM retry with cleanup/backoff for chunk processing; disk saves use atomic writes with retries. - Diagnostics: RAM telemetry per chunk/phase, per-PID monitors for workers, and detailed logging around chunk processing, saving, and stitching. - Input handling: explicit freeing of input tensors (new_frames) between streamed chunks to reduce transient RAM. - CLI additions: --spill_dir, --output_bitdepth, --video_codec, --video_pix_fmt, --video_crf, --video_bitrate, --video_preset, --recycle_workers_every. - Minor safety: pre-parse CUDA device; auto-enable streaming for long videos when --chunk_size is unset to avoid full loads.
|
Thank you for the PR @Will-I4M - there is quite a lot of changes. Did you test it with single and multi gpu? Including on long videos? Anything in the code update I need to consider when doing the review? I'll try to get to it next week, I'm unavailable this week. Thanks again. |
Indeed, my apologies for pushing several feature changes all at once. Perhaps it would be wiser to split the inference_cli into two parts to potentially keep some common sections and specialize one version in low RAM usage. In this PR, without using --debug, the code displays several messages about RAM usage, which isn't necessarily desirable for everyone, even though I find it very useful given SeedVR2's high RAM and VRAM requirements. I tested it in single-GPU and multi-GPU configurations. My main focus in this PR was RAM usage. There's generally better resilience to VRAM-related OOMs, but that's not the biggest change. In the initial version, all chunks are kept in RAM after processing until the final step: that's the biggest change, because in this version, only what's necessary (stitching, etc.) is kept in RAM. In the initial version, in multi-GPU mode, if a single process crashed during calculation (out of order...), the others would still continue unnecessarily until the end. The code is now more resilient and retries after somewhat aggressive memory cleanup attempts. I tested several strategies to free up RAM, all of which proved unsuccessful, until this latest version: the system no longer swaps unnecessarily. I was able to process several hours of video using the 7B model (fp16/4k/21 frames of context) on 4x3090 in a reasonable time, something I couldn't do before (even for durations of 25 minutes), so I think this work can be useful to the SeedVR2 user community. |
|
Pardon my ignorance, is this something that would, or could, benefit the ComfyUI version as well? |
|
|
Hi. BTW.. your file has still the bug about "--prepend_frames" not working. |
|
I have the same problem as above with --video_backend ffmpeg hanging. |