|
| 1 | +# Copyright (c) Meta Platforms, Inc. and affiliates. |
| 2 | +# All rights reserved. |
| 3 | +# |
| 4 | +# This source code is licensed under the BSD-style license found in the |
| 5 | +# LICENSE file in the root directory of this source tree. |
| 6 | + |
| 7 | +""" |
| 8 | +==================================== |
| 9 | +Performance Tips and Best Practices |
| 10 | +==================================== |
| 11 | +
|
| 12 | +This tutorial consolidates performance optimization techniques for video |
| 13 | +decoding with TorchCodec. Learn when and how to apply various strategies |
| 14 | +to increase performance. |
| 15 | +""" |
| 16 | + |
| 17 | + |
| 18 | +# %% |
| 19 | +# Overview |
| 20 | +# -------- |
| 21 | +# |
| 22 | +# When decoding videos with TorchCodec, several techniques can significantly |
| 23 | +# improve performance depending on your use case. This guide covers: |
| 24 | +# |
| 25 | +# 1. **Batch APIs** - Decode multiple frames at once |
| 26 | +# 2. **Approximate Mode & Keyframe Mappings** - Trade accuracy for speed |
| 27 | +# 3. **Multi-threading** - Parallelize decoding across videos or chunks |
| 28 | +# 4. **CUDA Acceleration (BETA)** - Use GPU decoding for supported formats |
| 29 | +# |
| 30 | +# We'll explore each technique and when to use it. |
| 31 | + |
| 32 | +# %% |
| 33 | +# 1. Use Batch APIs When Possible |
| 34 | +# -------------------------------- |
| 35 | +# |
| 36 | +# If you need to decode multiple frames at once, it is faster when using the batch methods. TorchCodec's batch APIs reduce overhead and can leverage |
| 37 | +# internal optimizations. |
| 38 | +# |
| 39 | +# **Key Methods:** |
| 40 | +# |
| 41 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` for specific indices |
| 42 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_in_range` for ranges |
| 43 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_at` for timestamps |
| 44 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_in_range` for time ranges |
| 45 | +# |
| 46 | +# **When to use:** |
| 47 | +# |
| 48 | +# - Decoding multiple frames |
| 49 | + |
| 50 | +# %% |
| 51 | +# .. note:: |
| 52 | +# |
| 53 | +# For complete examples with runnable code demonstrating batch decoding, |
| 54 | +# iteration, and frame retrieval, see: |
| 55 | +# |
| 56 | +# - :ref:`sphx_glr_generated_examples_decoding_basic_example.py` |
| 57 | + |
| 58 | +# %% |
| 59 | +# 2. Approximate Mode & Keyframe Mappings |
| 60 | +# ---------------------------------------- |
| 61 | +# |
| 62 | +# By default, TorchCodec uses ``seek_mode="exact"``, which performs a scan when |
| 63 | +# the decoder is created to build an accurate internal index of frames. This |
| 64 | +# ensures frame-accurate seeking but takes longer for decoder initialization, |
| 65 | +# especially on long videos. |
| 66 | + |
| 67 | +# %% |
| 68 | +# **Approximate Mode** |
| 69 | +# ~~~~~~~~~~~~~~~~~~~~ |
| 70 | +# |
| 71 | +# Setting ``seek_mode="approximate"`` skips the initial scan and relies on the |
| 72 | +# video file's metadata headers. This dramatically speeds up |
| 73 | +# :class:`~torchcodec.decoders.VideoDecoder` creation, particularly for long |
| 74 | +# videos, but may result in slightly less accurate seeking in some cases. |
| 75 | +# |
| 76 | +# |
| 77 | +# **Which mode should you use:** |
| 78 | +# |
| 79 | +# - If you care about exactness of frame seeking, use “exact”. |
| 80 | +# - If you can sacrifice exactness of seeking for speed, which is usually the case when doing clip sampling, use “approximate”. |
| 81 | +# - If your videos don’t have variable framerate and their metadata is correct, then “approximate” mode is a net win: it will be just as accurate as the “exact” mode while still being significantly faster. |
| 82 | +# - If your size is small enough and we’re decoding a lot of frames, there’s a chance exact mode is actually faster. |
| 83 | + |
| 84 | +# %% |
| 85 | +# **Custom Frame Mappings** |
| 86 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 87 | +# |
| 88 | +# For advanced use cases, you can pre-compute a custom mapping between desired |
| 89 | +# frame indices and actual keyframe locations. This allows you to speed up :class:`~torchcodec.decoders.VideoDecoder` |
| 90 | +# instantiation while maintaining the frame seeking accuracy of ``seek_mode="exact"`` |
| 91 | +# |
| 92 | +# **When to use:** |
| 93 | +# |
| 94 | +# - Frame accuracy is critical, so approximate mode cannot be used |
| 95 | +# - Videos can be preprocessed once and then decoded many times |
| 96 | +# |
| 97 | +# **Performance impact:** Enables consistent, predictable performance for repeated |
| 98 | +# random access without the overhead of exact mode's scanning. |
| 99 | + |
| 100 | +# %% |
| 101 | +# .. note:: |
| 102 | +# |
| 103 | +# For complete benchmarks showing actual speedup numbers, accuracy comparisons, |
| 104 | +# and implementation examples, see: |
| 105 | +# |
| 106 | +# - :ref:`sphx_glr_generated_examples_decoding_approximate_mode.py` |
| 107 | +# |
| 108 | +# - :ref:`sphx_glr_generated_examples_decoding_custom_frame_mappings.py` |
| 109 | + |
| 110 | +# %% |
| 111 | +# 3. Multi-threading for Parallel Decoding |
| 112 | +# ----------------------------------------- |
| 113 | +# |
| 114 | +# For video decoding of a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process: |
| 115 | +# |
| 116 | +# - FFmpeg-based parallelism: Using FFmpeg's internal threading capabilities |
| 117 | +# - Multiprocessing: Distributing work across multiple processes |
| 118 | +# - Multithreading: Using multiple threads within a single process |
| 119 | + |
| 120 | +# %% |
| 121 | +# .. note:: |
| 122 | +# |
| 123 | +# For complete examples comparing |
| 124 | +# sequential, ffmpeg-based parallelism, multi-process, and multi-threaded approaches, see: |
| 125 | +# |
| 126 | +# - :ref:`sphx_glr_generated_examples_decoding_parallel_decoding.py` |
| 127 | + |
| 128 | +# %% |
| 129 | +# 4. BETA: CUDA Acceleration |
| 130 | +# --------------------------- |
| 131 | +# |
| 132 | +# TorchCodec supports GPU-accelerated decoding using NVIDIA's hardware decoder |
| 133 | +# (NVDEC) on supported hardware. This keeps decoded tensors in GPU memory, |
| 134 | +# avoiding expensive CPU-GPU transfers for downstream GPU operations. |
| 135 | +# |
| 136 | +# **When to use:** |
| 137 | +# |
| 138 | +# - Decoding large resolution videos |
| 139 | +# - Large batch of videos saturating the CPU |
| 140 | +# - GPU-intensive pipelines with transforms like scaling and cropping |
| 141 | +# - CPU is saturated and you want to free it up for other work |
| 142 | +# |
| 143 | +# **When NOT to use:** |
| 144 | +# |
| 145 | +# - You need bit-exact results |
| 146 | +# - Small resolution videos and the PCI-e transfer latency is large |
| 147 | +# - GPU is already busy and CPU is idle |
| 148 | +# |
| 149 | +# **Performance impact:** CUDA decoding can significantly outperform CPU decoding, |
| 150 | +# especially for high-resolution videos and when combined with GPU-based transforms. |
| 151 | +# Actual speedup varies by hardware, resolution, and codec. |
| 152 | + |
| 153 | +# %% |
| 154 | +# .. note:: |
| 155 | +# |
| 156 | +# For installation instructions, detailed examples, and visual comparisons |
| 157 | +# between CPU and CUDA decoding, see: |
| 158 | +# |
| 159 | +# - :ref:`sphx_glr_generated_examples_decoding_basic_cuda_example.py` |
0 commit comments