Skip to content

Commit e529764

Browse files
committed
Comments
1 parent 4a9c00c commit e529764

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed

src/torchcodec/decoders/_core/VideoDecoder.h

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -418,6 +418,45 @@ class VideoDecoder {
418418
// FRAME TENSOR ALLOCATION APIs
419419
// --------------------------------------------------------------------------
420420

421+
// Note [Frame Tensor allocation and height and width]
422+
//
423+
// We always allocate [N]HWC tensors. The low-level decoding functions all
424+
// assume HWC tensors, since this is what FFmpeg natively handles. It's up to
425+
// the high-level decoding entry-points to permute that back to CHW, by calling
426+
// MaybePermuteHWC2CHW().
427+
//
428+
// Also, importantly, the way we figure out the the height and width of the
429+
// output frame varies and depends on the decoding entry-point:
430+
// - In all cases, if the user requested specific height and width from the
431+
// options, we honor that. Otherwise we fall into one of the categories below.
432+
// - In Batch decoding APIs (e.g. getFramesAtIndices), we get height and width
433+
// from the stream metadata, which itself got its value from the CodecContext,
434+
// when the stream was added.
435+
// - In single frames APIs:
436+
// - On CPU we get height and width from the AVFrame.
437+
// - On GPU, we get height and width from the metadata (same as batch APIs)
438+
//
439+
// These 2 strategies are encapsulated within
440+
// getHeightAndWidthFromOptionsOrMetadata() and
441+
// getHeightAndWidthFromOptionsOrAVFrame(). The reason they exist is to make it
442+
// very obvious which logic is used in which place, and they allow for `git
443+
// grep`ing.
444+
//
445+
// The source of truth for height and width really is the AVFrame: it's the
446+
// decoded ouptut from FFmpeg. The info from the metadata (i.e. from the
447+
// CodecContext) may not be as accurate. However, the AVFrame is only available
448+
// late in the call stack, when the frame is decoded, while the CodecContext is
449+
// available early when a stream is added. This is why we use the CodecContext
450+
// for pre-allocating batched output tensors (we could pre-allocate those only
451+
// once we decode the first frame to get the info frame the AVFrame, but that's
452+
// a more complex logic).
453+
//
454+
// Because the sources for height and width may disagree, we may end up with
455+
// conflicts: e.g. if we pre-allocate a batch output tensor based on the
456+
// metadata info, but the decoded AVFrame has a different height and width.
457+
// it is very important to check the height and width assumptions where the
458+
// tensors memory is used/filled in order to avoid segfaults.
459+
421460
std::tuple<int, int> getHeightAndWidthFromOptionsOrMetadata(
422461
const VideoDecoder::VideoStreamDecoderOptions& options,
423462
const VideoDecoder::StreamMetadata& metadata);

0 commit comments

Comments
 (0)