@@ -418,6 +418,45 @@ class VideoDecoder {
418418// FRAME TENSOR ALLOCATION APIs
419419// --------------------------------------------------------------------------
420420
421+ // Note [Frame Tensor allocation and height and width]
422+ //
423+ // We always allocate [N]HWC tensors. The low-level decoding functions all
424+ // assume HWC tensors, since this is what FFmpeg natively handles. It's up to
425+ // the high-level decoding entry-points to permute that back to CHW, by calling
426+ // MaybePermuteHWC2CHW().
427+ //
428+ // Also, importantly, the way we figure out the the height and width of the
429+ // output frame varies and depends on the decoding entry-point:
430+ // - In all cases, if the user requested specific height and width from the
431+ // options, we honor that. Otherwise we fall into one of the categories below.
432+ // - In Batch decoding APIs (e.g. getFramesAtIndices), we get height and width
433+ // from the stream metadata, which itself got its value from the CodecContext,
434+ // when the stream was added.
435+ // - In single frames APIs:
436+ // - On CPU we get height and width from the AVFrame.
437+ // - On GPU, we get height and width from the metadata (same as batch APIs)
438+ //
439+ // These 2 strategies are encapsulated within
440+ // getHeightAndWidthFromOptionsOrMetadata() and
441+ // getHeightAndWidthFromOptionsOrAVFrame(). The reason they exist is to make it
442+ // very obvious which logic is used in which place, and they allow for `git
443+ // grep`ing.
444+ //
445+ // The source of truth for height and width really is the AVFrame: it's the
446+ // decoded ouptut from FFmpeg. The info from the metadata (i.e. from the
447+ // CodecContext) may not be as accurate. However, the AVFrame is only available
448+ // late in the call stack, when the frame is decoded, while the CodecContext is
449+ // available early when a stream is added. This is why we use the CodecContext
450+ // for pre-allocating batched output tensors (we could pre-allocate those only
451+ // once we decode the first frame to get the info frame the AVFrame, but that's
452+ // a more complex logic).
453+ //
454+ // Because the sources for height and width may disagree, we may end up with
455+ // conflicts: e.g. if we pre-allocate a batch output tensor based on the
456+ // metadata info, but the decoded AVFrame has a different height and width.
457+ // it is very important to check the height and width assumptions where the
458+ // tensors memory is used/filled in order to avoid segfaults.
459+
421460std::tuple<int , int > getHeightAndWidthFromOptionsOrMetadata (
422461 const VideoDecoder::VideoStreamDecoderOptions& options,
423462 const VideoDecoder::StreamMetadata& metadata);
0 commit comments