@@ -885,6 +885,58 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
885885 return frameBatchOutput;
886886}
887887
888+ // Note [Audio Decoding Design]
889+ // This note explains why audio decoding is implemented the way it is, and why
890+ // it inherently differs from video decoding.
891+ //
892+ // Like for video, FFmpeg exposes the concept of a frame for audio streams. An
893+ // audio frame is a contiguous sequence of samples, where a sample consists of
894+ // `numChannels` values. An audio frame, or a sequence thereof, is always
895+ // converted into a tensor of shape `(numChannels, numSamplesPerChannel)`
896+ // tensors.
897+ //
898+ // The notion of 'frame' in audio isn't what users want to interact with. Users
899+ // want to interact with samples. The C++ and core APIs return frames, because
900+ // we want those to be close to FFmpeg concepts, but the higher-level public
901+ // APIs expose samples. As a result:
902+ // - We don't expose index-based APIs for audio, because exposing index-based
903+ // APIs expliciltly exposes the concept of audio frame. For know, we think
904+ // exposing time-based APIs is more natural.
905+ // - We never perform a scan for audio streams. We don't need to, since we won't
906+ // be converting timestamps to indices. That's why we enfore the "seek_mode"
907+ // to be "approximate" (which is slightly mis-leading, because technically the
908+ // output frames / samples will be perfectly exact).
909+ //
910+ // Audio frames are of variable dimensions: in the same stream, a frame can
911+ // contain 1024 samples and the next one may contain 512 [1]. This makes it
912+ // impossible to stack audio frames in the same way we can stack video frames.
913+ // That's why audio frames are *concatenated* along the samples dimension, not
914+ // stacked. This is also why we cannot re-use the same pre-allocation logic we
915+ // have for videos in getFramesPlayedInRange(): this would require constant (and
916+ // known) frame dimensions.
917+ //
918+ // [IMPORTANT!] There is one key invariant that we must respect when decoding
919+ // audio frames:
920+ //
921+ // BEFORE DECODING FRAME I, WE MUST DECODE ALL FRAMES j < i.
922+ //
923+ // Always. Why? We don't know. What we know is that if we don't, we get clipped,
924+ // incorrect audio as output [1]. All other (correct) libraries like TorchAudio
925+ // or Decord do something similar, whether it was intended or not. This has a
926+ // few implications:
927+ // - The **only** place we're allowed to seek to in an audio stream is the
928+ // stream's beginning. This ensures that if we need a frame, we'll have
929+ // decoded all previous frames.
930+ // - Because of that, we don't allow the public APIs to seek. Public APIs can
931+ // call next() and `getFramesPlayedInRangeAudio()`, but they cannot manually
932+ // seek.
933+ // - We try not to seek, when we can avoid it. Typically if the next frame we
934+ // need is in the future, we don't seek back to the beginning, we just decode
935+ // all the frames in-between.
936+ //
937+ // [1] If you're brave and curious, you can read the long "Seek offset for
938+ // audio" note in https://github.com/pytorch/torchcodec/pull/507/files, which
939+ // sums up past (and failed) attemps at working around this issue.
888940VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio (
889941 double startSeconds,
890942 std::optional<double > stopSecondsOptional) {
@@ -911,7 +963,7 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
911963 streamInfo.lastDecodedAvFrameDuration ) {
912964 // If we need to seek backwards, then we have to seek back to the beginning
913965 // of the stream.
914- // TODO-AUDIO: document why this is needed in a big comment .
966+ // See [Audio Decoding Design] .
915967 setCursorPtsInSecondsInternal (INT64_MIN);
916968 }
917969
@@ -955,6 +1007,8 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
9551007// --------------------------------------------------------------------------
9561008
9571009void VideoDecoder::setCursorPtsInSeconds (double seconds) {
1010+ // We don't allow public audio decoding APIs to seek, see [Audio Decoding
1011+ // Design]
9581012 validateActiveStream (AVMEDIA_TYPE_VIDEO);
9591013 setCursorPtsInSecondsInternal (seconds);
9601014}
@@ -995,6 +1049,7 @@ bool VideoDecoder::canWeAvoidSeeking() const {
9951049 if (streamInfo.avMediaType == AVMEDIA_TYPE_AUDIO) {
9961050 // For audio, we only need to seek if a backwards seek was requested within
9971051 // getFramesPlayedInRangeAudio(), when setCursorPtsInSeconds() was called.
1052+ // For more context, see [Audio Decoding Design]
9981053 return !cursorWasJustSet_;
9991054 }
10001055 int64_t lastDecodedAvFramePts =
0 commit comments