Skip to content

Commit 901e18a

Browse files
committed
Add note about audio decoding design
1 parent ae19a78 commit 901e18a

File tree

1 file changed

+56
-1
lines changed

1 file changed

+56
-1
lines changed

src/torchcodec/decoders/_core/VideoDecoder.cpp

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -885,6 +885,58 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
885885
return frameBatchOutput;
886886
}
887887

888+
// Note [Audio Decoding Design]
889+
// This note explains why audio decoding is implemented the way it is, and why
890+
// it inherently differs from video decoding.
891+
//
892+
// Like for video, FFmpeg exposes the concept of a frame for audio streams. An
893+
// audio frame is a contiguous sequence of samples, where a sample consists of
894+
// `numChannels` values. An audio frame, or a sequence thereof, is always
895+
// converted into a tensor of shape `(numChannels, numSamplesPerChannel)`
896+
// tensors.
897+
//
898+
// The notion of 'frame' in audio isn't what users want to interact with. Users
899+
// want to interact with samples. The C++ and core APIs return frames, because
900+
// we want those to be close to FFmpeg concepts, but the higher-level public
901+
// APIs expose samples. As a result:
902+
// - We don't expose index-based APIs for audio, because exposing index-based
903+
// APIs expliciltly exposes the concept of audio frame. For know, we think
904+
// exposing time-based APIs is more natural.
905+
// - We never perform a scan for audio streams. We don't need to, since we won't
906+
// be converting timestamps to indices. That's why we enfore the "seek_mode"
907+
// to be "approximate" (which is slightly mis-leading, because technically the
908+
// output frames / samples will be perfectly exact).
909+
//
910+
// Audio frames are of variable dimensions: in the same stream, a frame can
911+
// contain 1024 samples and the next one may contain 512 [1]. This makes it
912+
// impossible to stack audio frames in the same way we can stack video frames.
913+
// That's why audio frames are *concatenated* along the samples dimension, not
914+
// stacked. This is also why we cannot re-use the same pre-allocation logic we
915+
// have for videos in getFramesPlayedInRange(): this would require constant (and
916+
// known) frame dimensions.
917+
//
918+
// [IMPORTANT!] There is one key invariant that we must respect when decoding
919+
// audio frames:
920+
//
921+
// BEFORE DECODING FRAME I, WE MUST DECODE ALL FRAMES j < i.
922+
//
923+
// Always. Why? We don't know. What we know is that if we don't, we get clipped,
924+
// incorrect audio as output [1]. All other (correct) libraries like TorchAudio
925+
// or Decord do something similar, whether it was intended or not. This has a
926+
// few implications:
927+
// - The **only** place we're allowed to seek to in an audio stream is the
928+
// stream's beginning. This ensures that if we need a frame, we'll have
929+
// decoded all previous frames.
930+
// - Because of that, we don't allow the public APIs to seek. Public APIs can
931+
// call next() and `getFramesPlayedInRangeAudio()`, but they cannot manually
932+
// seek.
933+
// - We try not to seek, when we can avoid it. Typically if the next frame we
934+
// need is in the future, we don't seek back to the beginning, we just decode
935+
// all the frames in-between.
936+
//
937+
// [1] If you're brave and curious, you can read the long "Seek offset for
938+
// audio" note in https://github.com/pytorch/torchcodec/pull/507/files, which
939+
// sums up past (and failed) attemps at working around this issue.
888940
VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
889941
double startSeconds,
890942
std::optional<double> stopSecondsOptional) {
@@ -911,7 +963,7 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
911963
streamInfo.lastDecodedAvFrameDuration) {
912964
// If we need to seek backwards, then we have to seek back to the beginning
913965
// of the stream.
914-
// TODO-AUDIO: document why this is needed in a big comment.
966+
// See [Audio Decoding Design].
915967
setCursorPtsInSecondsInternal(INT64_MIN);
916968
}
917969

@@ -955,6 +1007,8 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(
9551007
// --------------------------------------------------------------------------
9561008

9571009
void VideoDecoder::setCursorPtsInSeconds(double seconds) {
1010+
// We don't allow public audio decoding APIs to seek, see [Audio Decoding
1011+
// Design]
9581012
validateActiveStream(AVMEDIA_TYPE_VIDEO);
9591013
setCursorPtsInSecondsInternal(seconds);
9601014
}
@@ -995,6 +1049,7 @@ bool VideoDecoder::canWeAvoidSeeking() const {
9951049
if (streamInfo.avMediaType == AVMEDIA_TYPE_AUDIO) {
9961050
// For audio, we only need to seek if a backwards seek was requested within
9971051
// getFramesPlayedInRangeAudio(), when setCursorPtsInSeconds() was called.
1052+
// For more context, see [Audio Decoding Design]
9981053
return !cursorWasJustSet_;
9991054
}
10001055
int64_t lastDecodedAvFramePts =

0 commit comments

Comments
 (0)