@@ -892,37 +892,36 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
892892// Like for video, FFmpeg exposes the concept of a frame for audio streams. An
893893// audio frame is a contiguous sequence of samples, where a sample consists of
894894// `numChannels` values. An audio frame, or a sequence thereof, is always
895- // converted into a tensor of shape `(numChannels, numSamplesPerChannel)`
896- // tensors.
895+ // converted into a tensor of shape `(numChannels, numSamplesPerChannel)`.
897896//
898897// The notion of 'frame' in audio isn't what users want to interact with. Users
899898// want to interact with samples. The C++ and core APIs return frames, because
900899// we want those to be close to FFmpeg concepts, but the higher-level public
901900// APIs expose samples. As a result:
902- // - We don't expose index-based APIs for audio, because exposing index-based
903- // APIs explicitly exposes the concept of audio frame. For know , we think
904- // exposing time-based APIs is more natural.
901+ // - We don't expose index-based APIs for audio, because that would mean
902+ // exposing the concept of audio frame. For now , we think exposing time-based
903+ // APIs is more natural.
905904// - We never perform a scan for audio streams. We don't need to, since we won't
906- // be converting timestamps to indices. That's why we enforce the " seek_mode"
905+ // be converting timestamps to indices. That's why we enforce the seek_mode
907906// to be "approximate" (which is slightly misleading, because technically the
908- // output frames / samples will be at their exact positions. But this
909- // incongruence is only exposed at the C++/core private levels).
907+ // output samples will be at their exact positions. But this incongruence is
908+ // only exposed at the C++/core private levels).
910909//
911910// Audio frames are of variable dimensions: in the same stream, a frame can
912911// contain 1024 samples and the next one may contain 512 [1]. This makes it
913912// impossible to stack audio frames in the same way we can stack video frames.
914- // That's why audio frames are *concatenated* along the samples dimension, not
915- // stacked. This is also why we cannot re-use the same pre-allocation logic we
916- // have for videos in getFramesPlayedInRange(): this would require constant (and
917- // known) frame dimensions .
913+ // This is one of the main reasons we cannot reuse the same pre-allocation logic
914+ // we have for videos in getFramesPlayedInRange(): pre-allocating a batch
915+ // requires constant (and known) frame dimensions. That's also why
916+ // *concatenated* along the samples dimension, not stacked .
918917//
919918// [IMPORTANT!] There is one key invariant that we must respect when decoding
920919// audio frames:
921920//
922921// BEFORE DECODING FRAME i, WE MUST DECODE ALL FRAMES j < i.
923922//
924923// Always. Why? We don't know. What we know is that if we don't, we get clipped,
925- // incorrect audio as output [1 ]. All other (correct) libraries like TorchAudio
924+ // incorrect audio as output [2 ]. All other (correct) libraries like TorchAudio
926925// or Decord do something similar, whether it was intended or not. This has a
927926// few implications:
928927// - The **only** place we're allowed to seek to in an audio stream is the
@@ -935,7 +934,7 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
935934// need is in the future, we don't seek back to the beginning, we just decode
936935// all the frames in-between.
937936//
938- // [1 ] If you're brave and curious, you can read the long "Seek offset for
937+ // [2 ] If you're brave and curious, you can read the long "Seek offset for
939938// audio" note in https://github.com/pytorch/torchcodec/pull/507/files, which
940939// sums up past (and failed) attemps at working around this issue.
941940VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio (
0 commit comments