Minor edits

NicolasHug · NicolasHug · commit f8cd5ec7bc56 · 2025-03-20T16:44:50.000Z
diff --git a/src/torchcodec/decoders/_core/VideoDecoder.cpp b/src/torchcodec/decoders/_core/VideoDecoder.cpp
@@ -892,37 +892,36 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
 // Like for video, FFmpeg exposes the concept of a frame for audio streams. An
 // audio frame is a contiguous sequence of samples, where a sample consists of
 // `numChannels` values. An audio frame, or a sequence thereof, is always
-// converted into a tensor of shape `(numChannels, numSamplesPerChannel)`
-// tensors.
+// converted into a tensor of shape `(numChannels, numSamplesPerChannel)`.
 //
 // The notion of 'frame' in audio isn't what users want to interact with. Users
 // want to interact with samples. The C++ and core APIs return frames, because
 // we want those to be close to FFmpeg concepts, but the higher-level public
 // APIs expose samples. As a result:
-// - We don't expose index-based APIs for audio, because exposing index-based
-//   APIs explicitly exposes the concept of audio frame. For know, we think
-//   exposing time-based APIs is more natural.
+// - We don't expose index-based APIs for audio, because that would mean
+//   exposing the concept of audio frame. For now, we think exposing time-based
+//   APIs is more natural.
 // - We never perform a scan for audio streams. We don't need to, since we won't
-//   be converting timestamps to indices. That's why we enforce the "seek_mode"
+//   be converting timestamps to indices. That's why we enforce the seek_mode
 //   to be "approximate" (which is slightly misleading, because technically the
-//   output frames / samples will be at their exact positions. But this
-//   incongruence is only exposed at the C++/core private levels).
+//   output samples will be at their exact positions. But this incongruence is
+//   only exposed at the C++/core private levels).
 //
 // Audio frames are of variable dimensions: in the same stream, a frame can
 // contain 1024 samples and the next one may contain 512 [1]. This makes it
 // impossible to stack audio frames in the same way we can stack video frames.
-// That's why audio frames are *concatenated* along the samples dimension, not
-// stacked. This is also why we cannot re-use the same pre-allocation logic we
-// have for videos in getFramesPlayedInRange(): this would require constant (and
-// known) frame dimensions.
+// This is one of the main reasons we cannot reuse the same pre-allocation logic
+// we have for videos in getFramesPlayedInRange(): pre-allocating a batch
+// requires constant (and known) frame dimensions. That's also why
+// *concatenated* along the samples dimension, not stacked.
 //
 // [IMPORTANT!] There is one key invariant that we must respect when decoding
 // audio frames:
 //
 // BEFORE DECODING FRAME i, WE MUST DECODE ALL FRAMES j < i.
 //
 // Always. Why? We don't know. What we know is that if we don't, we get clipped,
-// incorrect audio as output [1]. All other (correct) libraries like TorchAudio
+// incorrect audio as output [2]. All other (correct) libraries like TorchAudio
 // or Decord do something similar, whether it was intended or not. This has a
 // few implications:
 // - The **only** place we're allowed to seek to in an audio stream is the
@@ -935,7 +934,7 @@ VideoDecoder::FrameBatchOutput VideoDecoder::getFramesPlayedInRange(
 //   need is in the future, we don't seek back to the beginning, we just decode
 //   all the frames in-between.
 //
-// [1] If you're brave and curious, you can read the long "Seek offset for
+// [2] If you're brave and curious, you can read the long "Seek offset for
 // audio" note in https://github.com/pytorch/torchcodec/pull/507/files, which
 // sums up past (and failed) attemps at working around this issue.
 VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(