Skip to content

Commit f55dcc0

Browse files
committed
Update comment
1 parent 390fd7c commit f55dcc0

File tree

1 file changed

+58
-40
lines changed

1 file changed

+58
-40
lines changed

src/torchcodec/_core/BetaCudaDeviceInterface.h

Lines changed: 58 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,8 @@ class BetaCudaDeviceInterface : public DeviceInterface {
9393

9494
} // namespace facebook::torchcodec
9595

96-
// Note: [sendPacket, receiveFrame, frame ordering and NVCUVID callbacks]
96+
/* clang-format off */
97+
// Note: [General design, sendPacket, receiveFrame, frame ordering and NVCUVID callbacks]
9798
//
9899
// At a high level, this decoding interface mimics the FFmpeg send/receive
99100
// architecture:
@@ -120,46 +121,63 @@ class BetaCudaDeviceInterface : public DeviceInterface {
120121
// - B1 is a B-frame (bi-directional) which needs both I0 and P2 to be decoded
121122
// - P2 is a P-frame (predicted frame) which only needs I0 to be decodec.
122123
//
123-
// Because B1 needs both I0 and P2 to be properly decoded, the decode order must
124-
// be: I0 P2 B1 P4 B3 ... which is different from the display order.
124+
// Because B1 needs both I0 and P2 to be properly decoded, the decode order
125+
// (packet order), defined by the encoder, must be: I0 P2 B1 P4 B3 ... which is
126+
// different from the display order.
125127
//
126-
// We don't have to worry about the decode order: it's up to the parser to
127-
// figure that out. But we have to make sure that receiveFrame() returns frames
128-
// in display order.
129-
//
130-
// SendPacket(AVPacket)'s job is just to send the packet to the NVCUVID parser
131-
// by calling cuvidParseVideoData(packet). When cuvidParseVideoData(packet) is
132-
// called, it may trigger callbacks, particularly:
128+
// SendPacket(AVPacket)'s job is just to pass down the packet to the NVCUVID
129+
// parser by calling cuvidParseVideoData(packet). When
130+
// cuvidParseVideoData(packet) is called, it may trigger callbacks,
131+
// particularly:
132+
// - streamPropertyChange(videoFormat): triggered once at the start of the
133+
// stream, and possibly later if the stream properties change (e.g.
134+
// resolution).
133135
// - frameReadyForDecoding(picParams)): triggered **in decode order** when the
134136
// parser has accumulated enough data to decode a frame. We send that frame to
135-
// the NVDEC hardware for **async** decoding. While that frame is being
136-
// decoded, we store a light reference (a Slot) to that frame in the
137-
// readyFrames_, and mark that slot as BEING_DECODED. The value that uniquely
138-
// identifies that frame in the readyFrames_ is its "slotId", which is given
139-
// to us by NVCUVID in the callback parameter: picParams->CurrPicIdx.
137+
// the NVDEC hardware for **async** decoding.
140138
// - frameReadyInDisplayOrder(dispInfo)): triggered **in display order** when a
141-
// frame is ready to be "displayed" (returned). When it is triggered, we look
142-
// up the corresponding frame/slot in the readyFrames_, using
143-
// dispInfo->picture_index to match it against a given BEING_DECODED slotId.
144-
// We mark that frame/slot as READY_FOR_OUTPUT.
145-
// Crucially, this callback also tells us the pts of that frame. We store
146-
// the pts and other relevant info the slot.
147-
//
148-
// Said differently, from the perspective of the readyFrames_, at any point in
149-
// time a slot/frame in the readyFrames_ can be in 3 states:
150-
// - empty: no slot for that slotId exists in the readyFrames_
151-
// - BEING_DECODED: frameReadyForDecoding was triggered for that frame, and the
152-
// frame was sent to NVDEC for async decoding. We don't know its pts because
153-
// the parser didn't trigger frameReadyInDisplayOrder() for that frame yet.
154-
// - READY_FOR_OUTPUT: frameReadyInDisplayOrder was triggered for that frame, it
155-
// is decoded and ready to be mapped and returned. We know its pts.
156-
//
157-
// Because frameReadyInDisplayOrder is triggered in display order, we know that
158-
// if a slot is READY_FOR_OUTPUT, then all frames with a lower pts are also
159-
// READY_FOR_OUTPUT, or already returned. So when receiveFrame() is called, we
160-
// just need to look for the READY_FOR_OUTPUT slot with the lowest pts, and
161-
// return that frame. This guarantees that receiveFrame() returns frames in
162-
// display order. If no slot is READY_FOR_OUTPUT, then we return EAGAIN to
163-
// indicate the caller should send more packets.
164-
//
165-
// Simple, innit?
139+
// frame is ready to be "displayed" (returned). At that point, the parser also
140+
// gives us the pts of that frame. We store (a reference to) that frame in a
141+
// FIFO queue: readyFrames_.
142+
//
143+
// When receiveFrame(AVFrame) is called, if readyFrames_ is not empty, we pop
144+
// the front of the queue, which is the next frame in display order, and map it
145+
// to an AVFrame by calling cuvidMapVideoFrame(). If readyFrames_ is empty we
146+
// return EAGAIN to indicate the caller should send more packets.
147+
//
148+
// There is potentially a small inefficiency due to the callback design: in
149+
// order for us to know that a frame is ready in display order, we need the
150+
// frameReadyInDisplayOrder callback to be triggered. This can only happen
151+
// within cuvidParseVideoData(packet) in sendPacket(). This means there may be
152+
// the following sequence of calls:
153+
//
154+
// sendPacket(relevantAVPacket)
155+
// cuvidParseVideoData(relevantAVPacket)
156+
// frameReadyForDecoding()
157+
// cuvidDecodePicture() Send frame to NVDEC for async decoding
158+
//
159+
// receiveFrame() -> EAGAIN Frame is potentially already decoded
160+
// and could be returned, but we don't
161+
// know because frameReadyInDisplayOrder
162+
// hasn't been triggered yet. We'll only
163+
// know after sending another,
164+
// potentially irrelevant packet.
165+
//
166+
// sendPacket(irrelevantAVPacket)
167+
// cuvidParseVideoData(irrelevantAVPacket)
168+
// frameReadyInDisplayOrder() Only now do we know that our target
169+
// frame is ready.
170+
//
171+
// receiveFrame() return target frame
172+
//
173+
// How much this matters in practice is unclear, but probably negligible in
174+
// general. Particularly when frames are decoded consecutively anyway, the
175+
// "irrelevantPacket" is actually relevant for a future target frame.
176+
//
177+
// Note that the alternative is to *not* rely on the frameReadyInDisplayOrder
178+
// callback. It's technically possible, but it would mean we now have to solve
179+
// two hard, *codec-dependent* problems that the callback was solving for us:
180+
// - we have to guess the frame's pts ourselves
181+
// - we have to re-order the frames ourselves to preserve display order.
182+
//
183+
/* clang-format on */

0 commit comments

Comments
 (0)