Skip to content

Commit 5d194e5

Browse files
committed
Comment
1 parent 2a78b84 commit 5d194e5

File tree

2 files changed

+77
-0
lines changed

2 files changed

+77
-0
lines changed

src/torchcodec/_core/BetaCudaDeviceInterface.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -533,6 +533,11 @@ void BetaCudaDeviceInterface::FrameBuffer::markSlotReadyAndSetInfo(
533533
slotId,
534534
". This should never happen.");
535535

536+
TORCH_CHECK(
537+
it->second.state == SlotState::BEING_DECODED,
538+
"Slot ",
539+
slotId,
540+
" is not in BEING_DECODED state. This should never happen.");
536541
it->second.state = SlotState::READY_FOR_OUTPUT;
537542
it->second.dispInfo = *dispInfo;
538543
}

src/torchcodec/_core/BetaCudaDeviceInterface.h

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ class BetaCudaDeviceInterface : public DeviceInterface {
9494
}
9595

9696
private:
97+
// Map of slotId to Slot
9798
std::unordered_map<int, Slot> map_;
9899
};
99100

@@ -125,3 +126,74 @@ class BetaCudaDeviceInterface : public DeviceInterface {
125126
};
126127

127128
} // namespace facebook::torchcodec
129+
130+
// Note: [sendPacket, receiveFrame, frame ordering and NVCUVID callbacks]
131+
//
132+
// At a high level, this decoding interface mimics the FFmpeg send/receive
133+
// architecture:
134+
// - sendPacket(AVPacket) sends an AVPacket from the FFmpeg demuxer to the
135+
// NVCUVID parser.
136+
// - receiveFrame(AVFrame) is a non-blocking call:
137+
// - if a frame is ready **in display order**, it must return it. By display
138+
// order, we mean that receiveFrame() must return frames with increasing pts
139+
// values when called successively.
140+
// - if no frame is ready, it must return AVERROR(EAGAIN) to indicate the
141+
// caller should send more packets.
142+
//
143+
// The rest of this note assumes you have a reasonable level of familiarity with
144+
// the sendPacket/receiveFrame calling pattern. If you don't, look up the core
145+
// decoding loop in SingleVideoDecoder.
146+
//
147+
// The frame re-ordering problem:
148+
// Depending on the codec and on the encoding parameters, a packet from a video
149+
// stream may contain exactly one frame, more than one frame, or a fraction of a
150+
// frame. And, there may be non-linear frame dependencies because of B-frames,
151+
// which need both past *and* future frames to be decoded. Consider the
152+
// following stream, with frames presented in display order: I0 B1 P2 B3 P4 ...
153+
// - I0 is an I-frame (also key frame, can be decoded independently)
154+
// - B1 is a B-frame (bi-directional) which needs both I0 and P2 to be decoded
155+
// - P2 is a P-frame (predicted frame) which only needs I0 to be decodec.
156+
//
157+
// Because B1 needs both I0 and P2 to be properly decoded, the decode order must
158+
// be: I0 P2 B1 P4 B3 ... which is different from the display order.
159+
//
160+
// We don't have to worry about the decode order: it's up to the parser to
161+
// figure that out. But we have to make sure that receiveFrame() returns frames
162+
// in display order.
163+
//
164+
// SendPacket(AVPacket)'s job is just to send the packet to the NVCUVID parser
165+
// by calling cuvidParseVideoData(packet). When cuvidParseVideoData(packet) is
166+
// called, it may trigger callbacks, particularly:
167+
// - frameReadyForDecoding(picParams)): triggered **in decode order** when the
168+
// parser has accumulated enough data to decode a frame. We send that frame to
169+
// the NVDEC hardware for **async** decoding. While that frame is being
170+
// decoded, we store a light reference (a Slot) to that frame in the
171+
// frameBuffer_, and mark that slot as BEING_DECODED. The value that uniquely
172+
// identifies that frame in the frameBuffer_ is its "slotId", which is given
173+
// to us by NVCUVID in the callback parameter: picParams->CurrPicIdx.
174+
// - frameReadyInDisplayOrder(dispInfo)): triggered **in display order** when a
175+
// frame is ready to be "displayed" (returned). When it is triggered, we look
176+
// up the corresponding frame/slot in the frameBuffer_, using
177+
// dispInfo->picture_index to match it against a given BEING_DECODED slotId.
178+
// We mark that frame/slot as READY_FOR_OUTPUT.
179+
// Crucially, this callback also tells us the pts of that frame. We store
180+
// the pts and other relevant info the slot.
181+
//
182+
// Said differently, from the perspective of the frameBuffer_, at any point in
183+
// time a slot/frame in the frameBuffer_ can be in 3 states:
184+
// - empty: no slot for that slotId exists in the frameBuffer_
185+
// - BEING_DECODED: frameReadyForDecoding was triggered for that frame, and the
186+
// frame was sent to NVDEC for async decoding. We don't know its pts because
187+
// the parser didn't trigger frameReadyInDisplayOrder() for that frame yet.
188+
// - READY_FOR_OUTPUT: frameReadyInDisplayOrder was triggered for that frame, it
189+
// is decoded and ready to be mapped and returned. We know its pts.
190+
//
191+
// Because frameReadyInDisplayOrder is triggered in display order, we know that
192+
// if a slot is READY_FOR_OUTPUT, then all frames with a lower pts are also
193+
// READY_FOR_OUTPUT, or already returned. So when receiveFrame() is called, we
194+
// just need to look for the READY_FOR_OUTPUT slot with the lowest pts, and
195+
// return that frame. This guarantees that receiveFrame() returns frames in
196+
// display order. If no slot is READY_FOR_OUTPUT, then we return EAGAIN to
197+
// indicate the caller should send more packets.
198+
//
199+
// Simple, innit?

0 commit comments

Comments
 (0)