@@ -94,6 +94,7 @@ class BetaCudaDeviceInterface : public DeviceInterface {
94
94
}
95
95
96
96
private:
97
+ // Map of slotId to Slot
97
98
std::unordered_map<int , Slot> map_;
98
99
};
99
100
@@ -125,3 +126,74 @@ class BetaCudaDeviceInterface : public DeviceInterface {
125
126
};
126
127
127
128
} // namespace facebook::torchcodec
129
+
130
+ // Note: [sendPacket, receiveFrame, frame ordering and NVCUVID callbacks]
131
+ //
132
+ // At a high level, this decoding interface mimics the FFmpeg send/receive
133
+ // architecture:
134
+ // - sendPacket(AVPacket) sends an AVPacket from the FFmpeg demuxer to the
135
+ // NVCUVID parser.
136
+ // - receiveFrame(AVFrame) is a non-blocking call:
137
+ // - if a frame is ready **in display order**, it must return it. By display
138
+ // order, we mean that receiveFrame() must return frames with increasing pts
139
+ // values when called successively.
140
+ // - if no frame is ready, it must return AVERROR(EAGAIN) to indicate the
141
+ // caller should send more packets.
142
+ //
143
+ // The rest of this note assumes you have a reasonable level of familiarity with
144
+ // the sendPacket/receiveFrame calling pattern. If you don't, look up the core
145
+ // decoding loop in SingleVideoDecoder.
146
+ //
147
+ // The frame re-ordering problem:
148
+ // Depending on the codec and on the encoding parameters, a packet from a video
149
+ // stream may contain exactly one frame, more than one frame, or a fraction of a
150
+ // frame. And, there may be non-linear frame dependencies because of B-frames,
151
+ // which need both past *and* future frames to be decoded. Consider the
152
+ // following stream, with frames presented in display order: I0 B1 P2 B3 P4 ...
153
+ // - I0 is an I-frame (also key frame, can be decoded independently)
154
+ // - B1 is a B-frame (bi-directional) which needs both I0 and P2 to be decoded
155
+ // - P2 is a P-frame (predicted frame) which only needs I0 to be decodec.
156
+ //
157
+ // Because B1 needs both I0 and P2 to be properly decoded, the decode order must
158
+ // be: I0 P2 B1 P4 B3 ... which is different from the display order.
159
+ //
160
+ // We don't have to worry about the decode order: it's up to the parser to
161
+ // figure that out. But we have to make sure that receiveFrame() returns frames
162
+ // in display order.
163
+ //
164
+ // SendPacket(AVPacket)'s job is just to send the packet to the NVCUVID parser
165
+ // by calling cuvidParseVideoData(packet). When cuvidParseVideoData(packet) is
166
+ // called, it may trigger callbacks, particularly:
167
+ // - frameReadyForDecoding(picParams)): triggered **in decode order** when the
168
+ // parser has accumulated enough data to decode a frame. We send that frame to
169
+ // the NVDEC hardware for **async** decoding. While that frame is being
170
+ // decoded, we store a light reference (a Slot) to that frame in the
171
+ // frameBuffer_, and mark that slot as BEING_DECODED. The value that uniquely
172
+ // identifies that frame in the frameBuffer_ is its "slotId", which is given
173
+ // to us by NVCUVID in the callback parameter: picParams->CurrPicIdx.
174
+ // - frameReadyInDisplayOrder(dispInfo)): triggered **in display order** when a
175
+ // frame is ready to be "displayed" (returned). When it is triggered, we look
176
+ // up the corresponding frame/slot in the frameBuffer_, using
177
+ // dispInfo->picture_index to match it against a given BEING_DECODED slotId.
178
+ // We mark that frame/slot as READY_FOR_OUTPUT.
179
+ // Crucially, this callback also tells us the pts of that frame. We store
180
+ // the pts and other relevant info the slot.
181
+ //
182
+ // Said differently, from the perspective of the frameBuffer_, at any point in
183
+ // time a slot/frame in the frameBuffer_ can be in 3 states:
184
+ // - empty: no slot for that slotId exists in the frameBuffer_
185
+ // - BEING_DECODED: frameReadyForDecoding was triggered for that frame, and the
186
+ // frame was sent to NVDEC for async decoding. We don't know its pts because
187
+ // the parser didn't trigger frameReadyInDisplayOrder() for that frame yet.
188
+ // - READY_FOR_OUTPUT: frameReadyInDisplayOrder was triggered for that frame, it
189
+ // is decoded and ready to be mapped and returned. We know its pts.
190
+ //
191
+ // Because frameReadyInDisplayOrder is triggered in display order, we know that
192
+ // if a slot is READY_FOR_OUTPUT, then all frames with a lower pts are also
193
+ // READY_FOR_OUTPUT, or already returned. So when receiveFrame() is called, we
194
+ // just need to look for the READY_FOR_OUTPUT slot with the lowest pts, and
195
+ // return that frame. This guarantees that receiveFrame() returns frames in
196
+ // display order. If no slot is READY_FOR_OUTPUT, then we return EAGAIN to
197
+ // indicate the caller should send more packets.
198
+ //
199
+ // Simple, innit?
0 commit comments