11We want to support this user-facing API:
2-
2+
3+ ``` python
34 decoder = VideoDecoder(
45 " vid.mp4" ,
56 transforms = [
@@ -8,18 +9,19 @@ We want to support this user-facing API:
89 ),
910 torchvision.transforms.v2.Resize(
1011 width = 640 ,
11- height=480,
12+ height = 480 ,
1213 ),
1314 torchvision.transforms.v2.RandomCrop(
1415 width = 32 ,
1516 height = 32 ,
1617 ),
1718 ]
1819 )
20+ ```
1921
2022What the user is asking for, in English:
2123
22- 1. I want to decode frames from the file "vid.mp4".
24+ 1. I want to decode frames from the file ` "vid.mp4".`
2325 2. For each decoded frame, I want each frame to pass through the following
2426 transforms:
2527 a. Add or remove frames as necessary to ensure a constant 30 frames
@@ -38,7 +40,7 @@ These three transforms are instructive, as they force us to consider:
3840 2. Transforms that involve randomness. The main question we need to resolve
3941 is when the random value is resolved. I think this comes down to: once
4042 upon Python VideoDecoder creation, or different for each frame decoded?
41- I made the call above that it should be once upon Python VideoDecoder
43+ I made the call above that it should be once upon Python ` VideoDecoder`
4244 creation, but we need to make sure that lines up with what we think
4345 users will want.
4446 3. Transforms that are supported by FFmpeg but not supported by
@@ -48,19 +50,19 @@ These three transforms are instructive, as they force us to consider:
4850First let's consider implementing the "easy" case of Resize.
4951
5052 1. We add an optional `transforms` parameter to the initialization of
51- VideoDecoder. It is a sequence of TorchVision Transforms.
53+ ` VideoDecoder` . It is a sequence of TorchVision Transforms.
5254 2. During VideoDecoder object creation, we walk the list, capturing two
5355 pieces of information:
54- a. The transform name that the C++ layer will understand. (We will
56+ a. The transform name that the C++ layer will understand. (We will
5557 have to decide if we want to just use the FFmpeg filter name
5658 here, the fully resolved Transform name, or introduce a new
5759 naming layer.)
5860 b. The parameters in a format that the C++ layer will understand. We
5961 obtain them by calling `make_params()` on the Transform object.
60- 3. We add an optional transforms parameter to core.add_video_stream(). This
62+ 3. We add an optional transforms parameter to ` core.add_video_stream()` . This
6163 parameter will be a vector, but whether the vector contains strings,
6264 tensors, or some combination of them is TBD.
63- 4. The custom_ops.cpp and pybind_ops.cpp layer is responsible for turning
65+ 4. The ` custom_ops.cpp` and ` pybind_ops.cpp` layer is responsible for turning
6466 the values passed from the Python layer into transform objects that the
6567 C++ layer knows about. We will have one class per transform we support.
6668 Each class will have:
@@ -69,31 +71,31 @@ First let's consider implementing the "easy" case of Resize.
6971 c. A virtual member function that knows how to produce a string that
7072 can be passed to FFmpeg's filtergraph.
7173 5. We add a vector of such transforms to
72- SingleStreamDecoder::addVideoStream. We store the vector as a field in
73- SingleStreamDecoder.
74- 6. We need to reconcile FilterGraph, FiltersContext and this vector of
74+ ` SingleStreamDecoder::addVideoStream` . We store the vector as a field in
75+ ` SingleStreamDecoder` .
76+ 6. We need to reconcile ` FilterGraph`, ` FiltersContext` and this vector of
7577 transforms. They are all related, but it's not clear to me what the
7678 exact relationship should be.
7779 7. The actual string we pass to FFmepg's filtergraph comes from calling
7880 the virtual member function on each transform object.
7981
8082For the transforms that do not exist in TorchVision, we can build on the above:
8183
82- 1 . We define a new module, torchcodec.decoders.transforms.
84+ 1 . We define a new module, ` torchcodec.decoders.transforms ` .
8385 2 . All transforms we define in there inherit from
84- torchvision.transforms.v2.Transform.
86+ ` torchvision.transforms.v2.Transform ` .
8587 3 . We implement the mimimum needed to hook the new transforms into the
8688 machinery defined above.
8789
8890Open questions:
8991
90- 1. Is torchcodec.transforms the right namespace?
92+ 1. Is ` torchcodec.transforms` the right namespace?
9193 2. For random transforms, when should the value be fixed?
92- 3. Transforms such as Resize don't actually implement a make_params()
94+ 3. Transforms such as Resize don't actually implement a ` make_params()`
9395 method. How does TorchVision get their parameters? How will TorchCodec?
9496 4. How do we communicate the transformation names and parameters to the C++
9597 layer? We need to support transforms with an arbitrary number of parameters.
96- 5. How does this generalize to AudioDecoder? Ideally we would be able to
98+ 5. How does this generalize to ` AudioDecoder` ? Ideally we would be able to
9799 support TorchAudio's transforms in a similar way.
98- 6. What is the relationship between the C++ transform objects, FilterGraph
99- and FiltersContext?
100+ 6. What is the relationship between the C++ transform objects, ` FilterGraph`
101+ and ` FiltersContext` ?
0 commit comments