@@ -21,81 +21,81 @@ We want to support this user-facing API:
2121
2222What the user is asking for, in English:
2323
24- 1. I want to decode frames from the file `"vid.mp4".`
25- 2. For each decoded frame, I want each frame to pass through the following
26- transforms:
27- a. Add or remove frames as necessary to ensure a constant 30 frames
28- per second.
29- b. Resize the frame to 640x480. Use the algorithm that is
30- TorchVision's default.
31- c. Inside the resized frame, crop the image to 32x32. The x and y
32- coordinates are chosen randomly upon the creation of the Python
33- VideoDecoder object. All decoded frames use the same values for x
34- and y.
24+ 1 . I want to decode frames from the file ` "vid.mp4". `
25+ 2 . For each decoded frame, I want each frame to pass through the following
26+ transforms:
27+ a. Add or remove frames as necessary to ensure a constant 30 frames
28+ per second.
29+ b. Resize the frame to 640x480. Use the algorithm that is
30+ TorchVision's default.
31+ c. Inside the resized frame, crop the image to 32x32. The x and y
32+ coordinates are chosen randomly upon the creation of the Python
33+ VideoDecoder object. All decoded frames use the same values for x
34+ and y.
3535
3636These three transforms are instructive, as they force us to consider:
3737
38- 1. How "easy" TorchVision transforms will be handled, where all values are
39- static. Resize is such an example.
40- 2. Transforms that involve randomness. The main question we need to resolve
41- is when the random value is resolved. I think this comes down to: once
42- upon Python VideoDecoder creation, or different for each frame decoded?
43- I made the call above that it should be once upon Python `VideoDecoder`
44- creation, but we need to make sure that lines up with what we think
45- users will want.
46- 3. Transforms that are supported by FFmpeg but not supported by
47- TorchVision. In particular, FPS is something that multiple users have
48- asked for.
38+ 1 . How "easy" TorchVision transforms will be handled, where all values are
39+ static. Resize is such an example.
40+ 2 . Transforms that involve randomness. The main question we need to resolve
41+ is when the random value is resolved. I think this comes down to: once
42+ upon Python VideoDecoder creation, or different for each frame decoded?
43+ I made the call above that it should be once upon Python ` VideoDecoder `
44+ creation, but we need to make sure that lines up with what we think
45+ users will want.
46+ 3 . Transforms that are supported by FFmpeg but not supported by
47+ TorchVision. In particular, FPS is something that multiple users have
48+ asked for.
4949
5050First let's consider implementing the "easy" case of Resize.
5151
52- 1. We add an optional `transforms` parameter to the initialization of
53- `VideoDecoder`. It is a sequence of TorchVision Transforms.
54- 2. During VideoDecoder object creation, we walk the list, capturing two
55- pieces of information:
56- a. The transform name that the C++ layer will understand. (We will
57- have to decide if we want to just use the FFmpeg filter name
58- here, the fully resolved Transform name, or introduce a new
59- naming layer.)
60- b. The parameters in a format that the C++ layer will understand. We
61- obtain them by calling `make_params()` on the Transform object.
62- 3. We add an optional transforms parameter to `core.add_video_stream()`. This
63- parameter will be a vector, but whether the vector contains strings,
64- tensors, or some combination of them is TBD.
65- 4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning
66- the values passed from the Python layer into transform objects that the
67- C++ layer knows about. We will have one class per transform we support.
68- Each class will have:
69- a. A name which matches the FFmpeg filter name.
70- b. One member for each supported parameter.
71- c. A virtual member function that knows how to produce a string that
72- can be passed to FFmpeg's filtergraph.
73- 5. We add a vector of such transforms to
74- `SingleStreamDecoder::addVideoStream`. We store the vector as a field in
75- `SingleStreamDecoder`.
76- 6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of
77- transforms. They are all related, but it's not clear to me what the
78- exact relationship should be.
79- 7. The actual string we pass to FFmepg's filtergraph comes from calling
80- the virtual member function on each transform object.
52+ 1 . We add an optional ` transforms ` parameter to the initialization of
53+ ` VideoDecoder ` . It is a sequence of TorchVision Transforms.
54+ 2 . During VideoDecoder object creation, we walk the list, capturing two
55+ pieces of information:
56+ a. The transform name that the C++ layer will understand. (We will
57+ have to decide if we want to just use the FFmpeg filter name
58+ here, the fully resolved Transform name, or introduce a new
59+ naming layer.)
60+ b. The parameters in a format that the C++ layer will understand. We
61+ obtain them by calling ` make_params() ` on the Transform object.
62+ 3 . We add an optional transforms parameter to ` core.add_video_stream() ` . This
63+ parameter will be a vector, but whether the vector contains strings,
64+ tensors, or some combination of them is TBD.
65+ 4 . The ` custom_ops.cpp ` and ` pybind_ops.cpp ` layer is responsible for turning
66+ the values passed from the Python layer into transform objects that the
67+ C++ layer knows about. We will have one class per transform we support.
68+ Each class will have:
69+ a. A name which matches the FFmpeg filter name.
70+ b. One member for each supported parameter.
71+ c. A virtual member function that knows how to produce a string that
72+ can be passed to FFmpeg's filtergraph.
73+ 5 . We add a vector of such transforms to
74+ ` SingleStreamDecoder::addVideoStream ` . We store the vector as a field in
75+ ` SingleStreamDecoder ` .
76+ 6 . We need to reconcile ` FilterGraph ` , ` FiltersContext ` and this vector of
77+ transforms. They are all related, but it's not clear to me what the
78+ exact relationship should be.
79+ 7 . The actual string we pass to FFmepg's filtergraph comes from calling
80+ the virtual member function on each transform object.
8181
8282For the transforms that do not exist in TorchVision, we can build on the above:
8383
84- 1 . We define a new module, ` torchcodec.decoders.transforms ` .
85- 2 . All transforms we define in there inherit from
86- ` torchvision.transforms.v2.Transform ` .
87- 3 . We implement the mimimum needed to hook the new transforms into the
88- machinery defined above.
84+ 1 . We define a new module, ` torchcodec.decoders.transforms ` .
85+ 2 . All transforms we define in there inherit from
86+ ` torchvision.transforms.v2.Transform ` .
87+ 3 . We implement the mimimum needed to hook the new transforms into the
88+ machinery defined above.
8989
9090Open questions:
9191
92- 1. Is `torchcodec.transforms` the right namespace?
93- 2. For random transforms, when should the value be fixed?
94- 3. Transforms such as Resize don't actually implement a `make_params()`
95- method. How does TorchVision get their parameters? How will TorchCodec?
96- 4. How do we communicate the transformation names and parameters to the C++
97- layer? We need to support transforms with an arbitrary number of parameters.
98- 5. How does this generalize to `AudioDecoder`? Ideally we would be able to
99- support TorchAudio's transforms in a similar way.
100- 6. What is the relationship between the C++ transform objects, `FilterGraph`
101- and `FiltersContext`?
92+ 1 . Is ` torchcodec.transforms ` the right namespace?
93+ 2 . For random transforms, when should the value be fixed?
94+ 3 . Transforms such as Resize don't actually implement a ` make_params() `
95+ method. How does TorchVision get their parameters? How will TorchCodec?
96+ 4 . How do we communicate the transformation names and parameters to the C++
97+ layer? We need to support transforms with an arbitrary number of parameters.
98+ 5 . How does this generalize to ` AudioDecoder ` ? Ideally we would be able to
99+ support TorchAudio's transforms in a similar way.
100+ 6 . What is the relationship between the C++ transform objects, ` FilterGraph `
101+ and ` FiltersContext ` ?
0 commit comments