Skip to content

Commit 449f500

Browse files
committed
List formatting
1 parent a9e8182 commit 449f500

File tree

1 file changed

+66
-66
lines changed

1 file changed

+66
-66
lines changed

decoder_native_transforms.md

Lines changed: 66 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -21,81 +21,81 @@ We want to support this user-facing API:
2121

2222
What the user is asking for, in English:
2323

24-
1. I want to decode frames from the file `"vid.mp4".`
25-
2. For each decoded frame, I want each frame to pass through the following
26-
transforms:
27-
a. Add or remove frames as necessary to ensure a constant 30 frames
28-
per second.
29-
b. Resize the frame to 640x480. Use the algorithm that is
30-
TorchVision's default.
31-
c. Inside the resized frame, crop the image to 32x32. The x and y
32-
coordinates are chosen randomly upon the creation of the Python
33-
VideoDecoder object. All decoded frames use the same values for x
34-
and y.
24+
1. I want to decode frames from the file `"vid.mp4".`
25+
2. For each decoded frame, I want each frame to pass through the following
26+
transforms:
27+
a. Add or remove frames as necessary to ensure a constant 30 frames
28+
per second.
29+
b. Resize the frame to 640x480. Use the algorithm that is
30+
TorchVision's default.
31+
c. Inside the resized frame, crop the image to 32x32. The x and y
32+
coordinates are chosen randomly upon the creation of the Python
33+
VideoDecoder object. All decoded frames use the same values for x
34+
and y.
3535

3636
These three transforms are instructive, as they force us to consider:
3737

38-
1. How "easy" TorchVision transforms will be handled, where all values are
39-
static. Resize is such an example.
40-
2. Transforms that involve randomness. The main question we need to resolve
41-
is when the random value is resolved. I think this comes down to: once
42-
upon Python VideoDecoder creation, or different for each frame decoded?
43-
I made the call above that it should be once upon Python `VideoDecoder`
44-
creation, but we need to make sure that lines up with what we think
45-
users will want.
46-
3. Transforms that are supported by FFmpeg but not supported by
47-
TorchVision. In particular, FPS is something that multiple users have
48-
asked for.
38+
1. How "easy" TorchVision transforms will be handled, where all values are
39+
static. Resize is such an example.
40+
2. Transforms that involve randomness. The main question we need to resolve
41+
is when the random value is resolved. I think this comes down to: once
42+
upon Python VideoDecoder creation, or different for each frame decoded?
43+
I made the call above that it should be once upon Python `VideoDecoder`
44+
creation, but we need to make sure that lines up with what we think
45+
users will want.
46+
3. Transforms that are supported by FFmpeg but not supported by
47+
TorchVision. In particular, FPS is something that multiple users have
48+
asked for.
4949

5050
First let's consider implementing the "easy" case of Resize.
5151

52-
1. We add an optional `transforms` parameter to the initialization of
53-
`VideoDecoder`. It is a sequence of TorchVision Transforms.
54-
2. During VideoDecoder object creation, we walk the list, capturing two
55-
pieces of information:
56-
a. The transform name that the C++ layer will understand. (We will
57-
have to decide if we want to just use the FFmpeg filter name
58-
here, the fully resolved Transform name, or introduce a new
59-
naming layer.)
60-
b. The parameters in a format that the C++ layer will understand. We
61-
obtain them by calling `make_params()` on the Transform object.
62-
3. We add an optional transforms parameter to `core.add_video_stream()`. This
63-
parameter will be a vector, but whether the vector contains strings,
64-
tensors, or some combination of them is TBD.
65-
4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning
66-
the values passed from the Python layer into transform objects that the
67-
C++ layer knows about. We will have one class per transform we support.
68-
Each class will have:
69-
a. A name which matches the FFmpeg filter name.
70-
b. One member for each supported parameter.
71-
c. A virtual member function that knows how to produce a string that
72-
can be passed to FFmpeg's filtergraph.
73-
5. We add a vector of such transforms to
74-
`SingleStreamDecoder::addVideoStream`. We store the vector as a field in
75-
`SingleStreamDecoder`.
76-
6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of
77-
transforms. They are all related, but it's not clear to me what the
78-
exact relationship should be.
79-
7. The actual string we pass to FFmepg's filtergraph comes from calling
80-
the virtual member function on each transform object.
52+
1. We add an optional `transforms` parameter to the initialization of
53+
`VideoDecoder`. It is a sequence of TorchVision Transforms.
54+
2. During VideoDecoder object creation, we walk the list, capturing two
55+
pieces of information:
56+
a. The transform name that the C++ layer will understand. (We will
57+
have to decide if we want to just use the FFmpeg filter name
58+
here, the fully resolved Transform name, or introduce a new
59+
naming layer.)
60+
b. The parameters in a format that the C++ layer will understand. We
61+
obtain them by calling `make_params()` on the Transform object.
62+
3. We add an optional transforms parameter to `core.add_video_stream()`. This
63+
parameter will be a vector, but whether the vector contains strings,
64+
tensors, or some combination of them is TBD.
65+
4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning
66+
the values passed from the Python layer into transform objects that the
67+
C++ layer knows about. We will have one class per transform we support.
68+
Each class will have:
69+
a. A name which matches the FFmpeg filter name.
70+
b. One member for each supported parameter.
71+
c. A virtual member function that knows how to produce a string that
72+
can be passed to FFmpeg's filtergraph.
73+
5. We add a vector of such transforms to
74+
`SingleStreamDecoder::addVideoStream`. We store the vector as a field in
75+
`SingleStreamDecoder`.
76+
6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of
77+
transforms. They are all related, but it's not clear to me what the
78+
exact relationship should be.
79+
7. The actual string we pass to FFmepg's filtergraph comes from calling
80+
the virtual member function on each transform object.
8181

8282
For the transforms that do not exist in TorchVision, we can build on the above:
8383

84-
1. We define a new module, `torchcodec.decoders.transforms`.
85-
2. All transforms we define in there inherit from
86-
`torchvision.transforms.v2.Transform`.
87-
3. We implement the mimimum needed to hook the new transforms into the
88-
machinery defined above.
84+
1. We define a new module, `torchcodec.decoders.transforms`.
85+
2. All transforms we define in there inherit from
86+
`torchvision.transforms.v2.Transform`.
87+
3. We implement the mimimum needed to hook the new transforms into the
88+
machinery defined above.
8989

9090
Open questions:
9191

92-
1. Is `torchcodec.transforms` the right namespace?
93-
2. For random transforms, when should the value be fixed?
94-
3. Transforms such as Resize don't actually implement a `make_params()`
95-
method. How does TorchVision get their parameters? How will TorchCodec?
96-
4. How do we communicate the transformation names and parameters to the C++
97-
layer? We need to support transforms with an arbitrary number of parameters.
98-
5. How does this generalize to `AudioDecoder`? Ideally we would be able to
99-
support TorchAudio's transforms in a similar way.
100-
6. What is the relationship between the C++ transform objects, `FilterGraph`
101-
and `FiltersContext`?
92+
1. Is `torchcodec.transforms` the right namespace?
93+
2. For random transforms, when should the value be fixed?
94+
3. Transforms such as Resize don't actually implement a `make_params()`
95+
method. How does TorchVision get their parameters? How will TorchCodec?
96+
4. How do we communicate the transformation names and parameters to the C++
97+
layer? We need to support transforms with an arbitrary number of parameters.
98+
5. How does this generalize to `AudioDecoder`? Ideally we would be able to
99+
support TorchAudio's transforms in a similar way.
100+
6. What is the relationship between the C++ transform objects, `FilterGraph`
101+
and `FiltersContext`?

0 commit comments

Comments
 (0)