Skip to content

Commit 567bdc5

Browse files
committed
Initial design draft
1 parent ee77f57 commit 567bdc5

File tree

1 file changed

+99
-0
lines changed

1 file changed

+99
-0
lines changed

decoder_native_transforms.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
We want to support this user-facing API:
2+
3+
decoder = VideoDecoder(
4+
"vid.mp4",
5+
transforms=[
6+
torchcodec.transforms.FPS(
7+
fps=30,
8+
),
9+
torchvision.transforms.v2.Resize(
10+
width=640,
11+
height=480,
12+
),
13+
torchvision.transforms.v2.RandomCrop(
14+
width=32,
15+
height=32,
16+
),
17+
]
18+
)
19+
20+
What the user is asking for, in English:
21+
22+
1. I want to decode frames from the file "vid.mp4".
23+
2. For each decoded frame, I want each frame to pass through the following
24+
transforms:
25+
a. Add or remove frames as necessary to ensure a constant 30 frames
26+
per second.
27+
b. Resize the frame to 640x480. Use the algorithm that is
28+
TorchVision's default.
29+
c. Inside the resized frame, crop the image to 32x32. The x and y
30+
coordinates are chosen randomly upon the creation of the Python
31+
VideoDecoder object. All decoded frames use the same values for x
32+
and y.
33+
34+
These three transforms are instructive, as they force us to consider:
35+
36+
1. How "easy" TorchVision transforms will be handled, where all values are
37+
static. Resize is such an example.
38+
2. Transforms that involve randomness. The main question we need to resolve
39+
is when the random value is resolved. I think this comes down to: once
40+
upon Python VideoDecoder creation, or different for each frame decoded?
41+
I made the call above that it should be once upon Python VideoDecoder
42+
creation, but we need to make sure that lines up with what we think
43+
users will want.
44+
3. Transforms that are supported by FFmpeg but not supported by
45+
TorchVision. In particular, FPS is something that multiple users have
46+
asked for.
47+
48+
First let's consider implementing the "easy" case of Resize.
49+
50+
1. We add an optional `transforms` parameter to the initialization of
51+
VideoDecoder. It is a sequence of TorchVision Transforms.
52+
2. During VideoDecoder object creation, we walk the list, capturing two
53+
pieces of information:
54+
a. The transform name that the C++ layer will understand. (We will
55+
have to decide if we want to just use the FFmpeg filter name
56+
here, the fully resolved Transform name, or introduce a new
57+
naming layer.)
58+
b. The parameters in a format that the C++ layer will understand. We
59+
obtain them by calling `make_params()` on the Transform object.
60+
3. We add an optional transforms parameter to core.add_video_stream(). This
61+
parameter will be a vector, but whether the vector contains strings,
62+
tensors, or some combination of them is TBD.
63+
4. The custom_ops.cpp and pybind_ops.cpp layer is responsible for turning
64+
the values passed from the Python layer into transform objects that the
65+
C++ layer knows about. We will have one class per transform we support.
66+
Each class will have:
67+
a. A name which matches the FFmpeg filter name.
68+
b. One member for each supported parameter.
69+
c. A virtual member function that knows how to produce a string that
70+
can be passed to FFmpeg's filtergraph.
71+
5. We add a vector of such transforms to
72+
SingleStreamDecoder::addVideoStream. We store the vector as a field in
73+
SingleStreamDecoder.
74+
6. We need to reconcile FilterGraph, FiltersContext and this vector of
75+
transforms. They are all related, but it's not clear to me what the
76+
exact relationship should be.
77+
7. The actual string we pass to FFmepg's filtergraph comes from calling
78+
the virtual member function on each transform object.
79+
80+
For the transforms that do not exist in TorchVision, we can build on the above:
81+
82+
1. We define a new module, torchcodec.decoders.transforms.
83+
2. All transforms we define in there inherit from
84+
torchvision.transforms.v2.Transform.
85+
3. We implement the mimimum needed to hook the new transforms into the
86+
machinery defined above.
87+
88+
Open questions:
89+
90+
1. Is torchcodec.transforms the right namespace?
91+
2. For random transforms, when should the value be fixed?
92+
3. Transforms such as Resize don't actually implement a make_params()
93+
method. How does TorchVision get their parameters? How will TorchCodec?
94+
4. How do we communicate the transformation names and parameters to the C++
95+
layer? We need to support transforms with an arbitrary number of parameters.
96+
5. How does this generalize to AudioDecoder? Ideally we would be able to
97+
support TorchAudio's transforms in a similar way.
98+
6. What is the relationship between the C++ transform objects, FilterGraph
99+
and FiltersContext?

0 commit comments

Comments
 (0)