-
Notifications
You must be signed in to change notification settings - Fork 64
[RFC] Decoder-native transforms design #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
567bdc5
a9e8182
449f500
6aea38d
65e258e
2089840
130b7e0
07f1d73
90f41ed
d452d66
d03d0e3
d70f4a1
1e0da61
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Decoder Native Transforms | ||
|
||
## API | ||
We want to support this user-facing API: | ||
|
||
```python | ||
decoder = VideoDecoder( | ||
"vid.mp4", | ||
transforms=[ | ||
torchcodec.transforms.FPS( | ||
fps=30, | ||
), | ||
torchvision.transforms.v2.Resize( | ||
size=(640, 480) | ||
), | ||
torchvision.transforms.v2.RandomCrop( | ||
size=(32, 32) | ||
), | ||
] | ||
) | ||
``` | ||
|
||
What the user is asking for, in English: | ||
|
||
1. I want to decode frames from the file `"vid.mp4"`. | ||
2. For each decoded frame, I want each frame to pass through the following transforms: | ||
1. Add or remove frames as necessary to ensure a constant 30 frames per second. | ||
2. Resize the frame to 640x480. Use the algorithm that is TorchVision's default. | ||
3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are | ||
chosen randomly upon the creation of the Python `VideoDecoder` object. All decoded | ||
frames use the same values for x and y. | ||
|
||
## Design Considerations | ||
These three transforms are instructive, as they force us to consider: | ||
|
||
1. How "easy" TorchVision transforms will be handled, where all values are | ||
static. `Resize` is such an example. | ||
2. Transforms that involve randomness. The main question we need to resolve | ||
is when the random value is resolved. I think this comes down to: once | ||
upon Python `VideoDecoder` creation, or different for each frame decoded? | ||
I made the call above that it should be once upon Python `VideoDecoder` | ||
creation, but we need to make sure that lines up with what we think | ||
users will want. | ||
3. Transforms that are supported by FFmpeg but not supported by | ||
TorchVision. In particular, FPS is something that multiple users have | ||
asked for. | ||
|
||
## Implementation Sketch | ||
First let's consider implementing the "easy" case of `Resize`. | ||
|
||
1. We add an optional `transforms` parameter to the initialization of | ||
`VideoDecoder`. It is a sequence of TorchVision Transforms. | ||
2. During `VideoDecoder` object creation, we walk the list, capturing two | ||
pieces of information: | ||
1. The transform name that the C++ layer will understand. (We will | ||
have to decide if we want to just use the FFmpeg filter name | ||
here, the fully resolved Transform name, or introduce a new | ||
naming layer.) | ||
2. The parameters in a format that the C++ layer will understand. We | ||
obtain them by calling `make_params()` on the Transform object. | ||
3. We add an optional transforms parameter to `core.add_video_stream()`. This | ||
parameter will be a vector, but whether the vector contains strings, | ||
tensors, or some combination of them is TBD. | ||
4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning | ||
the values passed from the Python layer into transform objects that the | ||
C++ layer knows about. We will have one class per transform we support. | ||
Each class will have: | ||
1. A name which matches the FFmpeg filter name. | ||
2. One member for each supported parameter. | ||
3. A virtual member function that knows how to produce a string that | ||
can be passed to FFmpeg's filtergraph. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds reasonable to stick to filtergraph for this, but I wonder how this affect our current use of libswscale for color-conversion? Maybe we'll have to declare that any use of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can do a bit of both. Some transforms (such as We can also start with just filtergraph, and do libswscale as an optimization. |
||
5. We add a vector of such transforms to | ||
`SingleStreamDecoder::addVideoStream`. We store the vector as a field in | ||
`SingleStreamDecoder`. | ||
6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of | ||
transforms. They are all related, but it's not clear to me what the | ||
exact relationship should be. | ||
7. The actual string we pass to FFmepg's filtergraph comes from calling | ||
the virtual member function on each transform object. | ||
|
||
For the transforms that do not exist in TorchVision, we can build on the above: | ||
|
||
1. We define a new module, `torchcodec.decoders.transforms`. | ||
2. All transforms we define in there inherit from | ||
`torchvision.transforms.v2.Transform`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good solution for Design Consideration 3. Do you foresee any issues from adding a dependency on torchvision in torchcodec? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call out. I'd like to explore if we can avoid an explicit dependency and instead go a duck-typing route. We absolutely want to accept TorchVision transforms. But it would be great if we could avoid an explicit dependency from TorchCodec to TorchVision. |
||
3. We implement the mimimum needed to hook the new transforms into the | ||
machinery defined above. | ||
|
||
## Open Questions: | ||
|
||
1. Is `torchcodec.transforms` the right namespace? | ||
2. For random transforms, when should the value be fixed? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be probably be transforms dependent, but I agree with your call that in general this will be per-video. I think that ideally this would be "per-scene", but that's not in scope. |
||
3. Transforms such as Resize don't actually implement a `make_params()` | ||
method. How does TorchVision get their parameters? How will TorchCodec? | ||
Comment on lines
+93
to
+94
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
4. Should the name at the bridge layer between Python and C++ just be the FFmpeg filter name? | ||
5. How do we communicate the transformation names and parameters to the C++ | ||
layer? We need to support transforms with an arbitrary number of parameters. | ||
6. How does this generalize to `AudioDecoder`? Ideally we would be able to | ||
support TorchAudio's transforms in a similar way. | ||
7. What is the relationship between the C++ transform objects, `FilterGraph` | ||
and `FiltersContext`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three things come to mind:
make_transforms()
doesn't exist, the relevant attributes are just attributes on the Transform object (mentioned below as well). That's not an issue, we can easily work around that.make_params()
in its current state requires the input frames to be passed (flat_inputs
). Those are expected to be materialized tensors, which we definitely won't have by the time we need to callmake_params()
in TorchCodec. Most of the time, all of what's actually needed fromflat_inputs
are the H and W of the frames. So we should be able to get around that by slightly modifying the torchvision part, and accept H and W instead offlat_input
as input tomake_params()
(or similar new hook). We'll probably want to keep this private for the time being.make_param()
is often dependent on the H and W of the input frames, as described above. For most streams that's not an issue because H and W are constant. For the streams with variable resolution, that quickly becomes very hairy. Do we callmake_params()
again on the new H and W? Where? We're risking sampling new random values that may lead to nonsensical transforms for a given video!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, on the three things:
if isinstance(t, Resize)
, as that means there is no uniform way to discover a transforms parameters. We'll need to know that for transform X it has params Y and Z.Resize
, and punt on these issues until we doRandomCrop
.crop
actually accept expressions, not just number values: https://ffmpeg.org/ffmpeg-filters.html#crop. Some are evaluated per-frame, so we may be able to use those to deal with variable resolution videos. For first versions, I also think it's reasonable to say variable resolution videos are not fully supported.