Skip to content
Draft
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions decoder_native_transforms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Decoder Native Transforms

## API
We want to support this user-facing API:

```python
decoder = VideoDecoder(
"vid.mp4",
transforms=[
torchcodec.transforms.FPS(
fps=30,
),
torchvision.transforms.v2.Resize(
width=640,
height=480,
),
torchvision.transforms.v2.RandomCrop(
width=32,
height=32,
),
]
)
```

What the user is asking for, in English:

1. I want to decode frames from the file `"vid.mp4".`
2. For each decoded frame, I want each frame to pass through the following transforms:
1. Add or remove frames as necessary to ensure a constant 30 frames per second.
2. Resize the frame to 640x480. Use the algorithm that is TorchVision's default.
3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are
chosen randomly upon the creation of the Python `VideoDecoder` object. All decoded
frames use the same values for x and y.

## Design Considerations
These three transforms are instructive, as they force us to consider:

1. How "easy" TorchVision transforms will be handled, where all values are
static. `Resize` is such an example.
2. Transforms that involve randomness. The main question we need to resolve
is when the random value is resolved. I think this comes down to: once
upon Python `VideoDecoder` creation, or different for each frame decoded?
I made the call above that it should be once upon Python `VideoDecoder`
creation, but we need to make sure that lines up with what we think
users will want.
3. Transforms that are supported by FFmpeg but not supported by
TorchVision. In particular, FPS is something that multiple users have
asked for.

## Implementation Sketch
First let's consider implementing the "easy" case of `Resize`.

1. We add an optional `transforms` parameter to the initialization of
`VideoDecoder`. It is a sequence of TorchVision Transforms.
2. During `VideoDecoder` object creation, we walk the list, capturing two
pieces of information:
1. The transform name that the C++ layer will understand. (We will
have to decide if we want to just use the FFmpeg filter name
here, the fully resolved Transform name, or introduce a new
naming layer.)
2. The parameters in a format that the C++ layer will understand. We
obtain them by calling `make_params()` on the Transform object.
Comment on lines +59 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three things come to mind:

  • For non-random transforms make_transforms() doesn't exist, the relevant attributes are just attributes on the Transform object (mentioned below as well). That's not an issue, we can easily work around that.
  • make_params() in its current state requires the input frames to be passed (flat_inputs). Those are expected to be materialized tensors, which we definitely won't have by the time we need to call make_params() in TorchCodec. Most of the time, all of what's actually needed from flat_inputs are the H and W of the frames. So we should be able to get around that by slightly modifying the torchvision part, and accept H and W instead of flat_input as input to make_params() (or similar new hook). We'll probably want to keep this private for the time being.
  • the logic in make_param() is often dependent on the H and W of the input frames, as described above. For most streams that's not an issue because H and W are constant. For the streams with variable resolution, that quickly becomes very hairy. Do we call make_params() again on the new H and W? Where? We're risking sampling new random values that may lead to nonsensical transforms for a given video!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, on the three things:

  • We can work around it, but it does mean we'll probably need to do a lot of if isinstance(t, Resize), as that means there is no uniform way to discover a transforms parameters. We'll need to know that for transform X it has params Y and Z.
  • Noted. I think we can start the implementation with Resize, and punt on these issues until we do RandomCrop.
  • FFmpeg filters such as crop actually accept expressions, not just number values: https://ffmpeg.org/ffmpeg-filters.html#crop. Some are evaluated per-frame, so we may be able to use those to deal with variable resolution videos. For first versions, I also think it's reasonable to say variable resolution videos are not fully supported.

3. We add an optional transforms parameter to `core.add_video_stream()`. This
parameter will be a vector, but whether the vector contains strings,
tensors, or some combination of them is TBD.
4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning
the values passed from the Python layer into transform objects that the
C++ layer knows about. We will have one class per transform we support.
Each class will have:
1. A name which matches the FFmpeg filter name.
2. One member for each supported parameter.
3. A virtual member function that knows how to produce a string that
can be passed to FFmpeg's filtergraph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to stick to filtergraph for this, but I wonder how this affect our current use of libswscale for color-conversion? Maybe we'll have to declare that any use of the transforms parameter means filtergraph is used for color-conversion, despite libswscale being faster according to our past benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do a bit of both. Some transforms (such as Resize) we can support on libswscale. I assume there's a lot of more involved transforms that are not going to work on libswscale. We can add a virtual member function to this C++ class that returns a libswscale string, if possible. If all transforms are supportable on libswscale and all other its other criteria are met, we use that. If not, we have to use filtergraph.

We can also start with just filtergraph, and do libswscale as an optimization.

5. We add a vector of such transforms to
`SingleStreamDecoder::addVideoStream`. We store the vector as a field in
`SingleStreamDecoder`.
6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of
transforms. They are all related, but it's not clear to me what the
exact relationship should be.
7. The actual string we pass to FFmepg's filtergraph comes from calling
the virtual member function on each transform object.

For the transforms that do not exist in TorchVision, we can build on the above:

1. We define a new module, `torchcodec.decoders.transforms`.
2. All transforms we define in there inherit from
`torchvision.transforms.v2.Transform`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good solution for Design Consideration 3. Do you foresee any issues from adding a dependency on torchvision in torchcodec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. I'd like to explore if we can avoid an explicit dependency and instead go a duck-typing route. We absolutely want to accept TorchVision transforms. But it would be great if we could avoid an explicit dependency from TorchCodec to TorchVision.

3. We implement the mimimum needed to hook the new transforms into the
machinery defined above.

## Open Questions:

1. Is `torchcodec.transforms` the right namespace?
2. For random transforms, when should the value be fixed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be probably be transforms dependent, but I agree with your call that in general this will be per-video. I think that ideally this would be "per-scene", but that's not in scope.

3. Transforms such as Resize don't actually implement a `make_params()`
method. How does TorchVision get their parameters? How will TorchCodec?
Comment on lines +93 to +94
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_params() is mostly relevant for random transforms, either those that are randomly applied, or those that do random sampling. Resize isn't random. The relevant parameters are just stored as public attributes on the Transform object.

4. Should the name at the bridge layer between Python and C++ just be the FFmpeg filter name?
5. How do we communicate the transformation names and parameters to the C++
layer? We need to support transforms with an arbitrary number of parameters.
6. How does this generalize to `AudioDecoder`? Ideally we would be able to
support TorchAudio's transforms in a similar way.
7. What is the relationship between the C++ transform objects, `FilterGraph`
and `FiltersContext`?
Loading