|
| 1 | +We want to support this user-facing API: |
| 2 | + |
| 3 | + decoder = VideoDecoder( |
| 4 | + "vid.mp4", |
| 5 | + transforms=[ |
| 6 | + torchcodec.transforms.FPS( |
| 7 | + fps=30, |
| 8 | + ), |
| 9 | + torchvision.transforms.v2.Resize( |
| 10 | + width=640, |
| 11 | + height=480, |
| 12 | + ), |
| 13 | + torchvision.transforms.v2.RandomCrop( |
| 14 | + width=32, |
| 15 | + height=32, |
| 16 | + ), |
| 17 | + ] |
| 18 | + ) |
| 19 | + |
| 20 | +What the user is asking for, in English: |
| 21 | + |
| 22 | + 1. I want to decode frames from the file "vid.mp4". |
| 23 | + 2. For each decoded frame, I want each frame to pass through the following |
| 24 | + transforms: |
| 25 | + a. Add or remove frames as necessary to ensure a constant 30 frames |
| 26 | + per second. |
| 27 | + b. Resize the frame to 640x480. Use the algorithm that is |
| 28 | + TorchVision's default. |
| 29 | + c. Inside the resized frame, crop the image to 32x32. The x and y |
| 30 | + coordinates are chosen randomly upon the creation of the Python |
| 31 | + VideoDecoder object. All decoded frames use the same values for x |
| 32 | + and y. |
| 33 | + |
| 34 | +These three transforms are instructive, as they force us to consider: |
| 35 | + |
| 36 | + 1. How "easy" TorchVision transforms will be handled, where all values are |
| 37 | + static. Resize is such an example. |
| 38 | + 2. Transforms that involve randomness. The main question we need to resolve |
| 39 | + is when the random value is resolved. I think this comes down to: once |
| 40 | + upon Python VideoDecoder creation, or different for each frame decoded? |
| 41 | + I made the call above that it should be once upon Python VideoDecoder |
| 42 | + creation, but we need to make sure that lines up with what we think |
| 43 | + users will want. |
| 44 | + 3. Transforms that are supported by FFmpeg but not supported by |
| 45 | + TorchVision. In particular, FPS is something that multiple users have |
| 46 | + asked for. |
| 47 | + |
| 48 | +First let's consider implementing the "easy" case of Resize. |
| 49 | + |
| 50 | + 1. We add an optional `transforms` parameter to the initialization of |
| 51 | + VideoDecoder. It is a sequence of TorchVision Transforms. |
| 52 | + 2. During VideoDecoder object creation, we walk the list, capturing two |
| 53 | + pieces of information: |
| 54 | + a. The transform name that the C++ layer will understand. (We will |
| 55 | + have to decide if we want to just use the FFmpeg filter name |
| 56 | + here, the fully resolved Transform name, or introduce a new |
| 57 | + naming layer.) |
| 58 | + b. The parameters in a format that the C++ layer will understand. We |
| 59 | + obtain them by calling `make_params()` on the Transform object. |
| 60 | + 3. We add an optional transforms parameter to core.add_video_stream(). This |
| 61 | + parameter will be a vector, but whether the vector contains strings, |
| 62 | + tensors, or some combination of them is TBD. |
| 63 | + 4. The custom_ops.cpp and pybind_ops.cpp layer is responsible for turning |
| 64 | + the values passed from the Python layer into transform objects that the |
| 65 | + C++ layer knows about. We will have one class per transform we support. |
| 66 | + Each class will have: |
| 67 | + a. A name which matches the FFmpeg filter name. |
| 68 | + b. One member for each supported parameter. |
| 69 | + c. A virtual member function that knows how to produce a string that |
| 70 | + can be passed to FFmpeg's filtergraph. |
| 71 | + 5. We add a vector of such transforms to |
| 72 | + SingleStreamDecoder::addVideoStream. We store the vector as a field in |
| 73 | + SingleStreamDecoder. |
| 74 | + 6. We need to reconcile FilterGraph, FiltersContext and this vector of |
| 75 | + transforms. They are all related, but it's not clear to me what the |
| 76 | + exact relationship should be. |
| 77 | + 7. The actual string we pass to FFmepg's filtergraph comes from calling |
| 78 | + the virtual member function on each transform object. |
| 79 | + |
| 80 | +For the transforms that do not exist in TorchVision, we can build on the above: |
| 81 | + |
| 82 | + 1. We define a new module, torchcodec.decoders.transforms. |
| 83 | + 2. All transforms we define in there inherit from |
| 84 | + torchvision.transforms.v2.Transform. |
| 85 | + 3. We implement the mimimum needed to hook the new transforms into the |
| 86 | + machinery defined above. |
| 87 | + |
| 88 | +Open questions: |
| 89 | + |
| 90 | + 1. Is torchcodec.transforms the right namespace? |
| 91 | + 2. For random transforms, when should the value be fixed? |
| 92 | + 3. Transforms such as Resize don't actually implement a make_params() |
| 93 | + method. How does TorchVision get their parameters? How will TorchCodec? |
| 94 | + 4. How do we communicate the transformation names and parameters to the C++ |
| 95 | + layer? We need to support transforms with an arbitrary number of parameters. |
| 96 | + 5. How does this generalize to AudioDecoder? Ideally we would be able to |
| 97 | + support TorchAudio's transforms in a similar way. |
| 98 | + 6. What is the relationship between the C++ transform objects, FilterGraph |
| 99 | + and FiltersContext? |
0 commit comments