Initial design draft

scotts · scotts · commit 567bdc5cbff5 · 2025-09-08T12:10:57.000-07:00
diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md
@@ -0,0 +1,99 @@
+We want to support this user-facing API:
+ 
+    decoder = VideoDecoder(
+        "vid.mp4",
+        transforms=[
+            torchcodec.transforms.FPS(
+                fps=30,
+            ),
+            torchvision.transforms.v2.Resize(
+                width=640,
+                height=480, 
+            ),
+            torchvision.transforms.v2.RandomCrop(
+                width=32,
+                height=32,
+            ),
+        ]
+    )
+
+What the user is asking for, in English:
+
+    1. I want to decode frames from the file "vid.mp4".
+    2. For each decoded frame, I want each frame to pass through the following
+       transforms:
+         a. Add or remove frames as necessary to ensure a constant 30 frames
+            per second.
+         b. Resize the frame to 640x480. Use the algorithm that is
+            TorchVision's default.
+         c. Inside the resized frame, crop the image to 32x32. The x and y
+            coordinates are chosen randomly upon the creation of the Python
+            VideoDecoder object. All decoded frames use the same values for x
+            and y.
+
+These three transforms are instructive, as they force us to consider:
+
+    1. How "easy" TorchVision transforms will be handled, where all values are
+       static. Resize is such an example.
+    2. Transforms that involve randomness. The main question we need to resolve
+       is when the random value is resolved. I think this comes down to: once
+       upon Python VideoDecoder creation, or different for each frame decoded?
+       I made the call above that it should be once upon Python VideoDecoder
+       creation, but we need to make sure that lines up with what we think
+       users will want.
+    3. Transforms that are supported by FFmpeg but not supported by
+       TorchVision. In particular, FPS is something that multiple users have
+       asked for.
+
+First let's consider implementing the "easy" case of Resize.
+
+    1. We add an optional `transforms` parameter to the initialization of
+       VideoDecoder. It is a sequence of TorchVision Transforms.
+    2. During VideoDecoder object creation, we walk the list, capturing two
+       pieces of information:
+           a. The transform name that the C++ layer will understand. (We will 
+              have to decide if we want to just use the FFmpeg filter name
+              here, the fully resolved Transform name, or introduce a new
+              naming layer.)
+           b. The parameters in a format that the C++ layer will understand. We
+              obtain them by calling `make_params()` on the Transform object.
+    3. We add an optional transforms parameter to core.add_video_stream(). This
+       parameter will be a vector, but whether the vector contains strings,
+       tensors, or some combination of them is TBD.
+    4. The custom_ops.cpp and pybind_ops.cpp layer is responsible for turning
+       the values passed from the Python layer into transform objects that the
+       C++ layer knows about. We will have one class per transform we support.
+       Each class will have:
+           a. A name which matches the FFmpeg filter name.
+           b. One member for each supported parameter.
+           c. A virtual member function that knows how to produce a string that
+              can be passed to FFmpeg's filtergraph.
+    5. We add a vector of such transforms to
+       SingleStreamDecoder::addVideoStream. We store the vector as a field in
+       SingleStreamDecoder.
+    6. We need to reconcile FilterGraph, FiltersContext and this vector of
+       transforms. They are all related, but it's not clear to me what the
+       exact relationship should be.
+    7. The actual string we pass to FFmepg's filtergraph comes from calling
+       the virtual member function on each transform object.
+
+For the transforms that do not exist in TorchVision, we can build on the above:
+
+   1. We define a new module, torchcodec.decoders.transforms.
+   2. All transforms we define in there inherit from
+      torchvision.transforms.v2.Transform. 
+   3. We implement the mimimum needed to hook the new transforms into the
+      machinery defined above.
+
+Open questions:
+
+    1. Is torchcodec.transforms the right namespace?
+    2. For random transforms, when should the value be fixed?
+    3. Transforms such as Resize don't actually implement a make_params()
+       method. How does TorchVision get their parameters? How will TorchCodec?
+    4. How do we communicate the transformation names and parameters to the C++
+       layer? We need to support transforms with an arbitrary number of parameters.
+    5. How does this generalize to AudioDecoder? Ideally we would be able to
+       support TorchAudio's transforms in a similar way.
+    6. What is the relationship between the C++ transform objects, FilterGraph
+       and FiltersContext?