C++ implementation of crop transform #967

scotts · 2025-10-15T18:08:07Z

Adds C++ implementation for the crop transform.

Design points worth discussing:

I'm always setting exact=1 from the FFmpeg crop filter. Based on reading, this behavior is going to be consistent with torchvision.transforms.v2.functional.crop. Because we're eventually going to accept the TorchVision transform as the public API for crop, users won't be able to set this, so I think we have to be conservative.
We apply the crop in RGB space. Note that in the references I generate from the ffmpeg CLI, I explicitly add format=rgb24 before the filters. If we don't explicitly do that format conversion, the frames we get from the ffmpeg CLI will have been cropped in YUV space and we won't get tensor equality with TorchCodec or TorchVision's transforms. I think it would actually be faster to crop first, as then the colorspace conversion needs to do less work. But I think our ideal correctness comparison should be the result we get from passing a unchanged decoded frame to a TorchVision transform.

Implementation details:

I had to add TorchVision to our test environments.
I had to add Transform::validate(const StreamMetadata&) because sometimes we need to validate transforms based on the stream itself. For crop, we need to validate that the (x, y) coordinates are within the original frame's dimensions. (We could let FFmpeg throw an error instead, but that's not a great experience for our users. That will only happen upon a decode, and the error they'll get is not going to be understandable by them.) I wavered on what validate() should accept to do the validation, and I realized it's probably best for it to be StreamMetadata. If in the future what we need to validate against is not in the metadata, we should add it. I'd rather pass that around than the actual AVStream, and we might need to check against more than just the dimensions in the future.
I'm adding a new kind of reference file, which has the FFmpeg filter embedded in the file name. ~~We can use a different name if we want, but it seemed simple to use the filter expression itself.~~ It looks like Windows does not like these filenames, so we'll have to do something else.
I added a new test file, test_transform_ops.py, and moved all of the existing resize (scale) tests in there as well. My rule is that any test of the core API which sets transform_specs goes in the new test file. While test_ops.py is getting kinda long, my main motivation is that will simplify internal testing. Internally, we want to limit the places that TorchCodec and TorchVision must be used at the same time due to FFmpeg version issues.

scotts · 2025-10-15T18:58:34Z

test/test_transform_ops.py

+            match="out of range",
+        ):
+            add_video_stream(decoder, transform_specs="resize, 100, 1000000000000")
+


New tests start below this line.

scotts · 2025-10-15T19:09:04Z

src/torchcodec/_core/Frame.cpp

+FrameDims::FrameDims(int height, int width) : height(height), width(width) {
+  TORCH_CHECK(height > 0, "FrameDims.height must be > 0, got: ", height);
+  TORCH_CHECK(width > 0, "FrameDims.width must be > 0, got: ", width);
+}


Drive-by: when passing in a height and width, we should only be able to instantiate a FrameDims object with positive values. If we want a FrameDims object that has 0 for both values, that's just the default constructor. We should never have a FrameDims object with negative values.

NicolasHug

Thanks for the PR @scotts ! I left a few comments below, this looks good.

On this design discussion point:

We apply the crop in RGB space. Note that in the references I generate from the ffmpeg CLI, I explicitly add format=rgb24 before the filters. If we don't explicitly do that format conversion, the frames we get from the ffmpeg CLI will have been cropped in YUV space and we won't get tensor equality with TorchCodec or TorchVision's transforms. I think it would actually be faster to crop first, as then the colorspace conversion needs to do less work. But I think our ideal correctness comparison should be the result we get from passing a unchanged decoded frame to a TorchVision transform.

I don't remember having an explicit discussion about our inclusion criteria, so let me share my personal thoughts on this. When implementing a new TC transform, I think we should consider these two questions:

A. Perf: Is the TC transform faster than the TV transform? We want the answer to be yes. If the answer is no then it's not worth implementing in TC. Users can just use TV.
B. Correctness: Can a model tell the difference between a TV transform and its TC counterpart? We want the answer to be no. We know for example that for resize, we'll have to be careful about this. The easiest way to ensure that is to check that the TC output is exactly equal to the TV output - that's what this PR is doing, and this guarantees correctness with 100% certainty. In general though:
- we won't always be able to check strict equality with TV.
- strict equality with TV is a sufficient, but not always necessary condition to validate correctness. For example, the PIL and Tensor backend of TV's resize slightly differ in terms of output (usually within 1-pixel difference for bilinear mode, and a bit more for bicubic mode), but they're still un-distinguishable for models. I suspect this will also be the case between TV's resize and TC's resize. And maybe, it's OK for crop to have slightly different output resuts, as long as models can't tell the difference.

Back to our discussion point now: should we apply crop in YUV or in RGB?

I could be wrong but I think applying the crop in RGB will have the exact same perf as TC's crop. And if that's the case then we're not meeting criteria A. I think it's worth benchmarking this a bit, and I can help with that if needed.

We know that applying crop in RGB meets criteria B, so if we choose to apply it in YUV we'll have to check that it's still correct. We won't have strict equality with TV, but as I mentioned, this won't always be possible (or necessary!) anyway. Maybe we can have psnr checks with very strict tolerance. And we should try to validate filtergraph's algorithms (e.g. see if there's any kind of fancy filtering going on, which might be a problem) - that's also something I can help with.

src/torchcodec/_core/Transform.cpp

NicolasHug · 2025-10-16T10:11:25Z

src/torchcodec/_core/custom_ops.cpp

+// Where "crop" is the string literal and <width>, <height>, <x> and <y> are
+// positive integers. Note that that in this spec, we are following the
+// filtergraph convention of (width, height). This makes it easier to compare it
+// against actual filtergraph strings.


IIUC x and y are essentially the top-left coordinates of the crop, let's make that explicit (it could also be e.g. the center coordinates of the crop).

Separately, I was hoping we could use H and W everywhere for consistency. Can you share the comparison against filtergraph strings that are made easier by using the W and H order?

@NicolasHug, I'm fine with (h, w). It just means that in the tests and in the resource generation, we'll have strings that look like:

crop_spec = "crop, 200, 300, 50, 35" # note: h=200, w=300 crop_filtergraph = "crop=300:200:50:35:exact=1". # note: w=300, h=200

Comments can cover it. We'll have to deal with this awkwardness somewhere. I don't have a strong preference where, and I think it's probably good to keep all of TorchCodec in (h, w).

NicolasHug · 2025-10-16T10:12:51Z

src/torchcodec/_core/Transform.cpp

+  TORCH_CHECK(x_ <= streamMetadata.width, "Crop x position out of bounds");
+  TORCH_CHECK(y_ <= streamMetadata.height, "Crop y position out of bounds");


Should/can we account for H and W of the crop as well? E.g.

TORCH_CHECK(x_ + W < streamMetadata.width, "Crop x position out of bounds");

NicolasHug · 2025-10-16T10:13:58Z

test/generate_reference_resources.py

+        "1",
        output_path,
    ]
+    print("===" + " ".join([str(x) for x in cmd]))


Is this debug leftover? Just checking, maybe it's intended and OK since this is just the generation util.

test/test_transform_ops.py

NicolasHug · 2025-10-16T10:26:03Z

test/test_transform_ops.py

+
+from .utils import assert_frames_equal, NASA_VIDEO, needs_cuda
+
+torch._dynamo.config.capture_dynamic_output_shape_ops = True


I suspect we don't need this line in this file for now, nor the os.environ["TORCH_LOGS"] = "output_code" above, since they relate to torch.compile tests.

scotts · 2025-10-16T14:13:25Z

@NicolasHug, thanks for the discussion!

I could be wrong but I think applying the crop in RGB will have the exact same perf as TC's crop. And if that's the case then we're not meeting criteria A. I think it's worth benchmarking this a bit, and I can help with that if needed.

I actually think criteria A is too strict, as it's strictly phrased about CPU performance and only considers transforms in isolation. I think we need to also consider:

Overall memory usage. Even if crop (and resize) take the same amount of time as decoder-native transforms versus TorchVision transforms, they're going to use less memory overall. That is, if we have a video with 4K frames and we use TorchVision to do the crop or resize, we're going to return a 4K tensor. If we're cropping or resizing to something like (100, 100), then we're only going to return a (100, 100) tensor. That's quite a big difference. That's a lot less memory that FFmpeg will have to return to us in the AVFrame and much less memory that will be owned by the Python layer. I think we're efficient in getting our data into tensors, but once a tensor is owned by the Python layer, it's going to governed by the garbage collector. If we can just return a much smaller tensor to begin with, the garbage collector in the Python layer will be under much less pressure. The less bytes we have to shuttle around, the better overall performance will be.
Some transforms will enable others. If users want to do a pipeline that is basically [ExpensiveTransform, Crop], but Crop is not decoder-native, then they can't do ExpensiveTransform decoder-natively even if it's available.

Regarding cropping in RGB versus YUV space, this is actually a general question that will apply to all transforms. What I think I'd like to go with now is: we default to doing the transform in RGB space for now, as that means we are more likely to be able to have bit-for-bit equality with TorchVision transforms. Whether our users realize it or not, I also think that's what they'll expect. If we get requests for applying the transforms in YUV space, we could provide an API to request it. (Either a decoder level option, or some special transform which would allow users to have some of their transforms in YUV, some in RGB.)

scotts · 2025-10-17T03:15:54Z

test/generate_reference_resources.py

 import torch
 from PIL import Image

+from .utils import sanitize_filtergraph_expression


Note that pulling in code from utils.py means we have to run this as a module, and since utils.py imports general TorchCodec stuff, we do a full build and install for the reference_resources.yaml workflow. I think that's a fair trade-off to ensure that the unit tests and the resource generation are using the same logic for writing and loading files.

Do we expect to pull in more things from utils.py? If it's just for sanitize_filtergraph_expression, I would have a slight preference for just duplicating it in both files, because the python -m stuff isn't immediately obvious. But that's fine too.

NicolasHug

Thanks @scotts, LGTM

Agreed on both points above that we also care about memory usage, and about enabling transforms that may come later in the pipeline.

I suspect doing the crop in YUV (or even potentially within the decoder itself like for jpeg) will lead to lower memory usage and to better perf overall. But, if as discussed offline, we reserve the right to change the underlying implementation of the transforms as long as we're confident it won't break models, then having the first implementation in RGB sounds great!

NicolasHug · 2025-10-17T15:55:17Z

test/generate_reference_resources.py

 import torch
 from PIL import Image

+from .utils import sanitize_filtergraph_expression


Do we expect to pull in more things from utils.py? If it's just for sanitize_filtergraph_expression, I would have a slight preference for just duplicating it in both files, because the python -m stuff isn't immediately obvious. But that's fine too.

scotts · 2025-10-17T16:14:33Z

@NicolasHug, yes, I anticipate unifying generate_reference_resources.py with utils.py more. Right now, they agree on the actual file names because they both happen to have the same string. I'd like to make generate_reference_resources.py actually use the file definitions in there.

scotts added 3 commits October 10, 2025 10:32

Basic implementation and test

b171309

Refactored tests, full tensor equality

3248f2f

Add stream validation

74fe47f

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025

Add torchvision to test environments

c77fbd7

scotts commented Oct 15, 2025

View reviewed changes

Remove redundant test

0511233

scotts commented Oct 15, 2025

View reviewed changes

scotts added 4 commits October 15, 2025 12:12

Correct error message

bba2696

Merge branch 'main' of github.com:pytorch/torchcodec into crop_transform

4f121cb

Put subprocess back in test_ops

07e7f60

Lint

67478dd

scotts marked this pull request as ready for review October 15, 2025 20:06

NicolasHug reviewed Oct 16, 2025

View reviewed changes

scotts added 7 commits October 16, 2025 12:33

(w, h) -> (h, w); better filenames

88bc94a

More comments, add pytest to reference resources

a15d458

Wrong includes

8f4507b

Merge branch 'main' of github.com:pytorch/torchcodec into crop_transform

fca8b83

Lint

8dfbee9

Make generate resources install torchcodec from source

c1836b8

needs: build

abb80eb

scotts commented Oct 17, 2025

View reviewed changes

Actuallly install torchcodec

f819ac1

NicolasHug approved these changes Oct 17, 2025

View reviewed changes

scotts merged commit c1798db into meta-pytorch:main Oct 17, 2025
64 of 65 checks passed

scotts deleted the crop_transform branch October 17, 2025 17:37

NicolasHug pushed a commit to NicolasHug/torchcodec that referenced this pull request Oct 27, 2025

C++ implementation of crop transform (meta-pytorch#967)

3f63c85

		TORCH_CHECK(x_ <= streamMetadata.width, "Crop x position out of bounds");
		TORCH_CHECK(y_ <= streamMetadata.height, "Crop y position out of bounds");


		from .utils import assert_frames_equal, NASA_VIDEO, needs_cuda

		torch._dynamo.config.capture_dynamic_output_shape_ops = True

C++ implementation of crop transform #967

C++ implementation of crop transform #967

Uh oh!

Conversation

scotts commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scotts commented Oct 16, 2025

Uh oh!

scotts Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scotts commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotts commented Oct 15, 2025 •

edited

Loading

NicolasHug left a comment •

edited

Loading

scotts Oct 17, 2025 •

edited

Loading