Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Oct 15, 2025

Adds C++ implementation for the crop transform.

Design points worth discussing:

  1. I'm always setting exact=1 from the FFmpeg crop filter. Based on reading, this behavior is going to be consistent with torchvision.transforms.v2.functional.crop. Because we're eventually going to accept the TorchVision transform as the public API for crop, users won't be able to set this, so I think we have to be conservative.
  2. We apply the crop in RGB space. Note that in the references I generate from the ffmpeg CLI, I explicitly add format=rgb24 before the filters. If we don't explicitly do that format conversion, the frames we get from the ffmpeg CLI will have been cropped in YUV space and we won't get tensor equality with TorchCodec or TorchVision's transforms. I think it would actually be faster to crop first, as then the colorspace conversion needs to do less work. But I think our ideal correctness comparison should be the result we get from passing a unchanged decoded frame to a TorchVision transform.

Implementation details:

  1. I had to add TorchVision to our test environments.
  2. I had to add Transform::validate(const StreamMetadata&) because sometimes we need to validate transforms based on the stream itself. For crop, we need to validate that the (x, y) coordinates are within the original frame's dimensions. (We could let FFmpeg throw an error instead, but that's not a great experience for our users. That will only happen upon a decode, and the error they'll get is not going to be understandable by them.) I wavered on what validate() should accept to do the validation, and I realized it's probably best for it to be StreamMetadata. If in the future what we need to validate against is not in the metadata, we should add it. I'd rather pass that around than the actual AVStream, and we might need to check against more than just the dimensions in the future.
  3. I'm adding a new kind of reference file, which has the FFmpeg filter embedded in the file name. We can use a different name if we want, but it seemed simple to use the filter expression itself. It looks like Windows does not like these filenames, so we'll have to do something else.
  4. I added a new test file, test_transform_ops.py, and moved all of the existing resize (scale) tests in there as well. My rule is that any test of the core API which sets transform_specs goes in the new test file. While test_ops.py is getting kinda long, my main motivation is that will simplify internal testing. Internally, we want to limit the places that TorchCodec and TorchVision must be used at the same time due to FFmpeg version issues.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025
match="out of range",
):
add_video_stream(decoder, transform_specs="resize, 100, 1000000000000")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests start below this line.

FrameDims::FrameDims(int height, int width) : height(height), width(width) {
TORCH_CHECK(height > 0, "FrameDims.height must be > 0, got: ", height);
TORCH_CHECK(width > 0, "FrameDims.width must be > 0, got: ", width);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by: when passing in a height and width, we should only be able to instantiate a FrameDims object with positive values. If we want a FrameDims object that has 0 for both values, that's just the default constructor. We should never have a FrameDims object with negative values.

@scotts scotts marked this pull request as ready for review October 15, 2025 20:06
Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @scotts ! I left a few comments below, this looks good.


On this design discussion point:

  1. We apply the crop in RGB space. Note that in the references I generate from the ffmpeg CLI, I explicitly add format=rgb24 before the filters. If we don't explicitly do that format conversion, the frames we get from the ffmpeg CLI will have been cropped in YUV space and we won't get tensor equality with TorchCodec or TorchVision's transforms. I think it would actually be faster to crop first, as then the colorspace conversion needs to do less work. But I think our ideal correctness comparison should be the result we get from passing a unchanged decoded frame to a TorchVision transform.

I don't remember having an explicit discussion about our inclusion criteria, so let me share my personal thoughts on this. When implementing a new TC transform, I think we should consider these two questions:

  • A. Perf: Is the TC transform faster than the TV transform? We want the answer to be yes. If the answer is no then it's not worth implementing in TC. Users can just use TV.
  • B. Correctness: Can a model tell the difference between a TV transform and its TC counterpart? We want the answer to be no. We know for example that for resize, we'll have to be careful about this. The easiest way to ensure that is to check that the TC output is exactly equal to the TV output - that's what this PR is doing, and this guarantees correctness with 100% certainty. In general though:
    • we won't always be able to check strict equality with TV.
    • strict equality with TV is a sufficient, but not always necessary condition to validate correctness. For example, the PIL and Tensor backend of TV's resize slightly differ in terms of output (usually within 1-pixel difference for bilinear mode, and a bit more for bicubic mode), but they're still un-distinguishable for models. I suspect this will also be the case between TV's resize and TC's resize. And maybe, it's OK for crop to have slightly different output resuts, as long as models can't tell the difference.

Back to our discussion point now: should we apply crop in YUV or in RGB?

I could be wrong but I think applying the crop in RGB will have the exact same perf as TC's crop. And if that's the case then we're not meeting criteria A. I think it's worth benchmarking this a bit, and I can help with that if needed.

We know that applying crop in RGB meets criteria B, so if we choose to apply it in YUV we'll have to check that it's still correct. We won't have strict equality with TV, but as I mentioned, this won't always be possible (or necessary!) anyway. Maybe we can have psnr checks with very strict tolerance. And we should try to validate filtergraph's algorithms (e.g. see if there's any kind of fancy filtering going on, which might be a problem) - that's also something I can help with.

Comment on lines 219 to 222
// Where "crop" is the string literal and <width>, <height>, <x> and <y> are
// positive integers. Note that that in this spec, we are following the
// filtergraph convention of (width, height). This makes it easier to compare it
// against actual filtergraph strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC x and y are essentially the top-left coordinates of the crop, let's make that explicit (it could also be e.g. the center coordinates of the crop).

Separately, I was hoping we could use H and W everywhere for consistency. Can you share the comparison against filtergraph strings that are made easier by using the W and H order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, I'm fine with (h, w). It just means that in the tests and in the resource generation, we'll have strings that look like:

        crop_spec = "crop, 200, 300, 50, 35"   # note: h=200, w=300
        crop_filtergraph = "crop=300:200:50:35:exact=1". # note: w=300, h=200

Comments can cover it. We'll have to deal with this awkwardness somewhere. I don't have a strong preference where, and I think it's probably good to keep all of TorchCodec in (h, w).

Comment on lines 77 to 78
TORCH_CHECK(x_ <= streamMetadata.width, "Crop x position out of bounds");
TORCH_CHECK(y_ <= streamMetadata.height, "Crop y position out of bounds");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should/can we account for H and W of the crop as well? E.g.

TORCH_CHECK(x_ + W < streamMetadata.width, "Crop x position out of bounds");

"1",
output_path,
]
print("===" + " ".join([str(x) for x in cmd]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this debug leftover? Just checking, maybe it's intended and OK since this is just the generation util.


from .utils import assert_frames_equal, NASA_VIDEO, needs_cuda

torch._dynamo.config.capture_dynamic_output_shape_ops = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we don't need this line in this file for now, nor the os.environ["TORCH_LOGS"] = "output_code" above, since they relate to torch.compile tests.

@scotts
Copy link
Contributor Author

scotts commented Oct 16, 2025

@NicolasHug, thanks for the discussion!

I could be wrong but I think applying the crop in RGB will have the exact same perf as TC's crop. And if that's the case then we're not meeting criteria A. I think it's worth benchmarking this a bit, and I can help with that if needed.

I actually think criteria A is too strict, as it's strictly phrased about CPU performance and only considers transforms in isolation. I think we need to also consider:

  1. Overall memory usage. Even if crop (and resize) take the same amount of time as decoder-native transforms versus TorchVision transforms, they're going to use less memory overall. That is, if we have a video with 4K frames and we use TorchVision to do the crop or resize, we're going to return a 4K tensor. If we're cropping or resizing to something like (100, 100), then we're only going to return a (100, 100) tensor. That's quite a big difference. That's a lot less memory that FFmpeg will have to return to us in the AVFrame and much less memory that will be owned by the Python layer. I think we're efficient in getting our data into tensors, but once a tensor is owned by the Python layer, it's going to governed by the garbage collector. If we can just return a much smaller tensor to begin with, the garbage collector in the Python layer will be under much less pressure. The less bytes we have to shuttle around, the better overall performance will be.
  2. Some transforms will enable others. If users want to do a pipeline that is basically [ExpensiveTransform, Crop], but Crop is not decoder-native, then they can't do ExpensiveTransform decoder-natively even if it's available.

Regarding cropping in RGB versus YUV space, this is actually a general question that will apply to all transforms. What I think I'd like to go with now is: we default to doing the transform in RGB space for now, as that means we are more likely to be able to have bit-for-bit equality with TorchVision transforms. Whether our users realize it or not, I also think that's what they'll expect. If we get requests for applying the transforms in YUV space, we could provide an API to request it. (Either a decoder level option, or some special transform which would allow users to have some of their transforms in YUV, some in RGB.)

import torch
from PIL import Image

from .utils import sanitize_filtergraph_expression
Copy link
Contributor Author

@scotts scotts Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that pulling in code from utils.py means we have to run this as a module, and since utils.py imports general TorchCodec stuff, we do a full build and install for the reference_resources.yaml workflow. I think that's a fair trade-off to ensure that the unit tests and the resource generation are using the same logic for writing and loading files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to pull in more things from utils.py? If it's just for sanitize_filtergraph_expression, I would have a slight preference for just duplicating it in both files, because the python -m stuff isn't immediately obvious. But that's fine too.

Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scotts, LGTM

Agreed on both points above that we also care about memory usage, and about enabling transforms that may come later in the pipeline.

I suspect doing the crop in YUV (or even potentially within the decoder itself like for jpeg) will lead to lower memory usage and to better perf overall. But, if as discussed offline, we reserve the right to change the underlying implementation of the transforms as long as we're confident it won't break models, then having the first implementation in RGB sounds great!

import torch
from PIL import Image

from .utils import sanitize_filtergraph_expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to pull in more things from utils.py? If it's just for sanitize_filtergraph_expression, I would have a slight preference for just duplicating it in both files, because the python -m stuff isn't immediately obvious. But that's fine too.

@scotts
Copy link
Contributor Author

scotts commented Oct 17, 2025

@NicolasHug, yes, I anticipate unifying generate_reference_resources.py with utils.py more. Right now, they agree on the actual file names because they both happen to have the same string. I'd like to make generate_reference_resources.py actually use the file definitions in there.

@scotts scotts merged commit c1798db into meta-pytorch:main Oct 17, 2025
64 of 65 checks passed
@scotts scotts deleted the crop_transform branch October 17, 2025 17:37
NicolasHug pushed a commit to NicolasHug/torchcodec that referenced this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants