Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Nov 21, 2025

Implements torchcodec.transforms.RandomCrop and also accepts torchvision.transforms.v2.RandomCrop. The key difference between this capability and Resize is that we need to:

  1. Compute a random location in the image to crop.
  2. And it must match exactly what TorchVision does.

Short version of how we accomplish this:

  1. If you give us the TorchVision object, we call make_params() on it to get the computed location.
  2. If you don't, we do the same calculation in TorchCodec. We'll need to use testing and code review to make sure these stay aligned.

Working on this transform also made me realize that DecoderTransform and its subclasses are not dataclasses. I initially thought they would just be bags of values. But they're growing to have significant methods and internal state not exposed to users. In a follow-up PR, I'll refactor these into normal classes, much like the TorchVision versions. I felt that was too disruptive to do in this PR.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 21, 2025
int x = checkedToPositiveInt(cropTransformSpec[3]);
int y = checkedToPositiveInt(cropTransformSpec[4]);
int x = checkedToNonNegativeInt(cropTransformSpec[3]);
int y = checkedToNonNegativeInt(cropTransformSpec[4]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location (0, 0) is a valid image location. 🤦

@scotts scotts marked this pull request as ready for review November 22, 2025 03:05
@scotts scotts changed the title [WIP] Implement RandomCrop transform Implement RandomCrop transform Nov 22, 2025
if self._top is None or self._left is None:
# TODO: It would be very strange if only ONE of those is None. But should we
# make it an error? We can continue, but it would probably mean
# something bad happened. Dear reviewer, please register an opinion here:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would appear something bad happened in this case.

But when calling this function, do we expect _top or _left to have any value? My understanding is that these fields are only set when _make_transform_spec is called, which is only called once per DecoderTransform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on if RandomCrop was created via a TorchVision RandomCrop or instantiated directly. If created by a TorchVision RandomCrop in _from_torchvision(), then _top and _left should have values. If created directly, then both should not have values, in which case we have to do our random logic.

It has occurred to me that maybe we don't need to call RandomCrop.make_params() in _from_torchvision(). Maybe we should just always set these values in _make_transform_spec().

"v2 transform."
)
else:
input_dims = transform._get_output_dims(input_dims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is _get_output_dims is only used in this function for validation? I believe there are TODOs to move validation to the constructor which is great, but I do not understand the the returned value input_dims here.

Copy link
Contributor Author

@scotts scotts Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually used for validation. Think of the transforms as a pipeline: A -> B -> C -> D. Each stage may change the dimensions of the frame. We need to track the frame dimensions as we move through the pipeline because some transforms need to know the dimensions of the input frame. RandomCrop is one such transform: in order to randomly determine a location to crop, it needs to know the input frame dimensions to know the bounds to pass to the random number generator.

The dimensions that A receives come from the originally decoded frame, which we can get from the metadata. But the dimensions for B are actually the output of A! That extends to each transform in the pipeline.

This probably deserves a comment. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants