Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Oct 27, 2025

Public API for decoder-native resize. The implementation in this PR accepts both torchvision.transforms.v2.Resize and a newly defined torchcodec.transforms.Resize.

In #526, I had initially proposed not using TorchVision transforms, and instead coming up with TorchCodec specific versions. @NicolasHug proposed that we accept TorchVision transforms, and that's what I followed up with in my design in #885.

After discussing the previous iteration of this PR, we agreed we wanted to see what it would look like to accept both. Having implemented this, I agree it's the right thing to do:

  1. We now don't need to require TorchVision, even when using the decoder-native feature.
  2. We have a natural place to document the behavior of each decoder-native transform that we accept, and what its limitations are compared to the TorchVision version of that transform.
  3. We have a more principled mechanism of enforcing how TorchVision transforms map to decoder-native semantics. We still have to dig into the TorchVision object to get the info we need, but the torchcodec.transforms class is a clear representation in code of what is supported. In the old PR, that mapping was buried in the logic that turned the TorchVision transform directly into the specification string the core API needs.

Four points worth discussing:

  1. I made the base class for all TorchCodec defined decoder-native transforms to be DecoderTransform. I think it would be confusing if it was just Transform, and DecoderNativeTransform seems both too long and too obscure.
  2. I made the module path torchcodec.transforms instead of torchcodec.decoder_transforms. That's almost counter to point 1, but I think that there's less chance of confusion with the module path.
  3. Should it be DecoderResize instead of just Resize?
  4. The type annotation that users will see only mentions accepting torchcodec.transforms.DecoderTransform. It does not mention the TorchVision transforms or nn.Module. The text of the docstring will say it, and I think that's enough?

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 27, 2025
show_error_codes = True
pretty = True
allow_redefinition = True
follow_untyped_imports = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scotts scotts marked this pull request as ready for review November 10, 2025 02:11
:ref:`sphx_glr_generated_examples_decoding_approximate_mode.py`
transforms (sequence of transform objects, optional): Sequence of transforms to be
applied to the decoded frames by the decoder itself, in order. Accepts both
``torchcodec.transforms.DecoderTransform`` and ``torchvision.transforms.v2.Transform``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For it to render as links in the docs:

Suggested change
``torchcodec.transforms.DecoderTransform`` and ``torchvision.transforms.v2.Transform``
:class:`~torchcodec.transforms.DecoderTransform` and :class:`~torchvision.transforms.v2.Transform`

We should also create a doc page for the transforms!

Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, made a first pass!

transforms (sequence of transform objects, optional): Sequence of transforms to be
applied to the decoded frames by the decoder itself, in order. Accepts both
``torchcodec.transforms.DecoderTransform`` and ``torchvision.transforms.v2.Transform``
objects. All transforms are applied in the ouput pixel format and colorspace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to document this behavior? It seems binding, and we discussed that we may want to reserve the right to change the underlying implementation provided the output are still valid?

Copy link
Contributor Author

@scotts scotts Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to reserve the right to change the underlying implementation, but we may not be able to easily change when we apply the transform with respect to the colorspace conversion. That fact is, I think, implied by what we consider to be our reference: a fully decoded frame passed to a TorchVision transform. In that scenario, the transform is always applied after the colorspace conversion.

Then I think the questions are:

  1. Do we want to document that we consider passing untransformed frames to TorchVision transforms as our reference? I think we do, because I think that's implied by accepting the TorchVision transforms, and it's a easy way to explain the feature to users.
  2. Is when the transform is applied useful to users? I thought it was, but if it's of little value, we could potentially just not talk about it.

Given how far away the tolerances were when TorchCodec applied the transform in YUV, but TorchVision applied them in RGB, I think that if we ever changed this behavior, it would have to be an option.

decoded frames and applying the same kind of transform.
Most DecoderTransforms have a complementary transform in TorchVision,
specificially in torchvision.transforms.v2. For such transforms, we ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add URL

class Resize(DecoderTransform):
"""Resize the decoded frame to a given size.
Complementary TorchVision transform: torchvision.transforms.v2.Resize.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Complementary TorchVision transform: torchvision.transforms.v2.Resize.
Complementary TorchVision transform: :class:`~torchvision.transforms.v2.Resize~.

" DecoderTransform. TorchCodec also accept TorchVision "
"v2 transforms, but TorchVision is not installed."
)
if isinstance(transform, v2.Resize):
Copy link
Contributor

@NicolasHug NicolasHug Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fails if tv_available is False? Because v2 wouldn't exist

EDIT ah no that's probably fine because of the if not tv_available: check above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me thing we should have a dummy job where we don't install TV that ensures TC still works fine...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a job which doesn't have TorchVision installed: I agree we need to do something here, but I'd like to punt on this for now. The current testing file imports TorchVision unconditionally. I think we'll want to separate out the tests that require TorchVision from those that don't so that we can test both behaviors, but that will require different .py files. I'd like to deal with that in its own PR.

I actually started to add a step in the current linux wheel test that did not install TorchVision when I realized this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can punt on this. I'm hoping we can do something very simple regarding testing: keep all but one test job using torchvision, and just have one small CI job that doesn't install TV and just runs a few tests, basically just insuring TV is an optional dependency. I'd like to avoid separating tests in different files just for that - we may have more than one optional dependency and that quickly becomes untractable.

applied to the decoded frames by the decoder itself, in order. Accepts both
:class:`~torchcodec.transforms.DecoderTransform` and
`torchvision.transforms.v2.Transform <https://docs.pytorch.org/vision/stable/transforms.html#v2-api-reference-recommended>`_
objects. All transforms are applied
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and all other references to TorchVision transforms use hard links. I don't think we can get proper Sphinx references when it's in a different project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, we'll just need to add a torchvision entry here:

intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
"torch": ("https://pytorch.org/docs/stable/", None),
"numpy": ("https://numpy.org/doc/stable/", None),
"PIL": ("https://pillow.readthedocs.io/en/stable/", None),
"matplotlib": ("https://matplotlib.org/stable/", None),
}

Feel free to leave that as follow-up / open an issue.

def _make_params(self) -> str:
assert len(self.size) == 2
return f"resize, {self.size[0]}, {self.size[1]}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this class method below is new. Because I'm trying to exhaustively catch all of the v2.Resize options we don't support, the code for turning a v2.Resize into a torchcodec.transforms.Resize got more involved. Extrapolated across more transforms, this kind of logic would end up dominating the code in _video_decoder.py. By make this a private class method, we can put all logic related to what in v2.Resize we support and how to turn a v2.Resize into a torchcodec.transforms.Resize in one place.

Also, to state it explicit, _from_torchvision() and _make_params() are private methods so they're not publicly documented. Users shouldn't need to know about them.

@@ -0,0 +1,17 @@
.. _samplers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. _samplers:
.. _transforms:

Comment on lines 3 to 5
===================
torchcodec.transforms
===================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting warnings when the length don't match

Suggested change
===================
torchcodec.transforms
===================
=====================
torchcodec.transforms
=====================

should be both faster and more memory efficient than receiving normally
decoded frames and applying the same kind of transform.
Most `DecoderTransform` objects have a complementary transform in TorchVision,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly, single backticks in rst means italics. I think you wanted those to be code, like in markdown? For that we need double backticks (there are other instances of single backticks below and maybe in other files?)

Suggested change
Most `DecoderTransform` objects have a complementary transform in TorchVision,
Most ``DecoderTransform`` objects have a complementary transform in TorchVision,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw them render as italics, and I just though, "Oh, Sphinx makes code just italics? Okay..." :)

" DecoderTransform. TorchCodec also accept TorchVision "
"v2 transforms, but TorchVision is not installed."
)
if isinstance(transform, v2.Resize):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can punt on this. I'm hoping we can do something very simple regarding testing: keep all but one test job using torchvision, and just have one small CI job that doesn't install TV and just runs a few tests, basically just insuring TV is an optional dependency. I'd like to avoid separating tests in different files just for that - we may have more than one optional dependency and that quickly becomes untractable.

" DecoderTransform. TorchCodec also accept TorchVision "
"v2 transforms, but TorchVision is not installed."
)
if isinstance(transform, v2.Resize):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think I would have been less surprised by v2 being actually optional if this were elif.

Suggested change
if isinstance(transform, v2.Resize):
elif isinstance(transform, v2.Resize):

applied to the decoded frames by the decoder itself, in order. Accepts both
:class:`~torchcodec.transforms.DecoderTransform` and
`torchvision.transforms.v2.Transform <https://docs.pytorch.org/vision/stable/transforms.html#v2-api-reference-recommended>`_
objects. All transforms are applied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, we'll just need to add a torchvision entry here:

intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
"torch": ("https://pytorch.org/docs/stable/", None),
"numpy": ("https://numpy.org/doc/stable/", None),
"PIL": ("https://pillow.readthedocs.io/en/stable/", None),
"matplotlib": ("https://matplotlib.org/stable/", None),
}

Feel free to leave that as follow-up / open an issue.


@classmethod
def _from_torchvision(cls, resize_tv: nn.Module):
from torchvision.transforms import v2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest the following:

try:
    from torchvision.transforms import v2
except ImportError from e:
    raise RuntimeError("Couldn't find TorchVision - this should never happen, please report a bug") from e

This should probably be in a helper function, reused across classes. My goal here is mainly to help the reader (us, the devs) understand that this code-path is only expected to be run when TV is already available. Otherwise, the plain import makes it look like v2 could be a hard dep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, there is only one place where it's a bug if we can't find TorchVision, so I'll keep this as not-a-function for now. The other place we import TorchVision dynamically, we are fine if it's not there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, right, but we're going to have one of these for each transform. I'll pull into a function now, and it should just live in this file.

Comment on lines 60 to 62
def _make_params(self) -> str:
assert len(self.size) == 2
return f"resize, {self.size[0]}, {self.size[1]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it something else than _make_params?

make_params exists for the v2 transforms, but it does something quite different.

:class:`~torchcodec.transforms.DecoderTransform` and
`torchvision.transforms.v2.Transform <https://docs.pytorch.org/vision/stable/transforms.html#v2-api-reference-recommended>`_
objects. All transforms are applied
in the ouput pixel format and colorspace. Read more about this parameter in:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on #1003 (comment)

Do we want to document that we consider passing untransformed frames to TorchVision transforms as our reference? I think we do, because I think that's implied by accepting the TorchVision transforms, and it's a easy way to explain the feature to users.

Agreed, we should document and claim that TV is our ref. I think we have slightly different understandings of what we mean by "TV is our ref", your definition is slightly stricter than mine (see below).

Is when the transform is applied useful to users? I thought it was, but if it's of little value, we could potentially just not talk about it.

I don't think it adds a lot of value to document, as I don't know if that's a questions users are even asking themselves. But I could be wrong and I don't feel strongly about it. What I'm slightly more concerned about the comment is that it seems like a contract, and I suspect we may want to relax that behavior in the future. E.g. for crop, we might want to apply it in YUV space instead of RGB if it's faster and if models can't notice the difference.

To me, when we say "TV is our ref", it means "this transforms has the same behavior as the TV transform as far as models are concerned". It's not strictly about bitwise equality (we'll never have that). It's only about whether the models can tell the difference. We know they can tell the difference for resize's interpolation mode. But if they can't tell the difference for (e.g.) crop being applied before or after color-conversion, I think we could allow ourselves to make that change of behavior. That allows us more freedom to potentially enable higher perf gains in the future.

None of my comments above are blocking. We can go ahead as-is. I'm happy that for once, I am not the one insisting on strictness :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, that's all fair, and I also think it's fair to err on the side of explaining less about the implementation. If folks start asking about it, we can revisit.

But if they can't tell the difference for (e.g.) crop being applied before or after color-conversion, I think we could allow ourselves to make that change of behavior. That allows us more freedom to potentially enable higher perf gains in the future.

Based on what I did with crop and resize, I actually think that is likely to be the case everywhere: applying the transform in YUV versus RGB will be noticeable by the model. But we can easily punt on that determination by just not saying anything about it. If it becomes something folks ask about, we may need to make it an explicit option, in which case we'll document behavior.

@scotts scotts merged commit 0535b00 into meta-pytorch:main Nov 14, 2025
63 of 70 checks passed
@scotts scotts deleted the transform_api branch November 14, 2025 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants