Skip to content

Conversation

scotts
Copy link
Contributor

@scotts scotts commented Oct 8, 2025

Next step after #902. The design in #885 punted on how the Python layer would communicate the transforms and their parameters to the C++ layer. This PR answers that question: a string. The string format is:

"name1, param1, param2, ...; name2, param1, param2, ...; name3, param1, param2, ..."

In the above, nameX is the name of a transform, and paramX are the parameters that transform accepts. For example, the only transform that we have now is resize, and its spec is currently:

"resize, <height>, <width>"

Where resize is literally what we expect, and <height> and <width> are integers that will become the height and width. In the future we will add a third parameter for algorithm. Future transforms will take potentially different number of parameters with different types; we'll define exactly what the spec for each transform is when we add it.

I don't love that we're using strings with our own little specification language, but I'm convinced this is the least bad option:

  • It's possible to use tensors, but it would be uglier and more esoteric. Because tensors are limited to number types, we'd have to map numbers to transform kind. For example, we could say that 0 -> resize, and then if we wanted to specify a resize operation of height 1024 and width 768, we could say torch.tensor([0, 1024, 768]). But both the Python and C++ side would need to know this mapping of integer to transform. Yes, that's technically true with strings, but it's rather obvious what "resize" means. The machinery required for this approach is even more than what's required to accept our little string spec language.
  • JSON is overkill as we have a constrained input. I'd rather not parse full JSON.
  • Users aren't actually exposed to this specification language. It exists only on the core API. The VideoDecoder class will be responsible for translating from torchvision.transforms.v2 to these specification strings. Since it's our own code that will generate these specs, we don't need to worry about making something with sharp edges that will cut users.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
std::optional<int64_t> stream_index = std::nullopt,
std::string_view device = "cpu",
std::string_view device_variant = "default",
std::string_view transform_specs = "",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we're using an empty spec, "", as the default rather than making it optional. I find this makes the code simpler and easier to reason about.

videoStreamOptions.deviceVariant = device_variant;

std::vector<Transform*> transforms =
makeTransforms(std::string(transform_specs));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of how using a default empty spec make things simpler than using an optional: we always call this function. If we have an empty spec, we just get back an empty vector.

@scotts scotts marked this pull request as ready for review October 9, 2025 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants