Torchvision API infrastructure#6229
Conversation
Initial commit including infrastructure, resize and flip operators and unit tests Signed-off-by: Marek Dabek <mdabek@nvidia.com>
|
!build |
|
CI MESSAGE: [44795470]: BUILD STARTED |
|
CI MESSAGE: [44795470]: BUILD PASSED |
Greptile SummaryThis PR introduces the foundational infrastructure for a DALI-backed Torchvision-compatible transforms API ( Key points from the review:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Compose
participant PipelineHWC / PipelineCHW
participant PipelineWithLayout
participant DALI Pipeline (_pipeline_function)
participant Operator (_kernel)
User->>Compose: __call__(data_input)
Compose->>Compose: VerificationTensorOrImage.verify(data_input)
alt first call
Compose->>Compose: _build_pipeline(data_input)
Note over Compose: PIL Image → PipelineHWC<br/>torch.Tensor → PipelineCHW
end
Compose->>PipelineHWC / PipelineCHW: run(data_input)
PipelineHWC / PipelineCHW->>PipelineHWC / PipelineCHW: convert / unsqueeze input
PipelineHWC / PipelineCHW->>PipelineWithLayout: run(tensor)
PipelineWithLayout->>PipelineWithLayout: _align_data_with_device(tensor)
alt device == "gpu" (and CUDA available)
PipelineWithLayout->>DALI Pipeline (_pipeline_function): _cuda_run(tensor) via current CUDA stream
else
PipelineWithLayout->>DALI Pipeline (_pipeline_function): _cpu_run(tensor)
end
loop for each op in op_list
DALI Pipeline (_pipeline_function)->>Operator (_kernel): op(DataNode)
Operator (_kernel)-->>DALI Pipeline (_pipeline_function): transformed DataNode
end
DALI Pipeline (_pipeline_function)-->>PipelineWithLayout: TensorList output
PipelineWithLayout->>PipelineWithLayout: to_torch_tensor(output)
PipelineWithLayout-->>PipelineHWC / PipelineCHW: torch.Tensor
PipelineHWC / PipelineCHW->>PipelineHWC / PipelineCHW: convert back (PIL Image / squeeze)
PipelineHWC / PipelineCHW-->>User: PIL Image or torch.Tensor
|
dali/python/nvidia/dali/experimental/torchvision/v2/functional/resize.py
Outdated
Show resolved
Hide resolved
dali/python/nvidia/dali/experimental/torchvision/v2/functional/flips.py
Outdated
Show resolved
Hide resolved
| stream = torch.cuda.Stream(0) | ||
| with torch.cuda.stream(stream): | ||
| output = self.pipe.run(stream, input_data=data_input) | ||
|
|
There was a problem hiding this comment.
Using a cuda stream unconditionally might break CPU-only environments. Please wrap it with torch.cuda.is_available() or similar
There was a problem hiding this comment.
Do we have CPU-only tests that will cover torchvision? If not, we should definitely add some to catch things like this
There was a problem hiding this comment.
There are CPU-only test cases, but no test suite. I will add it.
| if len(tensor_or_tl) == 1: | ||
| tensor_or_tl = tensor_or_tl[0] | ||
| return _to_torch_tensor(tensor_or_tl) |
There was a problem hiding this comment.
why do we what to unpack 1 element tuples?. How about something like:
def to_torch_tensor(
x: Union[tuple, 'TensorListGPU', 'TensorListCPU']
) -> Union[torch.Tensor, tuple]:
if isinstance(x, (TensorListGPU, TensorListCPU)):
return to_torch_tensor(x.as_tensor())
elif isinstance(x, tuple):
return tuple(to_torch_tensor(elem) for elem in x)
else:
return torch.from_dlpack(x)
There was a problem hiding this comment.
We need to unpack it since Torchvision Compose operator returns tensor or PIL.Image not tuple. DALI.Torchvision supposed to be used as a drop in replacement to Torchvision and maximum compatibility should be maintained.
I will use the above code as a base for better implementation of to_torch_tensor.
| # This is WAR for DLPpack not supporting pinned memory | ||
| if output.device.device_type == "cpu": | ||
| output = np.asarray(output) |
There was a problem hiding this comment.
we should at least document this, or add a warning message
| VerificationTensorOrImage.verify(data_input) | ||
|
|
||
| if self.active_pipeline is None: | ||
| self._build_pipeline(data_input) |
There was a problem hiding this comment.
We should verify if we built a CHW or HWC pipeline, and rebuild in case the type changes. We can store the "pipeline kind" as a member, and initialize it during _build_pipeline.
There was a problem hiding this comment.
Shouldn't changing the layout in the middle of pipeline lifecycle considered a bug?
| [DEPRECATED but used] | ||
| """ | ||
|
|
||
| def __call__(self, data_input): |
There was a problem hiding this comment.
this doesn't seem to cover fully the torchvision core semantics. Here's a suggestion (LLM generated):
class ToTensor:
"""
Convert an image-like tensor to a float32 CHW tensor in [0, 1].
Intended to be analogous to torchvision's ToTensor for the common image path:
- input is uint8 in [0, 255] with layout "HWC" (e.g. PIL image converted to array)
- output is float32 in [0, 1] with layout "CHW"
Notes
-----
- This operates on DALI graph nodes (DataNode).
- If the input is already floating point, it is only cast to float32 by default
and not rescaled (to avoid double-scaling). Set `scale=True` if you want
unconditional division by 255.
"""
def __init__(
self,
*,
output_layout: Literal["CHW", "HWC"] = "CHW",
dtype=types.FLOAT,
scale: bool = True,
input_layout: Optional[Literal["HWC", "CHW"]] = None,
):
"""
Parameters
----------
output_layout:
Layout of the output. torchvision uses CHW.
dtype:
Output dtype (default: FLOAT / float32).
scale:
If True, divide by 255 when input is integer type. If False, only cast/transpose.
input_layout:
If provided, forces how ToTensor interprets the input layout.
If None, ToTensor assumes HWC (typical for PIL path in your Compose).
"""
self.output_layout = output_layout
self.dtype = dtype
self.scale = scale
self.input_layout = input_layout
def __call__(self, data_input):
# 1) Convert to float32 (or requested dtype)
# If input is integer and scale=True, do scaling in float to get [0, 1].
x = data_input
# Cast first; DALI will handle conversion
x = fn.cast(x, dtype=self.dtype)
if self.scale:
# This matches torchvision's ToTensor scaling for uint8-like images.
# We do it unconditionally after cast; for non-uint8 integer inputs it still scales,
# which is usually desired for "image" semantics. If you need stricter behavior,
# you can make this conditional on original dtype (requires dtype introspection).
x = x / 255.0
# 2) Layout conversion
in_layout = self.input_layout or "HWC"
out_layout = self.output_layout
if in_layout == out_layout:
return x
if in_layout == "HWC" and out_layout == "CHW":
# permute HWC -> CHW
x = fn.transpose(x, perm=[2, 0, 1])
return x
if in_layout == "CHW" and out_layout == "HWC":
x = fn.transpose(x, perm=[1, 2, 0])
return x
raise ValueError(f"Unsupported layout conversion: {in_layout} -> {out_layout}")
There was a problem hiding this comment.
ToTensor is a special case in this implementation, since DALI is returning tensors from a pipeline. I agree that scaling should be optional and casting is missing.
What is more, DALI implementation is limited and allows using ToTensor as a final operator of the pipeline - this will be documented.
There was a problem hiding this comment.
I removed ToTensor fromt his PR
| if no_size: | ||
| if orig_h > orig_w: | ||
| target_w = (max_size * orig_w) / orig_h | ||
| else: | ||
| target_h = (max_size * orig_h) / orig_w |
There was a problem hiding this comment.
this works in float, which might be imcompatible with torchvision (can lead to error off by one, etc).
Here's a suggestion (LLM generated, to be double checked) following the torch rounding rules:
from __future__ import annotations
from typing import Optional, Sequence, Tuple, Union
import math
SizeArg = Union[int, Sequence[int], None]
def _check_size_arg(size: SizeArg) -> SizeArg:
# torchvision allows int or sequence len 1 or 2; len 1 treated like int
if isinstance(size, (list, tuple)):
if len(size) == 0 or len(size) > 2:
raise ValueError(f"size must have len 1 or 2, got {len(size)}")
if len(size) == 1:
return int(size[0])
return (int(size[0]), int(size[1]))
if size is None:
return None
return int(size)
def _round_to_int(x: float) -> int:
# torchvision uses rounding-to-nearest-int for computed sizes.
# (In practice they do `int(x + 0.5)` for positive x or `round(x)` depending on version.)
return int(math.floor(x + 0.5))
def torchvision_resize_output_size(
orig_h: int,
orig_w: int,
size: SizeArg,
max_size: Optional[int] = None,
) -> Tuple[int, int]:
"""
Compute output size for torchvision.transforms.Resize(size, max_size=max_size)
for 2D images.
Returns (new_h, new_w) as ints.
"""
size = _check_size_arg(size)
if size is None:
if max_size is None:
raise ValueError("If size is None, max_size must be provided.")
# Equivalent to "not_larger": longest edge becomes max_size
# and aspect ratio is preserved.
# This mirrors torchvision v2 functional behavior.
if orig_w >= orig_h:
new_w = int(max_size)
new_h = _round_to_int(max_size * orig_h / orig_w)
else:
new_h = int(max_size)
new_w = _round_to_int(max_size * orig_w / orig_h)
return new_h, new_w
# size is (h, w): direct
if isinstance(size, tuple):
if max_size is not None:
raise ValueError("max_size must be None when size is a sequence of length 2.")
return int(size[0]), int(size[1])
# size is int: match shorter edge
if max_size is not None and max_size <= 0:
raise ValueError("max_size must be positive.")
short, long_ = (orig_w, orig_h) if orig_w <= orig_h else (orig_h, orig_w)
if short == size:
# torchvision returns original size (no-op)
new_h, new_w = orig_h, orig_w
else:
scale = float(size) / float(short)
new_h = _round_to_int(orig_h * scale)
new_w = _round_to_int(orig_w * scale)
if max_size is not None:
new_short, new_long = (new_w, new_h) if new_w <= new_h else (new_h, new_w)
if new_long > max_size:
scale = float(max_size) / float(new_long)
new_h = _round_to_int(new_h * scale)
new_w = _round_to_int(new_w * scale)
# safety: clamp to at least 1
new_h = max(1, int(new_h))
new_w = max(1, int(new_w))
return new_h, new_w
There was a problem hiding this comment.
Added size calculation in Python
| tv_shape_lower = torch.Size([out_tv.shape[1] - 1, out_tv.shape[2] - 1]) | ||
| tv_shape_upper = torch.Size([out_tv.shape[1] + 1, out_tv.shape[2] + 1]) |
There was a problem hiding this comment.
instead, you could follow torch's rounding rules and expect equality? At least to be checked if it is possible to achieve with DALI.
There was a problem hiding this comment.
Strongly agree, full equality would be the best.
If the rounding errors are unavoidable, we should probably document where discrepancy comes from and make sure that we know the error bound (is it always +/-1? or can it be +/-2 in some scenarios?). The documentation should warn the users that they won't see bit-exact results.
If equality is not possible, I think using something like isclose would make this test code more readable, as you'll have something along the lines of assert torch.isclose(out_fn.shape[0], out_tv.shape[0], rtol=0, atol=1)
There was a problem hiding this comment.
I would like it to be equal as well, but DALI calculates the output size different from Torchvision.
The following change in the resize_attr_base.cc::AdjustOutputSize:
- out_size[d] = in_size[d] * scale[d];
+ out_size[d] = std::floor(in_size[d] * scale[d]);
would align both libraries, but it would change DALI's behavior.
There is an option to recalculate the output size in Python, but it would add an additional overhead to resize and I took a liberty to not do that and add it to the documentation that the calcualted size may be off by one.
There was a problem hiding this comment.
Added size calculation in Python
szkarpinski
left a comment
There was a problem hiding this comment.
Leaving some comments after the first reading. My main concerns are:
- There's a significant amount of hard-coded dependencies on layouts (HWC, CHW) or color formats (RGB, RGBA). I'd like to make sure that this is intentional and we don't anticipate any need to extend this soon.
- The verification of arguments and argument handling seems to overlap with what DALI already does, and duplicated input validation means added maintenance cost and a risk of discrepancies
| return torch.from_dlpack(dali_tensor) | ||
|
|
||
|
|
||
| def to_torch_tensor(tensor_or_tl: tuple | TensorListGPU | TensorListCPU) -> torch.Tensor: |
There was a problem hiding this comment.
This looks like a useful and generic thing, not strictly related to torchvision. Shouldn't this be a part of main DALI, or Torch plugin?
| stream = torch.cuda.Stream(0) | ||
| with torch.cuda.stream(stream): | ||
| output = self.pipe.run(stream, input_data=data_input) | ||
|
|
There was a problem hiding this comment.
Do we have CPU-only tests that will cover torchvision? If not, we should definitely add some to catch things like this
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
| raise ValueError(f"Values {name} should be positive number, got {values}") | ||
|
|
||
|
|
||
| class VerifyIfOrderedPair(ArgumentVerificationRule): |
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
dali/python/nvidia/dali/experimental/torchvision/v2/functional/flips.py
Outdated
Show resolved
Hide resolved
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
| @params("gpu", "cpu") | ||
| def test_horizontal_random_flip_probability(device): | ||
| img = make_test_tensor() | ||
| transform = Compose([RandomHorizontalFlip(p=1.0, device=device)]) # always flip |
There was a problem hiding this comment.
Is Compose required to instantiate an operator? I believe that in torchvision you can use Compose to compose multiple operators, but Compose on a single-element list is a no-op and is not required nor encouraged
There was a problem hiding this comment.
Is Compose required to instantiate an operator?
Yes, since Compose is a wrapper for a pipeline. It does not make much sense to instantiate standalone operators (and build a pipeline around them), since it would promote an antipattern and probably hinder the performance. It is better to use functional, which is using dynamic mode.
Torchvision allows standalone operators instantiation, since they have low overhead when executed on CPU.
There was a problem hiding this comment.
And what happens when a user tries to use an operator without Compose (which is allowed in torchvision)? Will the error message be meaningful enough to guide the user towards the correct usage?
There was a problem hiding this comment.
It will end up with an unrelated DALI exception. I don't think that there is an obvious way to check if operator is inside a pipeline, other than passing a Compose to pipeline's operators.
It plan to add a note in the documentation, that DALI Torchvision operators needs to be encompassed with Compose.
There was a problem hiding this comment.
Maybe I underestimate the users, but I'm afraid very few will see this note and will get discouraged by the unrelated DALI error - "if writing a simplest single-operator Torchvision transform results in a cryptic error, what's next?!".
I'd strongly suggest wrapping the Torchvision API DataNodes into some thin Python layer that would return a meaningful error message. Or, instead of wrapping, maybe we can overwrite the __call__ of the operator.
If you believe it's not the right moment to do this we can do this in a follow-up.
There was a problem hiding this comment.
I'd echo what Szymon says about possible false impression that "this barely works" if one enounters such an error. The idea of overriding __call__ operator in experimental.trochvision.functional feels promising - the __call__ just throws, and we could have a private _call that calls super.
| tv_shape_lower = torch.Size([out_tv.shape[1] - 1, out_tv.shape[2] - 1]) | ||
| tv_shape_upper = torch.Size([out_tv.shape[1] + 1, out_tv.shape[2] + 1]) |
There was a problem hiding this comment.
Strongly agree, full equality would be the best.
If the rounding errors are unavoidable, we should probably document where discrepancy comes from and make sure that we know the error bound (is it always +/-1? or can it be +/-2 in some scenarios?). The documentation should warn the users that they won't see bit-exact results.
If equality is not possible, I think using something like isclose would make this test code more readable, as you'll have something along the lines of assert torch.isclose(out_fn.shape[0], out_tv.shape[0], rtol=0, atol=1)
Signed-off-by: Marek Dabek <mdabek@nvidia.com>
|
!build |
|
CI MESSAGE: [45326301]: BUILD STARTED |
dali/python/nvidia/dali/experimental/torchvision/v2/operator.py
Outdated
Show resolved
Hide resolved
|
CI MESSAGE: [45326301]: BUILD FAILED |
34bcd56 to
08bbd62
Compare
|
CI MESSAGE: [45708918]: BUILD FAILED |
|
CI MESSAGE: [45708918]: BUILD PASSED |
| output = super().run(_input) | ||
|
|
||
| if data_input.ndim == 3: | ||
| # DALI requires batch size to be present |
There was a problem hiding this comment.
| # DALI requires batch size to be present | |
| # Removing the batch dimension we added above |
or similar would be more readable.
| device_id = data_input.device.index | ||
| else: | ||
| device_id = torch.cuda.current_device() | ||
| stream = torch.cuda.Stream(device=device_id) |
There was a problem hiding this comment.
we are creating a new stream every time, and we don't really synchronize with it or return it. Is this intentional?
| def _to_torch_tensor(tensor_or_tl: TensorListGPU | TensorListCPU) -> torch.Tensor: | ||
| if isinstance(tensor_or_tl, (TensorListGPU, TensorListCPU)): | ||
| dali_tensor = tensor_or_tl.as_tensor() | ||
| else: | ||
| dali_tensor = tensor_or_tl | ||
|
|
||
| return torch.from_dlpack(dali_tensor) |
There was a problem hiding this comment.
Are we sure the produced TensorList/Tensor stays alive long enough? Should we clone? Should we document “tensor is only valid until next pipeline run” (if that’s the case)? Just open questions
There was a problem hiding this comment.
I think dlpack should receive pycapsule that holds allocation and tie the reference lifetime to the returned object, the new executor is able to transfer ownership of its allocations, so it should be fine.
| return output | ||
|
|
||
|
|
||
| def adjust_input(func): |
There was a problem hiding this comment.
Layout handling is inconsistent across the stack (HW vs HWC for grayscale)
adjust_input converts PIL L to layout "HW" (no channel dim).
But other parts (e.g., functional.resize) treat "HW" as acceptable alongside "CHW"/"NCHW", and compute original_h/original_w accordingly.
Meanwhile PipelineHWC.run() converts PIL L into a tensor with a channel dim (unsqueeze(-1)), i.e. effectively HWC.
So grayscale is "HW" in dynamic functional path but HWC in pipeline path.
Just verifying that this is intentional. Should we document it? Should we stick to one representation for all cases?
There was a problem hiding this comment.
Fixed - HWC will be used.
| # TODO: | ||
| # assert torch.allclose(out_tv, out_dali_tv, rtol=1, atol=1) | ||
| # assert torch.allclose(out_fn, out_dali_fn, rtol=1, atol=1) | ||
|
|
There was a problem hiding this comment.
do we plan to add that?
There was a problem hiding this comment.
Yes, but currently changing interpolation or antialiasing returns different results. I turned it on for tensors, for now, since these tests are not running with different interpolation or antialiasing settings.
| def test_large_sizes_images(resize, device): | ||
| loop_images_test(resize=resize, device=device) | ||
|
|
||
|
|
There was a problem hiding this comment.
Missing tests for:
- CUDA tensors input to Compose (and correctness / synchronization)
- Compose called repeatedly with different sizes / dtypes / devices
- grayscale layout consistency between functional and operator-based APIs
- non-uint8 dtypes, float tensors, etc.
| arg_rules: Sequence[ArgumentVerificationRule] = [] | ||
| input_rules: Sequence[DataVerificationRule] = [] |
There was a problem hiding this comment.
How about making those tuples if they are not meant to be mutable?
Signed-off-by: Marek Dabek <mdabek@nvidia.com>
38f8d3c to
54a3e57
Compare
Signed-off-by: Marek Dabek <mdabek@nvidia.com>
Signed-off-by: Marek Dabek <mdabek@nvidia.com>
a1be3e2 to
cbf29d1
Compare
|
!build |
|
CI MESSAGE: [46236785]: BUILD STARTED |
|
CI MESSAGE: [46236785]: BUILD PASSED |
| total = 1 | ||
| for s in shape: | ||
| total *= s |
There was a problem hiding this comment.
nit:
| total = 1 | |
| for s in shape: | |
| total *= s | |
| total = math.prod(shape) # or np.prod(shape) |
| @params("gpu", "cpu") | ||
| def test_horizontal_random_flip_probability(device): | ||
| img = make_test_tensor() | ||
| transform = Compose([RandomHorizontalFlip(p=1.0, device=device)]) # always flip |
There was a problem hiding this comment.
I'd echo what Szymon says about possible false impression that "this barely works" if one enounters such an error. The idea of overriding __call__ operator in experimental.trochvision.functional feels promising - the __call__ just throws, and we could have a private _call that calls super.
| loop_images_test(resize=resize, device=device) | ||
|
|
||
|
|
||
| @cartesian_params((512, 1125, 2048, ([512, 512]), ([2048, 2048])), ("cpu", "gpu")) |
There was a problem hiding this comment.
nit: parentheses without comma don't do anything, do they?
| @cartesian_params((512, 1125, 2048, ([512, 512]), ([2048, 2048])), ("cpu", "gpu")) | |
| @cartesian_params((512, 1125, 2048, [512, 512], [2048, 2048]), ("cpu", "gpu")) |
| @cartesian_params((512, 1125, 2048, ([512, 512]), ([2048, 2048])), ("cpu", "gpu")) | |
| @cartesian_params((512, 1125, 2048, (512, 512), (2048, 2048)), ("cpu", "gpu")) |
| Returns the size in a canonical form: | ||
|
|
||
| - ``int`` — resize the shorter edge to this value (aspect-ratio preserving) | ||
| - ``None`` — use ``max_size`` only (resize so longer edge equals ``max_size``) |
There was a problem hiding this comment.
Great catch :)
| orig_h = orig_size[0] | ||
| orig_w = orig_size[1] | ||
|
|
||
| if isinstance(size, (tuple, list)): |
There was a problem hiding this comment.
Right, I added exception
| def _to_torch_tensor(tensor_or_tl: TensorListGPU | TensorListCPU) -> torch.Tensor: | ||
| if isinstance(tensor_or_tl, (TensorListGPU, TensorListCPU)): | ||
| dali_tensor = tensor_or_tl.as_tensor() | ||
| else: | ||
| dali_tensor = tensor_or_tl | ||
|
|
||
| return torch.from_dlpack(dali_tensor) |
There was a problem hiding this comment.
I think dlpack should receive pycapsule that holds allocation and tie the reference lifetime to the returned object, the new executor is able to transfer ownership of its allocations, so it should be fine.
Signed-off-by: Marek Dabek <mdabek@nvidia.com>
423fc2a to
19fdda1
Compare
|
!build |
|
CI MESSAGE: [46425218]: BUILD STARTED |
|
CI MESSAGE: [46425218]: BUILD PASSED |
Torchvision API infrastructure
Category:
New feature
Description:
This is the first of the series of PRs implementing Torchvision API.
The full PR has been modified to include only infrastructure, resize and flip operators required for unit testing.
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A