Skip to content

Conversation

@rostan-t
Copy link
Collaborator

Category: Bug fix (non-breaking change which fixes an issue)

Description:

This PR enables arithmetic operations between tensors/batches and scalars. Previously, x + n worked if x was a CPU tensor and n a scalar, but not if x was a GPU tensor.

Arithmetic operations between a GPU tensor/batch and a Python list or tuple are also supported. Tensor types need to be explicitly copied.

Additional information:

Affected modules and functionalities:

Dynamic mode tensors and batches.

Key points relevant for the review:

Do arithmetic operations work the way we intend to?

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4545

@greptile-apps
Copy link

greptile-apps bot commented Dec 19, 2025

Greptile Summary

This PR fixes a critical bug that prevented arithmetic operations between GPU tensors/batches and scalars, and refactors the arithmetic operation handling logic for better maintainability.

Key Changes:

  • Fixed TensorListGPU.broadcast method in C++ backend that was incorrectly using CPUBackend types instead of GPUBackend types
  • Extracted arithmetic operation logic into a new _arithmetic.py module, removing duplication from _tensor.py and _batch.py
  • Enhanced device handling to properly convert scalars to GPU tensors when operating with GPU tensors
  • Added explicit validation for implicitly convertible types (numbers, lists, tuples) when targeting GPU
  • Improved error messages for mixed CPU/GPU operations
  • Added comprehensive tests covering scalar operations on both CPU and GPU tensors/batches, plus device compatibility checks

Confidence Score: 5/5

  • This PR is safe to merge - it fixes critical bugs, improves code organization, and includes thorough test coverage
  • The changes are well-structured bug fixes with comprehensive tests. The C++ fix corrects a clear type mismatch bug, and the Python refactoring improves code organization while maintaining backward compatibility. All changes follow existing patterns and include extensive test coverage for the new functionality.
  • No files require special attention

Important Files Changed

Filename Overview
dali/python/backend_impl.cc Fixed critical bug in TensorListGPU.broadcast where CPU backend types were incorrectly used instead of GPU backend types
dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py New module extracting arithmetic operation logic with improved device handling and explicit scalar-to-tensor conversion
dali/test/python/experimental_mode/test_arithm_ops.py Added comprehensive tests for scalar arithmetic operations on GPU/CPU tensors and batches, plus device compatibility validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Tensor/Batch
    participant _arithm_op
    participant as_tensor
    participant _arithmetic_generic_op

    User->>Tensor/Batch: x + scalar (e.g., gpu_tensor + 5)
    Tensor/Batch->>_arithm_op: __add__(self, scalar)
    
    Note over _arithm_op: Detect device type from<br/>Tensor/Batch arguments
    _arithm_op->>_arithm_op: gpu = any(arg.device == "gpu")
    
    alt scalar argument (not Tensor/Batch)
        _arithm_op->>_arithm_op: Check _implicitly_convertible(scalar)
        alt gpu is True and not implicitly convertible
            _arithm_op-->>User: ValueError: Not implicitly copyable to GPU
        else convertible
            _arithm_op->>as_tensor: as_tensor(scalar, device="gpu" if gpu else None)
            as_tensor-->>_arithm_op: Tensor on correct device
        end
    end
    
    _arithm_op->>_arithm_op: Validate all args on same device
    alt mixed GPU/CPU inputs
        _arithm_op-->>User: ValueError: Cannot mix GPU and CPU inputs
    else all same device
        _arithm_op->>_arithmetic_generic_op: Call with tensor args
        _arithmetic_generic_op-->>User: Result tensor
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py, line 36 (link)

    syntax: Typo: "implictly" should be "implicitly"

  2. dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py, line 30-46 (link)

    logic: Logic bug: when a scalar appears before a GPU tensor (e.g., 3 + gpu_tensor), the scalar gets converted to CPU before detecting the GPU tensor, causing the final check to fail with "Cannot mix GPU and CPU inputs."

    The algorithm needs two passes:

    1. First pass: scan all args to detect if any GPU tensor exists
    2. Second pass: convert scalars to appropriate device

    Example failure case:

    gpu_tensor = ndd.tensor([1, 2, 3], device="gpu")
    result = 5 + gpu_tensor  # Will raise ValueError
  3. dali/test/python/experimental_mode/test_arithm_ops.py, line 128 (link)

    style: Test only covers tensor + scalar but not scalar + tensor. Add reverse operation tests:

    # Also test scalar + tensor
    result_reversed = ndd.as_tensor(apply_bin_op(op, scalar, x))
    ref_reversed = apply_bin_op(op, scalar, tensor)
    if not np.allclose(result_reversed.cpu(), ref_reversed):
        msg = f"{scalar} {op} {tensor} = \n{result_reversed}\n!=\n{ref_reversed}"
        raise AssertionError(msg)

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@mzient mzient self-assigned this Dec 19, 2025
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py, line 20 (link)

    syntax: using | operator in isinstance() requires Python 3.10+, but pyproject.toml targets Python 3.8+. Use tuple syntax instead for compatibility.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile



def _implicitly_convertible(value: Any):
return isinstance(value, (SupportsInt, SupportsFloat, list, tuple))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... I don't think this'll work:

>>> isinstance(cp.array([0,0]), typing.SupportsFloat)  # CuPy, not a scalar
True
>>> isinstance(np.array([0,0]), typing.SupportsFloat)  # NumPy, not a scalar
True
>>> isinstance(np.array(["a","b"]), typing.SupportsFloat)  # NumPy, not a scalar, not even a numeric type
True

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 9319fa9. Now:

>>> isinstance(cp.array([0, 0]), numbers.Real)
False
>>> isinstance(np.array([0, 0]), numbers.Real)
False
>>> isinstance(np.int64(0), numbers.Real)
True

@rostan-t rostan-t force-pushed the ndd-gpu-arithmetic branch from 8d39fde to 9319fa9 Compare January 2, 2026 18:36
@rostan-t
Copy link
Collaborator Author

rostan-t commented Jan 2, 2026

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41073770]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41073770]: BUILD PASSED

Comment on lines 36 to 37
if gpu and not _implicitly_convertible(arg):
raise ValueError(f"Type {type(arg)} is not implicitly copyable to the GPU.")
Copy link
Contributor

@mzient mzient Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if arg is a CuPy array or a torch cuda tensor? Normally they would get converted to a GPU Tensor, but now we'd throw.

Copy link
Collaborator Author

@rostan-t rostan-t Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by bcefb0d. This also raises an exception earlier in cases like (ndd GPU tensor) + (PyTorch CPU tensor).

@rostan-t rostan-t force-pushed the ndd-gpu-arithmetic branch from 03f9013 to fcf79a4 Compare January 9, 2026 16:48
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Overview

This PR addresses a critical bug preventing arithmetic operations between GPU tensors/batches and scalars. It refactors arithmetic handling into a dedicated module and fixes a backend implementation bug.

Key Changes

1. Critical C++ Backend Fix (backend_impl.cc)

Fixed TensorListGPU.broadcast() method which was incorrectly using CPUBackend type parameters instead of GPUBackend. This prevented GPU tensor broadcasting from working at all. The fix is minimal and correct.

2. New _arithmetic.py Module

Extracted arithmetic operation logic into a centralized module to reduce duplication. The module implements:

  • _implicitly_convertible(): Validates scalar types (numbers, lists, tuples)
  • _arithm_op(): Handles argument conversion and device consistency checking

3. Refactored _tensor.py and _batch.py

Both files now delegate arithmetic operations to the new _arithm_op() function instead of inline implementations.

4. Test Coverage

Added comprehensive tests (test_binary_scalars) covering:

  • Scalar operations on both CPU and GPU tensors
  • Scalar operations on GPU batches with different batch sizes
  • Device mixing detection and error handling
  • PyTorch GPU tensor interoperability

Critical Issue Found

Logic Bug in _arithmetic.py (Lines 34-38)

The scalar-to-tensor conversion logic contains a critical structural flaw:

if not isinstance(arg, (Tensor, Batch)):
    if gpu and _implicitly_convertible(arg):
        arg = as_tensor(arg, device="gpu")
    arg = as_tensor(arg)  # ← ALWAYS EXECUTES (wrong!)

Line 38 should be in an else clause. Currently, every scalar is converted twice - once to the correct device, then again to default device. The code works by accident because as_tensor() preserves device for existing Tensor objects, but this is fragile and violates the principle of least surprise.

Positive Aspects

  • C++ backend fix is correct and necessary
  • Refactoring improves code maintainability
  • Test coverage for new functionality is good
  • Device validation logic correctly prevents mixed CPU/GPU operations
  • Batch broadcasting now works correctly for GPU tensors

Recommendations

The _arithmetic.py logic should be corrected to explicitly handle the scalar conversion decision, making the intent clear and eliminating accidental reliance on as_tensor() device preservation semantics.

Confidence Score: 2/5

  • This PR should NOT be merged in its current state due to a critical logic bug in _arithmetic.py that violates code quality standards, despite the bug working accidentally in practice.
  • The C++ backend fix (TensorListGPU.broadcast) is correct and critical. However, the _arithmetic.py module contains a structural logic error where line 38 always executes regardless of the condition at line 36. This means scalars are converted twice - once to the correct device, then again unconditionally. While this works due to as_tensor() preserving device for existing Tensor objects, it's fragile, poorly written, and could break if as_tensor() behavior changes. The code violates Python style expectations and makes the intent unclear. This is a moderate issue that doesn't cause runtime failures in current code but represents poor engineering practice and technical debt. The test coverage added is good, but doesn't validate the implementation quality.
  • dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py - Lines 34-38 need restructuring to clarify intent and eliminate accidental reliance on as_tensor() device preservation behavior

Important Files Changed

File Analysis

Filename Score Overview
dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py 2/5 New module that centralizes arithmetic operation logic. Contains a critical logic bug at line 38 where scalar conversion happens twice due to incorrect indentation - line 38 should be in an else clause. The code works by accident due to as_tensor() preserving device for existing Tensor objects, but is fragile and error-prone.
dali/python/backend_impl.cc 5/5 Critical bug fix in TensorListGPU.broadcast method (lines 1612-1613). Corrected type parameters from CPUBackend to GPUBackend, allowing GPU tensor broadcasting to work correctly. Fix is correct and minimal.
dali/python/nvidia/dali/experimental/dynamic/_tensor.py 5/5 Updated to import and use the new _arithmetic module for all arithmetic operations. Changes are clean refactoring from inline implementation to delegating to _arithm_op. No new issues introduced.
dali/python/nvidia/dali/experimental/dynamic/_batch.py 5/5 Updated to import and use the new _arithmetic module for all arithmetic operations. The broadcast method correctly detects device type and uses appropriate TensorList backend. Changes are clean refactoring with no new issues.
dali/test/python/experimental_mode/test_arithm_ops.py 4/5 New tests added for scalar arithmetic operations on both CPU and GPU tensors/batches, and device compatibility checks. Tests are comprehensive but could benefit from additional edge case coverage (e.g., complex numbers, boolean operations with scalars).

Sequence Diagram

sequenceDiagram
    participant User
    participant Tensor as GPU Tensor
    participant _arithm_op as _arithm_op()
    participant as_tensor as as_tensor()
    participant Backend as Backend

    User->>Tensor: tensor + 3 (scalar)
    Tensor->>_arithm_op: __add__(self, 3)
    
    _arithm_op->>_arithm_op: gpu = True (detect GPU)
    _arithm_op->>_arithm_op: arg = 3 (scalar)
    
    Note over _arithm_op: Check: gpu and<br/>_implicitly_convertible(3)?
    _arithm_op->>as_tensor: as_tensor(3, device='gpu')
    as_tensor-->>_arithm_op: GPU Tensor
    
    Note over _arithm_op: BUG: Line 38<br/>ALWAYS executes
    _arithm_op->>as_tensor: as_tensor(GPU_Tensor)
    as_tensor-->>_arithm_op: GPU Tensor (device preserved)
    
    _arithm_op->>_arithm_op: Validate: GPU == GPU ✓
    _arithm_op->>Backend: _arithmetic_generic_op(tensor, scalar)
    Backend-->>_arithm_op: Result Tensor
    _arithm_op-->>Tensor: Result
    Tensor-->>User: GPU Tensor result
    
    rect rgb(255, 200, 200)
        Note over _arithm_op: Works by accident:<br/>as_tensor() preserves<br/>device for Tensor objects
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully enables arithmetic operations between GPU tensors/batches and scalars, fixing a critical bug and improving code organization.

Key Changes:

  • Fixed critical bug in C++ backend where TensorListGPU.broadcast incorrectly used CPUBackend types
  • Extracted arithmetic operation logic into new _arithmetic.py module, eliminating code duplication between _tensor.py and _batch.py
  • Enhanced device handling to properly detect GPU context and convert scalars/lists/tuples to GPU tensors when operating with GPU data
  • Added comprehensive test coverage for scalar operations on both CPU and GPU, plus device compatibility validation

Improvements Over Previous Implementation:
The new _arithm_op function correctly handles the case where scalars appear before GPU tensors in operations (e.g., 5 + gpu_tensor). The old implementation would convert the scalar to CPU before detecting the GPU tensor, leading to device mismatch errors or incorrect behavior.

Confidence Score: 5/5

  • This PR is safe to merge - it fixes critical bugs, improves code organization, and has comprehensive test coverage
  • The changes fix a clear copy-paste bug in the C++ backend, successfully refactor duplicated Python code into a shared module, and properly handle GPU/CPU device detection. The only minor issue is a small code structure inefficiency that doesn't affect correctness. Comprehensive tests validate the new functionality.
  • No files require special attention - all changes are well-implemented

Important Files Changed

File Analysis

Filename Score Overview
dali/python/backend_impl.cc 5/5 Fixed critical bug in TensorListGPU.broadcast that incorrectly used CPUBackend types instead of GPUBackend types
dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py 4/5 New module extracting arithmetic logic; correctly handles GPU/CPU device detection and scalar-to-tensor conversion, with minor code structure inefficiency
dali/python/nvidia/dali/experimental/dynamic/_batch.py 5/5 Refactored to use centralized _arithm_op from _arithmetic module, removing duplicated code
dali/python/nvidia/dali/experimental/dynamic/_tensor.py 5/5 Refactored to use centralized _arithm_op from _arithmetic module, removing duplicated code
dali/test/python/experimental_mode/test_arithm_ops.py 5/5 Added comprehensive tests for scalar operations on GPU/CPU tensors and batches, plus device compatibility validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Tensor/Batch
    participant _arithm_op
    participant as_tensor
    participant _arithmetic_generic_op
    participant Backend

    User->>Tensor/Batch: gpu_tensor + scalar
    Tensor/Batch->>_arithm_op: __add__(scalar)
    
    Note over _arithm_op: Detect GPU from args
    _arithm_op->>_arithm_op: Check if any arg is GPU tensor
    _arithm_op->>_arithm_op: gpu = True
    
    Note over _arithm_op: Convert scalar to tensor
    _arithm_op->>as_tensor: as_tensor(scalar, device="gpu")
    as_tensor->>Backend: TensorListGPU.broadcast(tensor, 1)
    Backend-->>as_tensor: GPU tensor
    as_tensor-->>_arithm_op: GPU scalar tensor
    
    _arithm_op->>_arithm_op: Validate devices match
    _arithm_op->>_arithmetic_generic_op: Execute operation
    _arithmetic_generic_op->>Backend: Run GPU arithmetic op
    Backend-->>_arithmetic_generic_op: Result tensor
    _arithmetic_generic_op-->>User: GPU result tensor
Loading

@rostan-t rostan-t force-pushed the ndd-gpu-arithmetic branch from fcf79a4 to bcefb0d Compare January 9, 2026 16:55
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical bug preventing arithmetic operations between GPU tensors/batches and scalars, and refactors the arithmetic operation logic for better maintainability.

Key Changes:

  • Fixed TensorListGPU.broadcast in C++ backend that incorrectly used CPUBackend types instead of GPUBackend types, which was the root cause preventing GPU tensor operations
  • Extracted arithmetic operation handling into new _arithmetic.py module, eliminating code duplication between _tensor.py and _batch.py
  • Implemented smart device detection: when any operand is a GPU tensor, implicitly convertible scalars (numbers, lists, tuples) are automatically converted to GPU tensors
  • Added validation to prevent mixing CPU and GPU operands, providing clear error messages
  • Added comprehensive test coverage for scalar operations on both devices and device compatibility checks

Issues Found:

  • One syntax error in _tensor.py line 71: missing f prefix for f-string formatting

Confidence Score: 4/5

  • This PR is safe to merge after fixing the f-string syntax error - it addresses a critical bug and improves code quality
  • Score of 4 reflects one syntax error that must be fixed before merging. The core logic is sound: the C++ backend fix is correct, the arithmetic refactoring properly handles device placement, and tests comprehensively cover the new functionality. The syntax error on line 71 is in an error path that likely wasn't hit by tests.
  • Fix the f-string syntax error in dali/python/nvidia/dali/experimental/dynamic/_tensor.py line 71 before merging

Important Files Changed

File Analysis

Filename Score Overview
dali/python/backend_impl.cc 5/5 Critical bug fix: corrected TensorListGPU.broadcast to use GPUBackend instead of CPUBackend - this was preventing GPU tensor broadcasting from working
dali/python/nvidia/dali/experimental/dynamic/_arithmetic.py 5/5 New module extracting arithmetic operation logic with proper GPU/CPU device handling and implicit scalar-to-tensor conversion
dali/python/nvidia/dali/experimental/dynamic/_tensor.py 4/5 Refactored to use shared _arithm_op from _arithmetic.py; contains one f-string syntax error on line 71
dali/python/nvidia/dali/experimental/dynamic/_batch.py 5/5 Refactored to use shared _arithm_op from _arithmetic.py, removing duplicate code and improving maintainability
dali/test/python/experimental_mode/test_arithm_ops.py 5/5 Added comprehensive tests for scalar operations with tensors/batches on both CPU and GPU, plus device compatibility validation tests

Sequence Diagram

sequenceDiagram
    participant User
    participant Tensor/Batch
    participant _arithm_op
    participant as_tensor
    participant _arithmetic_generic_op
    participant Backend

    User->>Tensor/Batch: gpu_tensor + scalar
    Tensor/Batch->>_arithm_op: __add__(scalar)
    
    Note over _arithm_op: Detect GPU device from<br/>existing Tensor/Batch args
    
    _arithm_op->>_arithm_op: Check if scalar is<br/>implicitly_convertible<br/>(Real, list, tuple)
    
    alt Scalar is implicitly convertible & GPU detected
        _arithm_op->>as_tensor: Convert scalar to GPU tensor
        as_tensor-->>_arithm_op: GPU Tensor
    else Otherwise
        _arithm_op->>as_tensor: Convert scalar to CPU tensor
        as_tensor-->>_arithm_op: CPU Tensor
    end
    
    _arithm_op->>_arithm_op: Validate all args<br/>are on same device
    
    alt Device mismatch
        _arithm_op-->>User: ValueError: Cannot mix GPU and CPU
    else All devices match
        _arithm_op->>_arithmetic_generic_op: Execute operation
        _arithmetic_generic_op->>Backend: TensorListGPU.broadcast<br/>(now uses GPUBackend)
        Backend-->>_arithmetic_generic_op: Result
        _arithmetic_generic_op-->>User: Tensor/Batch result
    end
Loading

return arr.astype(np.int32), nvidia.dali.types.IMAGE_TYPE
else:
raise TypeError(f"Unexpected element type {type(item)}")
raise TypeError("Unexpected element type f{type(item)}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing f prefix for f-string.

Suggested change
raise TypeError("Unexpected element type f{type(item)}")
raise TypeError(f"Unexpected element type {type(item)}")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this PR and I fixed it in another PR already.

@rostan-t
Copy link
Collaborator Author

rostan-t commented Jan 9, 2026

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41436981]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [41436981]: BUILD FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants