From 5b1970cf7a17899934e2aff31bd2cf685e34da27 Mon Sep 17 00:00:00 2001
From: "codeflash-ai[bot]"
 <148906541+codeflash-ai[bot]@users.noreply.github.com>
Date: Wed, 14 Jan 2026 09:25:58 +0000
Subject: [PATCH] Optimize RandomThinPlateSpline.generate_parameters

The optimized code achieves a **21% speedup** by eliminating redundant tensor creation in the hot path.

**Key Optimization:**
The source control points template (a fixed 5x2 tensor with values `[[-1,-1], [-1,1], [1,-1], [1,1], [0,0]]`) was previously created from scratch on every call to `generate_parameters()`. The optimization **pre-creates this tensor once** during `__init__` and stores it as `self._src_template`, then simply copies it to the target device/dtype on each call.

**Why This Is Faster:**
- **Reduced object creation overhead**: `torch.tensor()` involves parsing Python lists, allocating memory, and initializing data. By doing this once instead of per-call, we eliminate ~17-18% of the function's time (line profiler shows the original `torch.tensor()` call took 17.2% + 10.8% = 28% total time).
- **Simpler operation path**: The `.to()` method on an existing tensor is faster than constructing a new tensor from Python literals.
- **Memory efficiency**: Only one template tensor exists in memory instead of creating temporary tensors per call.

**Performance Characteristics:**
- The optimization is most effective for **workloads with frequent calls** to `generate_parameters()` - evident from test cases showing 32-74% speedup on repeated calls (e.g., `test_generate_parameters_repeatability_same_input` shows 42.4% faster on second call).
- **Batch size agnostic**: The speedup is consistent across different batch sizes since the template is expanded, not the creation overhead.
- **Minimal impact on edge cases**: Tests with batch_size=0 show slight slowdown (6.2%), but this is negligible compared to typical use cases.

**Impact on Workloads:**
Since `generate_parameters()` is called during augmentation pipelines, this optimization directly reduces latency in data preprocessing - particularly valuable in training loops where augmentations are applied per-batch. The 21% speedup translates to faster data loading without any change to augmentation quality or behavior.
---
 .../augmentation/_2d/geometric/thin_plate_spline.py   | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kornia/augmentation/_2d/geometric/thin_plate_spline.py b/kornia/augmentation/_2d/geometric/thin_plate_spline.py
index d21ee534d9f..ce63e59cc45 100644
--- a/kornia/augmentation/_2d/geometric/thin_plate_spline.py
+++ b/kornia/augmentation/_2d/geometric/thin_plate_spline.py
@@ -70,6 +70,11 @@ def __init__(
             "padding_mode": SamplePadding.get(padding_mode),
         }
         self.dist = torch.distributions.Uniform(-scale, scale)
+        # Pre-create the source control points template
+        self._src_template = torch.tensor(
+            [[[-1.0, -1.0], [-1.0, 1.0], [1.0, -1.0], [1.0, 1.0], [0.0, 0.0]]],
+            dtype=torch.float32,
+        )
 
     def generate_parameters(self, shape: Tuple[int, ...]) -> Dict[str, torch.Tensor]:
         B, _, _, _ = shape
@@ -78,11 +83,7 @@ def generate_parameters(self, shape: Tuple[int, ...]) -> Dict[str, torch.Tensor]
         dtype = self.dtype
 
         # 5 TPS control points in normalized coordinates
-        src = torch.tensor(
-            [[[-1.0, -1.0], [-1.0, 1.0], [1.0, -1.0], [1.0, 1.0], [0.0, 0.0]]],
-            device=device,
-            dtype=dtype,
-        ).expand(B, 5, 2)
+        src = self._src_template.to(device=device, dtype=dtype).expand(B, 5, 2)
 
         if self.same_on_batch:
             noise = self.dist.rsample((1, 5, 2)).to(device=device, dtype=dtype)