-
Notifications
You must be signed in to change notification settings - Fork 72
enhancement: optimize outputs to objects metbo2xp #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The optimized code achieves an 87% speedup through several key optimizations: **1. Eliminated redundant list conversions and element-wise operations** - **Original**: `list(m.indices.detach().cpu().numpy())[0]` creates an intermediate list - **Optimized**: Direct numpy array access `m.indices.detach().cpu().numpy()[0]` - **Original**: List comprehension `[elem.tolist() for elem in rescale_bboxes(...)]` calls `.tolist()` on each bbox individually - **Optimized**: Single `.tolist()` call after all tensor operations: `rescaled.tolist()` **2. Vectorized padding adjustment** - **Original**: Per-element subtraction `[float(elem) - shift_size for elem in bbox]` in Python loop - **Optimized**: Tensor-wide subtraction `rescaled = rescaled - pad` before conversion to list - This leverages PyTorch's optimized C++ backend instead of Python loops **3. Reduced function call overhead** - **Original**: `objects.append()` performs attribute lookup on each iteration - **Optimized**: `append = objects.append` caches the method reference, eliminating repeated lookups **4. GPU tensor optimization** - Added `device=out_bbox.device` parameter to `torch.tensor()` creation to avoid potential device transfer overhead **Test case performance patterns:** - **Small cases (single objects)**: 5-7% improvement from reduced overhead - **Large cases (500-1000 objects)**: 160-200% improvement due to vectorized operations scaling much better than element-wise Python loops - **Mixed workloads**: Consistent improvements across all scenarios, with larger gains when more objects need processing The optimization is particularly effective for table detection models that typically process many bounding boxes simultaneously.
|
@claude critique this PR please. |
|
Claude finished @qued's task —— View job PR Review Progress ✅Reviewing the optimization changes in this PR:
Review FeedbackThis PR contains a solid optimization to the ✅ Positive Changes1. Reduced unnecessary conversions (lines 216-217) # Before: Converting to list early
pred_labels = list(m.indices.detach().cpu().numpy())[0]
pred_scores = list(m.values.detach().cpu().numpy())[0]
# After: Keep as numpy arrays
pred_labels = m.indices.detach().cpu().numpy()[0]
pred_scores = m.values.detach().cpu().numpy()[0]This eliminates redundant list conversions and keeps data in more efficient numpy format. 2. Tensor vectorization for bounding box processing (lines 222-228) # Before: List comprehension with individual tensor operations
pred_bboxes = [elem.tolist() for elem in rescale_bboxes(pred_bboxes, scale_size)]
# After: Vectorized tensor operations
rescaled = rescale_bboxes(pred_bboxes, scale_size)
if pad != 0:
rescaled = rescaled - pad
pred_bboxes = rescaled.tolist()This replaces a Python loop with vectorized tensor operations, which is more efficient. 3. Device-aware tensor creation (line 284) # Before: Creates tensor on default device
b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)
# After: Creates tensor on same device as input
b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32, device=out_bbox.device)This avoids potential device mismatches and cross-device operations. 4. Simplified bbox adjustment logic (line 237) 🔍 Minor Observations1. Removed micro-optimization 2. Improved readability ✅ Quality Assurance
📋 Version ManagementThe changelog and version updates are appropriately minimal:
✅ Overall AssessmentThis is a well-executed performance optimization that follows PyTorch best practices. The changes are focused, maintain correctness, and should provide measurable performance improvements, especially when processing on GPU. The optimization strategy of keeping data in tensor format longer is sound and aligns with the stated goal of "enabling vector operations and reducing conversions." Recommendation: ✅ Approve |
From codeflash: Optimizes table model pipeline by keeping the data as a tensor for longer, enabling vector operations and reducing conversions.