Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Oct 7, 2025

From codeflash: Optimizes table model pipeline by keeping the data as a tensor for longer, enabling vector operations and reducing conversions.

codeflash-ai bot and others added 2 commits August 27, 2025 01:54
The optimized code achieves an 87% speedup through several key optimizations:

**1. Eliminated redundant list conversions and element-wise operations**
- **Original**: `list(m.indices.detach().cpu().numpy())[0]` creates an intermediate list
- **Optimized**: Direct numpy array access `m.indices.detach().cpu().numpy()[0]`
- **Original**: List comprehension `[elem.tolist() for elem in rescale_bboxes(...)]` calls `.tolist()` on each bbox individually
- **Optimized**: Single `.tolist()` call after all tensor operations: `rescaled.tolist()`

**2. Vectorized padding adjustment**
- **Original**: Per-element subtraction `[float(elem) - shift_size for elem in bbox]` in Python loop
- **Optimized**: Tensor-wide subtraction `rescaled = rescaled - pad` before conversion to list
- This leverages PyTorch's optimized C++ backend instead of Python loops

**3. Reduced function call overhead**
- **Original**: `objects.append()` performs attribute lookup on each iteration
- **Optimized**: `append = objects.append` caches the method reference, eliminating repeated lookups

**4. GPU tensor optimization**  
- Added `device=out_bbox.device` parameter to `torch.tensor()` creation to avoid potential device transfer overhead

**Test case performance patterns:**
- **Small cases (single objects)**: 5-7% improvement from reduced overhead
- **Large cases (500-1000 objects)**: 160-200% improvement due to vectorized operations scaling much better than element-wise Python loops
- **Mixed workloads**: Consistent improvements across all scenarios, with larger gains when more objects need processing

The optimization is particularly effective for table detection models that typically process many bounding boxes simultaneously.
@qued
Copy link
Contributor Author

qued commented Oct 7, 2025

@claude critique this PR please.

@claude
Copy link

claude bot commented Oct 8, 2025

Claude finished @qued's task —— View job


PR Review Progress ✅

Reviewing the optimization changes in this PR:

  • Read repository guidelines (CLAUDE.md)
  • Analyze the core optimization in tables.py
  • Review version bump and changelog updates
  • Check for potential issues or improvements
  • Provide comprehensive review feedback

Review Feedback

This PR contains a solid optimization to the outputs_to_objects function in unstructured_inference/models/tables.py. The changes focus on keeping tensor operations on GPU/device longer before converting to CPU, which should reduce unnecessary data transfers and conversions.

✅ Positive Changes

1. Reduced unnecessary conversions (lines 216-217)

# Before: Converting to list early
pred_labels = list(m.indices.detach().cpu().numpy())[0]
pred_scores = list(m.values.detach().cpu().numpy())[0]

# After: Keep as numpy arrays
pred_labels = m.indices.detach().cpu().numpy()[0]
pred_scores = m.values.detach().cpu().numpy()[0]

This eliminates redundant list conversions and keeps data in more efficient numpy format.

2. Tensor vectorization for bounding box processing (lines 222-228)

# Before: List comprehension with individual tensor operations
pred_bboxes = [elem.tolist() for elem in rescale_bboxes(pred_bboxes, scale_size)]

# After: Vectorized tensor operations
rescaled = rescale_bboxes(pred_bboxes, scale_size)
if pad != 0:
    rescaled = rescaled - pad
pred_bboxes = rescaled.tolist()

This replaces a Python loop with vectorized tensor operations, which is more efficient.

3. Device-aware tensor creation (line 284)

# Before: Creates tensor on default device
b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)

# After: Creates tensor on same device as input
b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32, device=out_bbox.device)

This avoids potential device mismatches and cross-device operations.

4. Simplified bbox adjustment logic (line 237)
The padding adjustment is now handled in the vectorized rescaling step rather than during individual bbox processing.

🔍 Minor Observations

1. Removed micro-optimization
The removal of append = objects.append (line 230) is fine - this micro-optimization provides negligible benefit in modern Python and makes code less readable.

2. Improved readability
The padding logic is now more explicit and easier to follow with the conditional check if pad != 0:.

✅ Quality Assurance

  • Correctness: All mathematical operations remain equivalent
  • Performance: Should see meaningful speedup, especially with GPU tensors
  • Memory: Reduced intermediate allocations and copies
  • Compatibility: Changes are internal implementation details that don't affect API
  • Error handling: Maintains existing error handling patterns

📋 Version Management

The changelog and version updates are appropriately minimal:

  • Version bumped to 1.0.8-dev0
  • Changelog entry correctly attributes the optimization

✅ Overall Assessment

This is a well-executed performance optimization that follows PyTorch best practices. The changes are focused, maintain correctness, and should provide measurable performance improvements, especially when processing on GPU. The optimization strategy of keeping data in tensor format longer is sound and aligns with the stated goal of "enabling vector operations and reducing conversions."

Recommendation: ✅ Approve


@qued qued requested a review from ajjimeno October 8, 2025 14:05
@qued qued enabled auto-merge (squash) October 8, 2025 14:06
@qued qued merged commit 0c9a78b into main Oct 8, 2025
13 checks passed
@qued qued deleted the codeflash/optimize-outputs_to_objects-metbo2xp branch October 8, 2025 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants