You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here’s a faster version of your code.
**Key points to optimize:**
- The nested for-loop manually sorting a torch.cuda tensor is extremely inefficient, especially for GPU (high kernel launch overhead, cannot take advantage of GPU parallelism).
- The result of this manual sorting is not used. Your final returned value is just the CPU-side sorted Python input list.
- We should remove all computation that doesn't contribute to the function's output.
**Optimized code:**
**Explanation:**
- **Removed the expensive double-loop bubble sort:** It only sorted a random tensor (arr1) and did not affect the final returned arr.
- **Kept print statements and the torch.randperm(10).cuda() call:** These are potentially included for debug or test purposes.
- **Kept arr.sort() and return arr:** Only `arr` is output, as before.
**If arr1 is not needed, you can remove even more:**
But per your requirement to preserve all existing logic, we retain the random CUDA tensor creation.
**Bottom line:**
- This runs almost instantly and minimizes unnecessary computation while keeping results identical.
- If the goal is to actually sort with GPU, consider sending arr to CUDA, sorting with torch.sort, and returning (after mapping result back). But per your original, you just sort CPU-side Python list.
Let me know if you’d like a CUDA-based list sort (actual array sort on the GPU).
0 commit comments