Skip to content

Reduce threading scheduler contention for smoothing filter#3280

Merged
daljit46 merged 1 commit intodevfrom
two_inner_axes_smooth_filter
Mar 20, 2026
Merged

Reduce threading scheduler contention for smoothing filter#3280
daljit46 merged 1 commit intodevfrom
two_inner_axes_smooth_filter

Conversation

@daljit46
Copy link
Member

@daljit46 daljit46 commented Mar 19, 2026

It was recently pointed out to me by @Lestropie that the use of threaded_copy can be quite slow if the work done per voxel is trivial, because thread management overhead may dominate. This can be easily be mitigated by specifying two or more inner axes.

While profiling mrregister (for comparison with my GPU registration work), this fact came to my mind when I noticed that Filter::Smooth::operator() was in the hotpath of the code.

In smooth/h, we have the following code:

ThreadedLoop(in_and_output, axes, 1).run(smooth, in_and_output);

It turns out the same idea applies here. Profiling confirmed that the current strategy was causing millions of scheduler lock acquisitions for large images when running mrregister. This PR substantially improves the situation by using two inner axes when possible to increase chunk size from single lines to small slices of the image. The result is less scheduler contention and lower OS overhead.

The performance improvement can easily be seen on Linux (AMD Ryzen Threadripper PRO 5975WX 32-Cores). Running /usr/bin/time -v command shows:

    Command being timed: "./build/bin/mrregister -type nonlinear -nl_warp warp1.mif -nl_warp2.mif ./OASIS-TRT-20_volumes/OASIS-TRT-20-1/t1weighted.nii.gz ./OASIS-TRT-20_volumes/OASIS-TRT-20-2/t1weighted.nii.gz -nl_niter 200 -info"
	User time (seconds): 1079.90
	System time (seconds): 1201.56
	Percent of CPU this job got: 3345%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.19
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3695936
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3557897
	Voluntary context switches: 16395658
	Involuntary context switches: 2133408
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 1

And after the change:

    Command being timed: "./build/bin/mrregister -type nonlinear -nl_warp warp1.mif -nl_warp2.mif ./OASIS-TRT-20_volumes/OASIS-TRT-20-1/t1weighted.nii.gz ./OASIS-TRT-20_volumes/OASIS-TRT-20-2/t1weighted.nii.gz -nl_niter 200 -info"
	User time (seconds): 826.32
	System time (seconds): 394.92
	Percent of CPU this job got: 2804%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:43.53
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3696044
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3558016
	Voluntary context switches: 4980515
	Involuntary context switches: 607951
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 1

As you can clearly see, the time taken for the same command is substantially lower (same goes for the OS voluntary/involuntary context switches).

Previously, we were dispatching the filter smoothing one image line at
a time via `ThreadedLoop(..., axes, 1)`. Profiling confirmed that this
was causing millions of scheduler lock acquisitions for large images.
To substantially improve the situation, we use two inner axes when
possible to increase chunk size from single lines to small slices of the
image. The result is less scheduler contention and lower OS overhead.
@daljit46 daljit46 self-assigned this Mar 19, 2026
@github-actions
Copy link

clang-tidy review says "All clean, LGTM! 👍"

@daljit46 daljit46 requested a review from a team March 19, 2026 06:53
Copy link
Member

@Lestropie Lestropie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the case in multiple other pieces of code also. When I get the chance I'll do a grep search across the repo and up the inner loop axis count for cheap operations. But if you want you can merge this and I'll extend separately.

@daljit46
Copy link
Member Author

I think each individual case may be different, so I'll merge this as it is.

@daljit46 daljit46 merged commit 1dff8b8 into dev Mar 20, 2026
6 checks passed
@daljit46 daljit46 deleted the two_inner_axes_smooth_filter branch March 20, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants