Reduce threading scheduler contention for smoothing filter by daljit46 · Pull Request #3280 · MRtrix3/mrtrix3

daljit46 · 2026-03-19T06:50:36Z

It was recently pointed out to me by @Lestropie that the use of threaded_copy can be quite slow if the work done per voxel is trivial, because thread management overhead may dominate. This can be easily be mitigated by specifying two or more inner axes.

While profiling mrregister (for comparison with my GPU registration work), this fact came to my mind when I noticed that Filter::Smooth::operator() was in the hotpath of the code.

In smooth/h, we have the following code:

ThreadedLoop(in_and_output, axes, 1).run(smooth, in_and_output);

It turns out the same idea applies here. Profiling confirmed that the current strategy was causing millions of scheduler lock acquisitions for large images when running mrregister. This PR substantially improves the situation by using two inner axes when possible to increase chunk size from single lines to small slices of the image. The result is less scheduler contention and lower OS overhead.

The performance improvement can easily be seen on Linux (AMD Ryzen Threadripper PRO 5975WX 32-Cores). Running /usr/bin/time -v command shows:

    Command being timed: "./build/bin/mrregister -type nonlinear -nl_warp warp1.mif -nl_warp2.mif ./OASIS-TRT-20_volumes/OASIS-TRT-20-1/t1weighted.nii.gz ./OASIS-TRT-20_volumes/OASIS-TRT-20-2/t1weighted.nii.gz -nl_niter 200 -info"
	User time (seconds): 1079.90
	System time (seconds): 1201.56
	Percent of CPU this job got: 3345%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.19
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3695936
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3557897
	Voluntary context switches: 16395658
	Involuntary context switches: 2133408
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 1

And after the change:

    Command being timed: "./build/bin/mrregister -type nonlinear -nl_warp warp1.mif -nl_warp2.mif ./OASIS-TRT-20_volumes/OASIS-TRT-20-1/t1weighted.nii.gz ./OASIS-TRT-20_volumes/OASIS-TRT-20-2/t1weighted.nii.gz -nl_niter 200 -info"
	User time (seconds): 826.32
	System time (seconds): 394.92
	Percent of CPU this job got: 2804%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:43.53
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3696044
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3558016
	Voluntary context switches: 4980515
	Involuntary context switches: 607951
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 1

As you can clearly see, the time taken for the same command is substantially lower (same goes for the OS voluntary/involuntary context switches).

Previously, we were dispatching the filter smoothing one image line at a time via `ThreadedLoop(..., axes, 1)`. Profiling confirmed that this was causing millions of scheduler lock acquisitions for large images. To substantially improve the situation, we use two inner axes when possible to increase chunk size from single lines to small slices of the image. The result is less scheduler contention and lower OS overhead.

github-actions · 2026-03-19T06:53:09Z

clang-tidy review says "All clean, LGTM! 👍"

Lestropie

This is probably the case in multiple other pieces of code also. When I get the chance I'll do a grep search across the repo and up the inner loop axis count for cheap operations. But if you want you can merge this and I'll extend separately.

daljit46 · 2026-03-20T07:55:04Z

I think each individual case may be different, so I'll merge this as it is.

daljit46 self-assigned this Mar 19, 2026

daljit46 added the performance label Mar 19, 2026

daljit46 requested a review from a team March 19, 2026 06:53

Lestropie approved these changes Mar 20, 2026

View reviewed changes

daljit46 merged commit 1dff8b8 into dev Mar 20, 2026
6 checks passed

daljit46 deleted the two_inner_axes_smooth_filter branch March 20, 2026 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce threading scheduler contention for smoothing filter#3280

Reduce threading scheduler contention for smoothing filter#3280
daljit46 merged 1 commit intodevfrom
two_inner_axes_smooth_filter

daljit46 commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Lestropie left a comment

Uh oh!

daljit46 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daljit46 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Lestropie left a comment

Choose a reason for hiding this comment

Uh oh!

daljit46 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daljit46 commented Mar 19, 2026 •

edited

Loading