ignore: gracefully quit worker threads upon panic in ParallelVisitor #3010

cosmicexplorer · 2025-03-06T04:03:41Z

Fixes #3009.

Problem

WalkParallel::visit() will nondeterministically hang if the ParallelVisitor::visit() implementation panics. This also occurs when providing a closure to WalkParallel::run(). Minimal repro is provided in #3009:

        WalkBuilder::new(path)
            .build_parallel()
            .run(|| Box::new(|_| panic!("oops!")));

The above code will nondeterministically hang, because of an infinite loop when no new work is available:

ripgrep/crates/ignore/src/walk.rs

Lines 1695 to 1707 in de4baa1

    
           loop { 
        
               if let Some(v) = self.recv() { 
        
                   self.activate_worker(); 
        
                   value = Some(v); 
        
                   break; 
        
               } 
        
               // Our stack isn't blocking. Instead of burning the 
        
               // CPU waiting, we let the thread sleep for a bit. In 
        
               // general, this tends to only occur once the search is 
        
               // approaching termination. 
        
               let dur = std::time::Duration::from_millis(1); 
        
               std::thread::sleep(dur); 
        
           }

Solution

Check the quit_now flag in our wait loop.
Catch any panic in the run() method and set quit_now before propagating the panic.

Breaking change: In order to ensure soundness, we also enforce that the filter method provided to WalkBuilder#filter_entry() is UnwindSafe. Users can circumvent this by wrapping in AssertUnwindSafe as needed.

Result

The added test panic_in_parallel() always succeeds instead of hanging.

BurntSushi · 2025-08-17T21:05:21Z

crates/ignore/src/walk.rs

+            + std::panic::UnwindSafe
+            + Send
+            + Sync
+            + 'static,


Unfortunately, I think this is plausibly too big of a breaking change for me to stomach in a semver compatible release. And I don't have the bandwidth to do a semver incompatible release right now. Which kind of makes this PR stuck.

One possible way to get this unstuck is to add a new API that avoid this problem and deprecate the old one. But I'd only want to go this route if it was a very small addition that doesn't significantly complicate the implementation. I'm not sure if such a solution exists.

Otherwise, I'm inclined to leave this bug open for now, and just something to address in the next semver incompatible release.

BurntSushi · 2025-08-17T21:05:40Z

Closing, but keeping #3010 open.

BurntSushi · 2025-08-17T21:06:25Z

Also, thank you for your work on this! Even though I ended up not taking this PR, your work will be something worth building on for ignore 0.5.

cosmicexplorer · 2025-11-27T13:34:31Z

Wanted to confirm here that:

I always love working with you and your projects,
this is exactly the choice I would make as a maintainer,
and thanks so much for your careful and thorough reply!

And I am super glad to hear ignore 0.5 is on the horizon!

I would like you to know that as usual, it is extremely difficult to improve upon the kind of work you do. I have tried to argue with your comments in my head and largely failed. I have spent months investigating this, and I have had to introduce some immense complexity in order to make a pretty minor use case more robust. Great work.

I'm also looking at the API, and while I'm not done with mine yet, I would absolutely recommend considering the use of ops::ControlFlow, either internally or in the API. I think using ControlFlow may contain strictly more information than the untagged stop token. Personally, I have also separated the worker threads by category, and given each category individually configurable thread counts. I don't know if that will help performance, but I do think that the definition of "done" differs based upon whether the current node is a directory or not, which I think may justify slightly different looping techniques..

I also think the deadlock that occurred here is a symptom of a more general misalignment of the threading model to the input distribution. I think work stealing + recursive data generation (so each work item can always produce more) is necessarily prone to this deadlock. I do not have a proof of this, and I also do not have a more appropriate answer myself yet. I do understand that work stealing is a good approach to avoid having a single very large directory fill up its own queue. I also wonder whether there is any benefit to maximizing thread locality.

Finally, I was informed about a very widely available syscall getdents(), which is provided in Linux and BSD stdlibs, but not macOS, which writes multiple directory entries into a provided buffer. There is even a posix_getdents() from POSIX 2024, and musl supports it, but I have been having a very difficult time getting it accepted to the rust libc crate: rust-lang/libc#4522.

I am actually very confident that supporting getdents() will produce a drastic performance improvement for fs::read_dir(), so I'm going to try demonstrating that today. If I can demonstrate its utility with benchmarks, I will probably ping you again, and ask you to help support my quest to get this syscall into the stdlib.

ignore 0.5 may be able to just call upon fs::read_dir(), but for pipelining it can be nice to control the buffer size. Note that the alignment of the output is (1) very stable (2) ridiculous.
https://codeberg.org/cosmicexplorer/deep-link/src/commit/1ea3eba5d599d8c48ea56816b7103b12ee49d505/d-major/readdir-sys/src/getdents.rs#L101-L188

Again, very impressed by your engineering judgement, and I appreciate when you write a comment about decisions you didn't take too. Keep it up!!

cosmicexplorer added 4 commits March 5, 2025 21:32

test the problem

df8ad0c

solve the problem

6bf6029

[BREAKING] add comments on safety + API change enforcing UnwindSafe

09f5ef1

add further comments explaining which panic we end up propagating

105b5d9

cosmicexplorer changed the title ~~correctly quit out of busy loops if a ParallelVisitor panics to avoid a hang~~ ignore: correctly quit out of busy loops if a ParallelVisitor panics to avoid a hang Mar 6, 2025

cosmicexplorer changed the title ~~ignore: correctly quit out of busy loops if a ParallelVisitor panics to avoid a hang~~ ignore: gracefully quit worker threads upon panic in ParallelVisitor Mar 6, 2025

cosmicexplorer mentioned this pull request Mar 8, 2025

Allow testing ignore state of files that don't exist on disk yet teamtype/teamtype#239

Open

BurntSushi reviewed Aug 17, 2025

View reviewed changes

BurntSushi closed this Aug 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ignore: gracefully quit worker threads upon panic in ParallelVisitor #3010

ignore: gracefully quit worker threads upon panic in ParallelVisitor #3010

cosmicexplorer commented Mar 6, 2025

Uh oh!

BurntSushi Aug 17, 2025

Uh oh!

BurntSushi commented Aug 17, 2025

Uh oh!

BurntSushi commented Aug 17, 2025

Uh oh!

cosmicexplorer commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	loop {
	if let Some(v) = self.recv() {
	self.activate_worker();
	value = Some(v);
	break;
	}
	// Our stack isn't blocking. Instead of burning the
	// CPU waiting, we let the thread sleep for a bit. In
	// general, this tends to only occur once the search is
	// approaching termination.
	let dur = std::time::Duration::from_millis(1);
	std::thread::sleep(dur);
	}

Uh oh!

ignore: gracefully quit worker threads upon panic in ParallelVisitor #3010

ignore: gracefully quit worker threads upon panic in ParallelVisitor #3010

Conversation

cosmicexplorer commented Mar 6, 2025

Problem

Solution

Result

Uh oh!

BurntSushi Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

BurntSushi commented Aug 17, 2025

Uh oh!

BurntSushi commented Aug 17, 2025

Uh oh!

cosmicexplorer commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants