-
Notifications
You must be signed in to change notification settings - Fork 0
Add ThreadAwareScheduler #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
403d86f to
85da098
Compare
|
I was hoping that #81 would magically fix my problems here, but alas... My current nemesis is this sample on the JVM target: import haxe.Timer;
import hxcoro.CoroRun;
import hxcoro.Coro.*;
function main() {
for (numTasks in [1, 10, 100, 1_000, 10_000]) {
final stamp = Timer.milliseconds();
var racyInt = 0;
CoroRun.runScoped(node -> {
for (i in 0...numTasks) {
node.async(node -> {
racyInt++;
var busy = 1_000;
var localInt = 0;
while (busy-- > 0) {
yield();
}
});
}
});
trace('numTasks: $numTasks, run-time: ${Timer.milliseconds() - stamp}ms, racyInt: $racyInt');
}
}This tells me that there are 3/10000 children who were never resumed. The most obvious explanation seems to be that the scheduler "loses" their |
|
I have locally added a duplication check to var dispatchCounter = new AtomicInt(0);
public function onDispatch() {
if (dispatchCounter.exchange(1) != 0) {
Sys.println("Duplicate dispatch: " + (this : Dynamic));
return;
}And the same for There doesn't appear to be a strict relation between the two, i.e. sometimes I get one, sometimes the other, sometimes both. The dispatch one could suggest that the pool steals the same object twice (and misses one in its place). The resume one I have no idea right now. There also doesn't appear to be a relation between the number of duplicate calls and the number of hanging children, which is a little surprising too. |
|
Also, to be clear, while I cannot reproduce this behavior on master, there's a pretty high chance that we get bottlenecked by |
|
I'm pasting here how to profile on HL so that I don't have to dig it up in Discord all the time: import haxe.Timer;
import hxcoro.CoroRun;
import hxcoro.Coro.*;
function doProf(args) {
switch (args) {
case "start":
hl.Profile.event(-7, "" + 10000); // setup
hl.Profile.event(-3); // clear data
hl.Profile.event(-5); // resume all
case "dump":
hl.Profile.event(-6); // save dump
hl.Profile.event(-4); // pause all
hl.Profile.event(-3); // clear data
default:
}
}
function main() {
doProf("start");
final numTasks = 100;
final stamp = Timer.milliseconds();
var racyInt = 0;
CoroRun.runScoped(node -> {
for (i in 0...numTasks) {
node.async(node -> {
racyInt++;
var busy = 1_000;
while (busy-- > 0) {
yield();
}
});
}
});
trace('numTasks: $numTasks, run-time: ${Timer.milliseconds() - stamp}ms, racyInt: $racyInt');
doProf("dump");
}This makes a And then it usually just shows that the GC can't handle many short-lived objects very well:
At least the threads do what I want them to do and are spending their time in the right place:
|
|
For some reason, everyone really hates the Issue37 tests. I thought it might be another cancellation problem, but there have also been failures for the non-cancelling version. They always seem to end up in the same state: From the task ID we can infer that this is one of the reader tasks, because the 100 writer tasks are created first and would go to 721. That means that we never get out of this loop: while (channel.reader.waitForRead()) {
delay(1);
if (channel.reader.tryRead(o)) {
aggregateValue.add(o.get());
break;
} else {
continue;
}
}I know this is an awkward pattern with the delay-call between the wait and the try, but I do think it should work... |
|
So it's not (only) about the delay calls, HL just segfaulted without them: That last line points to the |



And for my next trick, I make all your scheduler events disappear... This implements the scheduler side on top of #75. I'll write more about it later - the idea should be sound but something is still not quite right with the implementation because I'm getting hangs in some samples. That could of course also again be the more parallelism = more problems situation, but we'll see.