Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stall/block in async heavy code #6605

Closed
p6rt opened this issue Oct 13, 2017 · 8 comments
Closed

stall/block in async heavy code #6605

p6rt opened this issue Oct 13, 2017 · 8 comments
Labels
conc regression Issue did not exist previously

Comments

@p6rt
Copy link

p6rt commented Oct 13, 2017

Migrated from rt.perl.org#132287 (status was 'resolved')

Searchable as RT132287$

@p6rt
Copy link
Author

p6rt commented Oct 13, 2017

From @jdv

Below is the best golfing I could manage. I ran it like so for
bisecting​:

while perl6 test.pl6; do :; done

The symptom is that sometimes (90%+ on my box and ungolfed) the program
seems to stall out on the qqx line. The last thing that is printed is
"before". But on earlier commits (that actually build) no such
stallage
happened for a couple minutes run time.

[jdv@​new-host-2]$ cat test.pl6
use v6.d.PREVIEW;

my %c = (​:0reactor_count);
my $s = supply {
  whenever IO​::Socket​::Async.listen('0.0.0.0', 5000) -> $c-conn {
  %c{$c-conn}<bar>++;

  emit $c-conn;

  %c{$c-conn}<msgs-supplier> = Supplier.new;

  $c-conn.Supply.lines.tap({
  %c{$c-conn}<msgs-supplier>.emit​: $_ });
  }
}

sub start_reactor ($msgs-in) {
  start {
  react whenever $msgs-in -> $msg-in {
  say "before";
  say qqx{echo foo};
  say "after";
  %c<reactor_count>++;
  }
  }
}

start {
  react whenever $s -> $c-conn {
  start {
  start_reactor(%c{$c-conn}<msgs-supplier>.Supply);
  }
  }
}

sleep 1;
my $c = await IO​::Socket​::Async.connect('0.0.0.0', 5000);
await $c.print("foo\n");
sleep 1;
exit %c<reactor_count> ?? 0 !! 1;
[jdv@​new-host-2]$

[jdv@​new-host-2]$ git bisect good
80b49320cf854ac68a17cdd216575ee26e380325 is the first bad commit
commit 80b49320cf854ac68a17cdd216575ee26e380325
Author​: Jonathan Worthington <jnthn@​jnthn.net>
Date​: Wed Sep 13 16​:08​:58 2017 +0200

  Initial re-implementation of thread pool scheduler

  Has separate general and timer queues with separate workers, and
also
  introduces affinity queues, which are intended for cases where
events
  will be fed into a `Supply`, and thus there's no point having lots
of
  threads competing over them only to immediately stumble over each
  other. The separate timer queue helps with timer events being
delayed,
  for example by a process producing a load of output.

  The new implementation also adds a supervisor, which is where we
will
  put the smarts on how many threads to have in the pool. For now, it
is
  already smart enough to start a lot less threads than the previous
  scheduler when they obviously aren't needed, which should help a
good
  bit with memory consumption, as well as to add more when needed to
  break deadlocks. This area will be further developed in upcoming
  commits. The default maximum number of threads has been raised to
64,
  now that the behavior of the scheduler is not to innevitably end up
  starting the maximum number of threads even when they don't have
any
  work to do.

  Finally, a new RAKUDO_SCHEDULER_DEBUG environment variable can be
set
  to see how many and what kinds of threads are being started.

:040000 040000 249e0613cf30b283f4fc3cbb0003474377f6019d
3bb01ba4471016067e4b542a2158ee3184cdc85c M src
:040000 040000 6a4e2506cb269875ff4d40054758a34e24c9024d
6a827bbc76dc09ba2cb3fda55083ebb3074ae552 M tools
[jdv@​new-host-2]$

@p6rt
Copy link
Author

p6rt commented Oct 13, 2017

From @ugexe

RAKUDO_SCHEDULER_DEBUG=1 perl6-gdb-m --ll-exception -e 'while 1 { use v6.d.PREVIEW; my %c = (​:0reactor_count); my $s = supply { whenever IO​::Socket​::Async.listen("0.0.0.0", 5000) -> $c-conn { %c{$c-conn}<bar>++; emit $c-conn; %c{$c-conn}<msgs-supplier> = Supplier.new; $c-conn.Supply.lines.tap({ %c{$c-conn}<msgs-supplier>.emit​: $_ }); } }; sub start_reactor ($msgs-in) { start { react whenever $msgs-in -> $msg-in { say "before"; say QX("echo foo"); say "after"; %c<reactor_count>++; } } }; start { react whenever $s -> $c-conn { start { start_reactor(%c{$c-conn}<msgs-supplier>.Supply); } } }; sleep 1; my $c = await IO​::Socket​::Async.connect("0.0.0.0", 5000); await $c.print("foo\n"); sleep 1; }'

Eventually prints out (for me)​:

Unhandled exception in code scheduled on thread 4
Too few positionals passed; expected 6 arguments but got 2
  in block at SETTING​::src/core/IO/Socket/Async.pm line 151
  in block at SETTING​::src/core/ThreadPoolScheduler.pm line 225
  in block at SETTING​::src/core/ThreadPoolScheduler.pm line 224
  in method run-one at SETTING​::src/core/ThreadPoolScheduler.pm line 224
  in method dispatch​:<!> at SETTING​::src/core/Mu.pm line 732
  in block at SETTING​::src/core/ThreadPoolScheduler.pm line 258

@p6rt
Copy link
Author

p6rt commented Oct 13, 2017

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Oct 16, 2017

From @jnthn

On Fri, 13 Oct 2017 10​:13​:36 -0700, jdv79 wrote​:

The symptom is that sometimes (90%+ on my box and ungolfed) the program
seems to stall out on the qqx line. The last thing that is printed is
"before". But on earlier commits (that actually build) no such
stallage
happened for a couple minutes run time.

So if I follow correctly, the changes in the commit mentioned make an existing problem manifest sooner, rather than introduce the problem?

@p6rt
Copy link
Author

p6rt commented Oct 17, 2017

From @jdv

No, sorry. I meant that I only tested for a couple minutes at each
commit.
Commits before the one bisect found never manifested the issue.

On 2017.10.16 10​:52 am, jnthn@​jnthn.net via RT wrote​:

On Fri, 13 Oct 2017 10​:13​:36 -0700, jdv79 wrote​:

The symptom is that sometimes (90%+ on my box and ungolfed) the
program
seems to stall out on the qqx line. The last thing that is printed
is
"before". But on earlier commits (that actually build) no such
stallage
happened for a couple minutes run time.

So if I follow correctly, the changes in the commit mentioned make an
existing problem manifest sooner, rather than introduce the problem?

@p6rt
Copy link
Author

p6rt commented Oct 23, 2017

From @AlexDaniel

OK, first of all, thank you for the report. The issue is indeed there, and in my opinion it's a significant problem.

Secondly, ♥👍 for trying to bisect it manually. Nowadays I only do it with a bot unless I have to trisect rakudo+nqp+moar.

However, the result of bisection is wrong (as far as I can see), and my guess is that on some steps you used “git bisect bad” when you should have used “git bisect skip” (when bisecting automatically you should use exit code 125 in these situations). It just happens that on some commits you can't really build rakudo (e.g. some work was split into several commits so some of the intermediate versions are unbuildable). We should try to avoid these situations, but it happens. On these commits you don't really know if the bug is there or not because there is simply no way to test. Similar bug (but not the same one) was bisected with a bot, and you can take a look at the result​: https://gist.github.com/Whateverable/0f0acba0af518988ddddeb207fd52fae (the commit you are talking about is listed there as well).

In other words, the problem appeared in better-sched branch, and not exactly in the commit you mentioned.

Now, talking about the code you've used, it looks a bit weird to me. It seems that you're doing async access to %c which should probably be protected with a lock. In any case, I was able to reproduce the issue without most of that code, by using just qqx. See GH #​1202 ( rakudo/rakudo#1202 ).

As far as I can see (judging by GH #​1202), the issue is much *easier* to reproduce now with better-sched, but the bug was there for a long time.

So for the 2017.10 release​: SNAFU.

On 2017-10-16 18​:27​:57, jdv79 wrote​:

No, sorry. I meant that I only tested for a couple minutes at each
commit.
Commits before the one bisect found never manifested the issue.

On 2017.10.16 10​:52 am, jnthn@​jnthn.net via RT wrote​:

On Fri, 13 Oct 2017 10​:13​:36 -0700, jdv79 wrote​:

The symptom is that sometimes (90%+ on my box and ungolfed) the
program
seems to stall out on the qqx line. The last thing that is printed
is
"before". But on earlier commits (that actually build) no such
stallage
happened for a couple minutes run time.

So if I follow correctly, the changes in the commit mentioned make an
existing problem manifest sooner, rather than introduce the problem?

@p6rt
Copy link
Author

p6rt commented Nov 12, 2017

From @jdv

The golfed example contained in this ticket and my real world are both fixed after zoffix's fixes. See the other ticket rakudo/rakudo#1202 and commits by
zoffix around 2017-10-25 or so.

@p6rt
Copy link
Author

p6rt commented Nov 12, 2017

@jdv - Status changed from 'open' to 'resolved'

@p6rt p6rt closed this as completed Nov 12, 2017
@p6rt p6rt added conc regression Issue did not exist previously labels Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conc regression Issue did not exist previously
Projects
None yet
Development

No branches or pull requests

1 participant