Parallelizing code unexpectedly makes it slower #5712

p6rt · 2016-10-01T13:15:18Z

Migrated from rt.perl.org#129779 (status was 'resolved')

Searchable as RT129779$

p6rt · 2016-10-01T13:15:18Z

From steve.piner@gmail.com

This could be a stupid user problem, in which case I apologise for wasting
your time.

Simple concurrency demonstrations seem to work; the following completes in
just over a second:

perl6 -e 'await Promise.allof(start {sleep 1}, start {sleep 1}, start
{sleep 1});say now - INIT now'

However if the started tasks are doing any CPU-intensive work, they seem
to take much longer than if they had been run sequentially, without using
promises at all.

To reproduce the issue, please run this script.

use v6;

my $r = 2_000_000;

say "Counting to $r";

my ($start, $setup, $stop);
$start = now;
{
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
}
$stop = now;

say "Non-promise iteration: {$stop - $start}";

my @promises;
$start = now;
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "One iteration: {$stop - $start} (setup: {$setup - $start})";

@promises = ();

$start = now;
for (1..16) {
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
}
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "16 iterations: {$stop - $start} (setup: {$setup - $start})";

What I expected:

One iteration would take roughly the same amount of time as the
non-promise iteration (the sleep example above would suggest that the
threading overhead would be less than 0.001s)

16 iterations would take, at worst, about 16 times as long as the line
'one iteration' above it.

What happens:

On a Windows 10 PC, with 6 cores (12 logical processors)

C:\Users\Steve>perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6162481
One iteration: 2.23909643 (setup: 0.00300325)
16 iterations: 65.56993665 (setup: 0.00700654)

On an Ubuntu 14.04.5 LTS PC, with 2 cores

steve@prole:~/$ perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6948201
One iteration: 1.9678549 (setup: 0.0024696)
16 iterations: 107.1483633 (setup: 0.0474610)

Also the CPU usage leaps 100% (all CPUs, all at 100%) while running, on
both PCs. The PCs are mostly idle otherwise (Windows 10 - about 3%
utilization, Linux - 199% idle)

Perl 6 version

C:\Users\Steve>perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

steve@prole:~/$ perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

p6rt · 2016-10-01T16:11:54Z

From steve.piner+bitcard@gmail.com

Replicated on Linux with 2016.09

steve@prole:~/$ perl6 --version
This is Rakudo version 2016.09 built on MoarVM version 2016.09
implementing Perl 6.c.

On Sat Oct 01 06:15:18 2016, steve.piner@gmail.com wrote:

This could be a stupid user problem, in which case I apologise for wasting
your time.

Simple concurrency demonstrations seem to work; the following completes in
just over a second:

perl6 -e 'await Promise.allof(start {sleep 1}, start {sleep 1}, start
{sleep 1});say now - INIT now'

However if the started tasks are doing any CPU-intensive work, they seem
to take much longer than if they had been run sequentially, without using
promises at all.

To reproduce the issue, please run this script.

-----
use v6;

my $r = 2_000_000;

say "Counting to $r";

my ($start, $setup, $stop);
$start = now;
{
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
}
$stop = now;

say "Non-promise iteration: {$stop - $start}";

my @promises;
$start = now;
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "One iteration: {$stop - $start} (setup: {$setup - $start})";

@promises = ();

$start = now;
for (1..16) {
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
}
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "16 iterations: {$stop - $start} (setup: {$setup - $start})";
-----

What I expected:

One iteration would take roughly the same amount of time as the
non-promise iteration (the sleep example above would suggest that the
threading overhead would be less than 0.001s)

16 iterations would take, at worst, about 16 times as long as the line
'one iteration' above it.

What happens:

On a Windows 10 PC, with 6 cores (12 logical processors)

C:\Users\Steve>perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6162481
One iteration: 2.23909643 (setup: 0.00300325)
16 iterations: 65.56993665 (setup: 0.00700654)

On an Ubuntu 14.04.5 LTS PC, with 2 cores

steve@prole:~/$ perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6948201
One iteration: 1.9678549 (setup: 0.0024696)
16 iterations: 107.1483633 (setup: 0.0474610)

Also the CPU usage leaps 100% (all CPUs, all at 100%) while running, on
both PCs. The PCs are mostly idle otherwise (Windows 10 - about 3%
utilization, Linux - 199% idle)

Perl 6 version

C:\Users\Steve>perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

steve@prole:~/$ perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

p6rt · 2016-10-03T04:33:59Z

From @zoffixznet

Seems the issue has more to do with running an empty loop, rather than
performing a real computation.

This is a run on a 4-core box. Attempting to parallelize an empty loop makes
the execution 1 second slower:

my &code = { for ^2_000_000 { } };

my $start = now; (^4).map: &code; my $stop = now;
say "Serial: {$stop - $start}";

$start = now; await (^4).map: {start code}; $stop = now;
say "Parallel: {$stop - $start}";

# Serial: 1.5161628
# Parallel: 2.56597794

But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:

use Crypt::Bcrypt;
my &code = { bcrypt-hash ~rand, :15rounds; };

my $start = now; (^4).map: &code; my $stop = now;
say "Serial: {$stop - $start}";

$start = now; await (^4).map: {start code}; $stop = now;
say "Parallel: {$stop - $start}";

# Serial: 7.69045687
# Parallel: 2.087147

On Sat Oct 01 06:15:18 2016, steve.piner@gmail.com wrote:

This could be a stupid user problem, in which case I apologise for wasting
your time.

Simple concurrency demonstrations seem to work; the following completes in
just over a second:

perl6 -e 'await Promise.allof(start {sleep 1}, start {sleep 1}, start
{sleep 1});say now - INIT now'

However if the started tasks are doing any CPU-intensive work, they seem
to take much longer than if they had been run sequentially, without using
promises at all.

To reproduce the issue, please run this script.

-----
use v6;

my $r = 2_000_000;

say "Counting to $r";

my ($start, $setup, $stop);
$start = now;
{
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
}
$stop = now;

say "Non-promise iteration: {$stop - $start}";

my @promises;
$start = now;
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "One iteration: {$stop - $start} (setup: {$setup - $start})";

@promises = ();

$start = now;
for (1..16) {
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
}
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "16 iterations: {$stop - $start} (setup: {$setup - $start})";
-----

What I expected:

One iteration would take roughly the same amount of time as the
non-promise iteration (the sleep example above would suggest that the
threading overhead would be less than 0.001s)

16 iterations would take, at worst, about 16 times as long as the line
'one iteration' above it.

What happens:

On a Windows 10 PC, with 6 cores (12 logical processors)

C:\Users\Steve>perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6162481
One iteration: 2.23909643 (setup: 0.00300325)
16 iterations: 65.56993665 (setup: 0.00700654)

On an Ubuntu 14.04.5 LTS PC, with 2 cores

steve@prole:~/$ perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6948201
One iteration: 1.9678549 (setup: 0.0024696)
16 iterations: 107.1483633 (setup: 0.0474610)

Also the CPU usage leaps 100% (all CPUs, all at 100%) while running, on
both PCs. The PCs are mostly idle otherwise (Windows 10 - about 3%
utilization, Linux - 199% idle)

Perl 6 version

C:\Users\Steve>perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

steve@prole:~/$ perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

p6rt · 2016-10-03T04:33:59Z

The RT System itself - Status changed from 'new' to 'open'

p6rt · 2016-10-03T15:23:10Z

From steve.piner@gmail.com

But Crypt::Bcrypt seems to be mostly native call stuff. While it is
running, I wouldn't imagine that it has much to do with Perl 6 at all.

The original case I noticed the problem in was a recursive function; all
Perl 6, no IO. Sure, it's not real work, but I'm not at that stage yet.

Try this: it's the fibonacci function.

The first section I call it 4 times, without using any concurrency.

The second section I call it 4 times in a single worker.

The third section I start 4 workers, each calling it once.

I would expect the first and second sections to have about the same run
time. I would expect the third section to take less time.

The times below are for a 2-core Ubuntu 14.04.5 PC, running Rakudo version
2016.09 built on MoarVM version 2016.09

use v6;

sub fib($n) {
$n < 2 ?? $n !! fib($n - 1) + fib($n - 2);
}

constant FIB = 30;

my ($start, $stop);
$start = now;
(^4).map: {fib(FIB)};
$stop = now;
say "No thread: {$stop - $start}";

$start = now;
await start {(^4).map: {fib(FIB)}};
$stop = now;
say "Single: {$stop - $start}";

$start = now;
await (^4).map: {start fib(FIB)};
$stop = now;
say "Parallel: {$stop - $start}";

# No thread: 13.55591338
# Single: 16.11898954
# Parallel: 20.72943701

On Mon, 03 Oct 2016 17:33:59 +1300, Zoffix Znet via RT
<perl6-bugs-followup@perl.org> wrote:

Seems the issue has more to do with running an empty loop, rather than
performing a real computation.

This is a run on a 4-core box. Attempting to parallelize an empty loop
makes
the execution 1 second slower:
my &code = \{ for ^2\_000\_000 \{ \} \};

my $start = now; $^4$\.map&#8203;: &code; my $stop = now;
say "Serial&#8203;: \{$stop \- $start\}";

$start = now; await $^4$\.map&#8203;: \{start code\}; $stop = now;
say "Parallel&#8203;: \{$stop \- $start\}";

\# Serial&#8203;: 1\.5161628
\# Parallel&#8203;: 2\.56597794
But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:
use Crypt&#8203;::Bcrypt;
my &code = \{ bcrypt\-hash ~rand, :15rounds; \};

my $start = now; $^4$\.map&#8203;: &code; my $stop = now;
say "Serial&#8203;: \{$stop \- $start\}";

$start = now; await $^4$\.map&#8203;: \{start code\}; $stop = now;
say "Parallel&#8203;: \{$stop \- $start\}";

\# Serial&#8203;: 7\.69045687
\# Parallel&#8203;: 2\.087147
On Sat Oct 01 06:15:18 2016, steve.piner@gmail.com wrote:

This could be a stupid user problem, in which case I apologise for
wasting
your time.

Simple concurrency demonstrations seem to work; the following completes
in
just over a second:

perl6 -e 'await Promise.allof(start {sleep 1}, start {sleep 1}, start
{sleep 1});say now - INIT now'

However if the started tasks are doing any CPU-intensive work, they seem
to take much longer than if they had been run sequentially, without
using
promises at all.

To reproduce the issue, please run this script.

-----
use v6;

my $r = 2_000_000;

say "Counting to $r";

my ($start, $setup, $stop);
$start = now;
{
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
}
$stop = now;

say "Non-promise iteration: {$stop - $start}";

my @promises;
$start = now;
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "One iteration: {$stop - $start} (setup: {$setup - $start})";

@promises = ();

$start = now;
for (1..16) {
@promises.push(start {
my ($i, $j);
$j = $r;
loop ($i = 0; $i < $j; $i++) { }
});
}
$setup = now;
await Promise.allof(@promises);
$stop = now;

say "16 iterations: {$stop - $start} (setup: {$setup - $start})";
-----

What I expected:

One iteration would take roughly the same amount of time as the
non-promise iteration (the sleep example above would suggest that the
threading overhead would be less than 0.001s)

16 iterations would take, at worst, about 16 times as long as the line
'one iteration' above it.

What happens:

On a Windows 10 PC, with 6 cores (12 logical processors)

C:\Users\Steve>perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6162481
One iteration: 2.23909643 (setup: 0.00300325)
16 iterations: 65.56993665 (setup: 0.00700654)

On an Ubuntu 14.04.5 LTS PC, with 2 cores

steve@prole:~/$ perl6 concurrency-wtf.pl
Counting to 2000000
Non-promise iteration: 0.6948201
One iteration: 1.9678549 (setup: 0.0024696)
16 iterations: 107.1483633 (setup: 0.0474610)

Also the CPU usage leaps 100% (all CPUs, all at 100%) while running, on
both PCs. The PCs are mostly idle otherwise (Windows 10 - about 3%
utilization, Linux - 199% idle)

Perl 6 version

C:\Users\Steve>perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

steve@prole:~/$ perl6 -v
This is Rakudo version 2016.07.1 built on MoarVM version 2016.07
implementing Perl 6.c.

p6rt · 2016-10-03T15:23:54Z

From @toolforger

Am 03.10.2016 um 06:34 schrieb Zoffix Znet via RT:

Seems the issue has more to do with running an empty loop, rather than
performing a real computation.

This is a run on a 4-core box. Attempting to parallelize an empty loop makes
the execution 1 second slower:

[...]

But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:

(Disclaimer: I have no ideas of the internals, but I know a bit about
concurrency.)

This might be four cores competing to get update access to the loop counter.
Core-to-core synchronization of a memory cell with high-frequency
updates is an extremely expensive operation, with dozens or hundreds of
wait states to request exclusive cache lines access and to move the
current state of the variable from one CPU's cache to the next.

p6rt · 2016-10-12T03:04:06Z

From steve.piner@gmail.com

On Tue, 04 Oct 2016 04:23:54 +1300, Joachim Durchholz via RT
<perl6-bugs-followup@perl.org> wrote:

Am 03.10.2016 um 06:34 schrieb Zoffix Znet via RT:

Seems the issue has more to do with running an empty loop, rather than
performing a real computation.

This is a run on a 4-core box. Attempting to parallelize an empty loop
makes
the execution 1 second slower:

[...]

But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:

(Disclaimer: I have no ideas of the internals, but I know a bit about
concurrency.)

This might be four cores competing to get update access to the loop
counter.
Core-to-core synchronization of a memory cell with high-frequency
updates is an extremely expensive operation, with dozens or hundreds of
wait states to request exclusive cache lines access and to move the
current state of the variable from one CPU's cache to the next.

For what it's worth, I've tried this with 4 separate functions, one per
thread. That should - to my mind - hopefully avoid blocking on cache line
access.

It doesn't seem to make a noticeable difference to performance.

The comment below is the output of the following program, compared to the
times listed for the Fibonacci program elsewhere in this issue. Replacing
@fibs with @fibs = &fib1 xx 4 results in closer times.

# No thread: 12.6327946 (vs 13.0872902)
# Parallel: 20.6087147 (vs 21.01963220)
# Single: 15.4441904 (vs 15.47348977)
# No thread: 15.4208225

use v6;

sub fib1($n) {
$n < 2 ?? $n !! fib1($n - 1) + fib1($n - 2);
}

sub fib2($n) {
$n < 2 ?? $n !! fib2($n - 1) + fib2($n - 2);
}

sub fib3($n) {
$n < 2 ?? $n !! fib3($n - 1) + fib3($n - 2);
}

sub fib4($n) {
$n < 2 ?? $n !! fib4($n - 1) + fib4($n - 2);
}

my @fibs = &fib1, &fib2, &fib3, &fib4;

constant FIB = 30;

my ($start, $stop);

$start = now;
(^4).map: { @fibs[$^a](FIB) };
$stop = now;
say "No thread: {$stop - $start}";

$start = now;
await (^4).map: { start @fibs[$^a](FIB) };
$stop = now;
say "Parallel: {$stop - $start}";

$start = now;
await start { (^4).map: { @fibs[$^a](FIB) } };
$stop = now;
say "Single: {$stop - $start}";

$start = now;
(^4).map: { @fibs[$^a](FIB) };
$stop = now;
say "No thread: {$stop - $start}";

p6rt · 2016-10-12T10:53:55Z

From @smls

I ran the Fibonacci benchmark from Steve Piner's preceding comment on a CPU with 4 cores (AMD Phenom II X4 965).
With MoarVM's dynamic specializer (spesh) disabled, the parallel case is faster than the no-thread case:

➜ MVM_SPESH_DISABLE=1 perl6 fibonacci.p6
No thread: 25.60150510
Parallel: 21.46605027
Single: 28.1509191
No thread: 28.3140393

But with spesh, the no-thread case gets a lot faster whereas the parallel case actually gets a bit slower:

➜ perl6 fibonacci.p6
No thread: 15.69618545
Parallel: 23.173925
Single: 18.196797935
No thread: 18.3020371

timotimo ran it on a different CPU (also 4 cores), and did *not* get such a significant slow-down for parallel vs no-thread, so it may be CPU/platform-dependent:

No thread: 10.4069735
Parallel: 10.92738216

He also had some thoughts on the matter (starting at https://irclog.perlgeek.de/perl6/2016-10-12#i_13383443):

<timotimo> allocating frames (i.e. calling functions) currently contends
a per-instance (rather than per-thread) memory allocator
<timotimo> if the empty loop actually invokes the empty body each time
'round the loop in one of those earlier examples, it will
hardly do anything but allocate and free frames
<timotimo> then the allocator becomes the bottleneck and would be
responsible for most of the runtime
<timotimo> recursive function calls are also bad :)

...and did some profiling with perf:

<timotimo> the whole program spends 34% of its time inside
MVM_fixed_size_alloc
<timotimo> an additional 7.84% of its time in MVM_fixed_size_free
<timotimo> and 7.44% in MVM_frame_invoke
<jnthn> Yeah, that'll be what's holding it up. Also, I suspect,
over-subscription
<jnthn> Our scheduler just ain't all that smart yet
<timotimo> mystery solved?
<timotimo> i'm not exactly sure why my parallel vs no thread results are
drastically better than what's in the ticket, though

p6rt · 2016-10-13T23:54:03Z

From @toolforger

Am 11.10.2016 um 13:33 schrieb Steve Piner:

On Tue, 04 Oct 2016 04:23:54 +1300, Joachim Durchholz via RT
<perl6-bugs-followup@perl.org> wrote:

Am 03.10.2016 um 06:34 schrieb Zoffix Znet via RT:

Seems the issue has more to do with running an empty loop, rather than
performing a real computation.

This is a run on a 4-core box. Attempting to parallelize an empty
loop makes
the execution 1 second slower:

[...]

But running actual real-life code makes it almost 4 times faster, as
would be expected on a 4-core box:

(Disclaimer: I have no ideas of the internals, but I know a bit about
concurrency.)

This might be four cores competing to get update access to the loop
counter.
Core-to-core synchronization of a memory cell with high-frequency
updates is an extremely expensive operation, with dozens or hundreds of
wait states to request exclusive cache lines access and to move the
current state of the variable from one CPU's cache to the next.

For what it's worth, I've tried this with 4 separate functions, one per
thread. That should - to my mind - hopefully avoid blocking on cache
line access.

I thought that the 4 processes would be trying to update the for
^2_000_000 loop counter. On closer look, I guess that was a gross
misunderstanding; my apologies.

Four different functions should not make a difference, unless Perl is
updating a usage counter on the functions themselves or something like that.

I don't know enough about Perl to decide what the single-threaded code
is actually doing, so I'll leave that to people with more knowledge.

Sorry I can't help more, I'm just learning :-)

Regards,
Jo

p6rt · 2017-04-16T14:00:46Z

From @timo

with jnthn's recent patch to give every thread its own free-list for the FSA, the behavior is now much nicer.

Here's a paste of a 40-core machine: https://gist.github.com/anonymous/650ff6b00a0f8dc34b9e358992e572b4

here's a 24-core box (some google compute cloud machine zoffix rents)

before the patch: https://gist.github.com/zoffixznet/4e0fd3e352fbf9115d63186d5dc47340

after the patch: https://gist.github.com/zoffixznet/783f0293d65300419549321f24e5ea6a

actual code used: https://gist.github.com/timo/a3a405f977840e8336f50234715e9cd4

i'm closing this bug.

p6rt · 2017-04-16T14:00:46Z

@timo - Status changed from 'open' to 'resolved'

p6rt · 2017-04-16T15:46:02Z

From steve.piner+bitcard@gmail.com

Yes, I retested with the initial scripts that lead me to raise this issue in the first place.

The multi-threaded versions are 40% faster that the single threaded versions on a dual core machine. There is also no apparent penalty for simply having threads as there was before.

So yeah, I agree this is fixed.

p6rt closed this as completed Apr 16, 2017

p6rt added conc perf spesh labels Jan 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing code unexpectedly makes it slower #5712

Parallelizing code unexpectedly makes it slower #5712

p6rt commented Oct 1, 2016

p6rt commented Oct 1, 2016

p6rt commented Oct 1, 2016

p6rt commented Oct 3, 2016

p6rt commented Oct 3, 2016

p6rt commented Oct 3, 2016

p6rt commented Oct 3, 2016

p6rt commented Oct 12, 2016

p6rt commented Oct 12, 2016

p6rt commented Oct 13, 2016

p6rt commented Apr 16, 2017

p6rt commented Apr 16, 2017

p6rt commented Apr 16, 2017

Parallelizing code unexpectedly makes it slower #5712

Parallelizing code unexpectedly makes it slower #5712

Comments

p6rt commented Oct 1, 2016

p6rt commented Oct 1, 2016

From steve.piner@gmail.com

p6rt commented Oct 1, 2016

From steve.piner+bitcard@gmail.com

p6rt commented Oct 3, 2016

From @zoffixznet

p6rt commented Oct 3, 2016

p6rt commented Oct 3, 2016

From steve.piner@gmail.com

p6rt commented Oct 3, 2016

From @toolforger

p6rt commented Oct 12, 2016

From steve.piner@gmail.com

p6rt commented Oct 12, 2016

From @smls

p6rt commented Oct 13, 2016

From @toolforger

p6rt commented Apr 16, 2017

From @timo

p6rt commented Apr 16, 2017

p6rt commented Apr 16, 2017

From steve.piner+bitcard@gmail.com