Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance bug: perl Thread::Queue is 20x slower than Unix pipe #13196

Open
p5pRT opened this issue Aug 24, 2013 · 15 comments
Open

performance bug: perl Thread::Queue is 20x slower than Unix pipe #13196

p5pRT opened this issue Aug 24, 2013 · 15 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 24, 2013

Migrated from rt.perl.org#119445 (status was 'open')

Searchable as RT119445$

@p5pRT
Copy link
Author

p5pRT commented Aug 24, 2013

From johnh@isi.edu

Created by johnh@isi.edu

This is a bug report for perl from johnh@​isi.edu,
generated with the help of perlbug 1.39 running under perl 5.16.3.

-----------------------------------------------------------------

Why is Thread​::Queue *so* slow?

I understand it has to do locking and be careful about data
structures, but it seems like it is about 20x slower than opening up a
Unix pipe, printing to that, reading it back and parsing the result.

Thread​::Queue is correct, but I suggest that 20x slower is a performance bug.

One would think that IPC through memory would be at least as fast as a
pipe through the kernel, and ideally it should be faster.

Here's timing of a test program that sends 500k integers between two threads,
using Thread​::Queue or pipe(2).

$ ./thread_ipc_perf.pl -m queue
benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @​ 0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe
benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @​ 1.69/s (n=1)

Here's a larger run (1M integers) with the same kind of results.

$ ./thread_ipc_perf.pl -N 1000000 -m queue
benchmark took 30 wallclock secs (32.69 usr + 6.06 sys = 38.75 CPU) @​ 0.03/s (n=1)

$ ./thread_ipc_perf.pl -N 1000000 -m pipe
benchmark took 1 wallclock secs ( 1.23 usr + 0.00 sys = 1.23 CPU) @​ 0.81/s (n=1)

Source code for the above simple benchmark is at
http​://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt

We can quibble over the exact multiplier (maybe it's only 15x slower),
but it's *really* slow.

Any suggestions? I get similar results if I simplify Thread​::Queue to
bare minimum code.

To speculate, I'm thinking the cost is in making all IPC data shared.
It would be great if one could have data that is sent over
Thread​::Queue that is copied, not shared.

Thanks for any suggestions,
  -John Heidemann

Perl Info

Flags:
    category=library
    severity=medium
    module=Thread::Queue

Site configuration information for perl 5.16.3:

Configured by Red Hat, Inc. at Tue Jun 18 09:17:09 UTC 2013.

Summary of my perl5 (revision 5 version 16 subversion 3) configuration:
   
  Platform:
    osname=linux, osvers=2.6.32-358.2.1.el6.x86_64, archname=x86_64-linux-thread-multi
    uname='linux buildvm-05.phx2.fedoraproject.org 2.6.32-358.2.1.el6.x86_64 #1 smp wed feb 20 12:17:37 est 2013 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Dccdlflags=-Wl,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Wl,-z,relro  -DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_sysl!
 og -Dman3
 ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.8.1 20130603 (Red Hat 4.8.1-1)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -fstack-protector'
    libpth=/usr/local/lib64 /lib64 /usr/lib64
    libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbvm_compat
    perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.17'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-rpath,/usr/lib64/perl5/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro '

Locally applied patches:
    


@INC for perl 5.16.3:
    /usr/local/lib64/perl5
    /usr/local/share/perl5
    /usr/lib64/perl5/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib64/perl5
    /usr/share/perl5
    .


Environment for perl 5.16.3:
    HOME=/home/johnh
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/local/lib
    LOGDIR (unset)
    PATH=/bin:/usr/bin:/usr/local/sbin:/etc:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From @jkeenan

On Fri Aug 23 17​:28​:00 2013, johnh@​isi.edu wrote​:

This is a bug report for perl from johnh@​isi.edu,
generated with the help of perlbug 1.39 running under perl 5.16.3.

-----------------------------------------------------------------

Why is Thread​::Queue *so* slow?

I understand it has to do locking and be careful about data
structures, but it seems like it is about 20x slower than opening up a
Unix pipe, printing to that, reading it back and parsing the result.

Thread​::Queue is correct, but I suggest that 20x slower is a
performance bug.

One would think that IPC through memory would be at least as fast as a
pipe through the kernel, and ideally it should be faster.

Here's timing of a test program that sends 500k integers between two
threads,
using Thread​::Queue or pipe(2).

$ ./thread_ipc_perf.pl -m queue
benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @​
0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe
benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @​
1.69/s (n=1)

Here's a larger run (1M integers) with the same kind of results.

$ ./thread_ipc_perf.pl -N 1000000 -m queue
benchmark took 30 wallclock secs (32.69 usr + 6.06 sys = 38.75 CPU) @​
0.03/s (n=1)

$ ./thread_ipc_perf.pl -N 1000000 -m pipe
benchmark took 1 wallclock secs ( 1.23 usr + 0.00 sys = 1.23 CPU) @​
0.81/s (n=1)

Source code for the above simple benchmark is at
http​://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt

We can quibble over the exact multiplier (maybe it's only 15x slower),
but it's *really* slow.

Any suggestions? I get similar results if I simplify Thread​::Queue to
bare minimum code.

To speculate, I'm thinking the cost is in making all IPC data shared.
It would be great if one could have data that is sent over
Thread​::Queue that is copied, not shared.

Thanks for any suggestions,
-John Heidemann

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=library
severity=medium
module=Thread​::Queue
---
Site configuration information for perl 5.16.3​:

Configured by Red Hat, Inc. at Tue Jun 18 09​:17​:09 UTC 2013.

Summary of my perl5 (revision 5 version 16 subversion 3)
configuration​:

Platform​:
osname=linux, osvers=2.6.32-358.2.1.el6.x86_64, archname=x86_64-
linux-thread-multi
uname='linux buildvm-05.phx2.fedoraproject.org 2.6.32-
358.2.1.el6.x86_64 #1 smp wed feb 20 12​:17​:37 est 2013 x86_64
x86_64 x86_64 gnulinux '
config_args='-des -Doptimize=-O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -grecord-gcc-switches -m64
-mtune=generic -Dccdlflags=-Wl,--enable-new-dtags
-Dlddlflags=-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4
-grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro
-DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost
-Dperladmin=root@​localhost -Dcc=gcc -Dcf_by=Red Hat, Inc.
-Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5
-Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl
-Darchlib=/usr/lib64/perl5
-Dvendorarch=/usr/lib64/perl5/vendor_perl
-Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64
/lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads
-Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db
-Ui_ndbm -Di_gdbm -Di_shadow -Di_sysl!
og -Dman3
ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005
-Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto
-Ud_endhostent_r_proto -Ud_sethostent_r_proto
-Ud_endprotoent_r_proto -Ud_setprotoent_r_proto
-Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin
-Dusesitecustomize'

That's a lot of configuration options. While I don't doubt that you
have a reason for all of them, I also doubt that many people are going
to want to build a perl with all those options just for the purpose of
testing your claim.

Would it be possible for you to try this again with the absolute minimum
number of configuration options required to build a threaded perl which
manifests the problem?

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From @iabyn

On Sun, Aug 25, 2013 at 05​:37​:39PM -0700, James E Keenan via RT wrote​:

On Fri Aug 23 17​:28​:00 2013, johnh@​isi.edu wrote​:

Why is Thread​::Queue *so* slow?

I understand it has to do locking and be careful about data
structures, but it seems like it is about 20x slower than opening up a
Unix pipe, printing to that, reading it back and parsing the result.

Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes, and read and writes chunks of them
into a shared buffer.

A T​::Q buffer takes a stream of perl "things", which might be objects or
other such complex structures, and ensures they they are accessible by
both the originating thread and any potential consumer thread. Migrating a
perl "thing" across a thread boundary is considerably more complex than
copying a byte across.

To speculate, I'm thinking the cost is in making all IPC data shared.
It would be great if one could have data that is sent over
Thread​::Queue that is copied, not shared.

But T​::Q is build upon a shared array, and is designed to handled shared
data.

I think the performance you are seeing is the performance I would expect,
and that this is not a bug.

--
In England there is a special word which means the last sunshine
of the summer. That word is "spring".

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From johnh@isi.edu

On Mon, 26 Aug 2013 08​:11​:12 -0700, "Dave Mitchell via RT" wrote​:

On Sun, Aug 25, 2013 at 05​:37​:39PM -0700, James E Keenan via RT wrote​:

On Fri Aug 23 17​:28​:00 2013, johnh@​isi.edu wrote​:

Why is Thread​::Queue *so* slow?

I understand it has to do locking and be careful about data
structures, but it seems like it is about 20x slower than opening up a
Unix pipe, printing to that, reading it back and parsing the result.

Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes, and read and writes chunks of them
into a shared buffer.

A T​::Q buffer takes a stream of perl "things", which might be objects or
other such complex structures, and ensures they they are accessible by
both the originating thread and any potential consumer thread. Migrating a
perl "thing" across a thread boundary is considerably more complex than
copying a byte across.

To speculate, I'm thinking the cost is in making all IPC data shared.
It would be great if one could have data that is sent over
Thread​::Queue that is copied, not shared.

But T​::Q is build upon a shared array, and is designed to handled shared
data.

I think the performance you are seeing is the performance I would expect,
and that this is not a bug.

I understand that Thread​::Queue and perl threads allow shared data, and that
that's much more than a pipe.

My concern is that Thread​::Queue also *forces* shared data, even when
it's not rqeuired. If that sharing comes with a 20x performance hit,
that should be clear.

From perlthrtut, the "Pipeline" model

  The pipeline model divides up a task into a series of steps, and passes
  the results of one step on to the thread processing the next. Each
  thread does one thing to each piece of data and passes the results to
  the next thread in line.

For the pipeline model, one does not need repeated sharing, just a
one-time hand-off. Each queue is FIFO with data touched by only one
thread at a time. That's exactly what my particular applications needs
to do.

But one does not *want* sharing (for the pipeline model) there if it's a
20x performance hit.

If the statement is that queues should require shared data and the
corresponding performance hit, that's a design choice one could make.
Then I'd suggest the bug becomes​: perlthrtut should say "don't use
Thread​::Queue for the pipeline model if you expect high performance,
roll your own IPC".

Alternatively, I'd love some mechanism to share data between threads
that allows a one-time handoff (not repeated sharing) with pipe-like
performance. One would *think* that shared memory should be able to be
faster than round-tripping through a pipe (with perl parsing and kernel
IO). It seems like a shame that perl is forcing full-on sharing since
it's slow and not required (in this case).

  -John

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From johnh@isi.edu

On Sun, 25 Aug 2013 17​:37​:39 -0700, "James E Keenan via RT" wrote​:

On Fri Aug 23 17​:28​:00 2013, johnh@​isi.edu wrote​:

This is a bug report for perl from johnh@​isi.edu,
generated with the help of perlbug 1.39 running under perl 5.16.3.

-----------------------------------------------------------------

Why is Thread​::Queue *so* slow?
...

$ ./thread_ipc_perf.pl -m queue
benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @​
0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe
benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @​
1.69/s (n=1)
...

Source code for the above simple benchmark is at
http​://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt
...

Site configuration information for perl 5.16.3​:

Configured by Red Hat, Inc. at Tue Jun 18 09​:17​:09 UTC 2013.

Summary of my perl5 (revision 5 version 16 subversion 3)
configuration​:

Platform​:
osname=linux, osvers=2.6.32-358.2.1.el6.x86_64, archname=x86_64-
linux-thread-multi
uname='linux buildvm-05.phx2.fedoraproject.org 2.6.32-
358.2.1.el6.x86_64 #1 smp wed feb 20 12​:17​:37 est 2013 x86_64
x86_64 x86_64 gnulinux '
config_args='-des -Doptimize=-O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -grecord-gcc-switches -m64
-mtune=generic -Dccdlflags=-Wl,--enable-new-dtags
-Dlddlflags=-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
-fexceptions -fstack-protector --param=ssp-buffer-size=4
-grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro
-DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost
-Dperladmin=root@​localhost -Dcc=gcc -Dcf_by=Red Hat, Inc.
-Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5
-Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl
-Darchlib=/usr/lib64/perl5
-Dvendorarch=/usr/lib64/perl5/vendor_perl
-Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64
/lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads
-Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db
-Ui_ndbm -Di_gdbm -Di_shadow -Di_sysl!
og -Dman3
ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005
-Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto
-Ud_endhostent_r_proto -Ud_sethostent_r_proto
-Ud_endprotoent_r_proto -Ud_setprotoent_r_proto
-Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin
-Dusesitecustomize'

That's a lot of configuration options. While I don't doubt that you
have a reason for all of them, I also doubt that many people are going
to want to build a perl with all those options just for the purpose of
testing your claim.

Would it be possible for you to try this again with the absolute minimum
number of configuration options required to build a threaded perl which
manifests the problem?

Thank you very much.
Jim Keenan

Thanks for the reply.

I don't build perl myself, those are the default configure options for
Fedora Linux. (Presumably RHEL its derivatives uses similar builds.)

I can build perl if you really want, but let me suggest an alternative
if you don't mind​:

I provided source code to my benchmark program at​:

  http​://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt

and the two invocations that clearly show the difference on my platform​:

$ ./thread_ipc_perf.pl -m queue
benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @​
0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe
benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @​
1.69/s (n=1)

The benchmark is 293 lines long, but it's mostly POD documentation and
boilerplate. Can I suggest you download the benchmark and try those two
invocations ("./thread_ipc_perf.pl" -m queue and "./thread_ipc_perf.pl
-m pipe") on whatever perl you prefer?

If some other platform or build has much different performance, I'll
take this up with my OS provider.

  -John

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From @ikegami

How does Thread​::Queue​::Any compare?

On Mon, Aug 26, 2013 at 11​:58 AM, John Heidemann <johnh@​isi.edu> wrote​:

On Mon, 26 Aug 2013 08​:11​:12 -0700, "Dave Mitchell via RT" wrote​:

On Sun, Aug 25, 2013 at 05​:37​:39PM -0700, James E Keenan via RT wrote​:

On Fri Aug 23 17​:28​:00 2013, johnh@​isi.edu wrote​:

Why is Thread​::Queue *so* slow?

I understand it has to do locking and be careful about data
structures, but it seems like it is about 20x slower than opening up a
Unix pipe, printing to that, reading it back and parsing the result.

Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes, and read and writes chunks of them
into a shared buffer.

A T​::Q buffer takes a stream of perl "things", which might be objects or
other such complex structures, and ensures they they are accessible by
both the originating thread and any potential consumer thread. Migrating a
perl "thing" across a thread boundary is considerably more complex than
copying a byte across.

To speculate, I'm thinking the cost is in making all IPC data shared.
It would be great if one could have data that is sent over
Thread​::Queue that is copied, not shared.

But T​::Q is build upon a shared array, and is designed to handled shared
data.

I think the performance you are seeing is the performance I would expect,
and that this is not a bug.

I understand that Thread​::Queue and perl threads allow shared data, and
that
that's much more than a pipe.

My concern is that Thread​::Queue also *forces* shared data, even when
it's not rqeuired. If that sharing comes with a 20x performance hit,
that should be clear.

From perlthrtut, the "Pipeline" model

   The pipeline model divides up a task into a series of steps\, and

passes
the results of one step on to the thread processing the next. Each
thread does one thing to each piece of data and passes the results
to
the next thread in line.

For the pipeline model, one does not need repeated sharing, just a
one-time hand-off. Each queue is FIFO with data touched by only one
thread at a time. That's exactly what my particular applications needs
to do.

But one does not *want* sharing (for the pipeline model) there if it's a
20x performance hit.

If the statement is that queues should require shared data and the
corresponding performance hit, that's a design choice one could make.
Then I'd suggest the bug becomes​: perlthrtut should say "don't use
Thread​::Queue for the pipeline model if you expect high performance,
roll your own IPC".

Alternatively, I'd love some mechanism to share data between threads
that allows a one-time handoff (not repeated sharing) with pipe-like
performance. One would *think* that shared memory should be able to be
faster than round-tripping through a pipe (with perl parsing and kernel
IO). It seems like a shame that perl is forcing full-on sharing since
it's slow and not required (in this case).

-John

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From @lizmat

On Aug 26, 2013, at 5​:58 PM, John Heidemann <johnh@​isi.edu> wrote​:

On Mon, 26 Aug 2013 08​:11​:12 -0700, "Dave Mitchell via RT" wrote​:

On Sun, Aug 25, 2013 at 05​:37​:39PM -0700, James E Keenan via RT wrote​:

On Fri Aug 23 17​:28​:00 2013, johnh@​isi.edu wrote​:

Why is Thread​::Queue *so* slow?

I understand it has to do locking and be careful about data
structures, but it seems like it is about 20x slower than opening up a
Unix pipe, printing to that, reading it back and parsing the result.
Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes, and read and writes chunks of them
into a shared buffer.

A T​::Q buffer takes a stream of perl "things", which might be objects or
other such complex structures, and ensures they they are accessible by
both the originating thread and any potential consumer thread. Migrating a
perl "thing" across a thread boundary is considerably more complex than
copying a byte across.

To speculate, I'm thinking the cost is in making all IPC data shared.
It would be great if one could have data that is sent over
Thread​::Queue that is copied, not shared.

But T​::Q is build upon a shared array, and is designed to handled shared
data.

I think the performance you are seeing is the performance I would expect,
and that this is not a bug.

I understand that Thread​::Queue and perl threads allow shared data, and that
that's much more than a pipe.

My concern is that Thread​::Queue also *forces* shared data, even when
it's not rqeuired. If that sharing comes with a 20x performance hit,
that should be clear.

You should realize that the perl ithreads implementation does *not* have any real shared variables at all. Each thread has its own *copy* of the world.

Variables with the :shared trait, are simply tied() variables to some internal logic that will STORE values in yet another, hidden thread. And will FETCH them from that hidden thread again when needed. There is some locking involved there, I would assume. But I think the biggest bottleneck is really that the slow tie() interface is used for shared variables.

The forks module does not do this differently. However, instead of making a copy of the world each time a thread is started, the forks module just does a fork() and let's the OS take care of any Copy-On-Write needed. This makes starting a thread *much* faster, especially if you have something like Moose and its dependencies loaded. Reading and writing shared variables are done by using pipes, Unix pipes if possible.

Thread​::Queue​::Any is simply a wrapper around Thread​::Queue, and thus suffers from the same performance issues.

In other words​: don't use Perl 5's ithreads for performance, use it for asynchronous jobs only where not having to wait for something slow,

Liz

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2013

From @Leont

On Mon, Aug 26, 2013 at 5​:58 PM, John Heidemann <johnh@​isi.edu> wrote​:

I understand that Thread​::Queue and perl threads allow shared data, and
that
that's much more than a pipe.

My concern is that Thread​::Queue also *forces* shared data, even when
it's not rqeuired. If that sharing comes with a 20x performance hit,
that should be clear.

From perlthrtut, the "Pipeline" model

   The pipeline model divides up a task into a series of steps\, and

passes
the results of one step on to the thread processing the next. Each
thread does one thing to each piece of data and passes the results
to
the next thread in line.

For the pipeline model, one does not need repeated sharing, just a
one-time hand-off. Each queue is FIFO with data touched by only one
thread at a time. That's exactly what my particular applications needs
to do.

But one does not *want* sharing (for the pipeline model) there if it's a
20x performance hit.

If the statement is that queues should require shared data and the
corresponding performance hit, that's a design choice one could make.
Then I'd suggest the bug becomes​: perlthrtut should say "don't use
Thread​::Queue for the pipeline model if you expect high performance,
roll your own IPC".

Actually I did write a queue implementation for threads​::lite that should
be a lot faster for simple data structures, but I never released it as a
separate module that could be used with threads.pm.

Alternatively, I'd love some mechanism to share data between threads
that allows a one-time handoff (not repeated sharing) with pipe-like
performance. One would *think* that shared memory should be able to be
faster than round-tripping through a pipe (with perl parsing and kernel
IO). It seems like a shame that perl is forcing full-on sharing since
it's slow and not required (in this case).

I don't think that would be faster than a queue, given perl's memory model
(memory has to be owned by a thread, shared memory has to be be handled
manually) a copy or two is necessary anyway.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2013

From @nwc10

On Mon, Aug 26, 2013 at 08​:58​:14AM -0700, John Heidemann wrote​:

My concern is that Thread​::Queue also *forces* shared data, even when
it's not rqeuired. If that sharing comes with a 20x performance hit,
that should be clear.

Yes, I agree that that's a valid concern, and we could document that better.

As someone rather too close to the code, it's not easy to pull back far
enough to work out where someone reading the documentation for the first
time would have expected to have found such a warning.

Do you have a suggestion for where we should document this, such that you
would have read it had it been there? (Even better if you can suggest a
suitable change)

Alternatively, I'd love some mechanism to share data between threads
that allows a one-time handoff (not repeated sharing) with pipe-like
performance. One would *think* that shared memory should be able to be
faster than round-tripping through a pipe (with perl parsing and kernel
IO). It seems like a shame that perl is forcing full-on sharing since
it's slow and not required (in this case).

Agree, I'd love this too. It would permit a lot of effective higher level
concurrency designs to work*. But sadly I don't believe that Perl 5 will
ever be able to provide a performant hand-off mechanism. The internals
assume all over that it's safe for any logical read to actually be a write
behind the scenes (making it awkward to provide any sort of read-only view
of another thread's data), and all interpreter data structures are
implicitly tied to the interpreter that allocated them, which would take a
massive amount of refactoring to attempt to untangle.

I don't think that this is particularly a Perl problem. I'm not aware of any
comparable C-based dynamic language has managed to retrofit true
concurrency. CPython still has a GIL (and Unladen Swallow failed to deliver
on its design to remove that), and my understanding is that Ruby (MRI/YARV)
still single-threads its interpreter, and PHP doesn't even offer threading.
If we had a design to steal, we'd steal it. :-/

Nicholas Clark

* such as the rather nice constructions that Jonathan Worthing demonstrated
  for Perl 6​: http​://jnthn.net/papers/2013-yapceu-conc.pdf
  (Video not yet online)

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From johnh@isi.edu

On Tue, 27 Aug 2013 11​:18​:57 +0100, Nicholas Clark wrote​:

On Mon, Aug 26, 2013 at 08​:58​:14AM -0700, John Heidemann wrote​:

My concern is that Thread​::Queue also *forces* shared data, even when
it's not rqeuired. If that sharing comes with a 20x performance hit,
that should be clear.

Yes, I agree that that's a valid concern, and we could document that better.

As someone rather too close to the code, it's not easy to pull back far
enough to work out where someone reading the documentation for the first
time would have expected to have found such a warning.

Do you have a suggestion for where we should document this, such that you
would have read it had it been there? (Even better if you can suggest a
suitable change)

A proposed patch to perlthrtut is attached at the end of this message.

Alternatively, I'd love some mechanism to share data between threads
that allows a one-time handoff (not repeated sharing) with pipe-like
performance. One would *think* that shared memory should be able to be
faster than round-tripping through a pipe (with perl parsing and kernel
IO). It seems like a shame that perl is forcing full-on sharing since
it's slow and not required (in this case).

Agree, I'd love this too. It would permit a lot of effective higher level
concurrency designs to work*. But sadly I don't believe that Perl 5 will
ever be able to provide a performant hand-off mechanism. The internals
assume all over that it's safe for any logical read to actually be a write
behind the scenes (making it awkward to provide any sort of read-only view
of another thread's data), and all interpreter data structures are
implicitly tied to the interpreter that allocated them, which would take a
massive amount of refactoring to attempt to untangle.

I don't think that this is particularly a Perl problem. I'm not aware of any
comparable C-based dynamic language has managed to retrofit true
concurrency. CPython still has a GIL (and Unladen Swallow failed to deliver
on its design to remove that), and my understanding is that Ruby (MRI/YARV)
still single-threads its interpreter, and PHP doesn't even offer threading.
If we had a design to steal, we'd steal it. :-/

I don't know anything about C-level internals of perl.

I agree these are inherrent in *shared* variables independent of language.

It's too bad there's no way to move data between two threads without
making the data shared (other than the move). A one-time copy from
thread A to B. C-only programs have done this for ages (see for
example, "The Duality of Memory and Communication in the
Implementation of a Multiprocessor Operating
System" by Young et al, ACM SOSP 1987).

What I'll do for now is to get this effect by printing it to pipe and
reading it back in through the other end, but boy what a lot of work on
the perl-side that could be hidden inside the C, both cleaner and
hopefully faster.

  -John


Inline Patch
--- perlthrtut.pod-	2013-08-27 08:47:16.347167972 -0700
+++ perlthrtut.pod	2013-08-27 08:53:26.159772710 -0700
@@ -465,6 +465,13 @@
 data inconsistency and race conditions. Note that Perl will protect its
 internals from your race conditions, but it won't protect you from you.
 
+=head2 Thread Pitfalls: Performance
+
+Shared data is and locking expensive, slowing down access.
+As of perl 5.18, one should expect sharing data between threads
+with tools such as L<Thread::Queue> to be about 15-20x slower
+than copying the data through L<pipe(2)>.
+
 =head1 Synchronization and control
 
 Perl provides a number of mechanisms to coordinate the interactions

@p5pRT
Copy link
Author

p5pRT commented Aug 29, 2013

From @tamias

On Tue, Aug 27, 2013 at 05​:15​:09PM -0700, John Heidemann wrote​:

--- perlthrtut.pod- 2013-08-27 08​:47​:16.347167972 -0700
+++ perlthrtut.pod 2013-08-27 08​:53​:26.159772710 -0700
@​@​ -465,6 +465,13 @​@​
data inconsistency and race conditions. Note that Perl will protect its
internals from your race conditions, but it won't protect you from you.

+=head2 Thread Pitfalls​: Performance
+
+Shared data is and locking expensive, slowing down access.

I think this sentence got a bit mixed up.

Ronald

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From @nwc10

On Tue, Aug 27, 2013 at 05​:15​:09PM -0700, John Heidemann wrote​:

On Tue, 27 Aug 2013 11​:18​:57 +0100, Nicholas Clark wrote​:

Do you have a suggestion for where we should document this, such that you
would have read it had it been there? (Even better if you can suggest a
suitable change)

A proposed patch to perlthrtut is attached at the end of this message.

Thanks

It's too bad there's no way to move data between two threads without
making the data shared (other than the move). A one-time copy from
thread A to B. C-only programs have done this for ages (see for
example, "The Duality of Memory and Communication in the
Implementation of a Multiprocessor Operating
System" by Young et al, ACM SOSP 1987).

Agree that's it's frustrating.

That paper seems to predate Perl 1 by about 5 weeks, but I don't think that
the complexity trade off to facilitate concurrency became a concern of
mainstream development until some point after Perl 5 shipped in 1994.
By which time, of course, it's too late to add it in from the start.
(And the Perl 5 codebase is a rewrite of Perl 4, which traces history all the
way back to Perl 1, so really it needed to be in by December 1987 to be
helpful)

I feel that it's the same fundamental problem as attempting to retrofit
Unicode support. Bolting it on later will never work completely - it has to
be in the design from the start.

----------------------------------------------------------------------
--- perlthrtut.pod- 2013-08-27 08​:47​:16.347167972 -0700
+++ perlthrtut.pod 2013-08-27 08​:53​:26.159772710 -0700
@​@​ -465,6 +465,13 @​@​
data inconsistency and race conditions. Note that Perl will protect its
internals from your race conditions, but it won't protect you from you.

+=head2 Thread Pitfalls​: Performance
+
+Shared data is and locking expensive, slowing down access.
+As of perl 5.18, one should expect sharing data between threads
+with tools such as L<Thread​::Queue> to be about 15-20x slower
+than copying the data through L<pipe(2)>.
+
=head1 Synchronization and control

Perl provides a number of mechanisms to coordinate the interactions

On Wed, Aug 28, 2013 at 11​:30​:58PM -0400, Ronald J Kimball wrote​:

I think this sentence got a bit mixed up.

I think also that it should mention your insight about what's not obvious
about performance - lack of handoff. I don't think that the performance has
changed much historically, and I foresee a way to change it in the future,
so I think that having a version number in there isn't that useful.
So this instead?

  Shared data and locking are expensive, slowing down access.
  Perl 5 has no way of passing ownership of data between threads, so all
  thread operations involve data becoming shared. One should expect sharing
  data between threads with tools such as L<Thread​::Queue> to be about
  15-20x slower than copying the data through L<pipe(2)>.

If in the future someone does radically improve thread performance, then I'd
expect them to revisit the documentation to update it with new figures
(and publicise their success).

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From @Leont

On Tue, Aug 27, 2013 at 12​:11 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

Actually I did write a queue implementation for threads​::lite that should
be a lot faster for simple data structures, but I never released it as a
separate module that could be used with threads.pm.

You can find it on github at https://github.com/Leont/thread-channel, it
will probably be released to cpan as soon as I've written tests for it.
I've created a benchmark based on your own, it's about 30% slower than
pipes for simple strings, but unlike strings can also handle complex
datastructures.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From johnh@isi.edu

On Fri, 30 Aug 2013 20​:27​:08 +0200, Leon Timmermans wrote​:

On Tue, Aug 27, 2013 at 12​:11 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

Actually I did write a queue implementation for threads​::lite that should
be a lot faster for simple data structures, but I never released it as a
separate module that could be used with threads.pm.

You can find it on github at https://github.com/Leont/thread-channel, it will
probably be released to cpan as soon as I've written tests for it. I've
created a benchmark based on your own, it's about 30% slower than pipes for
simple strings, but unlike strings can also handle complex datastructures.

Leon

That sounds great. Should it be Thread​::Queue​::Fast
or Thread​::Queue​::Nonshared?

  -John

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants