Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread creation time is hypersensitive #7298

Closed
p5pRT opened this issue May 16, 2004 · 16 comments
Closed

Thread creation time is hypersensitive #7298

p5pRT opened this issue May 16, 2004 · 16 comments

Comments

@p5pRT
Copy link

p5pRT commented May 16, 2004

Migrated from rt.perl.org#29637 (status was 'resolved')

Searchable as RT29637$

@p5pRT
Copy link
Author

p5pRT commented May 16, 2004

From @jlokier

Created by @jlokier

Please take the time to read this report carefully, as it appears to
be a Perl ithreads bug which, if fixed, would probably improve thread
creation time substantially in all cases - and it's probably a simple
logic error in the threading code.

The time to create a new thread is highly sensitive to conditions that
shouldn't affect the time.

Adding a single short, semantically void comment line to a large
script can _quadruple_ the time to create new threads in that script
for example. Or it can have the opposite effect​: adding a comment
line can reduce the time by a similar factor. Many other changes have
a similar effect.

The effect is repeatable​: a script which starts threads slowly does so
each time the program is run. Similarly for scripts which start
threads quickly.

It is also sensitive to names​: changing the name of some modules in
tests removes the effect - effectively changing the boundary
condition which highlights it.

The time variation is accompanied by variation in the amount of memory
allocated when creating a thread. "strace" shows that slow thread
creation is accompanied by more mmap() calls to allocate memory. In
fact most of the thread creation time appears to be spent allocating
memory in tiny mmap() calls.

There's strong evidence that this is a Perl bug, not a libc thread
creation bug. "strace" shows a very large number of 4k mmap() calls
to allocate memory on each thread creation. For one, this is silly
and slow​: pages should be allocated many at a time, and handed out
within the program, not using thousands of small mmap() calls. For
two, the enormous variation in thread creation times is matched with
variation in the number of single-page mmap() calls from Perl. This
implies that Perl's own allocation and copying logic for non-shared
data is not behaving as intended.

I have tried to produce examples of two Perl programs which differ
only by a few comment characters and exhibit a large difference in
thread creation time. I succeeded, with one taking ~40ms per thread
and the other taking ~150ms per thread, but the programs consist of
several large modules, derived from a real program when I first
noticed the problem, and trimming the code down is difficult because
the condition is so sensitive to the numbers of characters in each
file, the number of functions and variables created and their names etc.

That's why I haven't included example code. If there's a need for it,
I can supply a set of files privately, but I think there's a chance
the bug can be diagnosed from this description alone.

I noticed this problem when trying to speed up a script which uses
threads, and takes about 2 seconds to create each new thread on a
1.5GHz x86. I saw that the script uses modules which are slow to
load, "IO​::Socket​::INET" in particular. I guessed that loading big
modules would slow down thread creation because of the copying
(although 2 seconds is a remarkably long time for copying anything).

So I removed references to all IO​::* modules, and it made the script
much faster. I removed more modules, it helped some more. I thought
I had understood the problem. Then imagine my surprise when I
commented _out_ a line saying "use Fcntl" and thread creation became
much slower again! With some more experimentation, I discovered
thread creation time is only partly to do with the modules loaded, and
much of the time, when it's slow, is caused by this hypersensitive
effect which is accompanied by large numbers of mmap() calls.
Unfortunately I haven't found a coding strategy to avoid the effect
and get consistently fast thread creation. The only way to get that
is to fix the Perl core - which is a bit too much for me.

Thanks,
-- Jamie

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl v5.8.0:

Configured by bhcompile'
cf_email='bhcompile at Wed Aug 13 11:45:59 EDT 2003.

Summary of my rderl (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi
    uname='linux str'
    config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef'
 useithreads=define usemultiplicity=
    useperlio= d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=un uselongdouble=
    usemymalloc=, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers=''
gccversion='3.2.2 200302'
    intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long'
k', ivsize=4'
ivtype='l, nvtype='double'
o_nonbl', nvsize=, Off_t='', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc'
l', ldflags =' -L/u'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC'
ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s Unicode/Normalize XS/A'

Locally applied patches:
    MAINT18379


@INC for perl v5.8.0:
    /usr/lib/perl5/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/5.8.0
    /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.0
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.0
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/5.8.0
    .


Environment for perl v5.8.0:
    HOME=/home/jamie
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/jamie/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash
    dlflags='-share (unset)

@p5pRT
Copy link
Author

p5pRT commented May 17, 2004

From @iabyn

On Sun, May 16, 2004 at 11​:25​:12PM -0000, Jamie Lokier wrote​:

The time to create a new thread is highly sensitive to conditions that
shouldn't affect the time.

thanks for the report.

There's strong evidence that this is a Perl bug, not a libc thread
creation bug. "strace" shows a very large number of 4k mmap() calls
to allocate memory on each thread creation. For one, this is silly
and slow​: pages should be allocated many at a time, and handed out
within the program, not using thousands of small mmap() calls.

Perl is just using the system's malloc() library. If that does lots of
mmaps, then that's a design decision of libc. Note that in at least one
version of libc, there was a bug in the malloc() library that could cause
extreme slowdowns. Would it be possible to build a version of Perl
(preferably 5.8.4) with -Dusemymalloc, and see if the problem goes away?

Failing that ...

I have tried to produce examples of two Perl programs which differ
only by a few comment characters and exhibit a large difference in
thread creation time. I succeeded, with one taking ~40ms per thread
and the other taking ~150ms per thread, but the programs consist of
several large modules, derived from a real program when I first
noticed the problem, and trimming the code down is difficult because
the condition is so sensitive to the numbers of characters in each
file, the number of functions and variables created and their names etc.

That's why I haven't included example code. If there's a need for it,
I can supply a set of files privately, but I think there's a chance
the bug can be diagnosed from this description alone.

... could you supply the files, thanks.

Dave.

--
But Pity stayed his hand. "It's a pity I've run out of bullets",
he thought. - "Bored of the Rings"

@p5pRT
Copy link
Author

p5pRT commented May 17, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 18, 2004

From @jlokier

Dave Mitchell via RT wrote​:

There's strong evidence that this is a Perl bug, not a libc thread
creation bug. "strace" shows a very large number of 4k mmap() calls
to allocate memory on each thread creation. For one, this is silly
and slow​: pages should be allocated many at a time, and handed out
within the program, not using thousands of small mmap() calls.

Perl is just using the system's malloc() library. If that does lots of
mmaps, then that's a design decision of libc. Note that in at least one
version of libc, there was a bug in the malloc() library that could cause
extreme slowdowns.

My mistake - in fact it's doing hundreds of tiny mprotect() calls, not
mmap() calls. You are correct, they're from malloc(). Red Hat's
threaded malloc() is doing lots of mprotect() calls, although there
are no mprotect() calls until the first thread (non-main) is created.

The call chains for the bulk of mprotect() calls look like​:

#0 0x420dbc60 in mprotect () from /lib/tls/libc.so.6
#1 0x420766db in sYSMALLOc () from /lib/tls/libc.so.6
#2 0x4207378d in malloc () from /lib/tls/libc.so.6
#3 0x40086fab in Perl_safesysmalloc (size=1) at util.c​:69
#4 0x400a485c in S_more_sv (my_perl=0x40d00010) at sv.c​:293
#5 0x400b65d1 in Perl_sv_dup (my_perl=0x8847250, sstr=0x40d00010,
  param=0xbffff630) at sv.c​:9408
#6 0x400b5aad in Perl_sv_dup (my_perl=0x8847250, sstr=0x40d00010,
  param=0xbffff630) at sv.c​:9569
#7 0x400b5aad in Perl_sv_dup (my_perl=0x8847250, sstr=0x40d00010,
  param=0xbffff630) at sv.c​:9569
etc. - very deep Perl_sv_dup and similar calls

You were correct, it's due to malloc(), and these are clearly
occurring during a bulk data copy.

I compared the strace output (system calls) of the slow and fast test
scripts, and found only a little difference in the numbers of calls.
The slow script did 2439 mprotect calls, the fast did 2217. The run
times are 2.87s vs 0.82s, so clearly the mprotect() calls are not
a significant cause of the time difference.

I tried to use ltrace to count the number of malloc() calls, but
ltrace crashed in various ways.

These multitude of mprotect() calls are clearly a Glibc implementation
quirk, and they can be ignored - so please ignore that part of this
bug report about small allocations.

The main complaint is about the huge variation in thread creation
time, sensitive to trivial changes in a Perl script.

Would it be possible to build a version of Perl
(preferably 5.8.4) with -Dusemymalloc, and see if the problem goes away?

It would be quite time consuming as I have 56k dialup, and also the
problem would take a while to verify with a different Perl version as
the boundary condition which distinguishes slow vs. fast scripts is
sensitive to lots of other conditions. I'd have to search for a new
test case, and conclude the problem has gone after spending lots of
time failing to find one.

I'd much prefer if someone else can reproduce the problem, although I
understand that it is not easy to reproduce. If someone else can
reproduce it then I'm willing to do more to help pin it down.

I just had a fine example of two scripts differing by a couple of
trivial lines - one was consistently much slower than the other (the
2.87s vs 0.82s mentioned above). Now when I try it with a changed
environment or even just using a new shell, the differences have
disappeared and they both take 1.25s. With an empty environment (env
-i) they both take about 1.77s. Alas, I have closed the old shell.

Oh, surprise... I have noticed that changing to a different directory
and back, and this case shows the effect again. In fact, it's
sensitive to the exact contents of the environment (changing shells
and changing directory altered the OLDPWD environment variable by a
few characters).

In other words, it's tough to find a reproducible test case, and it's
sensitive to trivial environment variables among other things.

The method for finding a test case in your own environment is​:

  1. Start with a large script that loads modules and defines lots
  of subroutines etc but doesn't do anything. The effect seems
  to be more pronounced and easier to create with a large script
  which includes several other modules​: I have 6200 lines in 250k
  in 8 files in my current test case.

  2. Append this​:

  for (0..9) {
  use threads;
  async { print STDERR "new thread\n"; };
  };
  $_->join for threads->list;

  3. Time it with "time perl script.pl". If the time between
  each new thread, and the overall time, seem slow for mere thread
  creation we're getting somewhere.

  4. Remove chunks of the program, e.g. function definitons or comments
  or "use" calls or even single statements in functions, until
  the time suddenly jumps up or down. It'll be easier to remove
  things from the end of the program, as they are definitions
  which aren't used.

  5. Having found that removing some function or block of code makes
  a big different, play around with how much you remove to
  converge on that boundary condition where a slight change makes
  a big change in the time. You will end up with two scripts
  which differ by a few trivial lines and are uniquely sensitive
  to your environment. If you're lucky, you'll see a time
  difference of a factor of four or so. You may find less but
  still significant.

Failing that ...

I have tried to produce examples of two Perl programs which differ
only by a few comment characters and exhibit a large difference in
thread creation time. I succeeded, with one taking ~40ms per thread
and the other taking ~150ms per thread, but the programs consist of
several large modules, derived from a real program when I first
noticed the problem, and trimming the code down is difficult because
the condition is so sensitive to the numbers of characters in each
file, the number of functions and variables created and their names etc.

That's why I haven't included example code. If there's a need for it,
I can supply a set of files privately, but I think there's a chance
the bug can be diagnosed from this description alone.

... could you supply the files, thanks.

Ok. I don't expect they'll demonstrate the problem in your
environment - the recipe I give above is more reliable - but we can
give it a try. It depends on exact Perl source of parsed modules
(including the supplied ones), so to reproduce the problem you pretty
much need to be running the same version of Perl and probably it needs
to be Red Hat 9 as well.

I have 10 files, one of them is a shell script which uses "env" to
create the same environment variables from my machine.

I don't want to send this source publically (there's a lot of
unreleased code I don't want to share just yet). So I need a private
email address to send to.

Thanks,
-- Jamie

@p5pRT
Copy link
Author

p5pRT commented May 18, 2004

From @iabyn

On Tue, May 18, 2004 at 12​:34​:38AM +0100, Jamie Lokier wrote​:

The call chains for the bulk of mprotect() calls look like​:

#0 0x420dbc60 in mprotect () from /lib/tls/libc.so.6
#1 0x420766db in sYSMALLOc () from /lib/tls/libc.so.6
#2 0x4207378d in malloc () from /lib/tls/libc.so.6
#3 0x40086fab in Perl_safesysmalloc (size=1) at util.c​:69

Unless this is an artifact of gdb running on an optimised build,
the size=1 is very wrong there - it should always be 1008.

The main complaint is about the huge variation in thread creation
time, sensitive to trivial changes in a Perl script.

Are these truely trivial changes, or do they change the amount of modules
that are use'd or require'd?

The creation of a new thread mainly involves the recursive duplication
of nearly every data structure in the Perl interpreter, so anything
like a use which adds lots of new symbols to the symbol table etc,
means lots more things to be duplicated.

One thing you might want to do is use gdb to set a breakpoint on
Perl_sv_dup with a very large ignore count, then run the script to
completion and see how many times it's been called. Then see if that's
proportional to the thread creation time.

The method for finding a test case in your own environment is​:

1\. Start with a large script that loads modules and defines lots
   of subroutines etc but doesn't do anything\.  The effect seems
   to be more pronounced and easier to create with a large script
   which includes several other modules​: I have 6200 lines in 250k
   in 8 files in my current test case\.

I'll try to find some time in the next few days to play with this on a RH9
box.

--
Technology is dominated by two types of people​: those who understand what
they do not manage, and those who manage what they do not understand.

@p5pRT
Copy link
Author

p5pRT commented May 19, 2004

From mooring@acm.org

On Tue, May 18, 2004 at 12​:34​:38AM +0100, Jamie Lokier wrote​:

Dave Mitchell via RT wrote​:

Oh, surprise... I have noticed that changing to a different directory
and back, and this case shows the effect again. In fact, it's
sensitive to the exact contents of the environment (changing shells
and changing directory altered the OLDPWD environment variable by a
few characters).

In other words, it's tough to find a reproducible test case, and it's
sensitive to trivial environment variables among other things.

This smacks of an alignment issue. Try adding an environment
variable (of varying lengths) to see if the problem appears and
disappears. Something like​:
FOO="" <test>
FOO="x" <test>
FOO="xx" <test>
<ad nauseam>

If it's an alignment problem, the slowdown will likely disappear and
reappear at regular intervals. The likely fix is at the libc/OS level,
whichever is responsible for setting up the initial environment for
the process. On Linux, I'm pretty sure it's in glibc.

--
Ed Mooring (mooring@​acm.org)

@p5pRT
Copy link
Author

p5pRT commented May 19, 2004

From @jlokier

Dave Mitchell wrote​:

The call chains for the bulk of mprotect() calls look like​:

#0 0x420dbc60 in mprotect () from /lib/tls/libc.so.6
#1 0x420766db in sYSMALLOc () from /lib/tls/libc.so.6
#2 0x4207378d in malloc () from /lib/tls/libc.so.6
#3 0x40086fab in Perl_safesysmalloc (size=1) at util.c​:69

Unless this is an artifact of gdb running on an optimised build,
the size=1 is very wrong there - it should always be 1008.

I suspect gdb too. This is a standard install with no debugging info.

The main complaint is about the huge variation in thread creation
time, sensitive to trivial changes in a Perl script.

Are these truely trivial changes, or do they change the amount of modules
that are use'd or require'd?

Yes, they are are. Sometimes a comment is enough. However due to the
difficulty in producing a consistent test case (I didn't know until
recently to hard-code the environment), this is the current one. The
difference here is the name of a subroutine, which is never called in
the test, just defined​:

$ diff -u fast.pl slow.pl

Inline Patch
--- fast.pl     2004-05-19 12:58:53.000000000 +0100
+++ slow.pl     2004-05-19 12:59:03.000000000 +0100
@@ -1276,7 +1276,7 @@
 # from the local EBCDIC code page if there is one.  The caller is
 # responsible for doing this.
  
-sub http_server_100_continue(;$) {
+sub http_server_101_continue(;$) {
 }
  
 for (0..9) {

Testing fast.pl new thread new thread new thread new thread new thread new thread new thread new thread new thread new thread

real 0m0.816s
user 0m0.760s
sys 0m0.032s

Testing slow.pl
new thread
new thread
new thread
new thread
new thread
new thread
new thread
new thread
new thread
new thread

real 0m2.824s
user 0m2.748s
sys 0m0.052s

The creation of a new thread mainly involves the recursive duplication
of nearly every data structure in the Perl interpreter, so anything
like a use which adds lots of new symbols to the symbol table etc,
means lots more things to be duplicated.

Notice how in the above example, the only difference is in how a
single name is hashed - that's enough to trigger a big speed
difference. There should be zero difference in the amount of data
copies.

My first guess was that the recursive duplication was traversing some
data more times than necessary, depending on its layout in memory.
However (see below) that isn't confirmed by gdb counting Perl_sv_dup
calls :/

One thing you might want to do is use gdb to set a breakpoint on
Perl_sv_dup with a very large ignore count, then run the script to
completion and see how many times it's been called. Then see if that's
proportional to the thread creation time.

That's a great idea.

Testing fast.pl​:
1 breakpoint keep y 0x400b54c0 in Perl_sv_dup at sv.c​:9390
  breakpoint already hit 393104 times
Testing slow.pl​:
1 breakpoint keep y 0x400b54c0 in Perl_sv_dup at sv.c​:9390
  breakpoint already hit 393214 times

There's a difference (why?) but it's small.
It's not proportional to thread creation time.

(Btw, duplicating it under gdb is tricky because gdb calls a shell to
run the Perl command, and that disturbs the environment. I fixed that
by asking gdb to run /usr/bin/env with appropriate arguments - if you
do the same, remember to check that the thread creation time really is
different between the two scripts).

Something interesting from gdb, though​: Each time I interrupt slow.pl,
it's in Perl_ptr_table_fetch. In fast.pl it's in many different
functions.

That suggests Perl_ptr_table_fetch is taking much more time in
slow.pl. Let's check if it's called the same number of times​:

Testing fast.pl​:
1 breakpoint keep y 0x400b4fa8 in Perl_ptr_table_fetch at sv.c​:9168
  breakpoint already hit 319827 times
Testing slow.pl​:
1 breakpoint keep y 0x400b4fa8 in Perl_ptr_table_fetch at sv.c​:9168
  breakpoint already hit 319907 times

Ok, that confirms Perl_ptr_table_fetch is running slower in slow.pl,
rather than being called many more times.

That looks a likely culprit​: the hash function in Perl_ptr_table_fetch
and Perl_ptr_table_store simply uses the low-order bits of the SV
address with no scrambling.

It would be sensitive to the memory layout. (But so sensitive as to
slow down the program enormously due to minor program changes? And
why does changing a single name in the program, which isn't referenced
anywhere, changing the amount of memory allocated?)

Fixing that hash function looks likely to fix the problem. Let's
check​: I've put a breakpoint on the loop inside Perl_ptr_table_fetch.
The code looks like​:

  UV hash = PTR2UV(sv);
  tblent = tbl->tbl_ary[hash & tbl->tbl_max];
  for (; tblent; tblent = tblent->next) {
[Breakpoint]
  if (tblent->oldval == sv)
  return tblent->newval;
  }

(Actually that's the 5.8.3 source for illustration, and the breakpoint
is being set on Red Hat 9's Perl 5.8.0-88.3 without source).

Testing fast.pl up to the first "new thread" write()​:
2 breakpoint keep y 0x400b4fc0 in Perl_ptr_table_fetch at sv.c​:9171
  breakpoint already hit 103876 times
3 breakpoint keep y 0x401943d0 <write>
  breakpoint already hit 1 time

Testing fast.pl up to the second "new thread" write()​:
2 breakpoint keep y 0x400b4fc0 in Perl_ptr_table_fetch at sv.c​:9171
  breakpoint already hit 207747 times
3 breakpoint keep y 0x401943d0 <write>
  breakpoint already hit 2 times

Testing slow.pl up to the first "new thread" write()​:
1 breakpoint keep y 0x400b4fc0 in Perl_ptr_table_fetch at sv.c​:9171
  breakpoint already hit 2204712 times
2 breakpoint keep y 0x401943d0 <write>
  breakpoint already hit 1 time

Testing slow.pl up to the second "new thread" write()​:
1 breakpoint keep y 0x400b4fc0 in Perl_ptr_table_fetch at sv.c​:9171
  breakpoint already hit 4409517 times
2 breakpoint keep y 0x401943d0 <write>
  breakpoint already hit 2 times

We found it​: Perl_ptr_table_fetch is running very slowly, most likely
due to a poor hash function. The hash chains are approx. 22 times
longer on average for the slow script.

It's not clear to my why a trivial change of the name of a subroutine,
without changing its length, and without the name being referenced
anywhere, would change the number of memory allocations or the
addresses of anything, but evidently it's doing so.

-- Jamie

@p5pRT
Copy link
Author

p5pRT commented May 19, 2004

From @iabyn

On Wed, May 19, 2004 at 03​:42​:25PM +0100, Jamie Lokier wrote​:

Something interesting from gdb, though​: Each time I interrupt slow.pl,
it's in Perl_ptr_table_fetch. In fast.pl it's in many different
functions.

Great catch!!! Thanks for spotting this vital clue :-)

It turns out that the code in ptr_table_store(), which stores old to new
pointer mappings during cloning, has got a fatal logic flaw.
Basically it mainains an array of chains, and inserts the new entry into
the chain indexed by a hash value. If the chain is non-empty and the
total number of elemets is greater than the array size, then the array
is supposed to be doubled in size. *Except* that the code checked for
an empty chain rather than an non-empty chain. So sometimes the array
would never grow, and the chains would just get longer.

I managed to get a reproducable case that went from 2 secs to 23 secs
to create 10 threads, just by adding an extra space to a comment!

The change below to bleedperl fixes the immediate problem, although
I plan to look into this a bit further as the hashing algorithm is highly
sub-optimal.

Dave.

--
In the 70's we wore flares because we didn't know any better.
What possible excuse does the current generation have?

Change 22830 by davem@​davem-percy on 2004/05/19 20​:17​:55

  [perl #29637] Thread creation time is hypersensitive
 
  Due to a logic error, the dup ptr table sometimes wans't being
  grown, leading to extremely slow cloning.

Affected files ...

... //depot/perl/sv.c#742 edit

Differences ...

==== //depot/perl/sv.c#742 (text) ====

@​@​ -10400,11 +10400,11 @​@​
  * hash values e.g. if they grow faster in the most significant
  * bits */
  UV hash = PTR2UV(oldv);
- bool i = 1;
+ bool empty = 1;

  assert(tbl);
  otblent = &tbl->tbl_ary[hash & tbl->tbl_max];
- for (tblent = *otblent; tblent; i=0, tblent = tblent->next) {
+ for (tblent = *otblent; tblent; empty=0, tblent = tblent->next) {
  if (tblent->oldval == oldv) {
  tblent->newval = newv;
  return;
@​@​ -10416,7 +10416,7 @​@​
  tblent->next = *otblent;
  *otblent = tblent;
  tbl->tbl_items++;
- if (i && tbl->tbl_items > tbl->tbl_max)
+ if (!empty && tbl->tbl_items > tbl->tbl_max)
  ptr_table_split(tbl);
}

@p5pRT
Copy link
Author

p5pRT commented May 19, 2004

From @iabyn

The following patch to bleed (slightly) improves the hashing algorithm
used in cloning the pointer tables, which is currently just
(ptr & table_size_mask). Since the bottom few bits of a pointer are
usually zero, I now shift the pointer before hashing. This makes thread
cloning about 10% faster on my machine.

Dave.

--
You live and learn (although usually you just live).

Change 22831 by davem@​davem-percy on 2004/05/19 21​:02​:04

  improve hashing algorithm for ptr tables in perl_clone​:
  the bottom few bits of a pointer are usually zero

Affected files ...

... //depot/perl/sv.c#743 edit

Differences ...

==== //depot/perl/sv.c#743 (text) ====

@​@​ -10374,13 +10374,19 @​@​
  return tbl;
}

+#if (PTRSIZE == 8)
+# define PTR_TABLE_HASH(ptr) (PTR2UV(ptr) >> 3)
+#else
+# define PTR_TABLE_HASH(ptr) (PTR2UV(ptr) >> 2)
+#endif
+
/* map an existing pointer using a table */

void *
Perl_ptr_table_fetch(pTHX_ PTR_TBL_t *tbl, void *sv)
{
  PTR_TBL_ENT_t *tblent;
- UV hash = PTR2UV(sv);
+ UV hash = PTR_TABLE_HASH(sv);
  assert(tbl);
  tblent = tbl->tbl_ary[hash & tbl->tbl_max];
  for (; tblent; tblent = tblent->next) {
@​@​ -10399,7 +10405,7 @​@​
  /* XXX this may be pessimal on platforms where pointers aren't good
  * hash values e.g. if they grow faster in the most significant
  * bits */
- UV hash = PTR2UV(oldv);
+ UV hash = PTR_TABLE_HASH(oldv);
  bool empty = 1;

  assert(tbl);
@​@​ -10440,7 +10446,7 @​@​
  continue;
  curentp = ary + oldsize;
  for (entp = ary, ent = *ary; ent; ent = *entp) {
- if ((newsize & PTR2UV(ent->oldval)) != i) {
+ if ((newsize & PTR_TABLE_HASH(ent->oldval)) != i) {
  *entp = ent->next;
  ent->next = *curentp;
  *curentp = ent;

@p5pRT
Copy link
Author

p5pRT commented May 19, 2004

From @jlokier

Dave Mitchell wrote​:

The following patch to bleed (slightly) improves the hashing algorithm
used in cloning the pointer tables, which is currently just
(ptr & table_size_mask). Since the bottom few bits of a pointer are
usually zero, I now shift the pointer before hashing. This makes thread
cloning about 10% faster on my machine.

There are good hashes for power-of-two size tables. The approx golden
ratio multiplicative hashes, and the Robert Jenkins and Thomas Wang
word hashes comes to mind. Doesn't Perl have a well-tested integer
hash somewhere already?

My runs with gdb in the "good" case (fast.pl) indicate an average
number of searche iterations per non-empty bucket of approximately 3.
That suggests an average non-empty bucket size of about 5, making wild
assumptions about the distribution. Considering that the table is
supposed to resize when the number of items is approximately equal to
the array size, that does not seem like a good distribution.

-- Jamie

@p5pRT
Copy link
Author

p5pRT commented May 20, 2004

From @jlokier

Dave Mitchell wrote​:

the code checked for an empty chain rather than an non-empty
chain. So sometimes the array would never grow, and the chains would

I can confirm that your change appears to fix the problem.

I recompiled the exact Perl source package that Red Hat 9 uses, but
there were slight binary differences from the RH9 binary package, and
the recompiled one didn't exhibit the problem with my test case.

So I patched the binary. On Red Hat 9, with perl-5.8.0-88.3, changing
the byte in
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE/libperl.so at offset
647295 (0x9e07f) from 0 to 1 is equivalent to your patch.

This removed the time variation in creating new threads - now both
scripts are about the same speed as the fast one was before.

The real program I was having trouble with - a fairly complex
multi-threaded HTTP/1.1 server - now starts much faster. (For what
it's worth I am still avoiding IO​::Socket​::INET as it still slows down
thread creation noticably, although it is only just noticable now.
Roll on copy-on-write :).

I will continue to use the binary-patched version of Perl on my
development box, to see if the problem is consistently gone now and
doesn't have any other causes. Assuming that to be the case, it would
be a good fix to roll into stable Perl distributions, IMHO.

Thanks for noticing the real bug.

-- Jamie

@p5pRT
Copy link
Author

p5pRT commented May 20, 2004

From @JohnPeacock

Jamie Lokier wrote​:

I recompiled the exact Perl source package that Red Hat 9 uses, but
there were slight binary differences from the RH9 binary package, and
the recompiled one didn't exhibit the problem with my test case.

RedHat (and many other distros) make custom patches to their Perl source;
historically they often don't even send them back to us to apply to the core.

What I would recommend doing (instead of binary hacking) is to obtain the RedHat
source RPM, expand it out, apply the patch, then rebuild the binary RPM. No, I
can't help you with that part. ;~)

John

--
John Peacock
Director of Information Research and Technology
Rowman & Littlefield Publishing Group
4501 Forbes Boulevard
Suite H
Lanham, MD 20706
301-459-3366 x.5010
fax 301-429-5748

@p5pRT
Copy link
Author

p5pRT commented May 20, 2004

From @jlokier

John Peacock wrote​:

I recompiled the exact Perl source package that Red Hat 9 uses

RedHat (and many other distros) make custom patches to their Perl source;
historically they often don't even send them back to us to apply to the
core.

What I would recommend doing (instead of binary hacking) is to obtain the
RedHat source RPM, expand it out, apply the patch, then rebuild the binary
RPM. No, I can't help you with that part. ;~)

That's what I did. Hence "source package". I built from the source RPM.
The binary differences were very small, but enough to prevent the effect
appearing with my test case. (It's very sensitive).

Patching the existing binary did the trick. I'm sure the bug is
fixed. But to be sure that it hasn't just hidden the effect, in the
unlikely event that it has a different cause, I'm using the patched
binary while I continue working on the multi-threaded program. If it
has a different (another!) cause then it will show up again sooner or
later.

-- Jamie

@p5pRT
Copy link
Author

p5pRT commented May 24, 2004

From @nwc10

On Thu, May 20, 2004 at 09​:47​:00PM +0100, Jamie Lokier wrote​:

The real program I was having trouble with - a fairly complex
multi-threaded HTTP/1.1 server - now starts much faster. (For what
it's worth I am still avoiding IO​::Socket​::INET as it still slows down
thread creation noticably, although it is only just noticable now.
Roll on copy-on-write :).

Mmm. Yes. I think that I have a solution to Arthur's immediate problem with
the iCow (that &PL_sv_undef is used everywhere in source, and isn't an integer)
but it's all going to need a 5.10
(as well as those round tuity things that always seem to be in such short
supply)

I will continue to use the binary-patched version of Perl on my
development box, to see if the problem is consistently gone now and
doesn't have any other causes. Assuming that to be the case, it would
be a good fix to roll into stable Perl distributions, IMHO.

Yes, that's my opinion too :-)

Thanks for noticing the real bug.

Thanks for your help in tracking this down.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 24, 2004

From @nwc10

On Wed, May 19, 2004 at 11​:23​:19PM +0100, Jamie Lokier wrote​:

Dave Mitchell wrote​:

The following patch to bleed (slightly) improves the hashing algorithm
used in cloning the pointer tables, which is currently just
(ptr & table_size_mask). Since the bottom few bits of a pointer are
usually zero, I now shift the pointer before hashing. This makes thread
cloning about 10% faster on my machine.

There are good hashes for power-of-two size tables. The approx golden
ratio multiplicative hashes, and the Robert Jenkins and Thomas Wang
word hashes comes to mind. Doesn't Perl have a well-tested integer
hash somewhere already?

Yes, but it's hashing byte by byte on variable length character strings.
What's needed here is a hash that is fast and good for 32 (or 64 etc) bit
quantities, which can be fed into the algorithm as a single integer.

So yes, we would benefit from looking to find an external, good, algorithm
that suits these constraints.

My runs with gdb in the "good" case (fast.pl) indicate an average
number of searche iterations per non-empty bucket of approximately 3.
That suggests an average non-empty bucket size of about 5, making wild
assumptions about the distribution. Considering that the table is
supposed to resize when the number of items is approximately equal to
the array size, that does not seem like a good distribution.

I agree. But I'm not well read on these things, so I don't know what
approach we should take to reduce this. (apart from "become well read")

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 30, 2004

@iabyn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant