Slow global pattern match on specific large utf8 input #13606

p5pRT · 2014-02-16T00:14:03Z

Migrated from rt.perl.org#121259 (status was 'resolved')

Searchable as RT121259$

p5pRT · 2014-02-16T00:14:03Z

From @hknutzen

This is a bug report for perl from heinz.knutzen@gmail.com,
generated with the help of perlbug 1.40 running under perl 5.19.9.

I still test my application with Perl 5.19.x to find out why it runs
slower than with previous stable versions.
My application parses large files with a recursive descent parser which
in turn uses global pattern match.

By accident I created one specific input file where perl-5.19.x runs
about 2000 times slower than perl-5.18.2.
I stripped down my parser to this simple program:

my $file = shift;
open(my $fh, '<', $file) or die "Can't open $file: $!";
local $/;
my $input = <$fh>;
close $fh;

my $line = 1;
while (1) {
last if $input =~ m/\G$/gc;
$input =~ m/\G[ \t]*/gc;
$input =~ m/\G\S+/gc;
$input =~ m/\G\n/gc and print STDERR $line++, ' ';
}

The input file has 43092 lines. I tried to find a shorter one. But if I
deleted any lines, the effect of slowness disappeared. So I anonymized
the file. I simply substituted every [a-z] by [a] and [0-9] by [1]. The
file contains some unicode characters (umlauts). I had to leave them
unchanged, otherwise the slowness disappeared.

Find the input file at
https://drive.google.com/file/d/0B9mPoYCd7pxXMXRNRmlaODVPQ1E/edit?usp=sharing

The program prints the number of each parsed line.
With perl-5.18.2 the runtime is about 0.5 seconds (printing 43092 line
numbers).
With perl-blead it prints only about 50 line numbers per second.
When testing with perl-5.19.5 and perl-5.19.6, I get the same bad
performance.

Note:
I had to run "perlbrew --force install blead" because "run/locale.t" failed.
I have already reported this an hour ago.

Flags:
category=core
severity=low

Site configuration information for perl 5.19.9:

Configured by hk at Sat Feb 15 23:59:02 CET 2014.

Summary of my perl5 (revision 5 version 19 subversion 9) configuration:
Snapshot of: 1122035
Platform:
osname=linux, osvers=3.8.0-36-generic, archname=i686-linux
uname='linux buero 3.8.0-36-generic #52~precise1-ubuntu smp mon feb
3 21:56:56 utc 2014 i686 athlon i386 gnulinux '
config_args='-de -Dprefix=/home/hk/perl5/perlbrew/perls/perl-blead
-Dusedevel'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2',
cppflags='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
ccversion='', gccversion='4.6.3', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib/gcc/i686-linux-gnu/4.6/include-fixed
/usr/include/i386-linux-gnu /usr/lib /lib/i386-linux-gnu /lib/../lib
/usr/lib/i386-linux-gnu /usr/lib/../lib /lib
libs=-lnsl -ldl -lm -lcrypt -lutil -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
libc=, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.15'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib
-fstack-protector'

@INC for perl 5.19.9:
/home/hk/perl5/perlbrew/perls/perl-blead/lib/site_perl/5.19.9/i686-linux
/home/hk/perl5/perlbrew/perls/perl-blead/lib/site_perl/5.19.9
/home/hk/perl5/perlbrew/perls/perl-blead/lib/5.19.9/i686-linux
/home/hk/perl5/perlbrew/perls/perl-blead/lib/5.19.9
.

Environment for perl 5.19.9:
HOME=/home/hk
LANG=de_DE.UTF-8
LANGUAGE=de_DE:en
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/home/hk/perl5/perlbrew/bin:/home/hk/perl5/perlbrew/perls/perl-blead/bin:/home/hk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PERLBREW_BASHRC_VERSION=0.43
PERLBREW_HOME=/home/hk/.perlbrew
PERLBREW_MANPATH=/home/hk/perl5/perlbrew/perls/perl-blead/man

PERLBREW_PATH=/home/hk/perl5/perlbrew/bin:/home/hk/perl5/perlbrew/perls/perl-blead/bin
PERLBREW_PERL=perl-blead
PERLBREW_ROOT=/home/hk/perl5/perlbrew
PERLBREW_VERSION=0.43
PERL_BADLANG (unset)
SHELL=/bin/bash

p5pRT · 2014-02-16T14:33:01Z

From @jkeenan

On Sat Feb 15 16:14:03 2014, heinz.knutzen@gmail.com wrote:

This is a bug report for perl from heinz.knutzen@gmail.com,
generated with the help of perlbug 1.40 running under perl 5.19.9.

-----------------------------------------------------------------
I still test my application with Perl 5.19.x to find out why it runs
slower than with previous stable versions.
My application parses large files with a recursive descent parser
which
in turn uses global pattern match.

By accident I created one specific input file where perl-5.19.x runs
about 2000 times slower than perl-5.18.2.
I stripped down my parser to this simple program:

my $file = shift;
open(my $fh, '<', $file) or die "Can't open $file: $!";
local $/;
my $input = <$fh>;
close $fh;

my $line = 1;
while (1) {
last if $input =~ m/\G$/gc;
$input =~ m/\G[ \t]*/gc;
$input =~ m/\G\S+/gc;
$input =~ m/\G\n/gc and print STDERR $line++, ' ';
}

The input file has 43092 lines. I tried to find a shorter one. But if
I
deleted any lines, the effect of slowness disappeared. So I anonymized
the file. I simply substituted every [a-z] by [a] and [0-9] by [1].
The
file contains some unicode characters (umlauts). I had to leave them
unchanged, otherwise the slowness disappeared.

Find the input file at
https://drive.google.com/file/d/0B9mPoYCd7pxXMXRNRmlaODVPQ1E/edit?usp=sharing

The program prints the number of each parsed line.
With perl-5.18.2 the runtime is about 0.5 seconds (printing 43092 line
numbers).
With perl-blead it prints only about 50 line numbers per second.
When testing with perl-5.19.5 and perl-5.19.6, I get the same bad
performance.

Though I can't figure out what the poster's program is supposed to do other than report pattern-matching performance, I can confirm his results. On the Dromedary server, I got this:

#####
system perl (5.12.3)

real 0m0.382s
user 0m0.293s
sys 0m0.086s
#####
blead perl
This is perl 5, version 19, subversion 9 (v5.19.9 (v5.19.8-477-g4a2e143)) built for x86_64-linux

real 3m17.745s
user 3m17.011s
sys 0m0.136s
#####

Thank you very much.
Jim Keenan

p5pRT · 2014-02-16T14:33:01Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2014-02-16T23:33:39Z

From @tonycoz

On Sat Feb 15 16:14:03 2014, heinz.knutzen@gmail.com wrote:

I still test my application with Perl 5.19.x to find out why it runs
slower than with previous stable versions.
My application parses large files with a recursive descent parser
which
in turn uses global pattern match.

Just a note here: while the source text might be utf-8, perl isn't processing it as a SvUTF8() string since there's no layers supplied to open() to do the conversion.

Tony

p5pRT · 2014-02-17T00:02:15Z

From @tonycoz

On Sat Feb 15 16:14:03 2014, heinz.knutzen@gmail.com wrote:

I still test my application with Perl 5.19.x to find out why it runs
slower than with previous stable versions.
My application parses large files with a recursive descent parser
which
in turn uses global pattern match.

Bisected with:

perl ../bisect.pl --start=v5.18.0 --timeout=10 -- ./perl -Ilib ../slow-utf8.pl ../slow-utf8-data.txt

to:

commit 13b0f67
Author: David Mitchell <davem@iabyn.com>
Date: Wed May 22 16:38:29 2013 +0100

re-enable Copy-on-Write by default.

COW was first introduced (and enabled by default) in 5.17.7.
It was disabled by default in 5.17.10, because it was though to have too
many rough edges for the 5.18.0 release.

By re-enabling it now, early in the 5.19.x release cycle, hopefully it
will be ready for production use by 5.20.

This commit mainly reverts 9f351b4 and e1fd413 (with modifications),
then updates perldelta.

Tony

p5pRT · 2014-02-17T04:44:08Z

From @hknutzen

I simplified the test case a lot.
- utf8 isn't needed at all
- only the file size matters
- slowdown is visible if file size is 256*x+255

Size must be large enough to see a slowdown.
A reasonable value is 51455=200*256+255.
Use shell commands "yes" and "head" to create the file.

yes a|head -c 51455 > gen.txt

Apply this simplified Perl program to the data file:

my $file = shift;
open(my $fh, '<', $file) or die "Can't open $file: $!";
local $/;
my $input = <$fh>;
close $fh;

while (1) {
last if $input =~ m/\G$/gc;
$input =~ m/\G[ \t]*/gc;
$input =~ m/\G\S+/gc;
$input =~ m/\G\n/gc;
}

This is perl 5, version 18, subversion 2 (v5.18.2) built for i686-linux
real 0m0.094s
user 0m0.088s
sys 0m0.008s

This is perl 5, version 19, subversion 9 (v5.19.9
(v5.19.8-467-g1122035)) built for i686-linux
real 0m1.400s
user 0m1.392s
sys 0m0.008s

p5pRT · 2014-02-17T06:14:34Z

From perl5-porters@perl.org

Heinz Knutzen wrote:

I simplified the test case a lot.
- utf8 isn't needed at all
- only the file size matters
- slowdown is visible if file size is 256*x+255

Size must be large enough to see a slowdown.
A reasonable value is 51455=200*256+255.

I see no slowdown on Darwin, where the allocated buffer size is 51495.

On Linux, I do see a slowdown. The buffer size is 51456, which does
not leave room for a copy-on-write reference count, so copy-on-write
does not kick in. That would explain the slowness.

Now, why is the buffer that small? I thought we were allocating an
extra byte for the sake of copy-on-write.

p5pRT · 2014-02-17T06:30:14Z

From perl5-porters@perl.org

I wrote:

Now, why is the buffer that small? I thought we were allocating an
extra byte for the sake of copy-on-write.

The extra byte is not added if we happen to have a nice large power
of two. Since sv_gets is allocating the string here (the assignment
is optimised away and the input is read straight into $input), we can
simply add a byte there, since we know we do not need a power of two.

Please try the sprout/121259 branch. It seems to work for me. (I
still need to write a real commit message and comment the code.)

p5pRT · 2014-02-17T14:01:03Z

From @jkeenan

On Sun Feb 16 22:30:14 2014, perl5-porters@perl.org wrote:
[snip]

Please try the sprout/121259 branch. It seems to work for me. (I
still need to write a real commit message and comment the code.)

sprout/121259 branch on the Dromedary server:

real 0m0.348s
user 0m0.247s
sys 0m0.098s

But your patch introduced an additional hard-tab into the code:

#####
^Iif (!PerlLIO_fstat(PerlIO_fileno(fp), &st) && S_ISREG(st.st_mode)) {$
^I const Off_t offset = PerlIO_tell(fp);$
^I if (offset != (Off_t) -1 && st.st_size + append > offset) {$
^I ^I(void) SvGROW(sv, (STRLEN)((st.st_size - offset) + append + 1$
#ifdef PERL_NEW_COPY_ON_WRITE$
#####

Can you replace the ^I just before (void)?

Otherwise, assuming other C-knowledgeable people don't object, +1.

Thank you very much.
Jim Keenan

p5pRT · 2014-02-20T22:36:44Z

From @tonycoz

On Sun Feb 16 22:30:14 2014, perl5-porters@perl.org wrote:

I wrote:

Now, why is the buffer that small? I thought we were allocating an
extra byte for the sake of copy-on-write.

The extra byte is not added if we happen to have a nice large power
of two. Since sv_gets is allocating the string here (the assignment
is optimised away and the input is read straight into $input), we can
simply add a byte there, since we know we do not need a power of two.

Please try the sprout/121259 branch. It seems to work for me. (I
still need to write a real commit message and comment the code.)

This fixed the performance for me, from:

real 3m58.732s
user 3m56.659s
sys 0m0.796s

to:

real 0m0.590s
user 0m0.236s
sys 0m0.176s

Tony

p5pRT · 2014-02-22T18:57:36Z

From @cpansprout

Fixed in dca36a0.

--

Father Chrysostomos

p5pRT · 2014-02-22T18:57:37Z

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Feb 22, 2014

p5pRT added the Severity Low label Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow global pattern match on specific large utf8 input #13606

Slow global pattern match on specific large utf8 input #13606

p5pRT commented Feb 16, 2014

p5pRT commented Feb 16, 2014

p5pRT commented Feb 16, 2014

p5pRT commented Feb 16, 2014

p5pRT commented Feb 16, 2014

p5pRT commented Feb 17, 2014

p5pRT commented Feb 17, 2014

p5pRT commented Feb 17, 2014

p5pRT commented Feb 17, 2014

p5pRT commented Feb 17, 2014

p5pRT commented Feb 20, 2014

p5pRT commented Feb 22, 2014

p5pRT commented Feb 22, 2014

Slow global pattern match on specific large utf8 input #13606

Slow global pattern match on specific large utf8 input #13606

Comments

p5pRT commented Feb 16, 2014

p5pRT commented Feb 16, 2014

From @hknutzen

p5pRT commented Feb 16, 2014

From @jkeenan

p5pRT commented Feb 16, 2014

p5pRT commented Feb 16, 2014

From @tonycoz

p5pRT commented Feb 17, 2014

From @tonycoz

p5pRT commented Feb 17, 2014

From @hknutzen

p5pRT commented Feb 17, 2014

From perl5-porters@perl.org

p5pRT commented Feb 17, 2014

From perl5-porters@perl.org

p5pRT commented Feb 17, 2014

From @jkeenan

p5pRT commented Feb 20, 2014

From @tonycoz

p5pRT commented Feb 22, 2014

From @cpansprout

p5pRT commented Feb 22, 2014