Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex end of line match very slow #14539

Open
p5pRT opened this issue Feb 24, 2015 · 14 comments
Open

regex end of line match very slow #14539

p5pRT opened this issue Feb 24, 2015 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 24, 2015

Migrated from rt.perl.org#123918 (status was 'open')

Searchable as RT123918$

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2015

From @prsalazar

Created by @prsalazar

Reply-To​: paul.salazar@​testspectrum.com
From​: paul.salazar@​testspectrum.com
To​: perlbug@​perl.org
Subject​: regex end of line match very slow
Message-Id​: <5.20.1_6288_1424805345@​badger>

This is a bug report for perl from paul.salazar@​testspectrum.com,
generated with the help of perlbug 1.40 running under perl 5.20.1.

-----------------------------------------------------------------
Running the following regex matching code is very slow when matching
long strings for a positive match, but very fast for a negative match.

use Time​::HiRes qw(gettimeofday);

my $iters = 100000;
my $x; my $strt = 0;

my $strng = ''; my $dat = '123';
my $reg = "\\s\$";

print "Running match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
  $strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time​: ".(gettimeofday() - $strt)."\n";

$strng = ''; $dat = '123';
$reg = "\\S\$";

print "\nRunning match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
  $strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time​: ".(gettimeofday() - $strt)."\n";

The output is as follows​:

Running match \s$ on 100000 iterations on concatenated string...
Time​: 6.53281188011169

Running match \S$ on 100000 iterations on concatenated string...
Time​: 0.038999080657959

This problem does not exist in perl 5.12. I'm not sure when it was
introduced into core library. I discovered it when an app using
PDF​::API2 starting going really slow after upgrading to 5.20 strawberry.

Perl Info
---
Flags:
    category=core
    severity=high
---
Site configuration information for perl 5.20.1:

Configured by strawberry-perl at Mon Sep 15 18:27:03 2014.

Summary of my perl5 (revision 5 version 20 subversion 1) configuration:
   
  Platform:
    osname=MSWin32, osvers=6.3, archname=MSWin32-x86-multi-thread-64int
    uname='Win32 strawberry-perl 5.20.1.1 #1 Mon Sep 15 18:25:23 2014 i386'
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags =' -s -O2 -DWIN32  -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields',
    optimize='-s -O2',
    cppflags='-DWIN32'
    ccversion='', gccversion='4.8.3', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='long long', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='g++', ldflags ='-s -L"C:\Perl32\perl\lib\CORE" -L"C:\Perl32\c\lib"'
    libpth=C:\Perl32\c\lib C:\Perl32\c\i686-w64-mingw32\lib C:\Perl32\c\lib\gcc\i686-w64-mingw32\4.8.3
    libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    libc=, so=dll, useshrplib=true, libperl=libperl520.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-mdll -s -L"C:\Perl32\perl\lib\CORE" -L"C:\Perl32\c\lib"'


---
@INC for perl 5.20.1:
    C:/Perl32/perl/site/lib
    C:/Perl32/perl/vendor/lib
    C:/Perl32/perl/lib
    .

---
Environment for perl 5.20.1:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Program Files\ActiveState Perl Dev Kit 9.4\bin\;C:\Program Files (x86)\ActiveState Perl Dev Kit 9.0.1\bin\;C:\Perl\gtk\bin;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files (x86)\ActiveState Perl Dev Kit 9.1.1\bin\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\WIDCOMM\Bluetooth Software\;C:\Program Files\WIDCOMM\Bluetooth Software\syswow64;C:\Program Files (x86)\Common Files\Roxio Shared\DLLShared\;C:\Program Files (x86)\Common Files\Roxio Shared\OEM\DLLShared\;C:\Program Files (x86)\Common Files\Roxio Shared\OEM\DLLShared\;C:\Program Files (x86)\Common Files\Roxio Shared\OEM\12.0\DLLShared\;C:\Program Files (x86)\Roxio\OEM\AudioCore\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access Client\v5\;C:\Program Files (x86)\cvsnt;C:\Program Files (x86)\jZip;C:\D\dmd2\windows\bin;C:\D\dm\bin;C:\Program Files (x86)\OpenVPN\bin;C:\TDM-GCC-64\bin;C:\Perl32\c\bin;C:\Perl32\perl\site\bin;C:\Perl32\perl\bin;C:\Perl64\c\bin;C:\Perl64\perl\site\bin;C:\Perl64\perl\bin;C:\Program Files (x86)\CVSNT\;C:\Program Files (x86)\IDM Computer Solutions\UltraEdit\;C:\Program Files (x86)\IDM Computer Solutions\UltraCompare\;C:\Program Files (x86)\OpenVPN\bin;C:\Gtk+\bin
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2015

From @prsalazar

paul_salazar.vcf

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @iabyn

On Tue, Feb 24, 2015 at 11​:26​:31AM -0800, Paul Salazar wrote​:

Running the following regex matching code is very slow when matching
long strings for a positive match, but very fast for a negative match.

use Time​::HiRes qw(gettimeofday);

my $iters = 100000;
my $x; my $strt = 0;

my $strng = ''; my $dat = '123';
my $reg = "\\s\$";

print "Running match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
$strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time​: ".(gettimeofday() - $strt)."\n";

$strng = ''; $dat = '123';
$reg = "\\S\$";

print "\nRunning match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
$strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time​: ".(gettimeofday() - $strt)."\n";

The output is as follows​:

Running match \s$ on 100000 iterations on concatenated string...
Time​: 6.53281188011169

Running match \S$ on 100000 iterations on concatenated string...
Time​: 0.038999080657959

This problem does not exist in perl 5.12. I'm not sure when it was
introduced into core library. I discovered it when an app using
PDF​::API2 starting going really slow after upgrading to 5.20 strawberry.

It's a side-effect of copy-on-write (COW) strings, which first appeared in
5.20.0. Traditionally in the presence of captures or $`/$&amp;/$', a
successful match would make a copy of the just-matched string. Without
their presence, the copy would be skipped, which unfortunately made things
like

  $s =~ /abc/;
  .... modify $s ...;
  $x = eval '$&amp;';

leave $x containing garbage.

With COW, a successfully matched string is marked as COW, with the regex
holding a second pointer to the string. If the string is subsequently
modified, the string is copied (hence the C in COW). Given that the
string is being repeatedly extended in small increments, physically coping
the whole string each time round the loop ends up going quadratic.

Conversely in the days before COW, something like

  $s = 'x' x 1_000_000;
  $&;
  1 while $s =~ /./g;

would go quadratic on length.

So pre-COW​:

  the regex engine *always* copied after a successful match in the presence
  of $& etc, going quadratic always, regardless of whether the string is
  subsequently modified or not; (it actually didn't copy in the just
  presence of captures, and just returned garbage in $1 etc if the
  string was modified; this was a tradeoff of performance over
  correctness).

Post-COW​:

  on a successful match the string is marked as COW, but no physical
  copy is done. $1, $&amp; etc will always be correct regardless of whether
  the string is subsequently modified, and regardless of whether $& had
  been seen prior to the regex being executed. However, if the string
  is subsequently modified, a full copy is done.

I can't yet thing of any elegant solution for this.

Some theoretical workarounds that don't actually work are​:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

* Perform another successful match on a different string after the first
match but before modifying the string, e.g.

  $long_string =~ /.../;
  "a" =~ /./;
  modify $long_string;

In theory that changes the "current regex" (PL_curpm) which is used
to calculate the values of $1 etc on the fly. Unfortunately when it is
changed, the old regex still holds a pointer to the COWed string;
perhaps we should look into dropping a COW reference each time PL_curpm
is updated????

--
Dave's first rule of Opera​:
If something needs saying, say it​: don't warble it.

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @wolfsage

On Wed, Feb 25, 2015 at 9​:36 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

Is this a bug of /n?

By design /n only prevents grouping parens from filling in $1, $2, etc...

Other things like $&, named captures, etc still work.

-- Matthew Horsfall (alh)

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @khwilliamson

On 02/25/2015 09​:07 AM, Matthew Horsfall (alh) wrote​:

On Wed, Feb 25, 2015 at 9​:36 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

Is this a bug of /n?

By design /n only prevents grouping parens from filling in $1, $2, etc...

Other things like $&, named captures, etc still work.

-- Matthew Horsfall (alh)

And, FWIW I think this is the correct design

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @iabyn

On Wed, Feb 25, 2015 at 11​:07​:32AM -0500, Matthew Horsfall (alh) wrote​:

On Wed, Feb 25, 2015 at 9​:36 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

Is this a bug of /n?

By design /n only prevents grouping parens from filling in $1, $2, etc...

Other things like $&, named captures, etc still work.

Ah, I misunderstood its purpose.

--
But Pity stayed his hand. "It's a pity I've run out of bullets",
he thought. -- "Bored of the Rings"

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @prsalazar

So pre-COW​:

 the regex engine \*always\* copied after a successful match in the presence
 of $& etc\, going quadratic always\, regardless of whether the string is
 subsequently modified or not; \(it actually didn't copy in the just
 presence of captures\, and just returned garbage in $1 etc if the
 string was modified; this was a tradeoff of performance over
 correctness\)\.

All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

  ~$ perl -v

This is perl, v5.8.8 built for msys-64int

Running match \s$ on 100000 iterations on concatenated string...
Time​: 0.0459511280059814

Running match \S$ on 100000 iterations on concatenated string...
Time​: 0.0440521240234375

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @wolfsage

On Wed, Feb 25, 2015 at 11​:32 AM, Paul Salazar
<paul.salazar@​testspectrum.com> wrote​:

So pre-COW​:

 the regex engine \*always\* copied after a successful match in the

presence
of $& etc, going quadratic always, regardless of whether the string
is
subsequently modified or not; (it actually didn't copy in the just
presence of captures, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness).

All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

So, this was fine in 5.17.7 up until​:

1a904fc is the first bad commit
commit 1a904fc
Author​: Father Chrysostomos <sprout@​cpan.org>
Date​: Sun Nov 25 12​:57​:04 2012 -0800

  Disable PL_sawampersand

  PL_sawampersand actually causes bugs (e.g., perl #4289), because the
  behaviour changes. eval '$&' after a match will produce different
  results depending on whether $& was seen before the match.

  Using copy-on-write for the pre-match copy (preceding patches do that)
  alleviates the slowdown caused by mentioning $&. The copy doesn’t
  happen unless the string is modified after the match. It’s now a
  post- match copy. So we no longer need to do things differently
  depending on whether $& has been seen.

  PL_sawampersand is now #defined to be equal to what it would be if
  every program began with $',$&amp;,$`.

  I left the PL_sawampersand code in place, in case this commit proves
  immature. Running Configure with -Accflags=PERL_SAWAMPERSAND will
  reënable the PL_sawampersand mechanism.

Perhaps something else is going on here?

-- Matthew Horsfall (alh)

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2015

From @khwilliamson

On 02/25/2015 12​:55 PM, Matthew Horsfall (alh) wrote​:

On Wed, Feb 25, 2015 at 11​:32 AM, Paul Salazar
<paul.salazar@​testspectrum.com> wrote​:

So pre-COW​:

  the regex engine \*always\* copied after a successful match in the

presence
of $& etc, going quadratic always, regardless of whether the string
is
subsequently modified or not; (it actually didn't copy in the just
presence of captures, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness).

All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

So, this was fine in 5.17.7 up until​:

1a904fc is the first bad commit
commit 1a904fc
Author​: Father Chrysostomos <sprout@​cpan.org>
Date​: Sun Nov 25 12​:57​:04 2012 -0800

 Disable PL\_sawampersand

 PL\_sawampersand actually causes bugs \(e\.g\.\, perl \#4289\)\, because the
 behaviour changes\.  eval '$&' after a match will produce different
 results depending on whether $& was seen before the match\.

 Using copy\-on\-write for the pre\-match copy \(preceding patches do that\)
 alleviates the slowdown caused by mentioning $&\.  The copy doesn’t
 happen unless the string is modified after the match\.  It’s now a
 post\- match copy\.  So we no longer need to do things differently
 depending on whether $& has been seen\.

 PL\_sawampersand is now \#defined to be equal to what it would be if
 every program began with $'\,$&\,$\`\.

 I left the PL\_sawampersand code in place\, in case this commit proves
 immature\.  Running Configure with \-Accflags=PERL\_SAWAMPERSAND will
 reënable the PL\_sawampersand mechanism\.

Perhaps something else is going on here?

-- Matthew Horsfall (alh)

As I vaguely recall, there is code in regexec that used to be able to
quit prematurely if there was no PL_sawampersand, but now has to go to
completion.

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

From @iabyn

On Wed, Feb 25, 2015 at 10​:32​:21AM -0600, Paul Salazar wrote​:

So pre-COW​:

the regex engine \*always\* copied after a successful match in the presence
of $& etc\, going quadratic always\, regardless of whether the string is
subsequently modified or not; \(it actually didn't copy in the just
presence of captures\, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness\)\.

All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

~$ perl -v

This is perl, v5.8.8 built for msys-64int

Running match \s$ on 100000 iterations on concatenated string...
Time​: 0.0459511280059814

Running match \S$ on 100000 iterations on concatenated string...
Time​: 0.0440521240234375

Try adding $& to the code or changing the regex to "(\\s)\$"; you'll then
see the slowdown in 5.8.x too.

--
All wight. I will give you one more chance. This time, I want to hear
no Wubens. No Weginalds. No Wudolf the wed-nosed weindeers.
  -- Life of Brian

@demerphq
Copy link
Collaborator

@wolfsage you said something in this ticket I don't understand.

By design /n only prevents grouping parens from filling in $1, $2, etc...
Other things like $&, named captures, etc still work.

I don't quite get that. Named captures are $1, $2, etc but with a label. Was that a typo/thinko?

@wolfsage
Copy link
Contributor

@demerphq What I meant was this, does it help?

mhorsfall@tworivers:~$ perl -e '"hihey" =~ /(?<foo>hi)(hey)/; print "$1 $+{foo} $2\n"'
hi hi hey
mhorsfall@tworivers:~$ perl -e '"hihey" =~ /(?<foo>hi)(hey)/n; print "$1 $+{foo} $2\n"'
hi hi

/n turns () into (?:...), but (?<...>...) still works

@demerphq
Copy link
Collaborator

Yeah as I lay in bed trying to fall asleep i realized that you had meant that simple capturing parens were treated as non capturing. The language you used threw me off a touch, not that it was wrong just sufficiently different from the words I would have used that it didn't click at 2am. My bad. Thanks for the confirmation and sorry for the confusion. Rereading now with a more rested brain I see my mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants