regex end of line match very slow #14539

p5pRT · 2015-02-24T19:26:31Z

Migrated from rt.perl.org#123918 (status was 'open')

Searchable as RT123918$

p5pRT · 2015-02-24T19:26:31Z

From @prsalazar

Created by @prsalazar

Reply-To: paul.salazar@testspectrum.com
From: paul.salazar@testspectrum.com
To: perlbug@perl.org
Subject: regex end of line match very slow
Message-Id: <5.20.1_6288_1424805345@badger>

This is a bug report for perl from paul.salazar@testspectrum.com,
generated with the help of perlbug 1.40 running under perl 5.20.1.

-----------------------------------------------------------------
Running the following regex matching code is very slow when matching
long strings for a positive match, but very fast for a negative match.

use Time::HiRes qw(gettimeofday);

my $iters = 100000;
my $x; my $strt = 0;

my $strng = ''; my $dat = '123';
my $reg = "\\s\$";

print "Running match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
$strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time: ".(gettimeofday() - $strt)."\n";

$strng = ''; $dat = '123';
$reg = "\\S\$";

print "\nRunning match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
$strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time: ".(gettimeofday() - $strt)."\n";

The output is as follows:

Running match \s$ on 100000 iterations on concatenated string...
Time: 6.53281188011169

Running match \S$ on 100000 iterations on concatenated string...
Time: 0.038999080657959

This problem does not exist in perl 5.12. I'm not sure when it was
introduced into core library. I discovered it when an app using
PDF::API2 starting going really slow after upgrading to 5.20 strawberry.

Perl Info

---
Flags:
    category=core
    severity=high
---
Site configuration information for perl 5.20.1:

Configured by strawberry-perl at Mon Sep 15 18:27:03 2014.

Summary of my perl5 (revision 5 version 20 subversion 1) configuration:
   
  Platform:
    osname=MSWin32, osvers=6.3, archname=MSWin32-x86-multi-thread-64int
    uname='Win32 strawberry-perl 5.20.1.1 #1 Mon Sep 15 18:25:23 2014 i386'
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags =' -s -O2 -DWIN32  -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields',
    optimize='-s -O2',
    cppflags='-DWIN32'
    ccversion='', gccversion='4.8.3', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='long long', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='g++', ldflags ='-s -L"C:\Perl32\perl\lib\CORE" -L"C:\Perl32\c\lib"'
    libpth=C:\Perl32\c\lib C:\Perl32\c\i686-w64-mingw32\lib C:\Perl32\c\lib\gcc\i686-w64-mingw32\4.8.3
    libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    libc=, so=dll, useshrplib=true, libperl=libperl520.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-mdll -s -L"C:\Perl32\perl\lib\CORE" -L"C:\Perl32\c\lib"'


---
@INC for perl 5.20.1:
    C:/Perl32/perl/site/lib
    C:/Perl32/perl/vendor/lib
    C:/Perl32/perl/lib
    .

---
Environment for perl 5.20.1:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Program Files\ActiveState Perl Dev Kit 9.4\bin\;C:\Program Files (x86)\ActiveState Perl Dev Kit 9.0.1\bin\;C:\Perl\gtk\bin;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files (x86)\ActiveState Perl Dev Kit 9.1.1\bin\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\WIDCOMM\Bluetooth Software\;C:\Program Files\WIDCOMM\Bluetooth Software\syswow64;C:\Program Files (x86)\Common Files\Roxio Shared\DLLShared\;C:\Program Files (x86)\Common Files\Roxio Shared\OEM\DLLShared\;C:\Program Files (x86)\Common Files\Roxio Shared\OEM\DLLShared\;C:\Program Files (x86)\Common Files\Roxio Shared\OEM\12.0\DLLShared\;C:\Program Files (x86)\Roxio\OEM\AudioCore\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access Client\v5\;C:\Program Files (x86)\cvsnt;C:\Program Files (x86)\jZip;C:\D\dmd2\windows\bin;C:\D\dm\bin;C:\Program Files (x86)\OpenVPN\bin;C:\TDM-GCC-64\bin;C:\Perl32\c\bin;C:\Perl32\perl\site\bin;C:\Perl32\perl\bin;C:\Perl64\c\bin;C:\Perl64\perl\site\bin;C:\Perl64\perl\bin;C:\Program Files (x86)\CVSNT\;C:\Program Files (x86)\IDM Computer Solutions\UltraEdit\;C:\Program Files (x86)\IDM Computer Solutions\UltraCompare\;C:\Program Files (x86)\OpenVPN\bin;C:\Gtk+\bin
    PERL_BADLANG (unset)
    SHELL (unset)

p5pRT · 2015-02-24T19:26:31Z

From @prsalazar

paul_salazar.vcf

p5pRT · 2015-02-25T14:37:23Z

From @iabyn

On Tue, Feb 24, 2015 at 11:26:31AM -0800, Paul Salazar wrote:

Running the following regex matching code is very slow when matching
long strings for a positive match, but very fast for a negative match.

use Time::HiRes qw(gettimeofday);

my $iters = 100000;
my $x; my $strt = 0;

my $strng = ''; my $dat = '123';
my $reg = "\\s\$";

print "Running match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
$strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time: ".(gettimeofday() - $strt)."\n";

$strng = ''; $dat = '123';
$reg = "\\S\$";

print "\nRunning match $reg on $iters iterations on concatenated string...\n";

$strt = gettimeofday();
for ($x = 0; $x < $iters; $x++) {
$strng .= ($strng =~ m/$reg/o ? ' ' : '') . $dat . ' ';
}
print "Time: ".(gettimeofday() - $strt)."\n";

The output is as follows:

Running match \s$ on 100000 iterations on concatenated string...
Time: 6.53281188011169

Running match \S$ on 100000 iterations on concatenated string...
Time: 0.038999080657959

This problem does not exist in perl 5.12. I'm not sure when it was
introduced into core library. I discovered it when an app using
PDF::API2 starting going really slow after upgrading to 5.20 strawberry.

It's a side-effect of copy-on-write (COW) strings, which first appeared in
5.20.0. Traditionally in the presence of captures or $`/$&/$', a
successful match would make a copy of the just-matched string. Without
their presence, the copy would be skipped, which unfortunately made things
like

$s =~ /abc/;
.... modify $s ...;
$x = eval '$&';

leave $x containing garbage.

With COW, a successfully matched string is marked as COW, with the regex
holding a second pointer to the string. If the string is subsequently
modified, the string is copied (hence the C in COW). Given that the
string is being repeatedly extended in small increments, physically coping
the whole string each time round the loop ends up going quadratic.

Conversely in the days before COW, something like

$s = 'x' x 1_000_000;
$&;
1 while $s =~ /./g;

would go quadratic on length.

So pre-COW:

the regex engine *always* copied after a successful match in the presence
of $& etc, going quadratic always, regardless of whether the string is
subsequently modified or not; (it actually didn't copy in the just
presence of captures, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness).

Post-COW:

on a successful match the string is marked as COW, but no physical
copy is done. $1, $& etc will always be correct regardless of whether
the string is subsequently modified, and regardless of whether $& had
been seen prior to the regex being executed. However, if the string
is subsequently modified, a full copy is done.

I can't yet thing of any elegant solution for this.

Some theoretical workarounds that don't actually work are:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

* Perform another successful match on a different string after the first
match but before modifying the string, e.g.

$long_string =~ /.../;
"a" =~ /./;
modify $long_string;

In theory that changes the "current regex" (PL_curpm) which is used
to calculate the values of $1 etc on the fly. Unfortunately when it is
changed, the old regex still holds a pointer to the COWed string;
perhaps we should look into dropping a COW reference each time PL_curpm
is updated????

--
Dave's first rule of Opera:
If something needs saying, say it: don't warble it.

p5pRT · 2015-02-25T14:37:24Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2015-02-25T16:07:51Z

From @wolfsage

On Wed, Feb 25, 2015 at 9:36 AM, Dave Mitchell <davem@iabyn.com> wrote:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

Is this a bug of /n?

By design /n only prevents grouping parens from filling in $1, $2, etc...

Other things like $&, named captures, etc still work.

-- Matthew Horsfall (alh)

p5pRT · 2015-02-25T16:23:59Z

From @khwilliamson

On 02/25/2015 09:07 AM, Matthew Horsfall (alh) wrote:

On Wed, Feb 25, 2015 at 9:36 AM, Dave Mitchell <davem@iabyn.com> wrote:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

Is this a bug of /n?

By design /n only prevents grouping parens from filling in $1, $2, etc...

Other things like $&, named captures, etc still work.

-- Matthew Horsfall (alh)

And, FWIW I think this is the correct design

p5pRT · 2015-02-25T16:56:33Z

From @iabyn

On Wed, Feb 25, 2015 at 11:07:32AM -0500, Matthew Horsfall (alh) wrote:

On Wed, Feb 25, 2015 at 9:36 AM, Dave Mitchell <davem@iabyn.com> wrote:

* 5.21.8 and later include the /n modifier, which doesn't set $1 etc,
but it appears to be buggy in that it seems to still make the string COW.

Is this a bug of /n?

By design /n only prevents grouping parens from filling in $1, $2, etc...

Other things like $&, named captures, etc still work.

Ah, I misunderstood its purpose.

--
But Pity stayed his hand. "It's a pity I've run out of bullets",
he thought. -- "Bored of the Rings"

p5pRT · 2015-02-25T19:19:37Z

From @prsalazar

So pre-COW:

 the regex engine \*always\* copied after a successful match in the presence
 of $& etc\, going quadratic always\, regardless of whether the string is
 subsequently modified or not; \(it actually didn't copy in the just
 presence of captures\, and just returned garbage in $1 etc if the
 string was modified; this was a tradeoff of performance over
 correctness\)\.

All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

~$ perl -v

This is perl, v5.8.8 built for msys-64int

Running match \s$ on 100000 iterations on concatenated string...
Time: 0.0459511280059814

Running match \S$ on 100000 iterations on concatenated string...
Time: 0.0440521240234375

p5pRT · 2015-02-25T19:56:23Z

From @wolfsage

On Wed, Feb 25, 2015 at 11:32 AM, Paul Salazar
<paul.salazar@testspectrum.com> wrote:

So pre-COW:
 the regex engine \*always\* copied after a successful match in the
presence
of $& etc, going quadratic always, regardless of whether the string
is
subsequently modified or not; (it actually didn't copy in the just
presence of captures, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness).
All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

So, this was fine in 5.17.7 up until:

1a904fc is the first bad commit
commit 1a904fc
Author: Father Chrysostomos <sprout@cpan.org>
Date: Sun Nov 25 12:57:04 2012 -0800

Disable PL_sawampersand

PL_sawampersand actually causes bugs (e.g., perl #4289), because the
behaviour changes. eval '$&' after a match will produce different
results depending on whether $& was seen before the match.

Using copy-on-write for the pre-match copy (preceding patches do that)
alleviates the slowdown caused by mentioning $&. The copy doesn’t
happen unless the string is modified after the match. It’s now a
post- match copy. So we no longer need to do things differently
depending on whether $& has been seen.

PL_sawampersand is now #defined to be equal to what it would be if
every program began with $',$&,$`.

I left the PL_sawampersand code in place, in case this commit proves
immature. Running Configure with -Accflags=PERL_SAWAMPERSAND will
reënable the PL_sawampersand mechanism.

Perhaps something else is going on here?

-- Matthew Horsfall (alh)

p5pRT · 2015-02-25T20:08:26Z

From @khwilliamson

On 02/25/2015 12:55 PM, Matthew Horsfall (alh) wrote:

On Wed, Feb 25, 2015 at 11:32 AM, Paul Salazar
<paul.salazar@testspectrum.com> wrote:
So pre-COW:
  the regex engine \*always\* copied after a successful match in the
presence
of $& etc, going quadratic always, regardless of whether the string
is
subsequently modified or not; (it actually didn't copy in the just
presence of captures, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness).
All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8
So, this was fine in 5.17.7 up until:

1a904fc is the first bad commit
commit 1a904fc
Author: Father Chrysostomos <sprout@cpan.org>
Date: Sun Nov 25 12:57:04 2012 -0800
 Disable PL\_sawampersand

 PL\_sawampersand actually causes bugs $e\.g\.\, perl \#4289$\, because the
 behaviour changes\.  eval '$&' after a match will produce different
 results depending on whether $& was seen before the match\.

 Using copy\-on\-write for the pre\-match copy $preceding patches do that$
 alleviates the slowdown caused by mentioning $&\.  The copy doesn’t
 happen unless the string is modified after the match\.  It’s now a
 post\- match copy\.  So we no longer need to do things differently
 depending on whether $& has been seen\.

 PL\_sawampersand is now \#defined to be equal to what it would be if
 every program began with $'\,$&\,$\`\.

 I left the PL\_sawampersand code in place\, in case this commit proves
 immature\.  Running Configure with \-Accflags=PERL\_SAWAMPERSAND will
 reënable the PL\_sawampersand mechanism\.
Perhaps something else is going on here?

-- Matthew Horsfall (alh)

As I vaguely recall, there is code in regexec that used to be able to
quit prematurely if there was no PL_sawampersand, but now has to go to
completion.

p5pRT · 2015-02-27T09:32:40Z

From @iabyn

On Wed, Feb 25, 2015 at 10:32:21AM -0600, Paul Salazar wrote:

So pre-COW:
the regex engine \*always\* copied after a successful match in the presence
of $& etc\, going quadratic always\, regardless of whether the string is
subsequently modified or not; $it actually didn't copy in the just
presence of captures\, and just returned garbage in $1 etc if the
string was modified; this was a tradeoff of performance over
correctness$\.
All makes sense except I'm not seeing the pre-COW going quadratic. Same
example ran on 5.8

~$ perl -v

This is perl, v5.8.8 built for msys-64int

Running match \s$ on 100000 iterations on concatenated string...
Time: 0.0459511280059814

Running match \S$ on 100000 iterations on concatenated string...
Time: 0.0440521240234375

Try adding $& to the code or changing the regex to "(\\s)\$"; you'll then
see the slowdown in 5.8.x too.

--
All wight. I will give you one more chance. This time, I want to hear
no Wubens. No Weginalds. No Wudolf the wed-nosed weindeers.
-- Life of Brian

demerphq · 2023-01-21T17:40:51Z

@wolfsage you said something in this ticket I don't understand.

By design /n only prevents grouping parens from filling in $1, $2, etc...
Other things like $&, named captures, etc still work.

I don't quite get that. Named captures are $1, $2, etc but with a label. Was that a typo/thinko?

wolfsage · 2023-01-21T20:54:21Z

@demerphq What I meant was this, does it help?

mhorsfall@tworivers:~$ perl -e '"hihey" =~ /(?<foo>hi)(hey)/; print "$1 $+{foo} $2\n"'
hi hi hey
mhorsfall@tworivers:~$ perl -e '"hihey" =~ /(?<foo>hi)(hey)/n; print "$1 $+{foo} $2\n"'
hi hi

/n turns () into (?:...), but (?<...>...) still works

demerphq · 2023-01-22T01:42:46Z

Yeah as I lay in bed trying to fall asleep i realized that you had meant that simple capturing parens were treated as non capturing. The language you used threw me off a touch, not that it was wrong just sufficiently different from the words I would have used that it didn't click at 2am. My bad. Thanks for the confirmation and sorry for the confusion. Rereading now with a more rested brain I see my mistake.

p5pRT added the Severity Low label Oct 19, 2019

xenu removed the Severity Low label Dec 29, 2021

jkeenan added the type-regex label Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex end of line match very slow #14539

regex end of line match very slow #14539

p5pRT commented Feb 24, 2015

p5pRT commented Feb 24, 2015

p5pRT commented Feb 24, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

p5pRT commented Feb 27, 2015

demerphq commented Jan 21, 2023

wolfsage commented Jan 21, 2023

demerphq commented Jan 22, 2023

regex end of line match very slow #14539

regex end of line match very slow #14539

Comments

p5pRT commented Feb 24, 2015

p5pRT commented Feb 24, 2015

From @prsalazar

Created by @prsalazar

p5pRT commented Feb 24, 2015

From @prsalazar

p5pRT commented Feb 25, 2015

From @iabyn

p5pRT commented Feb 25, 2015

p5pRT commented Feb 25, 2015

From @wolfsage

p5pRT commented Feb 25, 2015

From @khwilliamson

p5pRT commented Feb 25, 2015

From @iabyn

p5pRT commented Feb 25, 2015

From @prsalazar

p5pRT commented Feb 25, 2015

From @wolfsage

p5pRT commented Feb 25, 2015

From @khwilliamson

p5pRT commented Feb 27, 2015

From @iabyn

demerphq commented Jan 21, 2023

wolfsage commented Jan 21, 2023

demerphq commented Jan 22, 2023