regexp optimizer loses its hopes too soon #8440

p5pRT · 2006-05-07T09:07:08Z

Migrated from rt.perl.org#39096 (status was 'stalled')

Searchable as RT39096$

p5pRT · 2006-05-07T09:07:09Z

From rvtol+perlbug@isolution.nl

Created by rvtol+perlbug@isolution.nl

I compared

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

to

perl -le '("z"x55) =~ /(?m-xis:^(.{0,10}){1,5}(?{$count++})$)/;
print "<$count>"'

and to

perl -le '("z"x55) =~ /(?m-xis)^(.{0,10}){1,5}(?{$count++})$/;
print "<$count>"'

and found that the first and the last variant return quickly (without
incrementing $count) and (only) the second variant takes more time.

Perl Info


Flags:
    category=core
    severity=high

Site configuration information for perl v5.8.6:

Configured by rj at Thu Feb  3 20:42:58 CET 2005.

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
  Platform:
    osname=freebsd, osvers=4.10-release-p2, archname=i386-freebsd-64int
    uname='freebsd xs0.xs4all.nl 4.10-release-p2 freebsd 4.10-release-p2 #1: thu aug 12 12:43:13 cest 2004 kai@xs0.xs4all.nl:usrobjusrsrcsysgeneric i386 '
    config_args='-sde -Dprefix=/usr/local -Darchlib=/usr/local/lib/perl5/5.8.6/mach -Dprivlib=/usr/local/lib/perl5/5.8.6 -Dman3dir=/usr/local/lib/perl5/5.8.6/perl/man/man3 -Dman1dir=/usr/local/man/man1 -Dsitearch=/usr/local/lib/perl5/site_perl/5.8.6/mach -Dsitelib=/usr/local/lib/perl5/site_perl/5.8.6 -Dscriptdir=/usr/local/bin -Dsiteman3dir=/usr/local/lib/perl5/5.8.6/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Ui_malloc -Ui_iconv -Uinstallusrbinperl -Dcc=cc -Doptimize=-O -pipe  -Duseshrplib -Dccflags=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.6/BSDPAN" -Ud_dosuid -Ui_gdbm -Dusethreads=n -Dusemymalloc=y -Duse64bitint'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=define use64bitall=undef uselongdouble=undef
    usemymalloc=y, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.6/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -I/usr/local/include',
    optimize='-O -pipe ',
    cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.6/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='2.95.4 20020320 [FreeBSD]', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -Wl,-E -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib
    libs=-lgdbm -lm -lcrypt -lutil
    perllibs=-lm -lcrypt -lutil
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='  -Wl,-R/usr/local/lib/perl5/5.8.6/mach/CORE'
    cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib'

Locally applied patches:
    SUIDPERLIO0 - fix PERLIO_DEBUG local root exploit (CAN-2005-0155)
    SUIDPERLIO1 - fix PERLIO_DEBUG buffer overflow (CAN-2005-0156)


@INC for perl v5.8.6:
    /home/r/rvtol/.cpan/CPAN/i386-freebsd-64int
    /home/r/rvtol/.cpan/CPAN
    /usr/local/lib/perl5/site_perl/5.8.6/mach
    /usr/local/lib/perl5/site_perl/5.8.6
    /usr/local/lib/perl5/site_perl
    /usr/local/lib/perl5/5.8.6/BSDPAN
    /usr/local/lib/perl5/5.8.6/mach
    /usr/local/lib/perl5/5.8.6
    .


Environment for perl v5.8.6:
    HOME=/home/r/rvtol
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/home/r/rvtol/bin/rrd/lib
    LOGDIR (unset)
    PATH=/home/r/rvtol/bin:/usr/local/bin:/bin:/usr/bin:/usr/games:.
    PERL5LIB=:/home/r/rvtol/.cpan/CPAN
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/bash

p5pRT · 2006-05-07T16:11:57Z

From @druud62

'intuit' is not activated with /(?m:^...$)/ regexp-format?

[1] perl -Mre=debug -e '"zzz" =~ /^.{1,2}$/m'

Freeing REx: `","'
Compiling REx `^.{1,2}$'
size 6 Got 52 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[\0-\11\13-\377{unicode_all}]'.
1: MBOL(2)
2: CURLY {1,2}(5)
4: REG_ANY(0)
5: MEOL(6)
6: END(0)
floating `'$ at 1..2 (checking floating)
stclass `ANYOF[\0-\11\13-\377{unicode_all}]'
anchored(MBOL)
minlen 1
Offsets: [6]
1[1] 3[5] 0[0] 2[1] 8[1] 9[0]
Guessing start of match, REx `^.{1,2}$' against `zzz'...
Found floating substr `'$ at offset 3...
Did not find /^/m...
Match rejected by optimizer
Freeing REx: `"^.{1,2}$"'

[2] perl -Mre=debug -e '"zzz" =~ /(?m:^.{1,2}$)/'

Freeing REx: `","'
Compiling REx `(?m:^.{1,2}$)'
size 6 Got 52 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[\0-\11\13-\377{unicode_all}]'.
1: MBOL(2)
2: CURLY {1,2}(5)
4: REG_ANY(0)
5: MEOL(6)
6: END(0)
stclass `ANYOF[\0-\11\13-\377{unicode_all}]'
anchored(MBOL)
minlen 1
Offsets: [6]
5[1] 7[5] 0[0] 6[1] 12[1] 14[0]
Matching REx `(?m:^.{1,2}$)' against `zzz'
Setting an EVAL scope, savestack=3
0 <> <zzz> | 1: MBOL
0 <> <zzz> | 2: CURLY {1,2}
REG_ANY can match 2 times out of 2...
Setting an EVAL scope, savestack=3
2 <zz> <z> | 5: MEOL
failed...
1 <z> <zz> | 5: MEOL
failed...
failed...
Match failed
Freeing REx: `"(?m:^.{1,2}$)"'

p5pRT · 2006-05-07T16:12:00Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2006-05-11T16:37:58Z

From dland@landgren.net

In the two instances that match quickly, the optimser rejects the first
and third patterns, since it can determine that (.{0,10}){1,5} matches
at best 50 characters, and the target is 55 characters. In the second
instance, the (?m:) assertion rewires ^, $ and thus kills this
optimisation. The engine has to brute force all the possibilities before
being satisfied that the pattern cannot match. I am not sure whether
this can be fixed without breaking stuff.

An alternative is to use \a and \z assertions instead, which are immune
to the (?m:) assertion.

perl -le '("z"x55) =~ /(?m-xis:\a(.{0,10}){1,5}(?{$count++})\z)/;
print "<$count>"'

David

p5pRT · 2006-05-19T00:32:12Z

From @iabyn

On Sun, May 07, 2006 at 02:07:10AM -0700, rvtol+perlbug @ isolution. nl wrote:

[Please enter your report here]

I compared

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

to

perl -le '("z"x55) =~ /(?m-xis:^(.{0,10}){1,5}(?{$count++})$)/;
print "<$count>"'

and to

perl -le '("z"x55) =~ /(?m-xis)^(.{0,10}){1,5}(?{$count++})$/;
print "<$count>"'

and found that the first and the last variant return quickly (without
incrementing $count) and (only) the second variant takes more time.

Thnaks for the report. The second case is in fact the expected behaviour;
nested {}'s in general have exponential search time. It so happens that in
cases 1 and 3 the perl optimiser is clever enough to realise that the
pattern will never match without actually needing to run it. In the second
case, the pattern is too complex for perl to notice this.

Ideally the optimiser should be extended to note this case too, although i
doubt that anyone will have the time to look at this right now.

--
"Strange women lying in ponds distributing swords is no basis for a system
of government. Supreme executive power derives from a mandate from the
masses, not from some farcical aquatic ceremony."
-- Dennis - Monty Python and the Holy Grail.

p5pRT · 2006-05-19T16:03:16Z

From @druud62

Dave Mitchell schreef:

rvtol+perlbug @ isolution.nl:

I compared
perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'
to
perl -le '("z"x55) =~ /(?m-xis:^(.{0,10}){1,5}(?{$count++})$)/;
print "<$count>"'
and to
perl -le '("z"x55) =~ /(?m-xis)^(.{0,10}){1,5}(?{$count++})$/;
print "<$count>"'
and found that the first and the last variant return quickly (without
incrementing $count) and (only) the second variant takes more time.

Thnaks for the report. The second case is in fact the expected
behaviour; nested {}'s in general have exponential search time. It so
happens that in cases 1 and 3 the perl optimiser is clever enough to
realise that the
pattern will never match without actually needing to run it. In the
second case, the pattern is too complex for perl to notice this.

But AFAIK, a 1-to-1 mapping exists between these 3 regexps, so they mean
exactly the same.

Ideally the optimiser should be extended to note this case too,
although i doubt that anyone will have the time to look at this right
now.

Maybe a unifying step (for at least the modifiers) is feasible?

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-05-19T16:45:35Z

From mjtg@cam.ac.uk

"Dr.Ruud" <rvtol+news@isolution.nl> wrote

rvtol+perlbug @ isolution.nl:

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

Maybe a unifying step (for at least the modifiers) is feasible?

They can't be straightforwardly unified because of the capturing parens.

Mike Guy

p5pRT · 2006-05-19T23:33:49Z

From @druud62

Mike Guy schreef:

Dr.Ruud:

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

Maybe a unifying step (for at least the modifiers) is feasible?

They can't be straightforwardly unified because of the capturing
parens.

Care to explain? I see the same thing in all three.

The difference stays the same when non-capturing:

("z"x65) =~ $_ for
qr/ ^ (?:.{0,10}) {1,6} (?{$count++}) $ /mx,
qr/(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $) / ,
qr/(?mx-is) ^ (?:.{0,10}) {1,6} (?{$count++}) $ / ;
print "<$count>\n"

Again, the optimizer only gets the first and third one.

Amazingly:

print "$_\n" for
qr/ ^ (?:.{0,10}) {1,6} (?{$count++}) $ /mx,
qr/(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $) / ,
qr/(?mx-is) ^ (?:.{0,10}) {1,6} (?{$count++}) $ /
'

(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $ )
(?-xism:(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $) ) <--
(?-xism:(?mx-is) ^ (?:.{0,10}) {1,6} (?{$count++}) $ )

That variant of the second also returns quickly. And that seems to only
depend on the 'pointless'
(?-xism: ... ) layer around it...

$ perl -Mre=debug -e '("z"x65) =~ /(?-xism:(?mx-is: ^ (?:.{0,10}) {1,6}
(?{$count++}) $) )/; print "<$count>"'

Compiling REx `(?-xism:(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $) )'
size 14 Got 116 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[\0-\11\13-\377{unicode_all}]'.
1: MBOL(2)
2: CURLYX[0] {1,6}(8)
4: CURLY {0,10}(7)
6: REG_ANY(0)
7: WHILEM(0)
8: NOTHING(9)
9: EVAL(11)
11: MEOL(12)
12: EXACT < >(14)
14: END(0)
floating ` ' at 0..60 (checking floating) stclass
`ANYOF[\0-\11\13-\377{unicode_all}]' anchored(MBOL) minlen 1 with eval
Offsets: [14]
19[1] 32[6] 0[0] 24[6] 0[0] 23[1] 37[0] 37[0] 38[14] 0[0] 52[1]
54[1] 0[0] 56[0]
Guessing start of match, REx `(?-xism:(?mx-is: ^ (?:.{0,10}) {1,6}
(?{$count++}) $) )' against `zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzzzzzzzzzz...'...
Did not find floating substr ` '...
Match rejected by optimizer

--
Affijn, Ruud

"Gewoon is een tijger."

p5pRT · 2006-12-02T19:59:53Z

From @demerphq

On Fri May 19 16:33:49 2006, rvtol+news@isolution.nl wrote:

Mike Guy schreef:

Dr.Ruud:

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

Maybe a unifying step (for at least the modifiers) is feasible?

They can't be straightforwardly unified because of the capturing
parens.

Care to explain? I see the same thing in all three.

The difference stays the same when non-capturing:

("z"x65) =~ $_ for
qr/ ^ (?:.{0,10}) {1,6} (?{$count++}) $ /mx,
qr/(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $) / ,
qr/(?mx-is) ^ (?:.{0,10}) {1,6} (?{$count++}) $ / ;
print "<$count>\n"

Be careful that should read

("z"x65) =~ $_ for
qr/ ^ (?:.{0,10}) {1,6} (?{$count++}) $ /mx,
qr/(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $)/ ,
qr/(?mx-is) ^ (?:.{0,10}) {1,6} (?{$count++}) $ / ;
print "<$count>\n"

Otherwise the second one tries to look for a space.

Compiling REx `(?-xism:(?mx-is: ^ (?:.{0,10}) {1,6} (?{$count++}) $) )'
[....]
floating ` ' at 0..60 (checking floating) stclass

This is the important part.

`ANYOF[\0-\11\13-\377{unicode_all}]' anchored(MBOL) minlen 1 with eval
Offsets: [14]
19[1] 32[6] 0[0] 24[6] 0[0] 23[1] 37[0] 37[0] 38[14] 0[0] 52[1]
54[1] 0[0] 56[0]
Guessing start of match, REx `(?-xism:(?mx-is: ^ (?:.{0,10}) {1,6}
(?{$count++}) $) )' against `zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzzzzzzzzzz...'...
Did not find floating substr ` '...
Match rejected by optimizer

What seems to be happening is that the optimiser is over-pessimising
because it needs to take into account weird stuff like

/foo+(?m:blah$)\nblah$/

At least thats how i understand it. So while /(?msix:...)/ affects what
regops are produced it has the sideeffect that the flags dont propagate
out of the pattern, and /m has effects at both compile time and run time.

I know what to change to make it so that

/(?m:....)/

is changed to be the same as

/(?m)/

but i dont know what do about other cases, such as

/(?m:....)(?m:.....)/

And I also have no idea why

("z"x65) =~ qr/^ (?:.{0,10}){1,6} (?{$count++}) $/mx

is efficient (ie the optimiser rejects the match outright)

But

('z'x65) =~ qr/^ (?:.{0,10}){1,6} (?{$count++}) $/x

is not. I would have thought the latter could be rejected outright, i
dont see why its not.

Ideas on how to procede welcome.

Cheers,
Yves

p5pRT · 2006-12-02T19:59:54Z

@demerphq - Status changed from 'open' to 'stalled'

p5pRT added Severity High distro-freebsd type-core labels Oct 18, 2019

xenu removed affects-5.8 labels Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp optimizer loses its hopes too soon #8440

regexp optimizer loses its hopes too soon #8440

p5pRT commented May 7, 2006

p5pRT commented May 7, 2006

p5pRT commented May 7, 2006

p5pRT commented May 7, 2006

p5pRT commented May 11, 2006

p5pRT commented May 19, 2006

p5pRT commented May 19, 2006

p5pRT commented May 19, 2006

p5pRT commented May 19, 2006

p5pRT commented Dec 2, 2006

p5pRT commented Dec 2, 2006

regexp optimizer loses its hopes too soon #8440

regexp optimizer loses its hopes too soon #8440

Comments

p5pRT commented May 7, 2006

p5pRT commented May 7, 2006

From rvtol+perlbug@isolution.nl

Created by rvtol+perlbug@isolution.nl

p5pRT commented May 7, 2006

From @druud62

p5pRT commented May 7, 2006

p5pRT commented May 11, 2006

From dland@landgren.net

p5pRT commented May 19, 2006

From @iabyn

p5pRT commented May 19, 2006

From @druud62

p5pRT commented May 19, 2006

From mjtg@cam.ac.uk

p5pRT commented May 19, 2006

From @druud62

p5pRT commented Dec 2, 2006

From @demerphq

p5pRT commented Dec 2, 2006