Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp optimizer loses its hopes too soon #8440

Open
p5pRT opened this issue May 7, 2006 · 10 comments
Open

regexp optimizer loses its hopes too soon #8440

p5pRT opened this issue May 7, 2006 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented May 7, 2006

Migrated from rt.perl.org#39096 (status was 'stalled')

Searchable as RT39096$

@p5pRT
Copy link
Author

p5pRT commented May 7, 2006

From rvtol+perlbug@isolution.nl

Created by rvtol+perlbug@isolution.nl

I compared

  perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
  print "<$count>"'

to

  perl -le '("z"x55) =~ /(?m-xis​:^(.{0,10}){1,5}(?{$count++})$)/;
  print "<$count>"'

and to

  perl -le '("z"x55) =~ /(?m-xis)^(.{0,10}){1,5}(?{$count++})$/;
  print "<$count>"'

and found that the first and the last variant return quickly (without
incrementing $count) and (only) the second variant takes more time.

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl v5.8.6:

Configured by rj at Thu Feb  3 20:42:58 CET 2005.

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
  Platform:
    osname=freebsd, osvers=4.10-release-p2, archname=i386-freebsd-64int
    uname='freebsd xs0.xs4all.nl 4.10-release-p2 freebsd 4.10-release-p2 #1: thu aug 12 12:43:13 cest 2004 kai@xs0.xs4all.nl:usrobjusrsrcsysgeneric i386 '
    config_args='-sde -Dprefix=/usr/local -Darchlib=/usr/local/lib/perl5/5.8.6/mach -Dprivlib=/usr/local/lib/perl5/5.8.6 -Dman3dir=/usr/local/lib/perl5/5.8.6/perl/man/man3 -Dman1dir=/usr/local/man/man1 -Dsitearch=/usr/local/lib/perl5/site_perl/5.8.6/mach -Dsitelib=/usr/local/lib/perl5/site_perl/5.8.6 -Dscriptdir=/usr/local/bin -Dsiteman3dir=/usr/local/lib/perl5/5.8.6/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Ui_malloc -Ui_iconv -Uinstallusrbinperl -Dcc=cc -Doptimize=-O -pipe  -Duseshrplib -Dccflags=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.6/BSDPAN" -Ud_dosuid -Ui_gdbm -Dusethreads=n -Dusemymalloc=y -Duse64bitint'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=define use64bitall=undef uselongdouble=undef
    usemymalloc=y, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.6/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -I/usr/local/include',
    optimize='-O -pipe ',
    cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.6/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='2.95.4 20020320 [FreeBSD]', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -Wl,-E -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib
    libs=-lgdbm -lm -lcrypt -lutil
    perllibs=-lm -lcrypt -lutil
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='  -Wl,-R/usr/local/lib/perl5/5.8.6/mach/CORE'
    cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib'

Locally applied patches:
    SUIDPERLIO0 - fix PERLIO_DEBUG local root exploit (CAN-2005-0155)
    SUIDPERLIO1 - fix PERLIO_DEBUG buffer overflow (CAN-2005-0156)


@INC for perl v5.8.6:
    /home/r/rvtol/.cpan/CPAN/i386-freebsd-64int
    /home/r/rvtol/.cpan/CPAN
    /usr/local/lib/perl5/site_perl/5.8.6/mach
    /usr/local/lib/perl5/site_perl/5.8.6
    /usr/local/lib/perl5/site_perl
    /usr/local/lib/perl5/5.8.6/BSDPAN
    /usr/local/lib/perl5/5.8.6/mach
    /usr/local/lib/perl5/5.8.6
    .


Environment for perl v5.8.6:
    HOME=/home/r/rvtol
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/home/r/rvtol/bin/rrd/lib
    LOGDIR (unset)
    PATH=/home/r/rvtol/bin:/usr/local/bin:/bin:/usr/bin:/usr/games:.
    PERL5LIB=:/home/r/rvtol/.cpan/CPAN
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 7, 2006

From @druud62

'intuit' is not activated with /(?m​:^...$)/ regexp-format?

[1] perl -Mre=debug -e '"zzz" =~ /^.{1,2}$/m'

Freeing REx​: `","'
Compiling REx `^.{1,2}$'
size 6 Got 52 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[\0-\11\13-\377{unicode_all}]'.
  1​: MBOL(2)
  2​: CURLY {1,2}(5)
  4​: REG_ANY(0)
  5​: MEOL(6)
  6​: END(0)
floating `'$ at 1..2 (checking floating)
  stclass `ANYOF[\0-\11\13-\377{unicode_all}]'
  anchored(MBOL)
  minlen 1
Offsets​: [6]
  1[1] 3[5] 0[0] 2[1] 8[1] 9[0]
Guessing start of match, REx `^.{1,2}$' against `zzz'...
Found floating substr `'$ at offset 3...
Did not find /^/m...
Match rejected by optimizer
Freeing REx​: `"^.{1,2}$"'

[2] perl -Mre=debug -e '"zzz" =~ /(?m​:^.{1,2}$)/'

Freeing REx​: `","'
Compiling REx `(?m​:^.{1,2}$)'
size 6 Got 52 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[\0-\11\13-\377{unicode_all}]'.
  1​: MBOL(2)
  2​: CURLY {1,2}(5)
  4​: REG_ANY(0)
  5​: MEOL(6)
  6​: END(0)
stclass `ANYOF[\0-\11\13-\377{unicode_all}]'
  anchored(MBOL)
  minlen 1
Offsets​: [6]
  5[1] 7[5] 0[0] 6[1] 12[1] 14[0]
Matching REx `(?m​:^.{1,2}$)' against `zzz'
  Setting an EVAL scope, savestack=3
  0 <> <zzz> | 1​: MBOL
  0 <> <zzz> | 2​: CURLY {1,2}
  REG_ANY can match 2 times out of 2...
  Setting an EVAL scope, savestack=3
  2 <zz> <z> | 5​: MEOL
  failed...
  1 <z> <zz> | 5​: MEOL
  failed...
  failed...
Match failed
Freeing REx​: `"(?m​:^.{1,2}$)"'

@p5pRT
Copy link
Author

p5pRT commented May 7, 2006

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 11, 2006

From dland@landgren.net

In the two instances that match quickly, the optimser rejects the first
and third patterns, since it can determine that (.{0,10}){1,5} matches
at best 50 characters, and the target is 55 characters. In the second
instance, the (?m​:) assertion rewires ^, $ and thus kills this
optimisation. The engine has to brute force all the possibilities before
being satisfied that the pattern cannot match. I am not sure whether
this can be fixed without breaking stuff.

An alternative is to use \a and \z assertions instead, which are immune
to the (?m​:) assertion.

  perl -le '("z"x55) =~ /(?m-xis​:\a(.{0,10}){1,5}(?{$count++})\z)/;
print "<$count>"'

David

@p5pRT
Copy link
Author

p5pRT commented May 19, 2006

From @iabyn

On Sun, May 07, 2006 at 02​:07​:10AM -0700, rvtol+perlbug @​ isolution. nl wrote​:

[Please enter your report here]

I compared

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

to

perl -le '("z"x55) =~ /(?m-xis​:^(.{0,10}){1,5}(?{$count++})$)/;
print "<$count>"'

and to

perl -le '("z"x55) =~ /(?m-xis)^(.{0,10}){1,5}(?{$count++})$/;
print "<$count>"'

and found that the first and the last variant return quickly (without
incrementing $count) and (only) the second variant takes more time.

Thnaks for the report. The second case is in fact the expected behaviour;
nested {}'s in general have exponential search time. It so happens that in
cases 1 and 3 the perl optimiser is clever enough to realise that the
pattern will never match without actually needing to run it. In the second
case, the pattern is too complex for perl to notice this.

Ideally the optimiser should be extended to note this case too, although i
doubt that anyone will have the time to look at this right now.

--
"Strange women lying in ponds distributing swords is no basis for a system
of government. Supreme executive power derives from a mandate from the
masses, not from some farcical aquatic ceremony."
  -- Dennis - Monty Python and the Holy Grail.

@p5pRT
Copy link
Author

p5pRT commented May 19, 2006

From @druud62

Dave Mitchell schreef​:

rvtol+perlbug @​ isolution.nl​:

I compared
perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'
to
perl -le '("z"x55) =~ /(?m-xis​:^(.{0,10}){1,5}(?{$count++})$)/;
print "<$count>"'
and to
perl -le '("z"x55) =~ /(?m-xis)^(.{0,10}){1,5}(?{$count++})$/;
print "<$count>"'
and found that the first and the last variant return quickly (without
incrementing $count) and (only) the second variant takes more time.

Thnaks for the report. The second case is in fact the expected
behaviour; nested {}'s in general have exponential search time. It so
happens that in cases 1 and 3 the perl optimiser is clever enough to
realise that the
pattern will never match without actually needing to run it. In the
second case, the pattern is too complex for perl to notice this.

But AFAIK, a 1-to-1 mapping exists between these 3 regexps, so they mean
exactly the same.

Ideally the optimiser should be extended to note this case too,
although i doubt that anyone will have the time to look at this right
now.

Maybe a unifying step (for at least the modifiers) is feasible?

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented May 19, 2006

From mjtg@cam.ac.uk

"Dr.Ruud" <rvtol+news@​isolution.nl> wrote

rvtol+perlbug @​ isolution.nl​:

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

Maybe a unifying step (for at least the modifiers) is feasible?

They can't be straightforwardly unified because of the capturing parens.

Mike Guy

@p5pRT
Copy link
Author

p5pRT commented May 19, 2006

From @druud62

Mike Guy schreef​:

Dr.Ruud​:

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

Maybe a unifying step (for at least the modifiers) is feasible?

They can't be straightforwardly unified because of the capturing
parens.

Care to explain? I see the same thing in all three.

The difference stays the same when non-capturing​:

("z"x65) =~ $_ for
  qr/ ^ (?​:.{0,10}) {1,6} (?{$count++}) $ /mx,
  qr/(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $) / ,
  qr/(?mx-is) ^ (?​:.{0,10}) {1,6} (?{$count++}) $ / ;
print "<$count>\n"

Again, the optimizer only gets the first and third one.

Amazingly​:

print "$_\n" for
  qr/ ^ (?​:.{0,10}) {1,6} (?{$count++}) $ /mx,
  qr/(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $) / ,
  qr/(?mx-is) ^ (?​:.{0,10}) {1,6} (?{$count++}) $ /
'

(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $ )
(?-xism​:(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $) ) <--
(?-xism​:(?mx-is) ^ (?​:.{0,10}) {1,6} (?{$count++}) $ )

That variant of the second also returns quickly. And that seems to only
depend on the 'pointless'
(?-xism​: ... ) layer around it...

$ perl -Mre=debug -e '("z"x65) =~ /(?-xism​:(?mx-is​: ^ (?​:.{0,10}) {1,6}
(?{$count++}) $) )/; print "<$count>"'

Compiling REx `(?-xism​:(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $) )'
size 14 Got 116 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[\0-\11\13-\377{unicode_all}]'.
  1​: MBOL(2)
  2​: CURLYX[0] {1,6}(8)
  4​: CURLY {0,10}(7)
  6​: REG_ANY(0)
  7​: WHILEM(0)
  8​: NOTHING(9)
  9​: EVAL(11)
  11​: MEOL(12)
  12​: EXACT < >(14)
  14​: END(0)
floating ` ' at 0..60 (checking floating) stclass
`ANYOF[\0-\11\13-\377{unicode_all}]' anchored(MBOL) minlen 1 with eval
Offsets​: [14]
  19[1] 32[6] 0[0] 24[6] 0[0] 23[1] 37[0] 37[0] 38[14] 0[0] 52[1]
54[1] 0[0] 56[0]
Guessing start of match, REx `(?-xism​:(?mx-is​: ^ (?​:.{0,10}) {1,6}
(?{$count++}) $) )' against `zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzzzzzzzzzz...'...
Did not find floating substr ` '...
Match rejected by optimizer

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented Dec 2, 2006

From @demerphq

On Fri May 19 16​:33​:49 2006, rvtol+news@​isolution.nl wrote​:

Mike Guy schreef​:

Dr.Ruud​:

perl -le '("z"x55) =~ /^(.{0,10}){1,5}(?{$count++})$/m;
print "<$count>"'

Maybe a unifying step (for at least the modifiers) is feasible?

They can't be straightforwardly unified because of the capturing
parens.

Care to explain? I see the same thing in all three.

The difference stays the same when non-capturing​:

("z"x65) =~ $_ for
qr/ ^ (?​:.{0,10}) {1,6} (?{$count++}) $ /mx,
qr/(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $) / ,
qr/(?mx-is) ^ (?​:.{0,10}) {1,6} (?{$count++}) $ / ;
print "<$count>\n"

Be careful that should read

("z"x65) =~ $_ for
  qr/ ^ (?​:.{0,10}) {1,6} (?{$count++}) $ /mx,
  qr/(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $)/ ,
  qr/(?mx-is) ^ (?​:.{0,10}) {1,6} (?{$count++}) $ / ;
print "<$count>\n"

Otherwise the second one tries to look for a space.

Compiling REx `(?-xism​:(?mx-is​: ^ (?​:.{0,10}) {1,6} (?{$count++}) $) )'
[....]
floating ` ' at 0..60 (checking floating) stclass

This is the important part.

`ANYOF[\0-\11\13-\377{unicode_all}]' anchored(MBOL) minlen 1 with eval
Offsets​: [14]
19[1] 32[6] 0[0] 24[6] 0[0] 23[1] 37[0] 37[0] 38[14] 0[0] 52[1]
54[1] 0[0] 56[0]
Guessing start of match, REx `(?-xism​:(?mx-is​: ^ (?​:.{0,10}) {1,6}
(?{$count++}) $) )' against `zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
zzzzzzzzzzzzzzzzzzzzzzzz...'...
Did not find floating substr ` '...
Match rejected by optimizer

What seems to be happening is that the optimiser is over-pessimising
because it needs to take into account weird stuff like

  /foo+(?m​:blah$)\nblah$/

At least thats how i understand it. So while /(?msix​:...)/ affects what
regops are produced it has the sideeffect that the flags dont propagate
out of the pattern, and /m has effects at both compile time and run time.

I know what to change to make it so that

  /(?m​:....)/

is changed to be the same as

  /(?m)/

but i dont know what do about other cases, such as

  /(?m​:....)(?m​:.....)/

And I also have no idea why

  ("z"x65) =~ qr/^ (?​:.{0,10}){1,6} (?{$count++}) $/mx

is efficient (ie the optimiser rejects the match outright)

But

  ('z'x65) =~ qr/^ (?​:.{0,10}){1,6} (?{$count++}) $/x

is not. I would have thought the latter could be rejected outright, i
dont see why its not.

Ideas on how to procede welcome.

Cheers,
Yves

@p5pRT
Copy link
Author

p5pRT commented Dec 2, 2006

@demerphq - Status changed from 'open' to 'stalled'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants