Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathological performance problem with RegExp engine #12360

Open
p5pRT opened this issue Aug 28, 2012 · 7 comments
Open

Pathological performance problem with RegExp engine #12360

p5pRT opened this issue Aug 28, 2012 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 28, 2012

Migrated from rt.perl.org#114636 (status was 'open')

Searchable as RT114636$

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2012

From malch@malch.com

Created by malch@malch.com

This problem has also been reproduced on

* ActivePerl 5.14.2 on Windows 7 64-bit

* Perl 5.8.8 built for i686-linux

The following RegExp does not match and indeed it should not.
However, execution requires more than 120 seconds of CPU and
this increases exponentially as $string is lengthened.

$string=q~
/#20Char#20Char1/Span/#20Char#20Char2/Span/#20Char#20Char3/Span/#20Char#20Char4/Span/#20Char#20Char5/Span/#20Char#20Char6/Span/Table#20of#20Figures/P/InlineShape/Figure/Default/P/References#20Char/Span/HeadTOC/P/DropCap/Figure/CM66/P/CM67#20Char/Span/Outline/Span/Subscript/Span/Superscript/Span/TableTitle#20Char/Span/Figure#20Heading#20Char/Span/Albany2/P/CM17/P/Caption/P/TOA/TOC/Page#20Number/Span/TOF/TOC/Emphasis/Span/Strikeout/Span/Body#20Char/Span/TextBox/Div/Heading#201/H1/heading#201/P/Heading#202/H2/heading#202/P/Heading#203/H3/Normal/P/h
~;
$pdf='\s*(/\S+\s*){0,5}/xxx';
if ($string=~m/$pdf/is) {
  print "Hit\n";
} else {
  print "Miss\n";
}

Malcolm Hoar
malch@​malch.com

Perl Info

Flags:
     category=core
     severity=medium

Site configuration information for perl 5.12.3:

Configured by sshd_server at Wed Feb  9 14:19:11 2011.

Summary of my perl5 (revision 5 version 12 subversion 3) configuration:

   Platform:
     osname=MSWin32, osvers=5.2, archname=MSWin32-x64-multi-thread
     uname=''
     config_args='undef'
     hint=recommended, useposix=true, d_sigaction=undef
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=define, use64bitall=undef, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -Ox -GL 
-fp:precise -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DWIN64 
-DCONSERVATIVE -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT 
-DPERL_IMPLICIT_SYS -DUSE_PERLIO',
     optimize='-MD -Zi -DNDEBUG -Ox -GL -fp:precise',
     cppflags='-DWIN32'
     ccversion='14.00.40310.41', gccversion='', gccosandvers=''
     intsize=4, longsize=4, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
     ivtype='__int64', ivsize=8, nvtype='double', nvsize=8, 
Off_t='__int64', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf 
-ltcg  -libpath:"C:\Perl\lib\CORE"  -machine:AMD64'
     libpth=\lib
     libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib 
  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib 
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib 
odbc32.lib odbccp32.lib comctl32.lib bufferoverflowU.lib msvcrt.lib
     perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib 
winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib 
oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib 
version.lib odbc32.lib odbccp32.lib comctl32.lib bufferoverflowU.lib 
msvcrt.lib
     libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib
     gnulibc_version=''
   Dynamic Linking:
     dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
     cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug 
-opt:ref,icf -ltcg  -libpath:"C:\Perl\lib\CORE"  -machine:AMD64'

Locally applied patches:
     ACTIVEPERL_LOCAL_PATCHES_ENTRY
     c6fbf28 [perl #71806] perldb does not setup %dbline with the 
shebang option -d
     1fd8fa4 Add Wolfram Humann to AUTHORS
     f120055 make string-append on win32 100 times faster
     a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7
     007cfe1 Don't pretend to support really old VC++ compilers
     6d8f7c9 Get rid of obsolete PerlCRT.dll support
     d956618 Make Term::ReadLine::findConsole fall back to STDIN if 
/dev/tty can't be opened
     321e50c Escape patch strings before embedding them in patchlevel.h


@INC for perl 5.12.3:
     C:/Perl/site/lib
     C:/Perl/lib
     .


Environment for perl 5.12.3:
     HOME (unset)
     LANG (unset)
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=C:\Program Files (x86)\NVIDIA 
Corporation\PhysX\Common;C:\Reskit\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;c:\bin;c:\wintools;C:\Perl\site\bin;C:\Perl\bin;C:\Photo\ImageMagick;C:\Program 
Files (x86)\Common 
Files\Intuit\QBPOSSDKRuntime;C:\DVD\MKVTool;C:\DVD\IsoBuster;C:\Wintools\Git\cmd;C:\DVD\QuickTime\QTSystem\
     PERL_BADLANG (unset)
     SHELL (unset)

-- 
|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
| Malcolm Hoar           "The more I practice, the luckier I get". |
| malch@malch.com                                     Gary Player. |
| http://www.malch.com/                                            |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2012

From malch@malch.com

Created by malch@malch.com

This problem has also been reproduced on

* ActivePerl 5.14.2 on Windows 7 64-bit

* Perl 5.8.8 built for i686-linux

The following RegExp does not match and indeed it should not.
However, execution requires more than 120 seconds of CPU and
this increases exponentially as $string is lengthened.

$string=q~
/#20Char#20Char1/Span/#20Char#20Char2/Span/#20Char#20Char3/Span/#20Char#20Char4/Span/#20Char#20Char5/Span/#20Char#20Char6/Span/Table#20of#20Figures/P/InlineShape/Figure/Default/P/References#20Char/Span/HeadTOC/P/DropCap/Figure/CM66/P/CM67#20Char/Span/Outline/Span/Subscript/Span/Superscript/Span/TableTitle#20Char/Span/Figure#20Heading#20Char/Span/Albany2/P/CM17/P/Caption/P/TOA/TOC/Page#20Number/Span/TOF/TOC/Emphasis/Span/Strikeout/Span/Body#20Char/Span/TextBox/Div/Heading#201/H1/heading#201/P/Heading#202/H2/heading#202/P/Heading#203/H3/Normal/P/h
~;
$pdf='\s*(/\S+\s*){0,5}/xxx';
if ($string=~m/$pdf/is) {
  print "Hit\n";
} else {
  print "Miss\n";
}

Malcolm Hoar
malch@​malch.com

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.12.3:

Configured by sshd_server at Wed Feb  9 14:19:11 2011.

Summary of my perl5 (revision 5 version 12 subversion 3) configuration:
   
  Platform:
    osname=MSWin32, osvers=5.2, archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -Ox -GL -fp:precise -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DWIN64 -DCONSERVATIVE -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO',
    optimize='-MD -Zi -DNDEBUG -Ox -GL -fp:precise',
    cppflags='-DWIN32'
    ccversion='14.00.40310.41', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
    ivtype='__int64', ivsize=8, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -ltcg  -libpath:"C:\Perl\lib\CORE"  -machine:AMD64'
    libpth=\lib
    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib comctl32.lib bufferoverflowU.lib msvcrt.lib
    perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib comctl32.lib bufferoverflowU.lib msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -ltcg  -libpath:"C:\Perl\lib\CORE"  -machine:AMD64'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY
    c6fbf28 [perl #71806] perldb does not setup %dbline with the shebang option -d
    1fd8fa4 Add Wolfram Humann to AUTHORS
    f120055 make string-append on win32 100 times faster
    a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7
    007cfe1 Don't pretend to support really old VC++ compilers
    6d8f7c9 Get rid of obsolete PerlCRT.dll support
    d956618 Make Term::ReadLine::findConsole fall back to STDIN if /dev/tty can't be opened
    321e50c Escape patch strings before embedding them in patchlevel.h


@INC for perl 5.12.3:
    C:/Perl/site/lib
    C:/Perl/lib
    .


Environment for perl 5.12.3:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Reskit\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;c:\bin;c:\wintools;C:\Perl\site\bin;C:\Perl\bin;C:\Photo\ImageMagick;C:\Program Files (x86)\Common Files\Intuit\QBPOSSDKRuntime;C:\DVD\MKVTool;C:\DVD\IsoBuster;C:\Wintools\Git\cmd;C:\DVD\QuickTime\QTSystem\
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2012

From @jkeenan

On Tue Aug 28 00​:52​:27 2012, malch@​malch.com wrote​:

This is a bug report for perl from malch@​malch.com,
generated with the help of perlbug 1.39 running under perl 5.12.3.

This problem has also been reproduced on

* ActivePerl 5.14.2 on Windows 7 64-bit

* Perl 5.8.8 built for i686-linux

The following RegExp does not match and indeed it should not.
However, execution requires more than 120 seconds of CPU and
this increases exponentially as $string is lengthened.

$string=q~

/#20Char#20Char1/Span/#20Char#20Char2/Span/#20Char#20Char3/Span/#20Char#20Char4/Span/#20Char#20Char5/Span/#20Char#20Char6/Span/Table#20of#20Figures/P/InlineShape/Figure/Default/P/References#20Char/Span/HeadTOC/P/DropCap/Figure/CM66/P/CM67#20Char/Span/Outline/Span/Subscript/Span/Superscript/Span/TableTitle#20Char/Span/Figure#20Heading#20Char/Span/Albany2/P/CM17/P/Caption/P/TOA/TOC/Page#20Number/Span/TOF/TOC/Emphasis/Span/Strikeout/Span/Body#20Char/Span/TextBox/Div/Heading#201/H1/heading#201/P/Heading#202/H2/heading#202/P/Heading#203/H3/Normal/P/h

~;
$pdf='\s*(/\S+\s*){0,5}/xxx';
if ($string=~m/$pdf/is) {
print "Hit\n";
} else {
print "Miss\n";
}

Confirmed on 5.16.0 on Linux/i386. Observed that the big difference is
between​:

$pdf='\s*(/\S+\s*){0,5}/xxx';

and either​:

$pdf='\s*(/\S+\s*){0,1}/xxx';

or​:

$pdf='\s*(/\S+\s*)?/xxx';

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2012

From @demerphq

On 30 August 2012 02​:51, James E Keenan via RT
<perlbug-followup@​perl.org> wrote​:

On Tue Aug 28 00​:52​:27 2012, malch@​malch.com wrote​:

This is a bug report for perl from malch@​malch.com,
generated with the help of perlbug 1.39 running under perl 5.12.3.

This problem has also been reproduced on

* ActivePerl 5.14.2 on Windows 7 64-bit

* Perl 5.8.8 built for i686-linux

The following RegExp does not match and indeed it should not.
However, execution requires more than 120 seconds of CPU and
this increases exponentially as $string is lengthened.

$string=q~

/#20Char#20Char1/Span/#20Char#20Char2/Span/#20Char#20Char3/Span/#20Char#20Char4/Span/#20Char#20Char5/Span/#20Char#20Char6/Span/Table#20of#20Figures/P/InlineShape/Figure/Default/P/References#20Char/Span/HeadTOC/P/DropCap/Figure/CM66/P/CM67#20Char/Span/Outline/Span/Subscript/Span/Superscript/Span/TableTitle#20Char/Span/Figure#20Heading#20Char/Span/Albany2/P/CM17/P/Caption/P/TOA/TOC/Page#20Number/Span/TOF/TOC/Emphasis/Span/Strikeout/Span/Body#20Char/Span/TextBox/Div/Heading#201/H1/heading#201/P/Heading#202/H2/heading#202/P/Heading#203/H3/Normal/P/h

~;
$pdf='\s*(/\S+\s*){0,5}/xxx';
if ($string=~m/$pdf/is) {
print "Hit\n";
} else {
print "Miss\n";
}

Confirmed on 5.16.0 on Linux/i386. Observed that the big difference is
between​:

$pdf='\s*(/\S+\s*){0,5}/xxx';

and either​:

$pdf='\s*(/\S+\s*){0,1}/xxx';

or​:

$pdf='\s*(/\S+\s*)?/xxx';

I think this a bug in super-linear cache.

Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2012

From @hvds

demerphq <demerphq@​gmail.com> wrote​:
:On 30 August 2012 02​:51, James E Keenan via RT wrote​:
:> On Tue Aug 28 00​:52​:27 2012, malch@​malch.com wrote​:
[...]
:>> The following RegExp does not match and indeed it should not.
:>> However, execution requires more than 120 seconds of CPU and
:>> this increases exponentially as $string is lengthened.
:>>
:>>
:>> $string=q~
:>>
:> /#20Char#20Char1/Span/#20Char#20Char2/Span/#20Char#20Char3/Span/#20Char#20Char4/Span/#20Char#20Char5/Span/#20Char#20Char6/Span/Table#20of#20Figures/P/InlineShape/Figure/Default/P/References#20Char/Span/HeadTOC/P/DropCap/Figure/CM66/P/CM67#20Char/Span/Outline/Span/Subscript/Span/Superscript/Span/TableTitle#20Char/Span/Figure#20Heading#20Char/Span/Albany2/P/CM17/P/Caption/P/TOA/TOC/Page#20Number/Span/TOF/TOC/Emphasis/Span/Strikeout/Span/Body#20Char/Span/TextBox/Div/Heading#201/H1/heading#201/P/Heading#202/H2/heading#202/P/Heading#203/H3/Normal/P/h
:>> ~;
:>> $pdf='\s*(/\S+\s*){0,5}/xxx';
:>> if ($string=~m/$pdf/is) {
:>> print "Hit\n";
:>> } else {
:>> print "Miss\n";
:>> }
:>>
:>
:> Confirmed on 5.16.0 on Linux/i386. Observed that the big difference is
:> between​:
:>
:> $pdf='\s*(/\S+\s*){0,5}/xxx';
:>
:> and either​:
:>
:> $pdf='\s*(/\S+\s*){0,1}/xxx';
:>
:> or​:
:>
:> $pdf='\s*(/\S+\s*)?/xxx';
:
:I think this a bug in super-linear cache.

Note that this is a pathological regexp; it may be that we can execute
it in non-pathological time by the use of clever optimizations such as
the superlinear cache, but as far as I remember we've progressively
turned off many parts of that optimization as we've discovered edge
cases in which it will do the wrong thing.

The issue here is that (without special optimizations) matches are fast,
but on failure the backtracking needs to cope with every possible way
to split the text as m{(/\S+\s*){0,5}}, which for the example string
requires of the order of the 5th power of the number of '/' characters.

I suggest therefore that the primary bug here is the regexp pattern
itself, and in particular the fact that because /\S/ can also match
the '/' character, the substring '/x/y' can match the central group
either once or twice.

I suspect the original poster might actually rather want​:
  $string =~ m{
  \s* ( / [^/]+ \s* ){0,5} /xxx
  }xis;

By writing the pattern so that '/x/y' no longer has the option to match
as either one group or two, the combinatorial explosion can be avoided.
It may also be the case that this would fix a bug in the original pattern,
but I don't know well enough what the OP is trying to achieve to be sure.

It may be we can find new optimizations or fix existing ones that would
allow the original pattern to match faster, but I am confident we will
not be able to make every pattern fast against every string.

(As to which optimization​: this particular test could be failed fastest
if we were looking for the floating substr '/xxx', I'm not sure why the
regexp engine is not using that.)

Hugo

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2012

From @demerphq

On 30 August 2012 09​:34, <hv@​crypt.org> wrote​:

demerphq <demerphq@​gmail.com> wrote​:
:On 30 August 2012 02​:51, James E Keenan via RT wrote​:
:> On Tue Aug 28 00​:52​:27 2012, malch@​malch.com wrote​:
[...]
:>> The following RegExp does not match and indeed it should not.
:>> However, execution requires more than 120 seconds of CPU and
:>> this increases exponentially as $string is lengthened.
:>>
:>>
:>> $string=q~
:>>
:> /#20Char#20Char1/Span/#20Char#20Char2/Span/#20Char#20Char3/Span/#20Char#20Char4/Span/#20Char#20Char5/Span/#20Char#20Char6/Span/Table#20of#20Figures/P/InlineShape/Figure/Default/P/References#20Char/Span/HeadTOC/P/DropCap/Figure/CM66/P/CM67#20Char/Span/Outline/Span/Subscript/Span/Superscript/Span/TableTitle#20Char/Span/Figure#20Heading#20Char/Span/Albany2/P/CM17/P/Caption/P/TOA/TOC/Page#20Number/Span/TOF/TOC/Emphasis/Span/Strikeout/Span/Body#20Char/Span/TextBox/Div/Heading#201/H1/heading#201/P/Heading#202/H2/heading#202/P/Heading#203/H3/Normal/P/h
:>> ~;
:>> $pdf='\s*(/\S+\s*){0,5}/xxx';
:>> if ($string=~m/$pdf/is) {
:>> print "Hit\n";
:>> } else {
:>> print "Miss\n";
:>> }
:>>
:>
:> Confirmed on 5.16.0 on Linux/i386. Observed that the big difference is
:> between​:
:>
:> $pdf='\s*(/\S+\s*){0,5}/xxx';
:>
:> and either​:
:>
:> $pdf='\s*(/\S+\s*){0,1}/xxx';
:>
:> or​:
:>
:> $pdf='\s*(/\S+\s*)?/xxx';
:
:I think this a bug in super-linear cache.

Note that this is a pathological regexp; it may be that we can execute
it in non-pathological time by the use of clever optimizations such as
the superlinear cache, but as far as I remember we've progressively
turned off many parts of that optimization as we've discovered edge
cases in which it will do the wrong thing.

Ah, I though it might have been an accidental thing.

The issue here is that (without special optimizations) matches are fast,
but on failure the backtracking needs to cope with every possible way
to split the text as m{(/\S+\s*){0,5}}, which for the example string
requires of the order of the 5th power of the number of '/' characters.

I suggest therefore that the primary bug here is the regexp pattern
itself, and in particular the fact that because /\S/ can also match
the '/' character, the substring '/x/y' can match the central group
either once or twice.

I suspect the original poster might actually rather want​:
$string =~ m{
\s* ( / [^/]+ \s* ){0,5} /xxx
}xis;

By writing the pattern so that '/x/y' no longer has the option to match
as either one group or two, the combinatorial explosion can be avoided.
It may also be the case that this would fix a bug in the original pattern,
but I don't know well enough what the OP is trying to achieve to be sure.

It may be we can find new optimizations or fix existing ones that would
allow the original pattern to match faster, but I am confident we will
not be able to make every pattern fast against every string.

Yeah true. This is a pattern that could probably be improved also by
using possessive matching.

(As to which optimization​: this particular test could be failed fastest
if we were looking for the floating substr '/xxx', I'm not sure why the
regexp engine is not using that.)

Hmm.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants