Skip Menu |
Report information
Id: 123743
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: warren.hyde [at] amd.com
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version: 5.18.2
Fixed In:
  • 5.21.9
  • 5.22.0



Date: Thu, 5 Feb 2015 10:45:56 -0500
To: <perlbug [...] perl.org>
From: <warren.hyde [...] amd.com>
Subject: RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Download (untitled) / with headers
text/plain 13.7k
This is a bug report for perl from warren.hyde@amd.com, generated with the help of perlbug 1.39 running under perl 5.18.2. ----------------------------------------------------------------- [Please describe your issue here] Certain types of (poorly-written) regular expressions have become painfully slow since the refactoring of regexec.c at some point in the 5.17 versions. I believe this to be a bug in the behavior, as the RegEx engine is doing much more (wasted) work trying to match certain patterns that previous Perl versions (through 5.16.1) had no problems with. Specifically, consider this (good, previous behavior) example: 5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i' 5.16.1> Compiling REx ".*:\s*ab" 5.16.1> Final program: 5.16.1> 1: STAR (3) 5.16.1> 2: REG_ANY (0) 5.16.1> 3: EXACTF <:> (5) 5.16.1> 5: STAR (7) 5.16.1> 6: SPACE (0) 5.16.1> 7: EXACTF <ab> (9) 5.16.1> 9: END (0) 5.16.1> anchored(MBOL) implicit minlen 3 5.16.1> Matching REx ".*:\s*ab" against "00000000: 0c" 5.16.1> 0 <> <00000000: > | 1:STAR(3) 5.16.1> REG_ANY can match 12 times out of 2147483647... 5.16.1> 8 <00000000> <: 0c> | 3: EXACTF <:>(5) 5.16.1> 9 <00000000:> < 0c> | 5: STAR(7) 5.16.1> SPACE can match 1 times out of 2147483647... 5.16.1> failed... 5.16.1> failed... 5.16.1> Match failed 5.16.1> Freeing REx: ".*:\s*ab" (Using case-insensitive matching disallows the optimizer from looking for the fixed text "ab".) Now, compare that with the work done by Perl since 5.18: 5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i' 5.18.2> Compiling REx ".*:\s*ab" 5.18.2> Final program: 5.18.2> 1: STAR (3) 5.18.2> 2: REG_ANY (0) 5.18.2> 3: EXACT <:> (5) 5.18.2> 5: STAR (7) 5.18.2> 6: POSIXD[\s] (0) 5.18.2> 7: EXACTF <ab> (9) 5.18.2> 9: END (0) 5.18.2> floating ":" at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 3 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000000: 0c" 5.18.2> Found floating substr ":" at offset 8... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> Matching REx ".*:\s*ab" against "00000000: 0c" 5.18.2> 0 <> <00000000: > | 1:STAR(3) 5.18.2> REG_ANY can match 12 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000000: 0c" 5.18.2> Found floating substr ":" at offset 7... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 1 <0> <0000000: 0> | 1:STAR(3) 5.18.2> REG_ANY can match 11 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000000: 0c" 5.18.2> Found floating substr ":" at offset 6... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 2 <00> <000000: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 10 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000: 0c" 5.18.2> Found floating substr ":" at offset 5... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 3 <000> <00000: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 9 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000: 0c" 5.18.2> Found floating substr ":" at offset 4... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 4 <0000> <0000: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 8 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000: 0c" 5.18.2> Found floating substr ":" at offset 3... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 5 <00000> <000: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 7 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00: 0c" 5.18.2> Found floating substr ":" at offset 2... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 6 <000000> <00: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 6 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0: 0c" 5.18.2> Found floating substr ":" at offset 1... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 7 <0000000> <0: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 5 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against ": 0c" 5.18.2> Found floating substr ":" at offset 0... 5.18.2> Position at offset 0 does not contradict /^/m... 5.18.2> Guessed: match at offset 0 5.18.2> 8 <00000000> <: 0c> | 1:STAR(3) 5.18.2> REG_ANY can match 4 times out of 2147483647... 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... 5.18.2> failed... 5.18.2> failed... 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against " 0c" 5.18.2> Did not find floating substr ":"... 5.18.2> Match rejected by optimizer 5.18.2> Match failed 5.18.2> Freeing REx: ".*:\s*ab" Why does it walk forward through the string trying to guess the start of the match again when all prior Perls know to give up after the first fail? This new behavior is causing massive wasted processing for a certain class of regular expression, namely the naiive use of ".*", which (rather unfortulately) shows up very often in the real world. Since 5.18 (or perhaps 5.17.x), the performance of this regular expression degrades geometrically with the length of the line before the colon, which caused a big efficiency problem. I know it's not a good example of a regex, but this presumes the coder knows how the internals of regex matching works. You can see how badly this affects things with the following comparison: 5.16.1> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*:\s*ab/i' 5.16.1> 0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 7488maxresident)k 5.16.1> 0inputs+0outputs (0major+515minor)pagefaults 0swaps 5.18.2> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*:\s*ab/i' 5.18.2> 4.31user 0.00system 0:04.41elapsed 97%CPU (0avgtext+0avgdata 7296maxresident)k 5.18.2> 0inputs+0outputs (0major+503minor)pagefaults 0swaps Spending almost 4.5 seconds matching this regular expression against 1000 lines of text is a big step back in the performance of the flagship text processing language. Something is amiss when backtracking ".*" now. [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl 5.18.2: Configured by pandora at Tue May 6 14:41:42 EDT 2014. Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=2.6.18-308.8.2.el5, archname=x86_64-linux-thread-multi uname='linux atlvbuild-el5-64 2.6.18-308.8.2.el5 #1 smp tue may 29 11:54:17 edt 2012 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Dcc=gcc -Dccflags=-fPIC -Dcflags=-fPIC -Uinstallusrbinperl -Dperladmin=pandora-admin@mpdtxmail -Dprefix=/tool/pandora64/.package/perl-5.18.2-gcc481 -Dstartperl=#!/tool/pandora64/.package/perl-5.18.2-gcc481/bin/perl -Dlibpth=/tool/pandora64/lib /lib64 /usr/lib64 -Duseshrplib -Duseithreads -Dusethreads -Duselargefiles -Duse64bitint -Dd_semctl_semun -Di_db -Ui_ndbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Ubincompat5005 -Uversiononly -Dpager=/tool/pandora64/bin/less -isr' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector' ccversion='', gccversion='4.8.1', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=undef, longdblsize= ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags =' -fstack-protector' libpth=/tool/pandora64/lib /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -O2 -fstack-protector' Locally applied patches: --- @INC for perl 5.18.2: /proj/verif_release_ro/p4w/current /proj/verif_release_ro/p4_mkwa/current /proj/verif_release_ro/siteconfig/current/lib/perl /proj/verif_release_ro/error_reporting/25/lib/perl /proj/verif_release_ro/cbwa_bootcore/current/lib/perl /tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2/x86_64-linux-thread-multi /tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2 /tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi /tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2 . --- Environment for perl 5.18.2: HOME=/home/whyde LANG=C LANGUAGE (unset) LC_COLLATE=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/tool/pandora64/.package/perl-5.18.2-gcc481/bin:/tool/pandora64/.package/perl-5.18.1/bin:/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/tool/pandora64/.package/subversion-1.6.3/bin:/proj/verif_release_ro/svn_ssh/31/bin:/tool/pandora64/.package/tcltk-8.4.6-a/bin:/tool/pandora64/.package/perl-5.8.8/bin:/proj/verif_release_ro/siteconfig/current/bin:/tool/pandora64/bin:/tool/pandora/bin:/proj/verif_release_ro/cbwa_bootcore/current/bin:/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/X11R6/bin:/usr/kerberos/bin:/sbin:/usr/sbin:/tool/lsf/bin:/home/whyde/bin:/usr/NX/bin:/usr/NX/bin:/proj/soc_infra/bin PERL5LIB=/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/proj/verif_release_ro/siteconfig/current/lib/perl:/proj/verif_release_ro/error_reporting/25/lib/perl:/proj/verif_release_ro/cbwa_bootcore/current/lib/perl PERL_BADLANG (unset) PERL_HOME=/tool/pandora64/.package/perl-5.18.2-gcc481 SHELL=/tool/pandora/binlatest/tcsh
Date: Mon, 9 Feb 2015 03:47:45 +0100
From: demerphq <demerphq [...] gmail.com>
CC: "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>, Dave Mitchell <davem [...] iabyn.com>, Ricardo SIGNES <perl.p5p [...] rjbs.manxome.org>, karl williamson <public [...] khwilliamson.com>
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
To: Perl5 Porteros <perl5-porters [...] perl.org>
On 5 February 2015 at 16:46, via RT <perlbug-followup@perl.org> wrote: Show quoted text
> # New Ticket Created by > # Please include the string: [perl #123743] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=123743 > > > > > This is a bug report for perl from warren.hyde@amd.com, > generated with the help of perlbug 1.39 running under perl 5.18.2. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > Certain types of (poorly-written) regular expressions have become > painfully slow since the refactoring of regexec.c at some point in > the 5.17 versions. I believe this to be a bug in the behavior, as > the RegEx engine is doing much more (wasted) work trying to match > certain patterns that previous Perl versions (through 5.16.1) had > no problems with. >
Thank you for the report. FWIW, I consider this a very serious, release-blocker level regression. Show quoted text
> Specifically, consider this (good, previous behavior) example: > > 5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i' > 5.16.1> Compiling REx ".*:\s*ab" > 5.16.1> Final program: > 5.16.1> 1: STAR (3) > 5.16.1> 2: REG_ANY (0) > 5.16.1> 3: EXACTF <:> (5) > 5.16.1> 5: STAR (7) > 5.16.1> 6: SPACE (0) > 5.16.1> 7: EXACTF <ab> (9) > 5.16.1> 9: END (0) > 5.16.1> anchored(MBOL) implicit minlen 3 > 5.16.1> Matching REx ".*:\s*ab" against "00000000: 0c" > 5.16.1> 0 <> <00000000: > | 1:STAR(3) > 5.16.1> REG_ANY can match 12 times out of 2147483647... > 5.16.1> 8 <00000000> <: 0c> | 3: EXACTF <:>(5) > 5.16.1> 9 <00000000:> < 0c> | 5: STAR(7) > 5.16.1> SPACE can match 1 times out of 2147483647... > 5.16.1> failed... > 5.16.1> failed... > 5.16.1> Match failed > 5.16.1> Freeing REx: ".*:\s*ab" > > (Using case-insensitive matching disallows the optimizer from looking > for the fixed text "ab".) > > Now, compare that with the work done by Perl since 5.18: > > 5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i' > 5.18.2> Compiling REx ".*:\s*ab" > 5.18.2> Final program: > 5.18.2> 1: STAR (3) > 5.18.2> 2: REG_ANY (0) > 5.18.2> 3: EXACT <:> (5) > 5.18.2> 5: STAR (7) > 5.18.2> 6: POSIXD[\s] (0) > 5.18.2> 7: EXACTF <ab> (9) > 5.18.2> 9: END (0) > 5.18.2> floating ":" at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 3 > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000000: 0c" > 5.18.2> Found floating substr ":" at offset 8... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> Matching REx ".*:\s*ab" against "00000000: 0c" > 5.18.2> 0 <> <00000000: > | 1:STAR(3) > 5.18.2> REG_ANY can match 12 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000000: 0c" > 5.18.2> Found floating substr ":" at offset 7... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 1 <0> <0000000: 0> | 1:STAR(3) > 5.18.2> REG_ANY can match 11 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000000: 0c" > 5.18.2> Found floating substr ":" at offset 6... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 2 <00> <000000: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 10 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00000: 0c" > 5.18.2> Found floating substr ":" at offset 5... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 3 <000> <00000: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 9 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0000: 0c" > 5.18.2> Found floating substr ":" at offset 4... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 4 <0000> <0000: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 8 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "000: 0c" > 5.18.2> Found floating substr ":" at offset 3... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 5 <00000> <000: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 7 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "00: 0c" > 5.18.2> Found floating substr ":" at offset 2... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 6 <000000> <00: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 6 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against "0: 0c" > 5.18.2> Found floating substr ":" at offset 1... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 7 <0000000> <0: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 5 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against ": 0c" > 5.18.2> Found floating substr ":" at offset 0... > 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0 > 5.18.2> 8 <00000000> <: 0c> | 1:STAR(3) > 5.18.2> REG_ANY can match 4 times out of 2147483647... > 5.18.2> 8 <00000000> <: 0c> | 3: EXACT <:>(5) > 5.18.2> 9 <00000000:> < 0c> | 5: STAR(7) > 5.18.2> POSIXD[\s] can match 1 times out of 2147483647... > 5.18.2> failed... > 5.18.2> failed... > 5.18.2> Guessing start of match in sv for REx ".*:\s*ab" against " 0c" > 5.18.2> Did not find floating substr ":"... > 5.18.2> Match rejected by optimizer > 5.18.2> Match failed > 5.18.2> Freeing REx: ".*:\s*ab" > > Why does it walk forward through the string trying to guess the start of > the match again when all prior Perls know to give up after the first > fail? This new behavior is causing massive wasted processing for a > certain class of regular expression, namely the naiive use of ".*", > which (rather unfortulately) shows up very often in the real world. > > Since 5.18 (or perhaps 5.17.x), the performance of this regular expression > degrades geometrically with the length of the line before the colon, > which caused a big efficiency problem. I know it's not a good example > of a regex, but this presumes the coder knows how the internals of regex > matching works.
As you say, this pattern was previously fast, and is not now. I suspect there is an issue with handling the implicit SBOL/MBOL as you can see here: Show quoted text
> 5.18.2> Position at offset 0 does not contradict /^/m... > 5.18.2> Guessed: match at offset 0
The reason we see this from a pattern with no "^" in it like we do here is that we automatically prepend a SBOL or MBOL to patterns starting with .* (depending on whether it was /s) so it does not go quadratic due to advancing the start pointer needlessly. It appears that in 5.18 we broke this somehow. The idea of this optimization is to avoid quadratic backtracking for unanchored patterns starting with .*, which is exactly what is occurring here. (It also seems like an *anchored* pattern could now be quadratic but I have not checked). Depending on whether /s was used or not .* either matches everything, or it matches everything but newlines. This means that a pattern starting with .* can and is (albeit apparently with no impact in 5.18) implicitly anchored either at the beginning of the string, or at the beginning of a line, since advancing the start pointer cannot make the pattern match (unless it is advanced to after a newline on a non /s pattern). The reason it cannot is that if it could it would have matched without advancing the start pointer and we would not have needed to advance the start pointer at all. In the case you show (with no /s modifier) it should be at the beginning of each line in the string being matched, which in your case is equivalent to the beginning of the string. So we should see the start class logic fire twice, once to say "yes we are the start of the string" and once later to say "there are no other newlines in this pattern", after which we should stop matching. It appears that this last step no longer happens. Show quoted text
> > You can see how badly this affects things with the following comparison: > > 5.16.1> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*:\s*ab/i' > 5.16.1> 0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 7488maxresident)k > 5.16.1> 0inputs+0outputs (0major+515minor)pagefaults 0swaps > > 5.18.2> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*:\s*ab/i' > 5.18.2> 4.31user 0.00system 0:04.41elapsed 97%CPU (0avgtext+0avgdata 7296maxresident)k > 5.18.2> 0inputs+0outputs (0major+503minor)pagefaults 0swaps > > Spending almost 4.5 seconds matching this regular expression against 1000 > lines of text is a big step back in the performance of the flagship text > processing language.
Indeed. That is a rather diplomatic way of putting it. In conversation I probably would use much harsher language. I imagine you did too. :-) Show quoted text
> Something is amiss when backtracking ".*" now.
We need to bisect to find this[1], but it would not surprise me if this was fallout from Dave M's work on cleaning up start class optimizations in the regex engine. Alternatively it might be my fault when I cleaned up logic related to SBOL/MBOL. My memory fails me as to exactly when these two changes were made so without more digging I can't be sure. A last possibility is that we have somehow broken the super-linear cache, but I discount that one as I don't think it fires in your case at all. IMO we must fix this before the next release of Perl. Patterns of this form, probably *because* of the optimization that appears to be broken, are all to common. We should have noticed this regression, and definitely should fix it ASAP. I will try to find some time to dig, but my schedule is tight and I encourage others to dig as well. Yves [1] And I am on a new laptop where it will take me a while to get setup for bisects and in general Perl dev.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 561b
On Sun Feb 08 18:48:21 2015, demerphq wrote: Show quoted text
> We need to bisect to find this
$ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0 --end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f: 0c", $_), 1..50; $t = time; /.*:\s*ab/i; die if time - $t > 1' ... 3465e1f03c6c748e8f8a6bf8bfdfaf1fc58a4810 is the first bad commit commit 3465e1f03c6c748e8f8a6bf8bfdfaf1fc58a4810 Author: Karl Williamson <public@khwilliamson.com> Date: Thu Oct 11 12:15:53 2012 -0600 regcomp.c: Optimize EXACTFish nodes without folds to EXACT -- Father Chrysostomos
CC: Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Date: Mon, 9 Feb 2015 05:54:18 +0100
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Download (untitled) / with headers
text/plain 633b
On 9 February 2015 at 05:16, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Sun Feb 08 18:48:21 2015, demerphq wrote:
>> We need to bisect to find this
> > $ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0 --end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f: 0c", $_), 1..50; $t = time; /.*:\s*ab/i; die if time - $t > 1' > ...
Thanks, but I am pretty sure this is a false positive. I think a better bisect would re 'debug' and count how many lines the OP's example outputs. If it is more than say 50 then you have found the bug. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 915b
On Sun Feb 08 20:54:52 2015, demerphq wrote: Show quoted text
> On 9 February 2015 at 05:16, Father Chrysostomos via RT > <perlbug-followup@perl.org> wrote:
> > On Sun Feb 08 18:48:21 2015, demerphq wrote:
> >> We need to bisect to find this
> > > > $ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0 > > --end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f: 0c", $_), > > 1..50; $t = time; /.*:\s*ab/i; die if time - $t > 1' > > ...
> > Thanks, but I am pretty sure this is a false positive. I think a > better bisect would re 'debug' and count how many lines the OP's > example outputs. If it is more than say 50 then you have found the > bug.
Same result with: $ ../perl.git/Porting/bisect.pl --start=v5.16.0 --end=v5.18.0 -e '$_= readpipe(q|./perl -Ilib -Mre=debug -e '\''"%01000.0f: 0c" =~ /.*:\s*ab/i'\'' 2>&1|) ; warn $_ =~ tr/\n//; die if $_=~ tr/\n// > 50' 3465e1f03c again. -- Father Chrysostomos
Date: Mon, 9 Feb 2015 10:55:57 +0000
CC: Perl5 Porteros <perl5-porters [...] perl.org>, "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>, Ricardo SIGNES <perl.p5p [...] rjbs.manxome.org>, karl williamson <public [...] khwilliamson.com>
From: Dave Mitchell <davem [...] iabyn.com>
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
To: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 295b
On Mon, Feb 09, 2015 at 03:47:45AM +0100, demerphq wrote: Show quoted text
> definitely should fix it ASAP. I will try to find some time to dig, > but my schedule is tight and I encourage others to dig as well.
I'm digging... -- "Emacs isn't a bad OS once you get used to it. It just lacks a decent editor."
From: Dave Mitchell <davem [...] iabyn.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>, "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>, Ricardo SIGNES <perl.p5p [...] rjbs.manxome.org>, karl williamson <public [...] khwilliamson.com>
Date: Tue, 10 Feb 2015 14:50:32 +0000
To: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Download (untitled) / with headers
text/plain 3.4k
On Mon, Feb 09, 2015 at 10:55:57AM +0000, Dave Mitchell wrote: Show quoted text
> On Mon, Feb 09, 2015 at 03:47:45AM +0100, demerphq wrote:
> > definitely should fix it ASAP. I will try to find some time to dig, > > but my schedule is tight and I encourage others to dig as well.
> > I'm digging...
I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too. In the following, $s = ('0' x 200_000) . '::: 0c'; ok ($s !~ /.*:::\s*ab/, 'PREGf_IMPLICIT'); ok ($s !~ /.*:::\s*ab/i, 'PREGf_IMPLICIT/i'); ok ($s !~ /.*:::\s*ab/m, 'PREGf_IMPLICIT/m'); ok ($s !~ /.*:::\s*ab/mi, 'PREGf_IMPLICIT/mi'); ok ($s !~ /.*:::\s*ab/s, 'PREGf_IMPLICIT/s'); ok ($s !~ /.*:::\s*ab/si, 'PREGf_IMPLICIT/si'); ok ($s !~ /.*:::\s*ab/ms, 'PREGf_IMPLICIT/ms'); ok ($s !~ /.*:::\s*ab/msi,'PREGf_IMPLICIT/msi'); all the non-//i ones are quadratic in all perls since 5.8-ish. The //i ones additionally went quadratic in 5.18.0 and later. The following commit which I've just pushed makes all 8 of the above run in millisecond time again. commit 0fa70a06a98fc8fa9840d4dbaa31fc2d3b28b99b Author: David Mitchell <davem@iabyn.com> AuthorDate: Tue Feb 10 12:17:51 2015 +0000 Commit: David Mitchell <davem@iabyn.com> CommitDate: Tue Feb 10 12:37:10 2015 +0000 simpify and speed up /.*.../ handling See RT ##123743. A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along with PREGf_IMPLICIT. The idea is that, with /.*.../s, if the NFA don't match when started at pos 0, then it's not going to match if started at any other position either; while /.*.../ won't match at any other start position up until the next \n. However, the branch in regexec() that implemented this was a bit a mess (like much in the perl core, it had gradually accreted), and caused intuit-enabled /.*.../ and /.*...patterns to go quadratic. The branch looked roughly like: if (anchored) { if (regtry(s)) goto success; if (can_intuit) { while (s < end) { s = intuit(s+1); if (!s) goto fail; if (regtry(s)) goto success; } } else { while (s < end) { s = skip_to_next_newline(s); if (regtry(s)) goto success; } } } The problem is that in the presence of a .* at the start of the pattern, intuit() will always return either NULL on failure, or the start position, rather than any later position. So the can_intuit branch above calls regtry() on every character position. This commit fixes this by changing the structure of the code to be like this, where it only tries things on newline boundaries: if (anchored) { if (regtry(s)) goto success; while (1) { s = skip_to_next_newline(s); if (can_intuit) { s = intuit(s+1); if (!s) goto fail; } if (regtry(s)) goto success; } } This makes the code a lot simpler, and mostly avoids quadratic behaviour (you can still get it with a string consisting mainly of newlines). -- Nothing ventured, nothing lost.
From: "Hyde, Warren" <Warren.Hyde [...] amd.com>
Date: Tue, 10 Feb 2015 15:11:54 +0000
To: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Subject: RE: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Excellent diagnosis and response, as usual, from the Perl community. Much thanks to Dave and the others who took a look at this. Another uninformed question is how Perl's regex engine winds up in PCRE, and whether that would also be affected? Cheers, -- Warren Hyde Show quoted text
-----Original Message----- From: Dave Mitchell via RT [mailto:perlbug-followup@perl.org] Sent: Tuesday, February 10, 2015 8:52 AM To: Hyde, Warren Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?) On Mon, Feb 09, 2015 at 10:55:57AM +0000, Dave Mitchell wrote:
> On Mon, Feb 09, 2015 at 03:47:45AM +0100, demerphq wrote:
> > definitely should fix it ASAP. I will try to find some time to dig, > > but my schedule is tight and I encourage others to dig as well.
> > I'm digging...
I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too. In the following, $s = ('0' x 200_000) . '::: 0c'; ok ($s !~ /.*:::\s*ab/, 'PREGf_IMPLICIT'); ok ($s !~ /.*:::\s*ab/i, 'PREGf_IMPLICIT/i'); ok ($s !~ /.*:::\s*ab/m, 'PREGf_IMPLICIT/m'); ok ($s !~ /.*:::\s*ab/mi, 'PREGf_IMPLICIT/mi'); ok ($s !~ /.*:::\s*ab/s, 'PREGf_IMPLICIT/s'); ok ($s !~ /.*:::\s*ab/si, 'PREGf_IMPLICIT/si'); ok ($s !~ /.*:::\s*ab/ms, 'PREGf_IMPLICIT/ms'); ok ($s !~ /.*:::\s*ab/msi,'PREGf_IMPLICIT/msi'); all the non-//i ones are quadratic in all perls since 5.8-ish. The //i ones additionally went quadratic in 5.18.0 and later. The following commit which I've just pushed makes all 8 of the above run in millisecond time again. commit 0fa70a06a98fc8fa9840d4dbaa31fc2d3b28b99b Author: David Mitchell <davem@iabyn.com> AuthorDate: Tue Feb 10 12:17:51 2015 +0000 Commit: David Mitchell <davem@iabyn.com> CommitDate: Tue Feb 10 12:37:10 2015 +0000 simpify and speed up /.*.../ handling See RT ##123743. A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along with PREGf_IMPLICIT. The idea is that, with /.*.../s, if the NFA don't match when started at pos 0, then it's not going to match if started at any other position either; while /.*.../ won't match at any other start position up until the next \n. However, the branch in regexec() that implemented this was a bit a mess (like much in the perl core, it had gradually accreted), and caused intuit-enabled /.*.../ and /.*...patterns to go quadratic. The branch looked roughly like: if (anchored) { if (regtry(s)) goto success; if (can_intuit) { while (s < end) { s = intuit(s+1); if (!s) goto fail; if (regtry(s)) goto success; } } else { while (s < end) { s = skip_to_next_newline(s); if (regtry(s)) goto success; } } } The problem is that in the presence of a .* at the start of the pattern, intuit() will always return either NULL on failure, or the start position, rather than any later position. So the can_intuit branch above calls regtry() on every character position. This commit fixes this by changing the structure of the code to be like this, where it only tries things on newline boundaries: if (anchored) { if (regtry(s)) goto success; while (1) { s = skip_to_next_newline(s); if (can_intuit) { s = intuit(s+1); if (!s) goto fail; } if (regtry(s)) goto success; } } This makes the code a lot simpler, and mostly avoids quadratic behaviour (you can still get it with a string consisting mainly of newlines). -- Nothing ventured, nothing lost.
From: Lukas Mai <plokinom [...] gmail.com>
Date: Tue, 10 Feb 2015 22:53:00 +0100
To: perl5-porters [...] perl.org
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Download (untitled) / with headers
text/plain 390b
Am 10.02.2015 um 16:11 schrieb Hyde, Warren: Show quoted text
> Excellent diagnosis and response, as usual, from the Perl community. > Much thanks to Dave and the others who took a look at this. Another > uninformed question is how Perl's regex engine winds up in PCRE,
It doesn't. Show quoted text
> and whether that would also be affected?
AFAIK perl and PCRE don't share any code. -- Lukas Mai <plokinom@gmail.com>
To: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Subject: RE: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
From: "Hyde, Warren" <Warren.Hyde [...] amd.com>
Date: Wed, 11 Feb 2015 17:32:48 +0000
Download (untitled) / with headers
text/plain 4.1k
Dave, A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along with PREGf_IMPLICIT. Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below. Patterns like these are extremely common in a "TWiki Formatted Search", for example. Cheers, -- Warren Hyde Show quoted text
-----Original Message----- From: Dave Mitchell via RT [mailto:perlbug-followup@perl.org] Sent: Tuesday, February 10, 2015 8:52 AM To: Hyde, Warren Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?) On Mon, Feb 09, 2015 at 10:55:57AM +0000, Dave Mitchell wrote:
> On Mon, Feb 09, 2015 at 03:47:45AM +0100, demerphq wrote:
> > definitely should fix it ASAP. I will try to find some time to dig, > > but my schedule is tight and I encourage others to dig as well.
> > I'm digging...
I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too. In the following, $s = ('0' x 200_000) . '::: 0c'; ok ($s !~ /.*:::\s*ab/, 'PREGf_IMPLICIT'); ok ($s !~ /.*:::\s*ab/i, 'PREGf_IMPLICIT/i'); ok ($s !~ /.*:::\s*ab/m, 'PREGf_IMPLICIT/m'); ok ($s !~ /.*:::\s*ab/mi, 'PREGf_IMPLICIT/mi'); ok ($s !~ /.*:::\s*ab/s, 'PREGf_IMPLICIT/s'); ok ($s !~ /.*:::\s*ab/si, 'PREGf_IMPLICIT/si'); ok ($s !~ /.*:::\s*ab/ms, 'PREGf_IMPLICIT/ms'); ok ($s !~ /.*:::\s*ab/msi,'PREGf_IMPLICIT/msi'); all the non-//i ones are quadratic in all perls since 5.8-ish. The //i ones additionally went quadratic in 5.18.0 and later. The following commit which I've just pushed makes all 8 of the above run in millisecond time again. commit 0fa70a06a98fc8fa9840d4dbaa31fc2d3b28b99b Author: David Mitchell <davem@iabyn.com> AuthorDate: Tue Feb 10 12:17:51 2015 +0000 Commit: David Mitchell <davem@iabyn.com> CommitDate: Tue Feb 10 12:37:10 2015 +0000 simpify and speed up /.*.../ handling See RT ##123743. A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along with PREGf_IMPLICIT. The idea is that, with /.*.../s, if the NFA don't match when started at pos 0, then it's not going to match if started at any other position either; while /.*.../ won't match at any other start position up until the next \n. However, the branch in regexec() that implemented this was a bit a mess (like much in the perl core, it had gradually accreted), and caused intuit-enabled /.*.../ and /.*...patterns to go quadratic. The branch looked roughly like: if (anchored) { if (regtry(s)) goto success; if (can_intuit) { while (s < end) { s = intuit(s+1); if (!s) goto fail; if (regtry(s)) goto success; } } else { while (s < end) { s = skip_to_next_newline(s); if (regtry(s)) goto success; } } } The problem is that in the presence of a .* at the start of the pattern, intuit() will always return either NULL on failure, or the start position, rather than any later position. So the can_intuit branch above calls regtry() on every character position. This commit fixes this by changing the structure of the code to be like this, where it only tries things on newline boundaries: if (anchored) { if (regtry(s)) goto success; while (1) { s = skip_to_next_newline(s); if (can_intuit) { s = intuit(s+1); if (!s) goto fail; } if (regtry(s)) goto success; } } This makes the code a lot simpler, and mostly avoids quadratic behaviour (you can still get it with a string consisting mainly of newlines). -- Nothing ventured, nothing lost.
To: "Hyde, Warren" <Warren.Hyde [...] amd.com>
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
CC: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Date: Fri, 13 Feb 2015 00:20:12 +0800
Download (untitled) / with headers
text/plain 612b
On 12 February 2015 at 01:32, Hyde, Warren <Warren.Hyde@amd.com> wrote: Show quoted text
> Dave, > > A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along > with PREGf_IMPLICIT. > > Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below. > > Patterns like these are extremely common in a "TWiki Formatted Search", for example.
Why are they common? Can you give us more context? Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Date: Thu, 12 Feb 2015 16:38:01 +0000
From: "Hyde, Warren" <Warren.Hyde [...] amd.com>
Subject: RE: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
To: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 2.6k
Yves, TWiki documents the extraction of text from a topic using a regular expression here: http://twiki.org/cgi-bin/view/TWiki/FormattedSearch#Extract_some_text_from_a_topic_u The regex is defined as needing to match the entire document (or line) as follows: Show quoted text
> $pattern(reg-exp) A regular expression pattern to extract some text from a topic (does not search meta data; use $formfield instead). In case of a multiple="on" search, the pattern is applied to the line found in each search hit. > • Specify a RegularExpression that covers the whole text (topic or line), which typically starts with .*, and must end in .* > • Put text you want to keep in parenthesis, like $pattern(.*?(from here.*?to here).*) > • Example: $pattern(.*?\*.*?Email\:\s*([^\n\r]+).*) extracts the e-mail address from a bullet of format * Email: ... > • This example has non-greedy .*? patterns to scan for the first occurance of the Email bullet; use greedy .* patterns to scan for the last occurance > • Limitation: Do not use .*) inside the pattern, e.g. $pattern(.*foo(.*)bar.*) does not work, but $pattern(.*foo(.*?)bar.*) does > • Note: Make sure that the integrity of a web page is not compromised; for example, if you include an HTML table make sure to include everything including the table end tag
I realize this is not an ideal situation, but if Perl takes a long time backtracking through leading dot-star (even with minimal matching), this may cause a request to timeout in the browser. That's what I meant by "common", because lots of things document this as a specific use-case, and TWiki came to mind because I recalled having stumbled across this before. My question was whether the fix as implemented also covered leading .*? as well as leading .*, since I didn't think to include this case in the original perlbug submission. Cheers, -- Warren Hyde Show quoted text
-----Original Message----- From: yves orton via RT [mailto:perlbug-followup@perl.org] Sent: Thursday, February 12, 2015 10:21 AM To: Hyde, Warren Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?) On 12 February 2015 at 01:32, Hyde, Warren <Warren.Hyde@amd.com> wrote:
> Dave, > > A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along > with PREGf_IMPLICIT. > > Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below. > > Patterns like these are extremely common in a "TWiki Formatted Search", for example.
Why are they common? Can you give us more context? Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
To: "Hyde, Warren" <Warren.Hyde [...] amd.com>
Date: Fri, 13 Feb 2015 14:53:20 +0800
CC: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.1k
On 13 February 2015 at 00:38, Hyde, Warren <Warren.Hyde@amd.com> wrote: Show quoted text
> Yves, > > TWiki documents the extraction of text from a topic using a regular expression here: > > http://twiki.org/cgi-bin/view/TWiki/FormattedSearch#Extract_some_text_from_a_topic_u > > The regex is defined as needing to match the entire document (or line) as follows: >
>> $pattern(reg-exp) A regular expression pattern to extract some text from a topic (does not search meta data; use $formfield instead). In case of a multiple="on" search, the pattern is applied to the line found in each search hit. >> • Specify a RegularExpression that covers the whole text (topic or line), which typically starts with .*, and must end in .* >> • Put text you want to keep in parenthesis, like $pattern(.*?(from here.*?to here).*) >> • Example: $pattern(.*?\*.*?Email\:\s*([^\n\r]+).*) extracts the e-mail address from a bullet of format * Email: ... >> • This example has non-greedy .*? patterns to scan for the first occurance of the Email bullet; use greedy .* patterns to scan for the last occurance >> • Limitation: Do not use .*) inside the pattern, e.g. $pattern(.*foo(.*)bar.*) does not work, but $pattern(.*foo(.*?)bar.*) does >> • Note: Make sure that the integrity of a web page is not compromised; for example, if you include an HTML table make sure to include everything including the table end tag
> > I realize this is not an ideal situation, but if Perl takes a long time backtracking through leading dot-star (even with minimal matching), this may cause a request to timeout in the browser. > > That's what I meant by "common", because lots of things document this as a specific use-case, and TWiki came to mind because I recalled having stumbled across this before. > > My question was whether the fix as implemented also covered leading .*? as well as leading .*, since I didn't think to include this case in the original perlbug submission.
It sounds to me like Twiki is going to automatically turn /PAT/ into /^(?:PAT)$/m And if it doesn't it could. :-) Anyway, I dont see any reason not to enable the same optimisation for the minmod case. I was just curious what the issue was with Twiki. yves
CC: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
From: Dave Mitchell <davem [...] iabyn.com>
Date: Fri, 13 Feb 2015 12:12:11 +0000
To: "Hyde, Warren" <Warren.Hyde [...] amd.com>
Subject: Re: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)
Download (untitled) / with headers
text/plain 408b
On Thu, Feb 12, 2015 at 04:38:01PM +0000, Hyde, Warren wrote: Show quoted text
> My question was whether the fix as implemented also covered leading .*? > as well as leading .*, since I didn't think to include this case in the > original perlbug submission.
Yes, it also covered /.*?/. I've added some extra speed tests with 3e33f67a69740d3. -- This is a great day for France! -- Nixon at Charles De Gaulle's funeral
Subject: Your ticket against Perl 5 has been resolved
Download (untitled) / with headers
text/plain 263b
Thanks for submitting this ticket The issue should be resolved with the release today of Perl v5.22, available at http://www.perl.org/get.html If you find that the problem persists, feel free to reopen this ticket -- Karl Williamson for the Perl 5 porters team


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org