Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex: (*THEN) doesn't work as described #12737

Open
p5pRT opened this issue Jan 25, 2013 · 6 comments
Open

Regex: (*THEN) doesn't work as described #12737

p5pRT opened this issue Jan 25, 2013 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 25, 2013

Migrated from rt.perl.org#116537 (status was 'open')

Searchable as RT116537$

@p5pRT
Copy link
Author

p5pRT commented Jan 25, 2013

From ph10@hermes.cam.ac.uk

Hello,

My understanding of how (*THEN) works is that the test below should
match. The perlre page says "...this verb always matches, and when
backtracked into on failure, it causes the regex engine to try the next
alternation in the innermost enclosing group (capturing or otherwise)
that has alternations." Unless I am going mad, the examples below (one a
normal group, the other an assertion) fulfil the condition.

$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

I discovered this because PCRE does match these patterns.

Regards,
Philip

--
Philip Hazel


Flags​:
  category=core
  severity=low


Site configuration information for perl 5.16.2​:

Configured by nobody at Tue Dec 11 22​:56​:31 CET 2012.

Summary of my perl5 (revision 5 version 16 subversion 2) configuration​:
 
  Platform​:
  osname=linux, osvers=3.6.9-1-arch, archname=i686-linux-thread-multi
  uname='linux flo 3.6.9-1-arch #1 smp preempt tue dec 4 08​:04​:10 cet 2012 i686 gnulinux '
  config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=i686 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -D_FORTIFY_SOURCE=2 -Dprefix=/usr -Dinstallprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-march=i686 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -D_FORTIFY_SOURCE=2',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.7.2', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=/lib/libc-2.16.so, so=so, useshrplib=true, libperl=libperl.so
  gnulibc_version='2.16'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/core_perl/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro -L/usr/local/lib -fstack-protector'

Locally applied patches​:
 


@​INC for perl 5.16.2​:
  /usr/lib/perl5/site_perl
  /usr/share/perl5/site_perl
  /usr/lib/perl5/vendor_perl
  /usr/share/perl5/vendor_perl
  /usr/lib/perl5/core_perl
  /usr/share/perl5/core_perl
  .


Environment for perl 5.16.2​:
  HOME=/home/ph10
  LANG=en_GB
  LANGUAGE (unset)
  LC_ALL=C
  LC_COLLATE=C
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/ph10/bin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/local/sbin​:/usr/sbin​:/sbin​:/usr/bin/vendor_perl​:/usr/bin/core_perl​:.
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 25, 2013

From @tamias

On Fri, Jan 25, 2013 at 09​:23​:54AM -0800, Philip Hazel wrote​:

My understanding of how (*THEN) works is that the test below should
match. The perlre page says "...this verb always matches, and when
backtracked into on failure, it causes the regex engine to try the next
alternation in the innermost enclosing group (capturing or otherwise)
that has alternations." Unless I am going mad, the examples below (one a
normal group, the other an assertion) fulfil the condition.

$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

These work in 5.10.1, but not in 5.14.1.

These are the only tests involving (*THEN) that expect a successful match,
from t/re/pat_advanced.t​:

  {
  #Mindnumbingly simple test of (*THEN)
  for ("ABC","BAX") {
  ok /A (*THEN) X | B (*THEN) C/x, "Simple (*THEN) test";
  }
  }

The key difference seems to be that in your tests, the two alternations
begin with the same character.

Ronald

@p5pRT
Copy link
Author

p5pRT commented Jan 25, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 25, 2013

From @b2gills

On Fri, Jan 25, 2013 at 4​:17 PM, Ronald J Kimball <rjk@​tamias.net> wrote​:

On Fri, Jan 25, 2013 at 09​:23​:54AM -0800, Philip Hazel wrote​:

My understanding of how (*THEN) works is that the test below should
match. The perlre page says "...this verb always matches, and when
backtracked into on failure, it causes the regex engine to try the next
alternation in the innermost enclosing group (capturing or otherwise)
that has alternations." Unless I am going mad, the examples below (one a
normal group, the other an assertion) fulfil the condition.

$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

These work in 5.10.1, but not in 5.14.1.

These are the only tests involving (*THEN) that expect a successful match,
from t/re/pat_advanced.t​:

\{
    \#Mindnumbingly simple test of \(\*THEN\)
    for \("ABC"\,"BAX"\) \{
        ok /A \(\*THEN\) X | B \(\*THEN\) C/x\, "Simple \(\*THEN\) test";
    \}
\}

The key difference seems to be that in your tests, the two alternations
begin with the same character.

This appears to be caused by the TRIE optimization (as far as I can tell)

  $ perl -Mre=debug -e'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n"​:"no\n")'

  Compiling REx "^(a(*THEN)b|ac)"
  Final program​:
  1​: BOL (2)
  2​: OPEN1 (4)
  4​: TRIE-EXACT[a] (14)
  <a> (7)
  7​: CUTGROUP (9)
  9​: EXACT <b> (14)
  <ac> (14)
  14​: CLOSE1 (16)
  16​: END (0)
  anchored(BOL) minlen 2
  Matching REx "^(a(*THEN)b|ac)" against "ac"
  0 <> <ac> | 1​:BOL(2)
  0 <> <ac> | 2​:OPEN1(4)
  0 <> <ac> | 4​:TRIE-EXACT[a](14)
  0 <> <ac> | State​: 1 Accepted​: N Charid​:
1 CP​: 61 After State​: 2
  1 <a> <c> | State​: 2 Accepted​: Y Charid​:
2 CP​: 63 After State​: 3
  2 <ac> <> | State​: 3 Accepted​: Y Charid​:
0 CP​: 0 After State​: 0
  got 2 possible matches
  TRIE matched word #1, continuing
  1 <a> <c> | 7​: CUTGROUP(9)
  1 <a> <c> | 9​: EXACT <b>(14)
  failed...
  failed...
  Match failed
  no
  Freeing REx​: "^(a(*THEN)b|ac)"

This fails in the exact same manner​:

  $ perl -Mre=debug -e'print (("ac" =~ /^((?​:a(*THEN)b)|ac)/)?
"yes\n"​:"no\n")'

This succeeds​:

  $ perl -Mre=debug -e'print (("ac" =~ /^((a(*THEN)b)|ac)/)? "yes\n"​:"no\n")'

  Compiling REx "^((a(*THEN)b)|ac)"
  Final program​:
  1​: BOL (2)
  2​: OPEN1 (4)
  4​: BRANCH (15)
  5​: OPEN2 (7)
  7​: EXACT <a> (9)
  9​: CUTGROUP (11)
  11​: EXACT <b> (13)
  13​: CLOSE2 (18)
  15​: BRANCH (FAIL)
  16​: EXACT <ac> (18)
  18​: CLOSE1 (20)
  20​: END (0)
  anchored(BOL) minlen 2
  Matching REx "^((a(*THEN)b)|ac)" against "ac"
  0 <> <ac> | 1​:BOL(2)
  0 <> <ac> | 2​:OPEN1(4)
  0 <> <ac> | 4​:BRANCH(15)
  0 <> <ac> | 5​: OPEN2(7)
  0 <> <ac> | 7​: EXACT <a>(9)
  1 <a> <c> | 9​: CUTGROUP(11)
  1 <a> <c> | 11​: EXACT <b>(13)
  failed...
  failed...
  0 <> <ac> | 15​:BRANCH(18)
  0 <> <ac> | 16​: EXACT <ac>(18)
  2 <ac> <> | 18​: CLOSE1(20)
  2 <ac> <> | 20​: END(0)
  Match successful!
  yes
  Freeing REx​: "^((a(*THEN)b)|ac)"

@p5pRT
Copy link
Author

p5pRT commented Jan 25, 2013

From @khwilliamson

On 01/25/2013 03​:17 PM, Ronald J Kimball wrote​:

On Fri, Jan 25, 2013 at 09​:23​:54AM -0800, Philip Hazel wrote​:

My understanding of how (*THEN) works is that the test below should
match. The perlre page says "...this verb always matches, and when
backtracked into on failure, it causes the regex engine to try the next
alternation in the innermost enclosing group (capturing or otherwise)
that has alternations." Unless I am going mad, the examples below (one a
normal group, the other an assertion) fulfil the condition.

$ perl -e 'print (("ac" =~ /^(?=ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(?=a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

$ perl -e 'print (("ac" =~ /^(ab|ac)/)? "yes\n"​:"no\n")'
yes
$ perl -e 'print (("ac" =~ /^(a(*THEN)b|ac)/)? "yes\n"​:"no\n")'
no

These work in 5.10.1, but not in 5.14.1.

These are the only tests involving (*THEN) that expect a successful match,
from t/re/pat_advanced.t​:

 \{
     \#Mindnumbingly simple test of \(\*THEN\)
     for \("ABC"\,"BAX"\) \{
         ok /A \(\*THEN\) X | B \(\*THEN\) C/x\, "Simple \(\*THEN\) test";
     \}
 \}

The key difference seems to be that in your tests, the two alternations
begin with the same character.

Ronald

I bisected this problem. The offending commit is
commit 2e64971
Author​: David Mitchell <davem@​iabyn.com>
Date​: Mon May 3 13​:57​:58 2010 +0100

  tries​: don't allocate memory at runtime

  This is an indirect fix for
  [perl #74484] Regex causing exponential runtime+mem usage

  The trie runtime code was doing more SAVETMPS than FREETMPS and was
thus
  growing a large tmps stack on heavy backtracking. Rather than
fixing this
  directly, I rewrote part of the trie code so that it no longer needs to
  allocate memory in S_regmatch (it still does in find_byclass()).

  The basic issue is that multiple branches in the trie may trigger an
  accept state; for example​:

  "abcd" =~ /xyz/abcd.*X|ab.*Y|/

  here, words (branches) 2 and 3 are accept states. The original approach
  was, at run time, to create a list of accepted word numbers and the
  character positions of the end of each of those words. Then run the
rest
  of the pattern for each word in the list in turn (in word index order).
  This requires memory for the list to be allocated and freed.

  The new approach involves creating extra info at compile time; in
  particular, for each word, a pointer to the previous accepted word (if
  any) in the state tree. For example for the above pattern, part of the
  state tree may be

  q b c d
  1 -> 2 -> 3 -> 4 -> 5
  (#3) (#2)

  (e.g. at state 1, if the next char is 'a', we transition to state 2).
  Here, state 3 is an accept state with word #3, and 5 is an accept state
  with word #2. So we build a table indexed by word number, which has
  wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.

  At run time we run the trie to completion, and remember the word
  associated with the longest accept state (word #2 above). Then by
following
  back the chain of .prev fields, we can produce a list of all accepting
  words. We then iteratively find the smallest-numbered (ie LH-most)
word in
  the chain, and run with it. On failure and backtrack, we find the
  next-smallest and so on.

  Since we are no longer recording the end-position of each word in the
  string, we have to recalculate this for each backtrack. We initially
  record the end-position of the shortest accepting word, and given
that we
  know the length of each word, we can calculate the new position
each time
  as an offset from that first word. Depending on unicode and
folding, that
  calculation can be cheap or expensive.

  This algorithm is optimised for the typical case where there are a
small
  number (<= 2) accepting states.

  This patch creates a new compile-time array, trie->wordinfo[],
indexed by
  word number, which contains relevant info about each word. This also
  supersedes the old trie->newword[] array, whose function of recording
  "overspills" of multiple words per accept state, is now handled as
part of
  the wordinfo[].prev chain.

@p5pRT
Copy link
Author

p5pRT commented Jan 28, 2013

From @iabyn

On Fri, Jan 25, 2013 at 04​:47​:48PM -0700, Karl Williamson wrote​:

I bisected this problem. The offending commit is
commit 2e64971
Author​: David Mitchell <davem@​iabyn.com>
Date​: Mon May 3 13​:57​:58 2010 +0100

tries&#8203;: don't allocate memory at runtime

This is an indirect fix for
    \[perl \#74484\] Regex causing exponential runtime\+mem usage

Oh joy. I'll add it to my list...

--
In the 70's we wore flares because we didn't know any better.
What possible excuse does the current generation have?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants