Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(?>) causes wrongness on long string #9539

Closed
p5pRT opened this issue Oct 21, 2008 · 27 comments
Closed

(?>) causes wrongness on long string #9539

p5pRT opened this issue Oct 21, 2008 · 27 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 21, 2008

Migrated from rt.perl.org#60034 (status was 'resolved')

Searchable as RT60034$

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2008

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -lwe '$a="xyzt"x1000000; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
no
$

It gives the correct answer "yes" if the input is repeated only 1000
times instead of 1000000, or if the (?>) wrapper is changed to (?​:),
or if it is run under perl 5.8.8.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.0:

Configured by Debian Project at Sat Jun 21 21:25:02 UTC 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.25.7, archname=i486-linux-gnu-thread-multi
    uname='linux ninsei 2.6.25.7 #1 smp preempt fri jun 20 14:17:13 pdt 2008 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:
    


@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/root
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/root/pub/i686-pc-linux-gnu/bin:/root/pub/common/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2008

From @nwc10

Andreas, would you be able to work out which change in 5.8.x caused this?

On Tue, Oct 21, 2008 at 04​:18​:18AM -0700, Zefram wrote​:

$ perl -lwe '$a="xyzt"x1000000; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
no
$

It gives the correct answer "yes" if the input is repeated only 1000
times instead of 1000000, or if the (?>) wrapper is changed to (?​:),
or if it is run under perl 5.8.8.

Verified under blead, maint-5.10.x *and* maint-5.8.x

Gah. That's a regression.

And there was I thinking that I'd nailed them all.

$ ./perl -lwe '$a="xyzt"x8191; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
yes
$ ./perl -lwe '$a="xyzt"x8192; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
no

What's so special about 32768 here? (the length of the failing string)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2008

From @Abigail

On Tue, Oct 21, 2008 at 01​:51​:09PM +0100, Nicholas Clark wrote​:

Andreas, would you be able to work out which change in 5.8.x caused this?

On Tue, Oct 21, 2008 at 04​:18​:18AM -0700, Zefram wrote​:

$ perl -lwe '$a="xyzt"x1000000; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
no
$

It gives the correct answer "yes" if the input is repeated only 1000
times instead of 1000000, or if the (?>) wrapper is changed to (?​:),
or if it is run under perl 5.8.8.

Verified under blead, maint-5.10.x *and* maint-5.8.x

Gah. That's a regression.

And there was I thinking that I'd nailed them all.

$ ./perl -lwe '$a="xyzt"x8191; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
yes
$ ./perl -lwe '$a="xyzt"x8192; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
no

What's so special about 32768 here? (the length of the failing string)

Hard coded constant from a way back, I think, when the regexp engine was
still recursive. It all has to do that * actually means 0 to 32767 times.
Didn't we have the same issue just a few months ago?

  $ perl -Mre=debug -cE '/\A(?>[a-z])*\z/'
  Compiling REx "\A(?>[a-z])*\z"
  Final program​:
  1​: SBOL (2)
  2​: CURLYM[0] {0,32767} (21)
  4​: SUSPEND (19)
  6​: ANYOF[a-z] (17)
  17​: SUCCEED (0)
  18​: TAIL (19)
  19​: SUCCEED (0)
  20​: NOTHING (21)
  21​: EOS (22)
  22​: END (0)
  floating ""$ at 0..2147483647 (checking floating) anchored(SBOL) minlen 0
  -e syntax OK
  Freeing REx​: "\A(?>[a-z])*\z"

Abigail

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2008

From @nwc10

On Tue, Oct 21, 2008 at 01​:51​:09PM +0100, Nicholas Clark wrote​:

Andreas, would you be able to work out which change in 5.8.x caused this?

Don't worry. Vincent said on IRC that it would probably be change 29916 and
he was right.

So, next question. What's the fix? It doesn't quite revert cleanly. Plus I
rather liked the elegance of it, and it should work, dammit, so what did I
(or Dave) miss?

Nicholas Clark

Change 29916 by nicholas@​nicholas-saigo on 2007/01/22 16​:26​:58

  Integrate​:
  [ 27526]
  reduce S_regrepeat_hard() callers from 3 to 1
 
  [ 27534]
  inline, then delete, S_regrepeat_hard()
 
  [ 27535]
  Restore a bit of change 27533 that change 27534 inadvertently unwound.
 
  [ 27569]
  remove idential code branch from regmatch()

Affected files ...

... //depot/maint-5.8/perl/embed.fnc#176 integrate
... //depot/maint-5.8/perl/embed.h#131 integrate
... //depot/maint-5.8/perl/proto.h#165 integrate
... //depot/maint-5.8/perl/regexec.c#68 integrate

Differences ...

==== //depot/maint-5.8/perl/embed.fnc#176 (text) ====

@​@​ -1261,7 +1261,6 @​@​
#if defined(PERL_IN_REGEXEC_C) || defined(PERL_DECL_PROT)
ERs |I32 |regmatch |NN regnode *prog
ERs |I32 |regrepeat |NN const regnode *p|I32 max
-ERs |I32 |regrepeat_hard |NN regnode *p|I32 max|NN I32 *lp
ERs |I32 |regtry |NN regexp *prog|NN char *startpos
ERs |bool |reginclass |NN const regnode *n|NN const U8 *p|NULLOK STRLEN *lenp\
  |bool do_utf8sv_is_utf8

==== //depot/maint-5.8/perl/embed.h#131 (text+w) ====

@​@​ -1291,7 +1291,6 @​@​
#if defined(PERL_CORE) || defined(PERL_EXT)
#define regmatch S_regmatch
#define regrepeat S_regrepeat
-#define regrepeat_hard S_regrepeat_hard
#define regtry S_regtry
#define reginclass S_reginclass
#define regcppush S_regcppush
@​@​ -3366,7 +3365,6 @​@​
#if defined(PERL_CORE) || defined(PERL_EXT)
#define regmatch(a) S_regmatch(aTHX_ a)
#define regrepeat(a,b) S_regrepeat(aTHX_ a,b)
-#define regrepeat_hard(a,b,c) S_regrepeat_hard(aTHX_ a,b,c)
#define regtry(a,b) S_regtry(aTHX_ a,b)
#define reginclass(a,b,c,d) S_reginclass(aTHX_ a,b,c,d)
#define regcppush(a) S_regcppush(aTHX_ a)

==== //depot/maint-5.8/perl/proto.h#165 (text+w) ====

@​@​ -1845,9 +1845,6 @​@​
STATIC I32 S_regrepeat(pTHX_ const regnode *p, I32 max)
  __attribute__warn_unused_result__;

-STATIC I32 S_regrepeat_hard(pTHX_ regnode *p, I32 max, I32 *lp)
- __attribute__warn_unused_result__;
-
STATIC I32 S_regtry(pTHX_ regexp *prog, char *startpos)
  __attribute__warn_unused_result__;

==== //depot/maint-5.8/perl/regexec.c#68 (text) ====

@​@​ -3399,7 +3399,9 @​@​
  case CURLYM​:
  {
  I32 l = 0;
+ I32 matches = 0;
  CHECKPOINT lastcp;
+ I32 maxwanted;
 
  /* We suppose that the next guy does not need
  backtracking​: in particular, it is of constant non-zero length,
@​@​ -3417,11 +3419,38 @​@​
  if (paren)
  scan += NEXT_OFF(scan); /* Skip former OPEN. */
  PL_reginput = locinput;
+ maxwanted = minmod ? ln : n;
+ if (maxwanted) {
+ while (PL_reginput < PL_regeol && matches < maxwanted) {
+ if (!regmatch(scan))
+ break;
+ /* on first match, determine length, l */
+ if (!matches++) {
+ if (PL_reg_match_utf8) {
+ char *s = locinput;
+ while (s < PL_reginput) {
+ l++;
+ s += UTF8SKIP(s);
+ }
+ }
+ else {
+ l = PL_reginput - locinput;
+ }
+ if (l == 0) {
+ matches = maxwanted;
+ break;
+ }
+ }
+ locinput = PL_reginput;
+ }
+ }
+
+ PL_reginput = locinput;
+
  if (minmod) {
  minmod = 0;
- if (ln && regrepeat_hard(scan, ln, &l) < ln)
+ if (ln && matches < ln)
  sayNO;
- locinput = PL_reginput;
  if (HAS_TEXT(next) || JUMPABLE(next)) {
  regnode *text_node = next;

@​@​ -3467,7 +3496,7 @​@​
  }
  /* Couldn't or didn't -- move forward. */
  PL_reginput = locinput;
- if (regrepeat_hard(scan, 1, &l)) {
+ if (regmatch(scan)) {
  ln++;
  locinput = PL_reginput;
  }
@​@​ -3476,15 +3505,13 @​@​
  }
  }
  else {
- n = regrepeat_hard(scan, n, &l);
- locinput = PL_reginput;
  DEBUG_r(
  PerlIO_printf(Perl_debug_log,
  "%*s matched %"IVdf" times, len=%"IVdf"...\n",
  (int)(REPORT_CODE_OFF+PL_regindent*2), "",
- (IV) n, (IV)l)
+ (IV) matches, (IV)l)
  );
- if (n >= ln) {
+ if (matches >= ln) {
  if (HAS_TEXT(next) || JUMPABLE(next)) {
  regnode *text_node = next;

@​@​ -3511,19 +3538,20 @​@​
  }
  assume_ok_REG​:
  REGCP_SET(lastcp);
- while (n >= ln) {
+ while (matches >= ln) {
  /* If it could work, try it. */
  if (c1 == -1000 ||
  UCHARAT(PL_reginput) == c1 ||
  UCHARAT(PL_reginput) == c2)
  {
  DEBUG_r(
- PerlIO_printf(Perl_debug_log,
- "%*s trying tail with n=%"IVdf"...\n",
- (int)(REPORT_CODE_OFF+PL_regindent*2), "", (IV)n)
+ PerlIO_printf(Perl_debug_log,
+ "%*s trying tail with matches=%"IVdf"...\n",
+ (int)(REPORT_CODE_OFF+PL_regindent*2),
+ "", (IV)matches)
  );
  if (paren) {
- if (n) {
+ if (matches) {
  PL_regstartp[paren] = HOPc(PL_reginput, -l) - PL_bostr;
  PL_regendp[paren] = PL_reginput - PL_bostr;
  }
@​@​ -3535,7 +3563,7 @​@​
  REGCP_UNWIND(lastcp);
  }
  /* Couldn't or didn't -- back up. */
- n--;
+ matches--;
  locinput = HOPc(locinput, -l);
  PL_reginput = locinput;
  }
@​@​ -3770,7 +3798,7 @​@​
  ln--;
  }
  REGCP_SET(lastcp);
- if (paren) {
+ {
  UV c = 0;
  while (n >= ln) {
  if (c1 != -1000) {
@​@​ -3792,28 +3820,6 @​@​
  PL_reginput = locinput = HOPc(locinput, -1);
  }
  }
- else {
- UV c = 0;
- while (n >= ln) {
- if (c1 != -1000) {
- if (do_utf8)
- c = utf8n_to_uvchr((U8*)PL_reginput,
- UTF8_MAXBYTES, 0,
- uniflags);
- else
- c = UCHARAT(PL_reginput);
- }
- /* If it could work, try it. */
- if (c1 == -1000 || c == (UV)c1 || c == (UV)c2)
- {
- TRYPAREN(paren, n, PL_reginput);
- REGCP_UNWIND(lastcp);
- }
- /* Couldn't or didn't -- back up. */
- n--;
- PL_reginput = locinput = HOPc(locinput, -1);
- }
- }
  }
  sayNO;
  break;
@​@​ -4274,57 +4280,6 @​@​
  return(c);
}

-/*
- - regrepeat_hard - repeatedly match something, report total lenth and length
- *
- * The repeater is supposed to have constant non-zero length.
- */
-
-STATIC I32
-S_regrepeat_hard(pTHX_ regnode *p, I32 max, I32 *lp)
-{
- register char *scan = NULL;
- register char *start;
- register char *loceol = PL_regeol;
- I32 l = 0;
- I32 count = 0, res = 1;
-
- if (!max)
- return 0;
-
- start = PL_reginput;
- if (PL_reg_match_utf8) {
- while (PL_reginput < loceol && (scan = PL_reginput, res = regmatch(p))) {
- if (!count++) {
- l = 0;
- while (start < PL_reginput) {
- l++;
- start += UTF8SKIP(start);
- }
- *lp = l;
- if (l == 0)
- return max;
- }
- if (count == max)
- return count;
- }
- }
- else {
- while (PL_reginput < loceol && (scan = PL_reginput, res = regmatch(p))) {
- if (!count++) {
- *lp = l = PL_reginput - start;
- if (max != REG_INFTY && l*max < loceol - scan)
- loceol = scan + l*max;
- if (l == 0)
- return max;
- }
- }
- }
- if (!res)
- PL_reginput = scan;
-
- return count;
-}

/*
- regclass_swash - prepare the utf8 swash

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2008

From perl@profvince.com

+ maxwanted = minmod ? ln : n;
+ if (maxwanted) {
+ while (PL_reginput < PL_regeol && matches < maxwanted) {
+ if (!regmatch(scan))
+ break;
+ /* on first match, determine length, l */
+ if (!matches++) {
+ if (PL_reg_match_utf8) {
+ char *s = locinput;
+ while (s < PL_reginput) {
+ l++;
+ s += UTF8SKIP(s);
+ }
+ }
+ else {
+ l = PL_reginput - locinput;
+ }
+ if (l == 0) {
+ matches = maxwanted;
+ break;
+ }
+ }
+ locinput = PL_reginput;
+ }
+ }
+
+ PL_reginput = locinput;
+

From how I understand the problem, maxwanted is REG_INFTY (i.e. 32767)
when * is used as the quantifier, but the new loop ignores this special
case, and eventually matches becomes greater or equal than 32767, and
the match fails. This can be fixed in several ways :
1a°) Increasing REG_INFTY, but if the engine is still recursive it may
not be a good choice.
1b°) Setting maxwanted to I32_MAX if it's REG_INFTY, since matches and
maxwanted are I32s.
(In those both cases, the problem gets translated to about 2^31. Is that
enough? I can imagine matching 2Gb long strings. Well, not *me*, but
there has to be someone that'll do it).
2°) Test for maxwanted == REG_INFTY in the while conditional, but it
will slow down a little all those repetitions.
I tested that 2 solves the issue.

Vincent.

@p5pRT
Copy link
Author

p5pRT commented Oct 22, 2008

From @iabyn

On Tue, Oct 21, 2008 at 05​:09​:28PM +0200, Vincent Pit wrote​:

+ maxwanted = minmod ? ln : n;
+ if (maxwanted) {
+ while (PL_reginput < PL_regeol && matches < maxwanted) {
+ if (!regmatch(scan))
+ break;
+ /* on first match, determine length, l */
+ if (!matches++) {
+ if (PL_reg_match_utf8) {
+ char *s = locinput;
+ while (s < PL_reginput) {
+ l++;
+ s += UTF8SKIP(s);
+ }
+ }
+ else {
+ l = PL_reginput - locinput;
+ }
+ if (l == 0) {
+ matches = maxwanted;
+ break;
+ }
+ }
+ locinput = PL_reginput;
+ }
+ }
+
+ PL_reginput = locinput;
+

From how I understand the problem, maxwanted is REG_INFTY (i.e. 32767)
when * is used as the quantifier, but the new loop ignores this special
case, and eventually matches becomes greater or equal than 32767, and
the match fails. This can be fixed in several ways :
1a°) Increasing REG_INFTY, but if the engine is still recursive it may
not be a good choice.
1b°) Setting maxwanted to I32_MAX if it's REG_INFTY, since matches and
maxwanted are I32s.
(In those both cases, the problem gets translated to about 2^31. Is that
enough? I can imagine matching 2Gb long strings. Well, not *me*, but
there has to be someone that'll do it).
2°) Test for maxwanted == REG_INFTY in the while conditional, but it
will slow down a little all those repetitions.
I tested that 2 solves the issue.

Without looking too closely, I presume that the "maxwanted == REG_INFTY in
the while conditional" was already there in the regrepeat_hard() function
and my original patch just accidently missed it out when I unrolled the
function? In which case adding it back in for the 5.8.x branch shouldn't
affect performance relative to 5.8.8, right? In which case that gets my
vote for 5.8.x.

I guess things might be a bit more complex for bleed/5.10.x

--
Wesley Crusher gets beaten up by his classmates for being a smarmy git,
and consequently has a go at making some friends of his own age for a
change.
  -- Things That Never Happen in "Star Trek" #18

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2008

From perl@profvince.com

Without looking too closely, I presume that the "maxwanted == REG_INFTY in
the while conditional" was already there in the regrepeat_hard() function
and my original patch just accidently missed it out when I unrolled the
function? In which case adding it back in for the 5.8.x branch shouldn't
affect performance relative to 5.8.8, right? In which case that gets my
vote for 5.8.x.

Yes, this makes sense. Patch attached, ok with maint-5.8@​34560.

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2008

From perl@profvince.com

perl-re-long-fix.patch
--- regexec.c	2008-09-22 15:53:35.000000000 +0200
+++ regexec.c	2008-10-21 16:43:43.000000000 +0200
@@ -3192,7 +3192,8 @@
 	    PL_reginput = locinput;
 	    maxwanted = minmod ? ln : n;
 	    if (maxwanted) {
-		while (PL_reginput < PL_regeol && matches < maxwanted) {
+		while (PL_reginput < PL_regeol && (maxwanted == REG_INFTY
+						   || matches < maxwanted)) {
 		    if (!regmatch(scan))
 			break;
 		    /* on first match, determine length, l */
--- t/op/pat.t	2008-09-22 15:53:35.000000000 +0200
+++ t/op/pat.t	2008-10-21 16:49:52.000000000 +0200
@@ -3845,6 +3845,12 @@
 	'IsPunct agrees with [:punct:] with explicit Latin1');
 } 
 
+# [perl #60034]
+{
+ my $a = "xyzt" x 8192;
+ ok($a =~ /\A(?>[a-z])*\z/, '(?>) does not cause wrongness on long string');
+}
+
 # Test counter is at bottom of file. Put new tests above here.
 #-------------------------------------------------------------------
 # Keep the following tests last -- they may crash perl
@@ -3860,4 +3866,4 @@
 
 # Put new tests above the dotted line about a page above this comment
 
-BEGIN{print "1..1273\n"};
+BEGIN{print "1..1274\n"};

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2008

From zefram@fysh.org

I wrote​:

$ perl -lwe '$a="xyzt"x1000000; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'
...
It gives the correct answer "yes" if the input is repeated only 1000
times instead of 1000000, or if the (?>) wrapper is changed to (?​:),
or if it is run under perl 5.8.8.

I discovered a variant on this bug. If the string is represented in
UTF-8 then both 5.10.0 and 5.8.8 fail.

$ perl -lwe '$a="xyzt"x1000000; utf8​::upgrade($a); print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'

-zefram

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2008

From @nwc10

On Thu, Oct 23, 2008 at 12​:40​:36AM +0200, Vincent Pit wrote​:

Without looking too closely, I presume that the "maxwanted == REG_INFTY in
the while conditional" was already there in the regrepeat_hard() function
and my original patch just accidently missed it out when I unrolled the
function? In which case adding it back in for the 5.8.x branch shouldn't
affect performance relative to 5.8.8, right? In which case that gets my
vote for 5.8.x.

Yes, this makes sense. Patch attached, ok with maint-5.8@​34560.

--- regexec.c 2008-09-22 15​:53​:35.000000000 +0200
+++ regexec.c 2008-10-21 16​:43​:43.000000000 +0200
@​@​ -3192,7 +3192,8 @​@​
PL_reginput = locinput;
maxwanted = minmod ? ln : n;
if (maxwanted) {
- while (PL_reginput < PL_regeol && matches < maxwanted) {
+ while (PL_reginput < PL_regeol && (maxwanted == REG_INFTY
+ || matches < maxwanted)) {

Is adding that ^^^^^^^^^^^^^^^^^^^^^^
(and the performance impact thereof) your only reservation about adding it
to blead and 5.10.x?

Given that it seems correct for 5.8.x, isn't it correct for all?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2008

From perl@profvince.com

Given that it seems correct for 5.8.x, isn't it correct for all?

In the spirit, it should be. However, the code has changed since 5.10
and the direct backport doesn't seem to do well - so it'll most likely
need more work.

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2008

From @nwc10

On Thu, Oct 23, 2008 at 12​:40​:36AM +0200, Vincent Pit wrote​:

Without looking too closely, I presume that the "maxwanted == REG_INFTY in
the while conditional" was already there in the regrepeat_hard() function
and my original patch just accidently missed it out when I unrolled the
function? In which case adding it back in for the 5.8.x branch shouldn't
affect performance relative to 5.8.8, right? In which case that gets my
vote for 5.8.x.

Yes, this makes sense. Patch attached, ok with maint-5.8@​34560.

Thanks, applied as change 34580.
I tweaked the indenting on the regression case (one space indenting! - did you
go to the same school as Nick Ing-Simmons?) and added a test for the same
string when UTF-8 encoding, which (as Zefram and I independently found) was
always failing on 5.8.8

On Thu, Oct 23, 2008 at 07​:07​:36PM +0200, Vincent Pit wrote​:

Given that it seems correct for 5.8.x, isn't it correct for all?

In the spirit, it should be. However, the code has changed since 5.10
and the direct backport doesn't seem to do well - so it'll most likely
need more work.

Gosh yes. It's not the same code now, is it?

I can merge the tests as TODOs into blead.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2008

From perl@profvince.com

I tweaked the indenting on the regression case (one space indenting! - did you
go to the same school as Nick Ing-Simmons?)
Heh, yes, sorry about that. I try very hard to refrain from letting it
slip into the patches, but sometimes my inner nature is just too strong. :)

Gosh yes. It's not the same code now, is it?
Not really. I suspect that the maxwanted logic has been shattered into
different places. Which makes it harder to fix.

I can merge the tests as TODOs into blead.
Please do. I'll try to have another look at it.

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2008

From @nwc10

On Sat, Oct 25, 2008 at 10​:49​:49AM +0200, Vincent Pit wrote​:

Gosh yes. It's not the same code now, is it?
Not really. I suspect that the maxwanted logic has been shattered into
different places. Which makes it harder to fix.

I can merge the tests as TODOs into blead.
Please do. I'll try to have another look at it.

Done as change 34581.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Nov 10, 2008

@smpeters - Status changed from 'open' to 'resolved'

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2009

From zefram@fysh.org

Steve Peters via RT wrote​:

According to our records, your request regarding
"(?>) causes wrongness on long string"
has been resolved.

This has not been resolved. A patch went into 5.8.9 for one form of this
bug, so it's resolved for 5.8. 5.10.0 suffers that form and another,
but 5.10 has not been patched at all for this. Blead is also unpatched.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 22, 2009

From @iabyn

On Sat, Oct 25, 2008 at 10​:13​:14AM +0100, Nicholas Clark wrote​:

On Sat, Oct 25, 2008 at 10​:49​:49AM +0200, Vincent Pit wrote​:

Gosh yes. It's not the same code now, is it?
Not really. I suspect that the maxwanted logic has been shattered into
different places. Which makes it harder to fix.

I can merge the tests as TODOs into blead.
Please do. I'll try to have another look at it.

Done as change 34581.

Now fixed in bleed, and tests unTODOed,
by commit c966426

--
The Enterprise is captured by a vastly superior alien intelligence which
does not put them on trial.
  -- Things That Never Happen in "Star Trek" #10

@p5pRT
Copy link
Author

p5pRT commented Mar 23, 2009

From @demerphq

2008/10/21 via RT Zefram <perlbug-followup@​perl.org>​:

$ perl -lwe '$a="xyzt"x1000000; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'

Should we be concerned that this pattern makes no sense?

I mean it should be

/\A(?>[a-z]*)\z/

and in fact, we should be able to statically optimize this so that it
doesnt backtrack at all.

I think this is an interesting class of pattern, one that could be
analytically determined to be resolvable with a pure DFA engine, which
we could use instead of the RNFA engine we currently use.

If it was part of a larger pattern that required the RNFA engine to
operate then we would return to the counting problem, as we would need
to keep track of how many times it matched and how big each match was.
This would require only one stack frame of storage hypothetically, but
would still have some sort of upper length limit because of the
"number of times matched" in the stack frame.

So ultimately there will ALWAYS be the requirement for some sort of
upper limit, so IMO we should decide on one, make it incredibly large,
and do the required adjustments so that we tweak it as memory
capabilities increase in the future. Maybe one day it will be
reasonable to want to do a regex on a terrabyte string. :-)

IMO 32 bits is not enough, it needs to be a 64 bit value. Im not
familiar enough with portability implications of such a choice, would
it be impossible to arrange that this counter is a 64 bit value? Or at
the very least the largest true integer the platform we build on can
deal with?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 23, 2009

From @nwc10

On Mon, Mar 23, 2009 at 10​:01​:11AM +0100, demerphq wrote​:

IMO 32 bits is not enough, it needs to be a 64 bit value. Im not
familiar enough with portability implications of such a choice, would
it be impossible to arrange that this counter is a 64 bit value? Or at

Not if the platform doesn't have a 64 bit type

the very least the largest true integer the platform we build on can
deal with?

Yes.

And, I think I need a signature, or at least a block comment, roughly

"This seems like a good idea. Note I've not thought through all the
implications. Don't let me stop you doing it; ask the list for help, and if
I know the answer I will reply. But please don't wait, hoping that I will do
it, or that I will organise someone else to do it, because it's not my
itch."

(Following on from an unrelated private conversation with Yves)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 23, 2009

From @iabyn

On Mon, Mar 23, 2009 at 10​:01​:11AM +0100, demerphq wrote​:

2008/10/21 via RT Zefram <perlbug-followup@​perl.org>​:

$ perl -lwe '$a="xyzt"x1000000; print $a =~ /\A(?>[a-z])*\z/ ? "yes" : "no"'

Should we be concerned that this pattern makes no sense?

I mean it should be

/\A(?>[a-z]*)\z/

I presume the regexp above just happens to excercise the bug, so it
doesn't really matter that it makes no sense. The following
much simpler pattern also demonstrates the issue​:

  $ perl5100 -le 'print "ok" if ("ab" x 32767) =~ /^(ab)*$/;'
  ok
  $ perl5100 -le 'print "ok" if ("ab" x 32768) =~ /^(ab)*$/;'
  $

I'm not really in a position to comment on the rest of your post,
due to lack of time and clue!

and in fact, we should be able to statically optimize this so that it
doesnt backtrack at all.

I think this is an interesting class of pattern, one that could be
analytically determined to be resolvable with a pure DFA engine, which
we could use instead of the RNFA engine we currently use.

If it was part of a larger pattern that required the RNFA engine to
operate then we would return to the counting problem, as we would need
to keep track of how many times it matched and how big each match was.
This would require only one stack frame of storage hypothetically, but
would still have some sort of upper length limit because of the
"number of times matched" in the stack frame.

So ultimately there will ALWAYS be the requirement for some sort of
upper limit, so IMO we should decide on one, make it incredibly large,
and do the required adjustments so that we tweak it as memory
capabilities increase in the future. Maybe one day it will be
reasonable to want to do a regex on a terrabyte string. :-)

IMO 32 bits is not enough, it needs to be a 64 bit value. Im not
familiar enough with portability implications of such a choice, would
it be impossible to arrange that this counter is a 64 bit value? Or at
the very least the largest true integer the platform we build on can
deal with?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

--
Art is anything that has a label (especially if the label is "untitled 1")

@p5pRT
Copy link
Author

p5pRT commented Mar 23, 2009

From @demerphq

2009/3/23 Nicholas Clark <nick@​ccl4.org>​:

On Mon, Mar 23, 2009 at 10​:01​:11AM +0100, demerphq wrote​:

IMO 32 bits is not enough, it needs to be a 64 bit value. Im not
familiar enough with portability implications of such a choice, would
it be impossible to arrange that this counter is a 64 bit value? Or at

Not if the platform doesn't have a 64 bit type

the very least the largest true integer the platform we build on can
deal with?

Yes.

So how do we decide what that is?

And, I think I need a signature, or at least a block comment, roughly

"This seems like a good idea. Note I've not thought through all the
implications. Don't let me stop you doing it; ask the list for help, and if
I know the answer I will reply. But please don't wait, hoping that I will do
it, or that I will organise someone else to do it, because it's not my
itch."

(Following on from an unrelated private conversation with Yves)

Ohh, did i annoy you with that conversation? I didnt mean to.

And no, its a regex thing, ill take care of it if i can, not expecting
you to at all....

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 24, 2009

From @nwc10

On Mon, Mar 23, 2009 at 10​:26​:20PM +0100, demerphq wrote​:

2009/3/23 Nicholas Clark <nick@​ccl4.org>​:

On Mon, Mar 23, 2009 at 10​:01​:11AM +0100, demerphq wrote​:

IMO 32 bits is not enough, it needs to be a 64 bit value. Im not
familiar enough with portability implications of such a choice, would
it be impossible to arrange that this counter is a 64 bit value? Or at

Not if the platform doesn't have a 64 bit type

the very least the largest true integer the platform we build on can
deal with?

Yes.

So how do we decide what that is?

Well, it's axiomatic in C89 that it's called "long".
But, then C99 came along, and, er, cough*

I suspect that size_t will do the job perfectly.

And, I think I need a signature, or at least a block comment, roughly

"This seems like a good idea. Note I've not thought through all the
implications. Don't let me stop you doing it; ask the list for help, and if
I know the answer I will reply. But please don't wait, hoping that I will do
it, or that I will organise someone else to do it, because it's not my
itch."

(Following on from an unrelated private conversation with Yves)

Ohh, did i annoy you with that conversation? I didnt mean to.

No, you didn't. It was more that you'd know where I was coming from if I
started adding a standard disclaimer, but everyone else wouldn't.

And no, its a regex thing, ill take care of it if i can, not expecting
you to at all....

Rah!

Nicholas Clark

* and also cough, atof(), cough

@p5pRT
Copy link
Author

p5pRT commented Jun 30, 2009

From p5p@spam.wizbit.be

setting status to open again (it looks like there are unresolved issues)

@p5pRT
Copy link
Author

p5pRT commented Jun 30, 2009

p5p@spam.wizbit.be - Status changed from 'resolved' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2009

From @iabyn

I'm closing this again, as the original bug is now fixed on 5.8.9,
5.10.1, 5.11.0, and the remaining "issues" are really just discussions
about future enhancements to the RE engine

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2009

@iabyn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant