Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex fails when string is too long #9729

Closed
p5pRT opened this issue May 4, 2009 · 9 comments
Closed

Regex fails when string is too long #9729

p5pRT opened this issue May 4, 2009 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented May 4, 2009

Migrated from rt.perl.org#65372 (status was 'resolved')

Searchable as RT65372$

@p5pRT
Copy link
Author

p5pRT commented May 4, 2009

From @perlpunk

Created by @perlpunk

This is a bug report for perl from tina@​cure.localdomain,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
Hi,

I found that in 5.10 and newer a regex begins to fail if a matching string
is longer than 32767.

The shortest example I could produce​:

  use strict;
  use warnings;

  parse("x" x 32767);
  parse("x" x 32768);

  sub parse {
  my $xml = shift;

  $xml = "&lt;html&gt;${xml}</html>";

  my $tag = "html";

  if ( $xml =~ m{<$tag>(.|\n)*?</$tag>}i ) {
  print "matched\n";
  }
  else {
  print "didn't match\n";
  }
  }

It prints​:
  matched
  didn't match

If I change the regex to​:
  m{<$tag>(.)*?</$tag>}is

it works.

regards,
tina

Perl Info

Flags:
     category=core
     severity=low

Site configuration information for perl 5.10.0:

Configured by tina at Thu Sep 11 21:29:01 CEST 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
   Platform:
     osname=linux, osvers=2.6.26, archname=i686-linux
     uname='linux cure 2.6.26 #8 smp tue aug 12 14:04:43 cest 2008 i686 gnulinux '
     config_args='-de -Dprefix=/opt/local/perl/5.10.0'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=undef, usemultiplicity=undef
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=undef, use64bitall=undef, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-fno-strict-aliasing -pipe -I/usr/local/include'
     ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-21)', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=4, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -L/usr/local/lib'
     libpth=/usr/local/lib /lib /usr/lib /usr/lib64
     libs=-lnsl -lgdbm -ldl -lm -lcrypt -lutil -lc
     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
     libc=/lib/libc-2.3.6.so, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.3.6'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
     /opt/local/perl/5.10.0/lib/5.10.0/i686-linux
     /opt/local/perl/5.10.0/lib/5.10.0
     /opt/local/perl/5.10.0/lib/site_perl/5.10.0/i686-linux
     /opt/local/perl/5.10.0/lib/site_perl/5.10.0
     .


Environment for perl 5.10.0:
     HOME=/home/tina
     LANG=de_DE.UTF-8
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
     PERL5LIB=
     PERL_BADLANG (unset)
     SHELL=/bin/bash

-- 
http://darkdance.net/
http://perlpunks.de/
http://www.trashcave.de/

@p5pRT
Copy link
Author

p5pRT commented May 6, 2009

From @schwern

Tina (via RT) wrote​:

I found that in 5.10 and newer a regex begins to fail if a matching string
is longer than 32767.

Thanks for the report. Confirmed on 5.10.0 on OS X.

No bug in 5.8.8, 5.8.9, 5.6.2 or 5.5.5.

--
Just call me 'Moron Sugar'.
  http​://www.somethingpositive.net/sp05182002.shtml

@p5pRT
Copy link
Author

p5pRT commented May 6, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 8, 2009

From @andk

On Tue, 05 May 2009 22​:20​:17 -0700, Michael G Schwern <schwern@​pobox.com> said​:

  > Tina (via RT) wrote​:

I found that in 5.10 and newer a regex begins to fail if a matching string
is longer than 32767.

  > Thanks for the report. Confirmed on 5.10.0 on OS X.

  > No bug in 5.8.8, 5.8.9, 5.6.2 or 5.5.5.

I think it's a known issue but Dave will correct me. Bisect points at

commit 40a8244
Author​: Dave Mitchell <davem@​fdisolutions.com>
Date​: Fri Jun 16 23​:25​:51 2006 +0000

  start turning regmatch() main loop into a FSM
  also make BRANCH use the state stack rather than its own unwind struct
 
  p4raw-id​: //depot/perl@​28398

--
andreas

@p5pRT
Copy link
Author

p5pRT commented May 9, 2009

From @iabyn

On Fri, May 08, 2009 at 08​:00​:17AM +0200, Andreas J. Koenig wrote​:

On Tue, 05 May 2009 22​:20​:17 -0700, Michael G Schwern <schwern@​pobox.com> said​:

Tina (via RT) wrote​:

I found that in 5.10 and newer a regex begins to fail if a matching string
is longer than 32767.

Thanks for the report. Confirmed on 5.10.0 on OS X.

No bug in 5.8.8, 5.8.9, 5.6.2 or 5.5.5.

I think it's a known issue but Dave will correct me. Bisect points at

commit 40a8244
Author​: Dave Mitchell <davem@​fdisolutions.com>
Date​: Fri Jun 16 23​:25​:51 2006 +0000

start turning regmatch\(\) main loop into a FSM
also make BRANCH use the state stack rather than its own unwind struct

Ah, mea cupla :-(

I'm aware of at least one other >32767 issue which this may be the same as.
Anyway, its on my "big list of things to fix for 5.10.1".

--
"Foul and greedy Dwarf - you have eaten the last candle."
  -- "Hordes of the Things", BBC Radio.

@p5pRT
Copy link
Author

p5pRT commented May 28, 2009

From @nwc10

Dave notes​:

looks to be another 5.10.0 regression involving regexes longer than 32767

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2009

From @hvds

This looks to be a simple oversight. All tests pass here.

Hugo

Inline Patch
--- regexec.c.old	2009-03-22 16:02:09.000000000 +0000
+++ regexec.c	2009-07-06 15:18:27.000000000 +0100
@@ -4411,7 +4411,7 @@
 	case CURLYM:	/* /A{m,n}B/ where A is fixed-length */
 
 	    /* This is an optimisation of CURLYX that enables us to push
-	     * only a single backtracking state, no matter now many matches
+	     * only a single backtracking state, no matter how many matches
 	     * there are in {m,n}. It relies on the pattern being constant
 	     * length, with no parens to influence future backrefs
 	     */
@@ -4574,7 +4574,8 @@
 	case CURLYM_B_fail: /* just failed to match a B */
 	    REGCP_UNWIND(ST.cp);
 	    if (ST.minmod) {
-		if (ST.count == ARG2(ST.me) /* max */)
+		I32 max = ARG2(ST.me);
+		if (max != REG_INFTY && ST.count == max)
 		    sayNO;
 		goto curlym_do_A; /* try to match a further A */
 	    }
--- t/op/pat.t.old	2009-06-06 13:51:10.000000000 +0100
+++ t/op/pat.t	2009-07-06 15:31:32.000000000 +0100
@@ -13,7 +13,7 @@
 
 $| = 1;
 
-my $EXPECTED_TESTS = 4061;  # Update this when adding/deleting tests.
+my $EXPECTED_TESTS = 4065;  # Update this when adding/deleting tests.
 
 BEGIN {
     chdir 't' if -d 't';
@@ -4346,6 +4346,21 @@
             iseq($str, "\$1 = undef, \$2 = undef, \$3 = undef, \$4 = undef, \$5 = undef, \$^R = undef");
        }
     }
+
+    {
+	local $BugId = 65372;	# minimal CURLYM limited to 32767 matches
+	my @pat = (
+	    qr{a(x|y)*b},	# CURLYM
+	    qr{a(x|y)*?b},	# .. with minmod
+	    qr{a([wx]|[yz])*b},	# .. and without tries
+	    qr{a([wx]|[yz])*?b},
+	);
+	my $len = 32768;
+	my $s = join '', 'a', 'x' x $len, 'b';
+	for my $pat (@pat) {
+	    ok($s =~ $pat, $pat);
+	}
+    }
     #
     # This should be the last test.
     #

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2009

From @Tux

On Mon, 06 Jul 2009 15​:45​:12 +0100, hv@​crypt.org wrote​:

This looks to be a simple oversight. All tests pass here.

I trust you on this :)
Applying​: Regex fails when string is too long
Thanks, patch successfully applied as 84d2fa1

Hugo
--- regexec.c.old 2009-03-22 16​:02​:09.000000000 +0000
+++ regexec.c 2009-07-06 15​:18​:27.000000000 +0100
@​@​ -4411,7 +4411,7 @​@​
case CURLYM​: /* /A{m,n}B/ where A is fixed-length */

     /\* This is an optimisation of CURLYX that enables us to push

- * only a single backtracking state, no matter now many matches
+ * only a single backtracking state, no matter how many matches
* there are in {m,n}. It relies on the pattern being constant
* length, with no parens to influence future backrefs
*/
@​@​ -4574,7 +4574,8 @​@​
case CURLYM_B_fail​: /* just failed to match a B */
REGCP_UNWIND(ST.cp);
if (ST.minmod) {
- if (ST.count == ARG2(ST.me) /* max */)
+ I32 max = ARG2(ST.me);
+ if (max != REG_INFTY && ST.count == max)
sayNO;
goto curlym_do_A; /* try to match a further A */
}
--- t/op/pat.t.old 2009-06-06 13​:51​:10.000000000 +0100
+++ t/op/pat.t 2009-07-06 15​:31​:32.000000000 +0100
@​@​ -13,7 +13,7 @​@​

$| = 1;

-my $EXPECTED_TESTS = 4061; # Update this when adding/deleting tests.
+my $EXPECTED_TESTS = 4065; # Update this when adding/deleting tests.

BEGIN {
chdir 't' if -d 't';
@​@​ -4346,6 +4346,21 @​@​
iseq($str, "\$1 = undef, \$2 = undef, \$3 = undef, \$4 = undef, \$5 = undef, \$^R = undef");
}
}
+
+ {
+ local $BugId = 65372; # minimal CURLYM limited to 32767 matches
+ my @​pat = (
+ qr{a(x|y)*b}, # CURLYM
+ qr{a(x|y)*?b}, # .. with minmod
+ qr{a([wx]|[yz])*b}, # .. and without tries
+ qr{a([wx]|[yz])*?b},
+ );
+ my $len = 32768;
+ my $s = join '', 'a', 'x' x $len, 'b';
+ for my $pat (@​pat) {
+ ok($s =~ $pat, $pat);
+ }
+ }
#
# This should be the last test.
#

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using & porting perl 5.6.2, 5.8.x, 5.10.x, 5.11.x on HP-UX 10.20, 11.00,
11.11, 11.23, and 11.31, OpenSuSE 10.3, 11.0, and 11.1, AIX 5.2 and 5.3.
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2009

bitcard@profvince.com - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant