Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(a|b)* cant specify a string match larger than 2**16-1 chars long. #6063

Closed
p5pRT opened this issue Nov 8, 2002 · 34 comments
Closed

(a|b)* cant specify a string match larger than 2**16-1 chars long. #6063

p5pRT opened this issue Nov 8, 2002 · 34 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 8, 2002

Migrated from rt.perl.org#18268 (status was 'open')

Searchable as RT18268$

@p5pRT
Copy link
Author

p5pRT commented Nov 8, 2002

From edi@agharta.de

Created by edi@agharta.de

edi@​bird​:~ > perl -le '$_="x" . ("a" x (2 ** 15 - 1)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
yes
edi@​bird​:~ > perl -le '$_="x" . ("a" x (2 ** 15)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
no

I've seen this with Perl 5.005, 5.6.0, 5.6.1, and 5.8.0 on various
Linux 2.4.x machines and on FreeBSD 4.5. With Linux 2.2.x and Solaris
5.8 (both Perl 5.6.1) I got core dumps instead. (I think I like the
core dumps a little bit more. At least I'm not getting a wrong
result... :)

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.6.1:

Configured by root at Tue Aug  6 13:27:19 CEST 2002.

Summary of my perl5 (revision 5.0 version 6 subversion 1) configuration:
  Platform:
    osname=linux, osvers=2.4.19-gentoo-r5, archname=i686-linux
    uname='linux bird.agharta.de 2.4.19-gentoo-r5 #3 tue aug 6 10:40:19 cest 2002 i686 genuineintel '
    config_args='-des -Dprefix=/usr -Darchname=i686-linux -Duselargefiles -Dd_dosuid -Dlocincpth=  -Dd_semctl_semun -Di_gdbm -Ui_db -Ui_ndbm'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-march=i686 -O3 -pipe',
    cppflags='-fno-strict-aliasing'
    ccversion='', gccversion='2.95.3 20010315 (release)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.6.1:
    /usr/lib/site_perl/5.6.1/i686-linux
    /usr/lib/site_perl/5.6.1
    /usr/lib/perl5/5.6.1/i686-linux
    /usr/lib/perl5/5.6.1
    /usr/lib/perl5/site_perl/5.6.1/i686-linux
    /usr/lib/perl5/site_perl/5.6.1
    /usr/lib/perl5/site_perl
    .


Environment for perl v5.6.1:
    HOME=/home/edi
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/local/lib:
    LOGDIR (unset)
    PATH=/usr/local/bin:/home/edi/.bin:/opt/opera/bin:/opt/scl/bin:/usr/kde/3/bin:/bin:/usr/bin:/usr/local/bin:/opt/Acrobat5:/opt/opera/bin:/opt/RealPlayer8:/usr/X11R6/bin:/opt/sun-jdk-1.4.0/bin:/opt/sun-jdk-1.4.0/jre/bin:/usr/qt/3/bin:/usr/kde/3/bin
    PERL5LIB=/usr/lib/site_perl/5.6.1
    PERL_BADLANG (unset)
    SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2002

From @eserte

"edi@​agharta.de (via RT)" <perlbug@​perl.org> writes​:

# New Ticket Created by edi@​agharta.de
# Please include the string​: [perl #18268]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt2/Ticket/Display.html?id=18268 >

This is a bug report for perl from edi@​agharta.de,
generated with the help of perlbug 1.33 running under perl v5.6.1.

-----------------------------------------------------------------
[Please enter your report here]

edi@​bird​:~ > perl -le '$_="x" . ("a" x (2 ** 15 - 1)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
yes
edi@​bird​:~ > perl -le '$_="x" . ("a" x (2 ** 15)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
no

I've seen this with Perl 5.005, 5.6.0, 5.6.1, and 5.8.0 on various
Linux 2.4.x machines and on FreeBSD 4.5. With Linux 2.2.x and Solaris
5.8 (both Perl 5.6.1) I got core dumps instead. (I think I like the
core dumps a little bit more. At least I'm not getting a wrong
result... :)

Well, you can trigger the core dump on all systems by running with

  limits -s 1M perl ...

Nevertheless, running the samples with -Dr suggests that (bc|a)* is
translated into something like (bc|a){0,32767}, and thus causing the
wrong answer. You can workaround by using ((bc|a)*)*, but the enormous
stack usage is still there. I think this is a job for Hugo and his
planned optimizations on the rx engine :-)

Regards,
  slaven

--
Slaven Rezic - slaven.rezic@​berlin.de

  tktimex - project time manager
  http​://sourceforge.net/projects/ptktools/

@p5pRT
Copy link
Author

p5pRT commented Nov 12, 2002

From edi@agharta.de

Slaven Rezic (via RT) <perlbug@​perl.org> writes​:

"edi@​agharta.de (via RT)" <perlbug@​perl.org> writes​:

# New Ticket Created by edi@​agharta.de
# Please include the string​: [perl #18268]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt2/Ticket/Display.html?id=18268 >

This is a bug report for perl from edi@​agharta.de,
generated with the help of perlbug 1.33 running under perl v5.6.1.

-----------------------------------------------------------------
[Please enter your report here]

edi@​bird​:~ > perl -le '$_="x" . ("a" x (2 ** 15 - 1)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
yes
edi@​bird​:~ > perl -le '$_="x" . ("a" x (2 ** 15)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
no

I've seen this with Perl 5.005, 5.6.0, 5.6.1, and 5.8.0 on various
Linux 2.4.x machines and on FreeBSD 4.5. With Linux 2.2.x and Solaris
5.8 (both Perl 5.6.1) I got core dumps instead. (I think I like the
core dumps a little bit more. At least I'm not getting a wrong
result... :)

Well, you can trigger the core dump on all systems by running with

limits \-s 1M perl \.\.\.

Nevertheless, running the samples with -Dr suggests that (bc|a)* is
translated into something like (bc|a){0,32767}, and thus causing the
wrong answer. You can workaround by using ((bc|a)*)*, but the enormous
stack usage is still there. I think this is a job for Hugo and his
planned optimizations on the rx engine :-)

Regards,
slaven

Thanks!

Edi.

@p5pRT
Copy link
Author

p5pRT commented Nov 17, 2006

From @demerphq

Hi,

Your bugreport is actually two different bugs.

The stack overflow/segv has been resolved already.

The other part, that of the star quantifier not matching strings longer
than 2**15-1 is still open, but I dont see what can be done about it so
I'm marking the ticket as stalled for the time being.

D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15)) .
'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
no

D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15-1))
. 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
yes

Thanks for the report.

Cheers
Yves

@p5pRT
Copy link
Author

p5pRT commented Nov 17, 2006

@demerphq - Status changed from 'open' to 'stalled'

@p5pRT
Copy link
Author

p5pRT commented May 22, 2010

From terada@ice.uec.ac.jp

Created by terada@ice.uec.ac.jp

Here is a test case​:

#! /usr/bin/perl

$n = 32768;

$str = "{";
for($i=0; $i<$n; $i++){
  $str = $str . "X";
}
$str = $str . "}";

if($str =~ /^{((E.|[^}E])*)}/){
  print "$2\n";
} else {
  print "MATCH FAILED\n";
}

1;

I tried perl 5.8.8 and 5.13.1.

If $n <= 32767, the result is correct for both versions.
(The match succeeds.)

For larger $n (>= 32768), the bug appears​:
  On 5.8.8, Segmentation fault occurs.
  On 5.13.1, the match fails.

Perl Info

Flags:
    category=core
    severity=high

This perlbug was built using Perl v5.8.8 in the Red Hat build system.
It is being executed now by Perl v5.8.8 - Thu Aug 28 09:56:48 EDT 2008.

Site configuration information for perl v5.8.8:

Configured by Red Hat, Inc. at Thu Aug 28 09:56:48 EDT 2008.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.18-92.1.10.el5xen, archname=i386-linux-thread-multi
    uname='linux xenbuilder4.fedora.phx.redhat.com 2.6.18-92.1.10.el5xen #1 smp wed jul 23 04:11:52 edt 2008 i686 i686 i386 gnulinux '
    config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -Dversion=5.8.8 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dinc_version_list=5.8.7 5.8.6 5.8.5 -Dscriptdir=/usr/bin'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.2 20070925 (Red Hat 4.1.2-33)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.7/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl/5.8.7
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl/5.8.5
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.7/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl/5.8.7
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl/5.8.5
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/5.8.8/i386-linux-thread-multi
    /usr/lib/perl5/5.8.8
    .


Environment for perl v5.8.8:
    HOME=/home/terada
    LANG=ja_JP.UTF-8
    LANGUAGE (unset)
    LC_TIME=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/terada/FS/trace/work.logging/bin:/usr/java/default/bin:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/X11R6/bin:/home/terada/FS/script:/home/terada/FS/binlinux
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jun 15, 2010

From @Abigail

On Sat, May 22, 2010 at 01​:24​:10AM -0700, Terada Minoru wrote​:

# New Ticket Created by Terada Minoru
# Please include the string​: [perl #75258]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=75258 >

To​: perlbug@​perl.org
Subject​: regex match fails for long input (> 32768? chars)
Reply-To​: terada@​ice.uec.ac.jp
Message-Id​: <5.8.8_14971_1274515855@​kirin.pr.ice.uec.ac.jp>

This is a bug report for perl from terada@​ice.uec.ac.jp,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

Here is a test case​:

#! /usr/bin/perl

$n = 32768;

$str = "{";
for($i=0; $i<$n; $i++){
$str = $str . "X";
}
$str = $str . "}";

if($str =~ /^{((E.|[^}E])*)}/){
print "$2\n";
} else {
print "MATCH FAILED\n";
}

1;

I tried perl 5.8.8 and 5.13.1.

If $n <= 32767, the result is correct for both versions.
(The match succeeds.)

For larger $n (>= 32768), the bug appears​:
On 5.8.8, Segmentation fault occurs.
On 5.13.1, the match fails.

Thank you for your report.

This is a known bug. Patterns of the form /(A|B)*/ are really
/(A|B){0,32767}/. One might argue that not giving a segfault is
an improvement, but I'm not so sure if silently giving the wrong
answer is better.

$ perl -Mre=debug -wce '/(A|B)*/'
Compiling REx "(A|B)*"
Final program​:
  1​: CURLYM[1] {0,32767} (15)
  5​: TRIE-EXACT[AB] (13)
  <A>
  <B>
  13​: SUCCEED (0)
  14​: NOTHING (15)
  15​: END (0)
minlen 0
-e syntax OK
Freeing REx​: "(A|B)*"
$

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jun 15, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 13, 2012

From @jkeenan

On Fri Nov 17 10​:31​:22 2006, demerphq wrote​:

Your bugreport is actually two different bugs.

The stack overflow/segv has been resolved already.

The other part, that of the star quantifier not matching strings longer
than 2**15-1 is still open, but I dont see what can be done about it so
I'm marking the ticket as stalled for the time being.

D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15)) .
'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
no

D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15-1))
. 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
yes

Confirmed as still present in Perl 5.16.

#####
[p5p] 504 $ perl -le '$_= q{x} . (q{a} x (2 ** 15-1)) . q{y}; print
(/x(bc|a)*y/ ?q{yes} : q{no});'
yes
[p5p] 505 $ perl -le '$_= q{x} . (q{a} x (2 ** 15)) . q{y}; print
(/x(bc|a)*y/ ?q{yes} : q{no});'
no
#####

Hence, ticket remains stalled.

@p5pRT
Copy link
Author

p5pRT commented Oct 13, 2012

The RT System itself - Status changed from 'stalled' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2012

From @jkeenan

On Fri Oct 12 17​:10​:17 2012, jkeenan wrote​:

On Fri Nov 17 10​:31​:22 2006, demerphq wrote​:

The other part, that of the star quantifier not matching strings longer
than 2**15-1 is still open, but I dont see what can be done about it so
I'm marking the ticket as stalled for the time being.

D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15)) .
'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
no

D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15-1))
. 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
yes

Confirmed as still present in Perl 5.16.

#####
[p5p] 504 $ perl -le '$_= q{x} . (q{a} x (2 ** 15-1)) . q{y}; print
(/x(bc|a)*y/ ?q{yes} : q{no});'
yes
[p5p] 505 $ perl -le '$_= q{x} . (q{a} x (2 ** 15)) . q{y}; print
(/x(bc|a)*y/ ?q{yes} : q{no});'
no
#####

Hence, ticket remains stalled.

(Well, RT changed its status back to open.)

Is this something that is potentially fixable? If not, should we just
document this as a limitation of the regex engine and move on?

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2012

From @hvds

"James E Keenan via RT" <perlbug-followup@​perl.org> wrote​:
:On Fri Oct 12 17​:10​:17 2012, jkeenan wrote​:
:> On Fri Nov 17 10​:31​:22 2006, demerphq wrote​:
:>
:> >
:> > The other part, that of the star quantifier not matching strings longer
:> > than 2**15-1 is still open, but I dont see what can be done about it so
:> > I'm marking the ticket as stalled for the time being.
:> >
:> > D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15)) .
:> > 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
:> > no
:> >
:> > D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15-1))
:> > . 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
:> > yes
:> >
:>
:> Confirmed as still present in Perl 5.16.
:>
:> #####
:> [p5p] 504 $ perl -le '$_= q{x} . (q{a} x (2 ** 15-1)) . q{y}; print
:> (/x(bc|a)*y/ ?q{yes} : q{no});'
:> yes
:> [p5p] 505 $ perl -le '$_= q{x} . (q{a} x (2 ** 15)) . q{y}; print
:> (/x(bc|a)*y/ ?q{yes} : q{no});'
:> no
:> #####
:>
:> Hence, ticket remains stalled.
:
:(Well, RT changed its status back to open.)
:
:Is this something that is potentially fixable? If not, should we just
:document this as a limitation of the regex engine and move on?

It is fixable, and I believe eventually it'll have to be fixed - as
computational power and size of typical strings increases, it is a
constraint that will seem increasingly restrictive.

I just don't know (quite) how, or when.

I believe that in principle it requires little more than changing the
definition of REG_INFTY. Unfortunately the "little" may involve a bit
of rearrangement of the regexp node structure.

I have long felt that the compilation process for regexps should construct
and optimize an AST before creating the final packed form (on the assumption
we'd still need that for its benefits of locality). One of the benefits of
that would be to reduce the engineering cost and the risk of rearchitecting
the packed form (as I suspect would be needed to extend the REG_INFTY limit),
by reducing the amount of code that needs to deal with it.

(Others include​: much more readable and maintainable code for compiling
and optimizing regexps; and an opportunity to get more insight into and
control over the optimizations applied.)

Hugo

@p5pRT
Copy link
Author

p5pRT commented Oct 15, 2012

From @demerphq

On 15 October 2012 00​:16, <hv@​crypt.org> wrote​:

"James E Keenan via RT" <perlbug-followup@​perl.org> wrote​:
:On Fri Oct 12 17​:10​:17 2012, jkeenan wrote​:
:> On Fri Nov 17 10​:31​:22 2006, demerphq wrote​:
:>
:> >
:> > The other part, that of the star quantifier not matching strings longer
:> > than 2**15-1 is still open, but I dont see what can be done about it so
:> > I'm marking the ticket as stalled for the time being.
:> >
:> > D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15)) .
:> > 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
:> > no
:> >
:> > D​:\dev\perl\ver\myblead\win32>..\perl -le "$_='x' . ('a' x (2 ** 15-1))
:> > . 'y'; print (/x(bc|a)*y/ ? 'yes' : 'no');"
:> > yes
:> >
:>
:> Confirmed as still present in Perl 5.16.
:>
:> #####
:> [p5p] 504 $ perl -le '$_= q{x} . (q{a} x (2 ** 15-1)) . q{y}; print
:> (/x(bc|a)*y/ ?q{yes} : q{no});'
:> yes
:> [p5p] 505 $ perl -le '$_= q{x} . (q{a} x (2 ** 15)) . q{y}; print
:> (/x(bc|a)*y/ ?q{yes} : q{no});'
:> no
:> #####
:>
:> Hence, ticket remains stalled.
:
:(Well, RT changed its status back to open.)
:
:Is this something that is potentially fixable? If not, should we just
:document this as a limitation of the regex engine and move on?

It is fixable, and I believe eventually it'll have to be fixed - as
computational power and size of typical strings increases, it is a
constraint that will seem increasingly restrictive.

I just don't know (quite) how, or when.

I believe that in principle it requires little more than changing the
definition of REG_INFTY. Unfortunately the "little" may involve a bit
of rearrangement of the regexp node structure.

I have long felt that the compilation process for regexps should construct
and optimize an AST before creating the final packed form (on the assumption
we'd still need that for its benefits of locality).

I agree. Ive even toyed with an implementation. But its a bear of a
task. At least at my C skill level.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

From ina.cpan@gmail.com

Created by ina@cpan.org

The following regular expression(No.1) doesn't match.
(Is this a known problem?)

# a32768c_gabbc.pl
print "1..5\n";
$_ = ('A'x 32768).'C'; print /\G(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";
$_ = ('A'x 32767).'C'; print /\G(A|BB)*C/ ? "ok - 2\n" : "not ok - 2\n";
$_ = ('A'x 32768).'C'; print /(A|BB)*C/ ? "ok - 3\n" : "not ok - 3\n";
$_ = ('A'x 32768).'C'; print /\G(A|B)*C/ ? "ok - 4\n" : "not ok - 4\n";
$_ = ('A'x 32768).'C'; print /\G(A|BB)*/ ? "ok - 5\n" : "not ok - 5\n";
__END__

$ ./perl a32768c_gabbc.pl
1..5
not ok - 1
ok - 2
ok - 3
ok - 4
ok - 5

Thanks,
INABA Hitoshi

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.16.2:

Configured by hinaba at Tue Nov  6 06:22:20 JST 2012.

Summary of my perl5 (revision 5 version 16 subversion 2) configuration:

  Platform:
    osname=linux, osvers=2.6.9-5.el, archname=x86_64-linux
    uname='linux greentea 2.6.9-5.el #1 wed jan 5 19:21:57 est 2005
x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Dprefix=/home/hinaba/perl501602'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='3.4.3 20041212 (Red Hat 3.4.3-9.EL4)',
gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 /usr/local/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.3.4.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.16.2:
    /home/hinaba/perl501602/lib/site_perl/5.16.2/x86_64-linux
    /home/hinaba/perl501602/lib/site_perl/5.16.2
    /home/hinaba/perl501602/lib/5.16.2/x86_64-linux
    /home/hinaba/perl501602/lib/5.16.2
    .


Environment for perl 5.16.2:
    HOME=/home/hinaba
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/kerberos/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/hinaba/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

From @Tux

On Mon, 14 Jan 2013 05​:27​:20 -0800, ina cpan (via RT)
<perlbug-followup@​perl.org> wrote​:

The following regular expression(No.1) doesn't match.
(Is this a known problem?)

If you'd have used -w or 'use warnings;', you might have seen

Complex regular subexpression recursion limit (32766) exceeded at a32768c_gabbc.pl line 2.

which imho indicates that this is a known problem

# a32768c_gabbc.pl
print "1..5\n";
$_ = ('A'x 32768).'C'; print /\G(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";
$_ = ('A'x 32767).'C'; print /\G(A|BB)*C/ ? "ok - 2\n" : "not ok - 2\n";
$_ = ('A'x 32768).'C'; print /(A|BB)*C/ ? "ok - 3\n" : "not ok - 3\n";
$_ = ('A'x 32768).'C'; print /\G(A|B)*C/ ? "ok - 4\n" : "not ok - 4\n";
$_ = ('A'x 32768).'C'; print /\G(A|BB)*/ ? "ok - 5\n" : "not ok - 5\n";
__END__

$ ./perl a32768c_gabbc.pl
1..5
not ok - 1
ok - 2
ok - 3
ok - 4
ok - 5

Thanks,
INABA Hitoshi

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

From @khwilliamson

On 01/14/2013 07​:02 AM, H.Merijn Brand wrote​:

On Mon, 14 Jan 2013 05​:27​:20 -0800, ina cpan (via RT)
<perlbug-followup@​perl.org> wrote​:

The following regular expression(No.1) doesn't match.
(Is this a known problem?)

If you'd have used -w or 'use warnings;', you might have seen

Complex regular subexpression recursion limit (32766) exceeded at a32768c_gabbc.pl line 2.

which imho indicates that this is a known problem

# a32768c_gabbc.pl
print "1..5\n";
$_ = ('A'x 32768).'C'; print /\G(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";
$_ = ('A'x 32767).'C'; print /\G(A|BB)*C/ ? "ok - 2\n" : "not ok - 2\n";
$_ = ('A'x 32768).'C'; print /(A|BB)*C/ ? "ok - 3\n" : "not ok - 3\n";
$_ = ('A'x 32768).'C'; print /\G(A|B)*C/ ? "ok - 4\n" : "not ok - 4\n";
$_ = ('A'x 32768).'C'; print /\G(A|BB)*/ ? "ok - 5\n" : "not ok - 5\n";
__END__

$ ./perl a32768c_gabbc.pl
1..5
not ok - 1
ok - 2
ok - 3
ok - 4
ok - 5

Thanks,
INABA Hitoshi

I believe this covers it
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=18268
but am not sure

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

From @iabyn

On Mon, Jan 14, 2013 at 03​:02​:01PM +0100, H.Merijn Brand wrote​:

On Mon, 14 Jan 2013 05​:27​:20 -0800, ina cpan (via RT)
<perlbug-followup@​perl.org> wrote​:

The following regular expression(No.1) doesn't match.
(Is this a known problem?)

If you'd have used -w or 'use warnings;', you might have seen

Complex regular subexpression recursion limit (32766) exceeded at a32768c_gabbc.pl line 2.

which imho indicates that this is a known problem

Yeah. Since making the regex engine non-recursive, its silly that we still
have "infinity" for some reg op types set to 32767.

--
I've often wanted to drown my troubles, but I can't get my wife to go
swimming.

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

From @demerphq

On 14 January 2013 17​:51, Dave Mitchell <davem@​iabyn.com> wrote​:

On Mon, Jan 14, 2013 at 03​:02​:01PM +0100, H.Merijn Brand wrote​:

On Mon, 14 Jan 2013 05​:27​:20 -0800, ina cpan (via RT)
<perlbug-followup@​perl.org> wrote​:

The following regular expression(No.1) doesn't match.
(Is this a known problem?)

If you'd have used -w or 'use warnings;', you might have seen

Complex regular subexpression recursion limit (32766) exceeded at a32768c_gabbc.pl line 2.

which imho indicates that this is a known problem

Yeah. Since making the regex engine non-recursive, its silly that we still
have "infinity" for some reg op types set to 32767.

Agreed on the silliness, but is fixing this practical?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2013

From @iabyn

On Mon, Jan 14, 2013 at 06​:35​:33PM +0100, demerphq wrote​:

On 14 January 2013 17​:51, Dave Mitchell <davem@​iabyn.com> wrote​:

On Mon, Jan 14, 2013 at 03​:02​:01PM +0100, H.Merijn Brand wrote​:

On Mon, 14 Jan 2013 05​:27​:20 -0800, ina cpan (via RT)
<perlbug-followup@​perl.org> wrote​:

The following regular expression(No.1) doesn't match.
(Is this a known problem?)

If you'd have used -w or 'use warnings;', you might have seen

Complex regular subexpression recursion limit (32766) exceeded at a32768c_gabbc.pl line 2.

which imho indicates that this is a known problem

Yeah. Since making the regex engine non-recursive, its silly that we still
have "infinity" for some reg op types set to 32767.

Agreed on the silliness, but is fixing this practical?

I haven't looked closely at it, but I'm assuming its "just" a case of
storing 32-bit int's rather than 16-bit ones for {m,n}. Then the
"infinity" of '*', a.k.a {0,Inf}, is less silly. Then fix up regexec.c
for anywhere where it assumes inf == 32768.

However, I've never done much work on the compilation side of regexes, and
I don't know how easy it is to make a node have 32-bit ranges.

--
"Foul and greedy Dwarf - you have eaten the last candle."
  -- "Hordes of the Things", BBC Radio.

@p5pRT
Copy link
Author

p5pRT commented Jan 23, 2013

From @iabyn

On Wed, Jan 16, 2013 at 09​:54​:33PM +0900, ina cpan wrote​:

If you'd have used -w or 'use warnings;', you might have seen

Complex regular subexpression recursion limit (32766) exceeded at
a32768c_gabbc.pl line 2.

which imho indicates that this is a known problem

Yeah. Since making the regex engine non-recursive, its silly that we
still have "infinity" for some reg op types set to 32767.

Agreed on the silliness, but is fixing this practical?

Yves

This idiom is used when performing Multiple-Byte Anchoring processing.
(Appendix W​: Perl Code Examples of CJKV Information Processing, O'REILLY)
For example, in Shift_JIS encoding, it is as follows.

if (m/\G([\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]|[\x00-\xFF])*?Foo/) {
...
}
__END__

The length of the data which is the target of processing often exceeds
32767 octets.

Note that the bug is not actually related to \G, but rather to X* matching
more than 32767 characters.

This is ok even though the \G matches more than 32K into the string​:

  $_ = ('A'x 40000).'C';
  pos = 33000;
  print /\G(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";

while this fails, even with no \G​:

$_ = ('A'x 32768).'C'; print /^(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";

--
I thought I was wrong once, but I was mistaken.

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2013

From ina.cpan@gmail.com

2013/1/24, Dave Mitchell <davem@​iabyn.com>​:

Note that the bug is not actually related to \G, but rather to X* matching
more than 32767 characters.

This is ok even though the \G matches more than 32K into the string​:

$\_ = \('A'x 40000\)\.'C';
pos = 33000;
print /\\G\(A|BB\)\*C/ ? "ok \- 1\\n" : "not ok \- 1\\n";

while this fails, even with no \G​:

$_ = ('A'x 32768).'C'; print /^(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";

--
I thought I was wrong once, but I was mistaken.

Isn't the bug(limitation) related to anchoring not only /\G/ ?

# a32768c_gabbc-2.pl
print "1..6\n";

$_ = ('A'x (33000+32767)).'C';
pos = 33000;
print /\G(A|BB)*C/ ? "ok - 1\n" : "not ok - 1\n";

$_ = ('A'x (33000+32768)).'C';
pos = 33000;
print /\G(A|BB)*C/ ? "ok - 2\n" : "not ok - 2\n";

$_ = ('A'x 32767).'C'; print /^(A|BB)*C/ ? "ok - 3\n" : "not ok - 3\n";
$_ = ('A'x 32768).'C'; print /^(A|BB)*C/ ? "ok - 4\n" : "not ok - 4\n";
$_ = ('A'x 32767).'C'; print /(A|BB)*C/ ? "ok - 5\n" : "not ok - 5\n";
$_ = ('A'x 32768).'C'; print /(A|BB)*C/ ? "ok - 6\n" : "not ok - 6\n";

__END__
- I thought I was wrong once, but I was wrong twice.

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2013

From @iabyn

On Thu, Jan 24, 2013 at 09​:43​:40AM +0900, ina cpan wrote​:

Isn't the bug(limitation) related to anchoring not only /\G/ ?

No, the bug is exactly that '*' is being treated as equivalent to
{0,32767} rather than {0,infinity}. The example you showed without
anchoring simply meant that after the match against the beginning of the
string failed (because the string was > 32767 chars), it backtracked,
tried the match at position 1, and succeeded.

So the bug is completed unrelated to anchoring (with ^ or \G etc), except
that the anchoring may trigger a mismatch due to the unachievable
requirement of 32768 chars needing to be matched.

If you run your code with warnings enabled, you'll see this output, showing
that the * bug is being hit each time​:

1..6
Complex regular subexpression recursion limit (32766) exceeded at /tmp/p line 7.
ok - 1
Complex regular subexpression recursion limit (32766) exceeded at /tmp/p line 11.
not ok - 2
Complex regular subexpression recursion limit (32766) exceeded at /tmp/p line 13.
ok - 3
Complex regular subexpression recursion limit (32766) exceeded at /tmp/p line 14.
not ok - 4
Complex regular subexpression recursion limit (32766) exceeded at /tmp/p line 15.
ok - 5
Complex regular subexpression recursion limit (32766) exceeded at /tmp/p line 16.
ok - 6

--
Standards (n). Battle insignia or tribal totems.

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

From @cpansprout

On Sun Oct 14 16​:06​:33 2012, hv wrote​:

It is fixable, and I believe eventually it'll have to be fixed - as
computational power and size of typical strings increases, it is a
constraint that will seem increasingly restrictive.

I just don't know (quite) how, or when.

Hopefully soon.

I’m trying to write tests for perl’s anchored string optimisation with
large strings. But I’m having to write things like this​:

(?​:(?​:.{32766}){32766}){2}(?​:.{32766}){8}.{8}

This is a bit annoying. :-)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

From @jkeenan

On Sat Jul 27 00​:03​:28 2013, sprout wrote​:

On Sun Oct 14 16​:06​:33 2012, hv wrote​:

It is fixable, and I believe eventually it'll have to be fixed - as
computational power and size of typical strings increases, it is a
constraint that will seem increasingly restrictive.

I just don't know (quite) how, or when.

Hopefully soon.

I’m trying to write tests for perl’s anchored string optimisation with
large strings. But I’m having to write things like this​:

(?​:(?​:.{32766}){32766}){2}(?​:.{32766}){8}.{8}

This is a bit annoying. :-)

Has anyone been able to make any progress on this issue?

In 5.20.1​:

#####
$ perl -wle '$_="x" . ("a" x (2 ** 15 - 1)) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
Complex regular subexpression recursion limit (32766) exceeded at -e line 1.
yes

$ perl -wle '$_="x" . ("a" x (2 ** 15 )) . "y"; print (/x(bc|a)*y/ ? "yes" : "no");'
Complex regular subexpression recursion limit (32766) exceeded at -e line 1.
#####
--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Feb 28, 2015

From @shlomif

On Fri Feb 27 12​:08​:26 2015, jkeenan wrote​:

On Sat Jul 27 00​:03​:28 2013, sprout wrote​:

On Sun Oct 14 16​:06​:33 2012, hv wrote​:

It is fixable, and I believe eventually it'll have to be fixed - as
computational power and size of typical strings increases, it is a
constraint that will seem increasingly restrictive.

I just don't know (quite) how, or when.

Hopefully soon.

I’m trying to write tests for perl’s anchored string optimisation
with
large strings. But I’m having to write things like this​:

(?​:(?​:.{32766}){32766}){2}(?​:.{32766}){8}.{8}

This is a bit annoying. :-)

Has anyone been able to make any progress on this issue?

In 5.20.1​:

#####
$ perl -wle '$_="x" . ("a" x (2 ** 15 - 1)) . "y"; print (/x(bc|a)*y/
? "yes" : "no");'
Complex regular subexpression recursion limit (32766) exceeded at -e
line 1.
yes

$ perl -wle '$_="x" . ("a" x (2 ** 15 )) . "y"; print (/x(bc|a)*y/
? "yes" : "no");'
Complex regular subexpression recursion limit (32766) exceeded at -e
line 1.
#####

I should note that it also happens with a non-capturing group​:

<SHELL>

shlomif@​telaviv1​:~$ perl -wle '$_="x" . ("a" x (2 ** 15 -1 )) . "y"; print (/x(?​:bc|a)*y/ ? "yes" : "no");'
Complex regular subexpression recursion limit (32766) exceeded at -e line 1.
yes
shlomif@​telaviv1​:~$ perl -wle '$_="x" . ("a" x (2 ** 15 )) . "y"; print (/x(?​:bc|a)*y/ ? "yes" : "no");'
Complex regular subexpression recursion limit (32766) exceeded at -e line 1.
no
shlomif@​telaviv1​:~$

</SHELL>

Regards,

-- Shlomi Fish

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2019

From @khwilliamson

The current state of perl is that the limit in 5.30 has been doubled so you don't hit the recursion limit until 2**16. Also, some parts of this ticket (and those merged to it) have been fixed, so that if you give an unbounded quantifier, it actually uses the machine platform limit. If you do specify an upper bound, it must a max of 2**16-1.

--
Karl Williamson

@p5pRT p5pRT added the khw label Oct 25, 2019
@toddr toddr removed the khw label Oct 25, 2019
@khwilliamson khwilliamson changed the title (a|b)* cant match a string larger than 2**15-1 chars long. (a|b)* cant specify a string match larger than 2**16-1 chars long. May 31, 2021
@khwilliamson
Copy link
Contributor

I looked at this, and it's not that hard to fix. There are several options.

One is to press into service the currently unused FLAGS field in the CURLY regnodes if the high bit in the 16 bit argument is set (That field may be used when compiling the pattern; I would need to look further.) . That would allow us 23 bits. Or we could create new regnodes which parallel the existing ones and are used only if the max quantifier is above 16 bits. That would give us 24 bits.

Or we could instead create parallel regnodes that have 32 bit arguments if the input called for that. With the FLAGS field that would give us 39 bits.

But how important is it fo fix this 20 year old bug? We should decide. close it or fix it. (I'm willing to do the work, if we deem it worthy of doing

@khwilliamson
Copy link
Contributor

For that matter, the simplest fix is just to make each of the regnodes that hold these to use 32 bits. That would increase the size of any pattern that used {m,n} quantifiers.

Something else that could be done is to instead of storing {m,n} in the regnode, store {m, n-m} so that the delta has to be <65K

@demerphq
Copy link
Collaborator

demerphq commented Apr 25, 2022 via email

@demerphq
Copy link
Collaborator

demerphq commented Oct 11, 2022 via email

@khwilliamson
Copy link
Contributor

@DMQ I believe this is now fixed.

@khwilliamson
Copy link
Contributor

Whoops, @demerphq Is this closable?

@demerphq
Copy link
Collaborator

Fixed in 0678333

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants