Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can exceed regex complex recursion limit #8369

Closed
p5pRT opened this issue Mar 12, 2006 · 22 comments · Fixed by #19636
Closed

Can exceed regex complex recursion limit #8369

p5pRT opened this issue Mar 12, 2006 · 22 comments · Fixed by #19636

Comments

@p5pRT
Copy link

p5pRT commented Mar 12, 2006

Migrated from rt.perl.org#38717 (status was 'stalled')

Searchable as RT38717$

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2006

From @shlomif

Created by @shlomif

The following program segfaults after printing "A" and before printing "B"​:

<<<<<<<<

use strict;
use warnings;
my $string = qq{'} . "hello" x 10_000;
print "A\n";
$string =~ s{a|'(?​:\\.|[^'\\])*'}{};
print "B\n";

This code was adapted from a more complex example that was discovered by
Yossi Itzkovich​:

http​://perl.org.il/pipermail/perl/2006-March/007695.html

So it will be a good idea to see that the segfault there does not happen
either.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.7:

Configured by Mandriva at Tue Dec  6 10:19:22 MST 2005.

Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
  Platform:
    osname=linux, osvers=2.6.3-28mdk-i686-up-4gb, archname=i386-linux
    uname='linux mercury.mandriva.com 2.6.3-28mdk-i686-up-4gb #1 thu sep 29 11:26:42 mdt 2005 i686 intel(r) pentium(r) 4 cpu 3.00ghz unknown gnulinux '
    config_args='-des -Dinc_version_list=5.8.6 5.8.6/i386-linux 5.8.5 5.8.4 5.8.3 5.8.2 5.8.1 5.8.0 5.6.1 5.6.0 -Darchname=i386-linux -Dcc=gcc -Doptimize=-O2  -pipe -Wp,-D_FORTIFY_SOURCE=2 -fomit-frame-pointer -march=i586 -mtune=pentiumpro -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dsitebin=/usr/local/bin -Dsiteman1dir=/usr/local/share/man/man1 -Dsiteman3dir=/usr/local/share/man/man3 -Dman3ext=3pm -Dcf_by=Mandriva -Dmyhostname=localhost -Dperladmin=root@localhost -Dcf_email=root@localhost -Dd_dosuid -Ud_csh -Duseshrplib -Accflags=-DPERL_DISABLE_PMC'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-DPERL_DISABLE_PMC -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -pipe -Wp,-D_FORTIFY_SOURCE=2 -fomit-frame-pointer -march=i586 -mtune=pentiumpro',
    cppflags='-DPERL_DISABLE_PMC -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.0.1 (4.0.1-5mdk for Mandriva Linux release 2006.0)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.3.5.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.7/i386-linux/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    Mandriva Linux patches (cf the source RPM)


@INC for perl v5.8.7:
    /usr/lib/perl5/5.8.7/i386-linux
    /usr/lib/perl5/5.8.7
    /usr/lib/perl5/site_perl/5.8.7/i386-linux
    /usr/lib/perl5/site_perl/5.8.7
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl/5.8.6/i386-linux
    /usr/lib/perl5/site_perl/5.8.5
    /usr/lib/perl5/site_perl/5.8.3
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.7/i386-linux
    /usr/lib/perl5/vendor_perl/5.8.7
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl/5.8.6/i386-linux
    /usr/lib/perl5/vendor_perl/5.8.5
    /usr/lib/perl5/vendor_perl/5.8.4
    /usr/lib/perl5/vendor_perl/5.8.3
    /usr/lib/perl5/vendor_perl/5.8.2
    /usr/lib/perl5/vendor_perl/5.8.1
    /usr/lib/perl5/vendor_perl/5.8.0
    /usr/lib/perl5/vendor_perl
    .


Environment for perl v5.8.7:
    HOME=/home/shlomi
    LANG=en_US.UTF-8
    LANGUAGE=en_US:en
    LC_ADDRESS=en_US.UTF-8
    LC_COLLATE=en_US.UTF-8
    LC_CTYPE=en_US.UTF-8
    LC_IDENTIFICATION=en_US.UTF-8
    LC_MEASUREMENT=en_US.UTF-8
    LC_MESSAGES=en_US.UTF-8
    LC_MONETARY=en_US.UTF-8
    LC_NAME=en_US.UTF-8
    LC_NUMERIC=en_US.UTF-8
    LC_PAPER=en_US.UTF-8
    LC_SOURCED=1
    LC_TELEPHONE=en_US.UTF-8
    LC_TIME=en_US.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/shlomi/apps/latemp/bin:/home/shlomi/apps/file/gringotts/bin:/home/shlomi/apps/gimageview/bin:/usr/local/apps/gimp-1.3.x/bin:/home/shlomi/apps/test/quadpres/bin:/usr/local/apps/svn-repos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/games:/usr/share/unsermake:/home/shlomi/bin:/usr/share/unsermake
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2006

From @demerphq

On 3/12/06, via RT Shlomi Fish <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Shlomi Fish
# Please include the string​: [perl #38717]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=38717 >

This is a bug report for perl from shlomif@​iglu.org.il,
generated with the help of perlbug 1.35 running under perl v5.8.7.

-----------------------------------------------------------------
[Please enter your report here]

The following program segfaults after printing "A" and before printing "B"​:

<<<<<<<<

use strict;
use warnings;
my $string = qq{'} . "hello" x 10_000;
print "A\n";
$string =~ s{a|'(?​:\\.|[^'\\])*'}{};
print "B\n";

This code was adapted from a more complex example that was discovered by
Yossi Itzkovich​:

http​://perl.org.il/pipermail/perl/2006-March/007695.html

So it will be a good idea to see that the segfault there does not happen
either.

This is a well known issue with the regex engine. If you rewrite the
regex as the semantically equivelent

  $string =~ s{a|'(?​:[^'\\]+|\\.)*'}{};

then odds are that you wont see the segfault occur. When alternating
always try to get your alternatives to match as much as they possibly
can and if possible always make sure the option that will most likely
match the longest thing is first. Thus [^'\\] should change to [^'\\]+
and come first. Also I guess

  $string =~ s{a|'(?​:[^'\\]+|(?​:\\.)+)*'}{};

continues on in this theme, but given the general rarity of escaped
chars Id guess not much, or rather only in exceptional circumstances
like strings full of \0 or whatever.

In theory the above patterns are identical (they should produce the
same DFA) perls regex engine doesnt see them as the same and won't
optimise the one form into the other.


Maybe an optimisation hack would be if an alternation can't be trie'd
and is in a group that has a * or + quantifier wrap each element of
the alternation in a (?​:)+ unless it is already so wrapped (or matches
the empty string?)

yves

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2006

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2006

From @demerphq

On 3/12/06, demerphq <demerphq@​gmail.com> wrote​:

On 3/12/06, via RT Shlomi Fish <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Shlomi Fish
# Please include the string​: [perl #38717]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=38717 >

This is a bug report for perl from shlomif@​iglu.org.il,
generated with the help of perlbug 1.35 running under perl v5.8.7.

-----------------------------------------------------------------
[Please enter your report here]

The following program segfaults after printing "A" and before printing "B"​:

<<<<<<<<

use strict;
use warnings;
my $string = qq{'} . "hello" x 10_000;
print "A\n";
$string =~ s{a|'(?​:\\.|[^'\\])*'}{};
print "B\n";

This code was adapted from a more complex example that was discovered by
Yossi Itzkovich​:

http​://perl.org.il/pipermail/perl/2006-March/007695.html

So it will be a good idea to see that the segfault there does not happen
either.

This is a well known issue with the regex engine. If you rewrite the
regex as the semantically equivelent

$string =~ s{a|'(?​:[^'\\]+|\\.)*'}{};

then odds are that you wont see the segfault occur. When alternating
always try to get your alternatives to match as much as they possibly
can and if possible always make sure the option that will most likely
match the longest thing is first. Thus [^'\\] should change to [^'\\]+
and come first. Also I guess

$string =~ s{a|'(?​:[^'\\]+|(?​:\\.)+)*'}{};

continues on in this theme, but given the general rarity of escaped
chars Id guess not much, or rather only in exceptional circumstances
like strings full of \0 or whatever.

In theory the above patterns are identical (they should produce the
same DFA) perls regex engine doesnt see them as the same and won't
optimise the one form into the other.
---
Maybe an optimisation hack would be if an alternation can't be trie'd
and is in a group that has a * or + quantifier wrap each element of
the alternation in a (?​:)+ unless it is already so wrapped (or matches
the empty string?)

Although on second thought maybe the behaviour of later perls on your
chosen expression is actually better​:

Complex regular subexpression recursion limit (32766) exceeded at
c​:\tmp\p5p_re.pl line 5.
A
B

Wheras with the "optimisation" I mentioned above the re takes a
looooooong time to finish (iow i didnt bother letting it finish)
because it has to try both paths (the a or the quoted value) at every
position in the string, maybe immediate death due to segfault/stack
guards is preferable to almost never finish behaviour.

Also changing the pattern to two makes it finish instantly​:

s{'(?​:[^'\\]+|\\.)*'}{} || s{a}{}
  for $string;

As does using a zero width assertion​:

$string =~ s{(?=[a''])(?​:a|'(?​:[^''\\]+|(?​:\\.)+)*')}{};

seems to work nicely. :-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2006

From @smpeters

[shlomif@​iglu.org.il - Sun Mar 12 05​:22​:22 2006]​:

This is a bug report for perl from shlomif@​iglu.org.il,
generated with the help of perlbug 1.35 running under perl v5.8.7.

-----------------------------------------------------------------
[Please enter your report here]

The following program segfaults after printing "A" and before printing
"B"​:

<<<<<<<<

use strict;
use warnings;
my $string = qq{'} . "hello" x 10_000;
print "A\n";
$string =~ s{a|'(?​:\\.|[^'\\])*'}{};
print "B\n";

This code was adapted from a more complex example that was discovered
by
Yossi Itzkovich​:

http​://perl.org.il/pipermail/perl/2006-March/007695.html

So it will be a good idea to see that the segfault there does not
happen
either.

With change #27598, the core dump has gone away, but a panic is still
generated.

steve@​kirk​:~/smoke/perl-current$ ./perl -Ilib rt_38717.pl
A
Complex regular subexpression recursion limit (32766) exceeded at
rt_38717.pl line 7.
B
steve@​kirk​:~/smoke/perl-current$ cat rt_38717.pl
#!perl

use strict;
use warnings;
my $string = qq{'} . "hello" x 10_000;
print "A\n";
$string =~ s{a|'(?​:\\.|[^'\\])*'}{};
print "B\n";

@p5pRT
Copy link
Author

p5pRT commented Jul 9, 2006

From roland@roland-illig.de

Created by roland@baccf5ee.roland-illig.de

The function S_regmatch calls itself recursively, resulting in a stack
overflow for long strings combined with certain regular expressions.

For example​:
("aaaaaaaaaa " x 8000) =~ qr"^((?​:\\#|[^#])*?)(?​:\s*(#.*))?$";

Note​: The regular expression splits a line from a Makefile into two
parts​: $1 will contain everything before the first "#" character, where
"\#" is interpreted as a "#" sign. $2 will contain the comment from that
line, including all leading white-space.

Perl Info

Flags:
    category=core
    severity=critical

Site configuration information for perl v5.8.8:

Configured by roland at Sun Jul  9 21:21:11 CEST 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=netbsd, osvers=3.0_stable, archname=i386-netbsd-thread-multi
    uname='netbsd baccf5ee.roland-illig.de 3.0_stable netbsd 3.0_stable (generic) #0: sat jun 17 13:16:17 cest 2006 build@baccf5ee.roland-illig.de:homebuild32006-06i386worksysarchi386compilegeneric i386 '
    config_args='-sde -Darchname=i386-netbsd -Dcc=cc -Doptimize=-O2 -pthread -DDEBUGGING -Duseshrplib -Ui_malloc -Uusemymalloc -Uinstallusrbinperl -Dinstallstyle=lib/perl5 -Dprefix=/home/roland/pkg -Dsiteprefix=/home/roland/pkg -Dvendorprefix=/home/roland/pkg -Dscriptdir=/home/roland/pkg/lib/perl5/bin -Dsitescript=/home/roland/pkg/lib/perl5/site_perl/bin -Dvendorscript=/home/roland/pkg/lib/perl5/vendor_perl/bin -Dsitebin=/home/roland/pkg/lib/perl5/site_perl/bin -Dvendorbin=/home/roland/pkg/lib/perl5/vendor_perl/bin -Dprivlib=/home/roland/pkg/lib/perl5/5.8.0 -Dsitelib=/home/roland/pkg/lib/perl5/site_perl/5.8.0 -Dvendorlib=/home/roland/pkg/lib/perl5/vendor_perl/5.8.0 -Dsitelib_stem=/home/roland/pkg/lib/perl5/site_perl -Dvendorlib_stem=/home/roland/pkg/lib/perl5/vendor_perl -Dman1ext=1 -Dman1dir=/home/roland/pkg/lib/perl5/man/man1 -Dsiteman1dir=/home/roland/pkg/lib/perl5/site_perl/man/man1 -Dvendorman1dir=/home/roland/pkg/lib/perl5/vendor_perl/man/man1 -Dman3ext=3 -Dman3dir=/hom
 e/roland/pkg/lib/perl5/man/man3 -Dsiteman3dir=/home/roland/pkg/lib/perl5/site_perl/man/man3 -Dvendorman3dir=/home/roland/pkg/lib/perl5/vendor_perl/man/man3 -Daphostname=/bin/hostname -Dln=/bin/ln -Dsed=/usr/bin/sed -Dsh=/bin/sh -Dissymlink=test -h -Duseithreads -Dlibswanted=m crypt '
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -I/home/roland/pkg/include',
    optimize='-O2 -pthread -DDEBUGGING',
    cppflags='-fno-strict-aliasing -pipe -I/home/roland/pkg/include'
    ccversion='', gccversion='3.3.3 (NetBSD nb3 20040520)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-R/home/roland/pkg/lib  -L/home/roland/pkg/lib'
    libpth=/usr/lib /home/roland/pkg/lib
    libs=-lm -lcrypt -lpthread
    perllibs=-lm -lcrypt -lpthread
    libc=/lib/libc.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E  -Wl,-R/home/roland/pkg/lib/perl5/5.8.0/i386-netbsd-thread-multi/CORE'
    cccdlflags='-DPIC -fPIC ', lddlflags='-Wl,-R/home/roland/pkg/lib --whole-archive -shared  -L/home/roland/pkg/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /home/roland/pkg/lib/perl5/site_perl/5.8.0/i386-netbsd-thread-multi
    /home/roland/pkg/lib/perl5/site_perl/5.8.0
    /home/roland/pkg/lib/perl5/site_perl
    /home/roland/pkg/lib/perl5/vendor_perl/5.8.0/i386-netbsd-thread-multi
    /home/roland/pkg/lib/perl5/vendor_perl/5.8.0
    /home/roland/pkg/lib/perl5/vendor_perl
    /home/roland/pkg/lib/perl5/5.8.0/i386-netbsd-thread-multi
    /home/roland/pkg/lib/perl5/5.8.0
    .


Environment for perl v5.8.8:
    HOME=/home/roland
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=de_DE.ISO8859-1
    LD_LIBRARY_PATH=/usr/pkg/lib:/usr/X11R6/lib
    LOGDIR (unset)
    PATH=/home/roland/bin:/home/roland/usr.local/bin:/home/roland/usr.local/sbin:/home/roland/pkg/bin:/home/roland/pkg/sbin:/home/roland/pkg/java/blackdown-1.3.1/bin:/usr/local/bin:/usr/local/bin:/usr/pkg/bin:/usr/pkg/sbin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/games:/usr/X11R6/bin:/usr/bin:/bin:/usr/pkg/bin:/usr/local/bin
    PERL_BADLANG (unset)
    SHELL=/usr/pkg/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2006

From @smpeters

On Sun, Jul 09, 2006 at 01​:17​:30PM -0700, roland@​roland-illig.de wrote​:

# New Ticket Created by roland@​roland-illig.de
# Please include the string​: [perl #39774]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=39774 >

This is a bug report for perl from roland@​baccf5ee.roland-illig.de,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

The function S_regmatch calls itself recursively, resulting in a stack
overflow for long strings combined with certain regular expressions.

For example​:
("aaaaaaaaaa " x 8000) =~ qr"^((?​:\\#|[^#])*?)(?​:\s*(#.*))?$";

Note​: The regular expression splits a line from a Makefile into two
parts​: $1 will contain everything before the first "#" character, where
"\#" is interpreted as a "#" sign. $2 will contain the comment from that
line, including all leading white-space.

In a recent development version of Perl, the core dump goes away, but it
still fails.

steve@​kirk​:~/perl-current$ ./perl -wle'("aaaaaaaaaa " x 8000) =~
qr"^((?​:\\#|[^#])*?)(?​:\s*(#.*))?$";'
Complex regular subexpression recursion limit (32766) exceeded at -e line 1.

Steve Peters
steve@​fisharerojo.org

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2006

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2006

From @demerphq

On 7/9/06, via RT roland @​ roland-illig. de <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by roland@​roland-illig.de
# Please include the string​: [perl #39774]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=39774 >

This is a bug report for perl from roland@​baccf5ee.roland-illig.de,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

The function S_regmatch calls itself recursively, resulting in a stack
overflow for long strings combined with certain regular expressions.

For example​:
("aaaaaaaaaa " x 8000) =~ qr"^((?​:\\#|[^#])*?)(?​:\s*(#.*))?$";

Note​: The regular expression splits a line from a Makefile into two
parts​: $1 will contain everything before the first "#" character, where
"\#" is interpreted as a "#" sign. $2 will contain the comment from that
line, including all leading white-space.

Expressions of this form​:

(?​:\\#|[^#])*

Are always going to cause perls regex engine (indeed any NFA regex
engine) trouble. Every single char that pattern matches requires a
backtrack point being stored. The proper way to do things like this
is​:

(?​:[^#\\]+|(?​:\\#?)+)*

This is because A) the most likely to match possibility is first in
the alternation, B) each possibility consumes as much as possible,
meaning that less backtrack points are stored.

It would be nice if perl knew this and did this optimisation
transparently and automatically but currently it doesnt, and its
probably not that easy to do.

IMO we should probably add something to the docs about this type of
pattern, if it isnt already there (i havent checked).

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2006

From @jimc

demerphq wrote​:

On 7/9/06, via RT roland @​ roland-illig. de
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by roland@​roland-illig.de
# Please include the string​: [perl #39774]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=39774 >

This is a bug report for perl from roland@​baccf5ee.roland-illig.de,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

The function S_regmatch calls itself recursively, resulting in a stack
overflow for long strings combined with certain regular expressions.

For example​:
("aaaaaaaaaa " x 8000) =~ qr"^((?​:\\#|[^#])*?)(?​:\s*(#.*))?$";

Note​: The regular expression splits a line from a Makefile into two
parts​: $1 will contain everything before the first "#" character, where
"\#" is interpreted as a "#" sign. $2 will contain the comment from that
line, including all leading white-space.

Expressions of this form​:

(?​:\\#|[^#])*

Are always going to cause perls regex engine (indeed any NFA regex
engine) trouble. Every single char that pattern matches requires a
backtrack point being stored. The proper way to do things like this
is​:

(?​:[^#\\]+|(?​:\\#?)+)*
See "Mastering Regular Expressions", ch 5, esp section "A Global Veiw of
Backtracking".
  (Owl book)

This is because A) the most likely to match possibility is first in
the alternation, B) each possibility consumes as much as possible,
meaning that less backtrack points are stored.

It would be nice if perl knew this and did this optimisation
transparently and automatically but currently it doesnt, and its
probably not that easy to do.

<mulling out loud>

When I read the Owl, I found myself wondering if a *non-transparent*
way would be practical. While there are obvious downsides, the upsides
are pretty good too​:

given​: sub friedl_unroll ($) ...
which returns a qr// with the proper regex munging and repeating,
the qr// can be interpolated into other regexs.

- simplify the regex - make readable
- give programmer a higher-level construct.
- doesnt need regex core changes
- CPAN-able (it might be there already ?)
- doesnt preclude transparent approach, might help clarify it.

IMO we should probably add something to the docs about this type of
pattern, if it isnt already there (i havent checked).

Yves

-jimc

@p5pRT
Copy link
Author

p5pRT commented May 16, 2008

p5p@spam.wizbit.be - Status changed from 'open' to 'stalled'

@khwilliamson
Copy link
Contributor

The limit as been doubled to 65K

@khwilliamson khwilliamson changed the title Perl Segfault in Regex Match Can exceed regex complex recursion limit Apr 14, 2022
@khwilliamson
Copy link
Contributor

This ticket now comes down to that we do have recursion limits. I think that's a given in Perl, and we should just close this ticket.

@khwilliamson khwilliamson added the Closable? We might be able to close this ticket, but we need to check with the reporter label Apr 14, 2022
@hvds
Copy link
Contributor

hvds commented Apr 14, 2022

I do find it sad that we have a built-in hard limit: with perl-level recursion, IIRC we take care to store the bookkeeping on the heap rather than the C stack, and we limit it to a warning so that if you have the memory you can recurse more deeply. 2^16 as a hard limit regardless of your available memory seems rather restrictive for modern hardware.

That said, Perl no longer seems to be the language of choice for bioinformatics, and I don't know who else has a real-world need for this - I don't remember it ever being an issue for my mathematical work.

@iabyn
Copy link
Contributor

iabyn commented Apr 14, 2022 via email

@khwilliamson khwilliamson removed the Closable? We might be able to close this ticket, but we need to check with the reporter label Apr 16, 2022
@khwilliamson
Copy link
Contributor

perldiag says this

=item Complex regular subexpression recursion limit (%d) exceeded
(W regexp) The regular expression engine uses recursion in complex
situations where back-tracking is required. Recursion depth is limited
to 32766, or perhaps less in architectures where the stack cannot grow
arbitrarily. ("Simple" and "medium" situations are handled without
recursion and are not subject to a limit.) Try shortening the string
under examination; looping in Perl code (e.g. with C) rather than
in the regular expression engine; or rewriting the regular expression so
that it is simpler or backtracks less. (See L for information
on I.)

So is it no longer true that recursion is used sometimes?

@demerphq
Copy link
Collaborator

demerphq commented Apr 17, 2022 via email

@khwilliamson
Copy link
Contributor

So is this message obsolete?

I commented out the two places in regexec.c that stop looking and lead to this message, and ran the reduction program in this ticket. Everything worked. I propose removing those in early 5.37 and see what happens

@shlomif, I can't find the original program you created the reduction from.

@demerphq
Copy link
Collaborator

Can you show me a patch of what you tried @khwilliamson please?

khwilliamson added a commit to khwilliamson/perl5 that referenced this issue Apr 17, 2022
@khwilliamson
Copy link
Contributor

@demerphq
Copy link
Collaborator

demerphq commented Apr 17, 2022

I see. I personally would be fine with removing those blocks right away:

$ grep S_regmatch regexec.c
S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
    bool result = 0;	    /* return value of S_regmatch */
                 * on first entry to S_regmatch rather than at some random

shouldn't break anything, the execution engine manages its own stack using heap memory.

$ git grep -P "\bregmatch\b" regexec.c
regexec.c:     *    since we have to run regmatch from position 0 - decrement the
regexec.c:     * leave it to regmatch itself) */
regexec.c:   in regmatch. /grrr */
regexec.c:     * slot N+3: ready for use by regmatch()
regexec.c:    result = regmatch(reginfo, *startposp, progi->program + 1);
regexec.c:/* Macros for regmatch(), using its internal variables */
regexec.c:regmatch() - main matching routine
regexec.c:        if (regmatch(A)) {
regexec.c:PL_regmatch_state.  The first time regmatch() is called, the first slab is
regexec.c:regmatch(), slabs allocated since entry are freed.
regexec.c:            Perl_re_printf( aTHX_ "regmatch start\n" );

Note that regmatch(A) is part of a comment, it is not an actual call, regmatch() has way more than one argument.

khwilliamson added a commit to khwilliamson/perl5 that referenced this issue Apr 17, 2022
There is no recursion to exceed limits of.
khwilliamson added a commit to khwilliamson/perl5 that referenced this issue Apr 17, 2022
There is no recursion to exceed limits of.

This fixes Perl#8369
@khwilliamson
Copy link
Contributor

I have created a proper PRhttps://github.com/Perl/perl5/pull/19636

khwilliamson added a commit that referenced this issue May 28, 2022
There is no recursion to exceed limits of.

This fixes #8369
khwilliamson added a commit that referenced this issue Jul 7, 2022
scottchiefbaker pushed a commit to scottchiefbaker/perl5 that referenced this issue Nov 3, 2022
There is no recursion to exceed limits of.

This fixes Perl#8369
scottchiefbaker pushed a commit to scottchiefbaker/perl5 that referenced this issue Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants