Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV compiling regexp in 5.10.0 #9512

Closed
p5pRT opened this issue Oct 6, 2008 · 13 comments
Closed

SIGSEGV compiling regexp in 5.10.0 #9512

p5pRT opened this issue Oct 6, 2008 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 6, 2008

Migrated from rt.perl.org#59654 (status was 'open')

Searchable as RT59654$

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2008

From @salva

Created by salva@ubuntu.int.qindel.com

Compiling/parsing complex regular expressions cause perl to SIGSEGV.

For instance​:

my $len = 6000;
my $re = join('|', (("a(?​:b") x $len), "a") . (")" x $len);
$re = qr/$re/; # ==> Segmentation fault

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.0:

Configured by salva at Mon Oct  6 16:31:48 CEST 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.24-19-server, archname=i686-linux
    uname='linux ubuntu 2.6.24-19-server #1 smp sat jul 12 00:40:01 utc 2008 i686 gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g -O0',
    cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.2.3 (Ubuntu 4.2.3-2ubuntu7)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -g -O0 -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /usr/local/perl/5.10.0-debug/lib/5.10.0/i686-linux
    /usr/local/perl/5.10.0-debug/lib/5.10.0
    /usr/local/perl/5.10.0-debug/lib/site_perl/5.10.0/i686-linux
    /usr/local/perl/5.10.0-debug/lib/site_perl/5.10.0
    .


Environment for perl 5.10.0:
    HOME=/home/salva
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/salva/bin:/usr/local/*/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/bin/bash



      

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2008

From @salva

Compiling/parsing complex regular expressions cause perl to SIGSEGV.

For instance​:

my $len = 6000;
my $re = join('|', (("a(?​:b") x $len), "a") . (")" x $len);
$re = qr/$re/; # ==> Segmentation fault

The minimum length that causes the SIGSEGV depends on how perl is
compiled. 6000 crashed a debugging perl (-g -O0 -DDEBUGGING).

Crashing a perl compiled with the default configuration requires $len =
15000;

- Salva

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2008

@salva - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 28, 2009

From @nwc10

Dave notes​:

Appears to be a 5.10.0 regression. Still present in maint,blead

@p5pRT
Copy link
Author

p5pRT commented May 28, 2009

From p5p@spam.wizbit.be

On Mon Oct 06 09​:52​:20 2008, salva wrote​:

Compiling/parsing complex regular expressions cause perl to
SIGSEGV.

For instance​:

my $len = 6000;
my $re = join('|', (("a(?​:b") x $len), "a") . (")" x $len);
$re = qr/$re/; # ==> Segmentation fault

The minimum length that causes the SIGSEGV depends on how perl is
compiled. 6000 crashed a debugging perl (-g -O0 -DDEBUGGING).

Crashing a perl compiled with the default configuration requires
$len =
15000;

- Salva

What appears changed between the different perl version is the length
that is required to trigger the segmentation fault;

If I use length 17000 then perl-5.9.4 works and perl-5.9.5 produces a
segmentation fault.
A binary search with length 17000 points to Change 28785

If I use length 25000 then perl-5.6.0 and above all crash.
A binary search with length 25000 points to Change 4669

If I use length 35000 then perl-5.005_62@​4668 segfaults as well.
A binary search with length 37000 points to Change 3860

If I use length 45000 then perl-p-5.005_58@​3859 segfaults as well.
Binary searching that reveals that Change 1461 introduced qr//.
So changing the test script and using m// instead of qr// in the test
script

And finally the Change that caused the segmentation fault​:

----Program----
my $len = 45000;
my $re = join('|', (("a(?​:b") x $len), "a") . (")" x $len);
"" =~ m/$re/;

----Output of ...p5eAlRt/perl-5.004_54@​266/bin/perl----
/a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|
a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a(?​:b|a/​: regexp
too big at /tmp/rt-59654.pl line 3.

----EOF ($?='512')----
----Output of ...peqbxr8/perl-5.004_54@​267/bin/perl----

----EOF ($?='11')----
Need a perl between 266 and 267
No patch available between 266 and 267

http​://public.activestate.com/cgi-bin/perlbrowse/p/267
Change 267 by mbeattie@​localhost on 1997/11/19 11​:04​:15

  Jumbo regexp patch applied (with minor fix-up tweaks)​:
  Subject​: Version 7 of Jumbo RE patch available
  Date​: Sun, 16 Nov 1997 00​:29​:39 -0500 (EST)
  From​: Ilya Zakharevich <ilya@​math.ohio-state.edu>

Before the change the regex returned the error message 'regexp too big'.
After the change it just segfaults.

In perl@​266/perl@​267​:
http​://public.activestate.com/cgi-bin/perlbrowse/f/regcomp.h@​267

regcomp.h contains​:
#define REGALIGN
#ifdef REGALIGN
# define REGALIGN_STRUCT
#endif

regcomp.c contains​:
http​://public.activestate.com/cgi-bin/perlbrowse/f/regcomp.c@​267

#ifndef REGALIGN_STRUCT
  if (regsize >= 0x10000L && extralen)
  FAIL("regexp too big");
#else
  if (regsize >= 0x10000L && extralen)
  regsize += extralen;
  else
  extralen = 0;
#endif

Which means that the 'regexp too big' was never triggered.
It was then removed with Change 1120 when REGALIGN was removed​:

The error 'regexp too big' was removed when 'REG_ALIGN' was removed​:
http​://public.activestate.com/cgi-bin/perlbrowse/p/1120
Change 1120 by gsar@​aatma on 1998/06/12 06​:51​:08

  applied patch, with indentation tweaks
  From​: Ilya Zakharevich <ilya@​math.ohio-state.edu>
  Message-Id​: <199806110803.EAA09149@​monk.mps.ohio-
state.edu>
  Subject​: [PATCH 5.004_66] Remove REG_ALIGN junk
  Date​: Thu, 11 Jun 1998 04​:03​:58 -0400 (EDT)

Currently the code in blead​:
  /* Small enough for pointer-storage convention?
  If extralen==0, this means that we will not need long jumps. */
  if (RExC_size >= 0x10000L && RExC_extralen)
  RExC_size += RExC_extralen;
  else
  RExC_extralen = 0;

'regexp too big' is mentioned in the section 'Obsolete Diagnostics' in
perl561delta​:

=item regexp too big

(F) The current implementation of regular expressions uses shorts as
address offsets within a string. Unfortunately this means that if
the regular expression compiles to longer than 32767, it'll blow up.
Usually when you want a regular expression this big, there is a better
way to do it with multiple statements. See L<perlre>.

So the real questions​: does this limit still apply? Should there be a
check on the length? ...?

Best regards,

Bram

@p5pRT
Copy link
Author

p5pRT commented Jun 18, 2009

From p5p@spam.wizbit.be

Asked dmq to take a quick look at this​:

[19​:06] <dmq> there should not be a regex limit at all.
[19​:06] <dmq> i find it quite odd actually.
[19​:08] <dmq> i think i can imagine where
[19​:09] <dmq> we do two passes (could be more but for the sake of this
subject imagine its just 2) the first is to calculate the length
required.
[19​:09] <dmq> the second is to actually construct the regex.
[19​:10] <dmq> sometimes we do a third, if we have to do recoding or
optimisations that we couldnt know about until we actually did the
compilaton.
[19​:10] <dmq> so somewhere there is storage for the actual size, and or
for the extra len or whatever.
[19​:10] <dmq> no doubt this is too small, and we are rolling over.

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2009

From @hvds

On Mon Oct 06 09​:06​:13 2008, salva wrote​:
[...]

Compiling/parsing complex regular expressions cause perl to SIGSEGV.

For instance​:

my $len = 6000;
my $re = join('|', (("a(?​:b") x $len), "a") . (")" x $len);
$re = qr/$re/; # ==> Segmentation fault

Just took a quick look at this​: it segvs in today's blead due to a
recursive call in S_study_chunk(), which is blowing the stack​:
  /* we suppose the run is continuous, last=next...*/
  minnext = study_chunk(pRExC_state, &scan, minlenp, &deltanext,
  next, &data_fake,
  stopparen, recursed, NULL, f,depth+1);

I have not tried to work out what would be involved in fixing it, and
the class of things made possible by fixing it doesn't seem to include
much that's useful. I may come back to this later.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Dec 16, 2009

From @obra

As of today's blead, bumping the constant up to 20000, it passes, while
5.8.8 explodes.

While it should never fail, it doesn't currently appear to be a regression.
I'm going to remove this as a 5.12 blocker

@p5pRT
Copy link
Author

p5pRT commented Dec 16, 2009

From @salva

On Tue Dec 15 19​:04​:05 2009, jesse wrote​:

As of today's blead, bumping the constant up to 20000, it passes, while
5.8.8 explodes.

While it should never fail, it doesn't currently appear to be a
regression.
I'm going to remove this as a 5.12 blocker

Can this limit be OS/architecture/perl config dependant?

Can you add the following test so it gets smoked?

$ cat op/reg_69654.t
#!perl

# See http​://rt.perl.org/rt3/Ticket/Display.html?id=59654

BEGIN {
  chdir 't' if -d 't';
  push @​INC, '../lib';
}

use Test​::More;

my @​sizes = (1000, 2000, 5000, 10000, 20000);

plan tests => scalar @​sizes;

for my $len (@​sizes) {
  my $re = join('|', (("a(?​:b") x $len), "a") . (")" x $len);
  ok(qr/$re/);
}

@khwilliamson
Copy link
Contributor

khwilliamson commented Apr 14, 2022

This now dies with the error "Too many nested open parens in regex". I think that is a reasonable outcome. We're always going to have limits, and we now properly handle exceeding this limit. So I think this is closable

@khwilliamson khwilliamson added the Closable? We might be able to close this ticket, but we need to check with the reporter label Apr 14, 2022
@hvds
Copy link
Contributor

hvds commented Apr 14, 2022

I think I'd say it's a bug that we hit the C stack for this. While I don't think there are that many real-world cases where #8369 really restricts what you can do, this one seems much more likely to stop useful stuff getting done by automated construction of regexps for complex grammars.

I might feel differently if it were only capturing parens that triggered this: I think it's less likely that you'd need 1000s of nested capturing parens to express a useful pattern. But non-capturing parens feel like they should be barely more costly than a no-op, and it is really useful to sprinkle them around liberally when constructing a regexp recursively.

@demerphq
Copy link
Collaborator

Regarding @hvds last comment, I guess we hit the C stack while calling reg over and over. This is a parser bug, not an optimizer bug I think.

@khwilliamson khwilliamson removed the Closable? We might be able to close this ticket, but we need to check with the reporter label Apr 17, 2022
@demerphq
Copy link
Collaborator

I am closing this ticket as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants