Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/(.(?2))((?<=(?=(?1)).))/ hangs and eats all available RAM #14935

Closed
p5pRT opened this issue Sep 25, 2015 · 24 comments
Closed

/(.(?2))((?<=(?=(?1)).))/ hangs and eats all available RAM #14935

p5pRT opened this issue Sep 25, 2015 · 24 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 25, 2015

Migrated from rt.perl.org#126182 (status was 'resolved')

Searchable as RT126182$

@p5pRT
Copy link
Author

p5pRT commented Sep 25, 2015

From victor@drawall.cc

Created by @Grimy

How to reproduce
----------------

perl5.23.4 -e '/(.(?2))((?<=(?=(?1)).))/'

Expected behavior
-----------------

Perl should immediately die with the following diagnostic​:

Infinite recursion in regex at -e line 1.

This is the current behavior of almost everything that would cause a regex to
never complete, creating the expectation that the regex engine is safe from
this kind of problem.

Actual behavior
---------------

Perl becomes unresponsive, allocating more and more RAM

Perl Info

Flags:
    category=core
    severity=wishlist

Site configuration information for perl 5.23.4:

Configured by grimy at Tue Sep 22 21:18:14 CEST 2015.

Summary of my perl5 (revision 5 version 23 subversion 4) configuration:
  Commit id: 2d9b5f101563ac9fee41e6ca496f79db6222d2e3
  Platform:
    osname=linux, osvers=4.0.7-2-arch, archname=x86_64-linux
    uname='linux localhost 4.0.7-2-arch #1 smp preempt tue jun 30
07:50:21 utc 2015 x86_64 gnulinux '
    config_args='-ds -e -Dusedevel'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe
-fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fwrapv -fno-strict-aliasing -pipe
-fstack-protector-strong -I/usr/local/include'
    ccversion='', gccversion='5.1.0', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8,
byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib
/usr/lib/gcc/x86_64-unknown-linux-gnu/5.1.0/include-fixed /usr/lib
/lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64
    libs=-lpthread -lnsl -lnm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
    perllibs=-lpthread -lnsl -lnm -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.21.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.21'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib
-fstack-protector-strong'



@INC for perl 5.23.4:
    /usr/local/lib/perl5/site_perl/5.23.4/x86_64-linux
    /usr/local/lib/perl5/site_perl/5.23.4
    /usr/local/lib/perl5/5.23.4/x86_64-linux
    /usr/local/lib/perl5/5.23.4
    /usr/local/lib/perl5/site_perl
    .


Environment for perl 5.23.4:
    HOME=/home/grimy
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/grimy/bin:/home/grimy/.nvim/scripts:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/plan9/bin
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Oct 13, 2015

From @khwilliamson

I am unable to reproduce this. I've tried on both -DEBUGGING builds, and not, and in 5.22.2

The regex compiles in all builds I tried it as

Final program​:
  1​: OPEN1 (3)
  3​: REG_ANY (4)
  4​: GOSUB2[+5] (7)
  7​: CLOSE1 (9)
  9​: OPEN2 (11)
  11​: IFMATCH[-1] (23)
  13​: IFMATCH[0] (20)
  15​: GOSUB1[-14] (18)
  18​: SUCCEED (0)
  19​: TAIL (20)
  20​: REG_ANY (21)
  21​: SUCCEED (0)
  22​: TAIL (23)
  23​: CLOSE2 (25)
  25​: END (0)
minlen 1
r->extflags​: NO_INPLACE_SUBST
r->intflags​: [none-set]

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Oct 13, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2015

From victor@drawall.cc

My bad, my “How to reproduce section” is wrong. When matched against
an empty string, the regex will fail immediately, as expected. The
problems only start when it is applied to a non-empty string, as in :

perl -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'

I’ve just reproduced it on the latest blead
(f83e001).

I’d recommend doing a `ulimit -Sv 1048576` or some such before running
the command, to prevent it from getting out of control.

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2015

From @demerphq

On 14 October 2015 at 08​:05, Victor ADAM <victor@​drawall.cc> wrote​:

My bad, my “How to reproduce section” is wrong. When matched against
an empty string, the regex will fail immediately, as expected. The
problems only start when it is applied to a non-empty string, as in :

perl -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'

I’ve just reproduced it on the latest blead
(f83e001).

I’d recommend doing a `ulimit -Sv 1048576` or some such before running
the command, to prevent it from getting out of control.

This should trigger an infinite recursion error.

yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2015

From victor@drawall.cc

Yes, it *should* trigger an infinite recursion error, but it actually doesn’t.

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2015

From @khwilliamson

On 10/14/2015 03​:11 AM, Victor ADAM wrote​:

Yes, it *should* trigger an infinite recursion error, but it actually doesn’t.

Thanks for the ulimit hint.

This is an area of the engine that I have never looked at, so it would
take me a while to get up to speed to figure out how to fix it. Here's
the first bit of the output of -Dr execution. I see why the current
check doesn't work in this situation, but am unsure of how one might fix
it without breaking other things​:

Matching REx "(.(?2))((?<=(?=(?1)).))" against "a"
  0 <> <a> | 1​:OPEN1(3)
  0 <> <a> | 3​:REG_ANY(4)
  1 <a> <> | 4​:GOSUB2[+5](7)
  1 <a> <> | 9​: OPEN2(11)
  1 <a> <> | 11​: IFMATCH[-1](23)
  0 <> <a> | 13​: IFMATCH[0](20)
  0 <> <a> | 15​: GOSUB1[-14](18)
  0 <> <a> | 1​: OPEN1(3)
  0 <> <a> | 3​: REG_ANY(4)
  1 <a> <> | 4​: GOSUB2[+5](7)
  1 <a> <> | 9​: OPEN2(11)
  1 <a> <> | 11​: IFMATCH[-1](23)
  0 <> <a> | 13​: IFMATCH[0](20)
  0 <> <a> | 15​: GOSUB1[-14](18)
  0 <> <a> | 1​: OPEN1(3)
  0 <> <a> | 3​: REG_ANY(4)
  1 <a> <> | 4​: GOSUB2[+5](7)
  1 <a> <> | 9​: OPEN2(11)
  1 <a> <> | 11​: IFMATCH[-1](23)
  0 <> <a> | 13​: IFMATCH[0](20)
  0 <> <a> | 15​: GOSUB1[-14](18)
  0 <> <a> | 1​: OPEN1(3)
  0 <> <a> | 3​: REG_ANY(4)
  1 <a> <> | 4​: GOSUB2[+5](7)
  1 <a> <> | 9​: OPEN2(11)
  1 <a> <> | 11​:
IFMATCH[-1](23)
  0 <> <a> | 13​:
IFMATCH[0](20)
  0 <> <a> | 15​:
GOSUB1[-14](18)

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2015

From @demerphq

On 14 October 2015 at 18​:01, Karl Williamson <public@​khwilliamson.com> wrote​:

On 10/14/2015 03​:11 AM, Victor ADAM wrote​:

Yes, it *should* trigger an infinite recursion error, but it actually
doesn’t.

Thanks for the ulimit hint.

This is an area of the engine that I have never looked at, so it would take
me a while to get up to speed to figure out how to fix it.

I have a fix. (Bit vector of visited nodes).

Here's the first
bit of the output of -Dr execution. I see why the current check doesn't
work in this situation, but am unsure of how one might fix it without
breaking other things

Exactly. I am still investigating. Might take me a few days, but ill get there.

Yves

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2015

From @khwilliamson

On 10/14/2015 11​:09 AM, demerphq wrote​:

On 14 October 2015 at 18​:01, Karl Williamson <public@​khwilliamson.com> wrote​:

On 10/14/2015 03​:11 AM, Victor ADAM wrote​:

Yes, it *should* trigger an infinite recursion error, but it actually
doesn’t.

Thanks for the ulimit hint.

This is an area of the engine that I have never looked at, so it would take
me a while to get up to speed to figure out how to fix it.

I have a fix. (Bit vector of visited nodes).

Here's the first
bit of the output of -Dr execution. I see why the current check doesn't
work in this situation, but am unsure of how one might fix it without
breaking other things

Exactly. I am still investigating. Might take me a few days, but ill get there.

Yves

Yves++

@p5pRT
Copy link
Author

p5pRT commented Dec 26, 2015

From @khwilliamson

On 10/14/2015 11​:09 AM, demerphq wrote​:

On 14 October 2015 at 18​:01, Karl Williamson <public@​khwilliamson.com> wrote​:

On 10/14/2015 03​:11 AM, Victor ADAM wrote​:

Yes, it *should* trigger an infinite recursion error, but it actually
doesn’t.

Thanks for the ulimit hint.

This is an area of the engine that I have never looked at, so it would take
me a while to get up to speed to figure out how to fix it.

I have a fix. (Bit vector of visited nodes).

Here's the first
bit of the output of -Dr execution. I see why the current check doesn't
work in this situation, but am unsure of how one might fix it without
breaking other things

Exactly. I am still investigating. Might take me a few days, but ill get there.

Yves

How is this coming?

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2015

From @demerphq

I'm really sorry, a series of personal issues, family in hospital type
things, have kept me from finishing this. I hope to find time at some
point, but I cant say when.

If it would help I will try to find time to upload what
work-in-progress I have. Otherwise I will do my best to get to it when
time permits.

Yves

On 26 December 2015 at 06​:24, Karl Williamson <public@​khwilliamson.com> wrote​:

On 10/14/2015 11​:09 AM, demerphq wrote​:

On 14 October 2015 at 18​:01, Karl Williamson <public@​khwilliamson.com>
wrote​:

On 10/14/2015 03​:11 AM, Victor ADAM wrote​:

Yes, it *should* trigger an infinite recursion error, but it actually
doesn’t.

Thanks for the ulimit hint.

This is an area of the engine that I have never looked at, so it would
take
me a while to get up to speed to figure out how to fix it.

I have a fix. (Bit vector of visited nodes).

Here's the first
bit of the output of -Dr execution. I see why the current check doesn't
work in this situation, but am unsure of how one might fix it without
breaking other things

Exactly. I am still investigating. Might take me a few days, but ill get
there.

Yves

How is this coming?

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2016

From @khwilliamson

I've added this to the 5.24 blockers list.

Yves, if you don't have the tuits for this, could you post what you got so far?
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2016

From @jkeenan

On Tue Oct 13 23​:05​:46 2015, victor@​drawall.cc wrote​:

My bad, my “How to reproduce section” is wrong. When matched against
an empty string, the regex will fail immediately, as expected. The
problems only start when it is applied to a non-empty string, as in :

perl -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'

I’ve just reproduced it on the latest blead
(f83e001).

I’d recommend doing a `ulimit -Sv 1048576` or some such before running
the command, to prevent it from getting out of control.

And here it is at today's blead (on Linux x86_64)​:
#####
$ git show | head -1
commit f24480b

$ ulimit -Sv 1048576 && ./perl -Ilib -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'
Out of memory!
#####

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2016

From @demerphq

On 2 March 2016 at 00​:58, James E Keenan via RT
<perlbug-followup@​perl.org> wrote​:

On Tue Oct 13 23​:05​:46 2015, victor@​drawall.cc wrote​:

My bad, my “How to reproduce section” is wrong. When matched against
an empty string, the regex will fail immediately, as expected. The
problems only start when it is applied to a non-empty string, as in :

perl -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'

I’ve just reproduced it on the latest blead
(f83e001).

I’d recommend doing a `ulimit -Sv 1048576` or some such before running
the command, to prevent it from getting out of control.

And here it is at today's blead (on Linux x86_64)​:
#####
$ git show | head -1
commit f24480b

$ ulimit -Sv 1048576 && ./perl -Ilib -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'
Out of memory!

Yeah, I know. This is still on my todo list. All I can say is that I
am now "back" hacking after being away for a while dealing with the
passing of my father and hope to get to it sooner than later.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 6, 2016

From @demerphq

On 2 March 2016 at 11​:01, demerphq <demerphq@​gmail.com> wrote​:

On 2 March 2016 at 00​:58, James E Keenan via RT
<perlbug-followup@​perl.org> wrote​:

And here it is at today's blead (on Linux x86_64)​:
#####
$ git show | head -1
commit f24480b

$ ulimit -Sv 1048576 && ./perl -Ilib -e '"a"=~/(.(?2))((?<=(?=(?1)).))/'
Out of memory!

hope to get to it sooner than later.

And this should be fixed now with​:

ba6840f

This might need a doc fix. It is no longer illegal to do "infinite
recursion", instead the match "fails" to match at the given position.

so for instance this​:

/a(?R)?b/

now matches "ab", "aabb", etc.

I am not sure this is the best wording, but now a pattern sub-routine
will fail to match if entered recursively from the same position. Note
this does not apply to EVAL (??{ ... }) style recursion, only GOSUB
style recursion (?&foo).

Eg​:

"foobar"=~/(?&y)(?(DEFINE)(?<x>(?&y)?foo)(?<y>(?&x)?bar))/

matches. However​:

"foobar"=~/(?&y)(?(DEFINE)(?<x>(?&y)foo)(?<y>(?&x)bar))/

does not, and produces output like this​:

floating "foobar" at 0..9223372036854775807 (checking floating) minlen 6
Matching REx "(?&y)(?(DEFINE)(?<x>(?&y)foo)(?<y>(?&x)bar))" against "foobar"
Intuit​: trying to determine minimum start position...
  doing 'check' fbm scan, [0..6] gave 0
  Found floating substr "foobar" at offset 0 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit​: Successfully guessed​: match at offset 0
  0 <> <foobar> | 1​:GOSUB2[+16​:17] 'y'(4)
  0 <> <foobar> | 17​: OPEN2 'y'(19)
  0 <> <foobar> | 19​: GOSUB1[-11​:8] 'x'(22)
  0 <> <foobar> | 8​: OPEN1 'x'(10)
  0 <> <foobar> | 10​: GOSUB2[+7​:17] 'y'(13)
  pattern left-recursion without
consuming input always fails...
  failed...
Match failed
Freeing REx​: "(?&y)(?(DEFINE)(?<x>(?&y)foo)(?<y>(?&x)bar))"

So I think this ticket can be closed.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2016

From zefram@fysh.org

demerphq wrote​:

                      It is no longer illegal to do "infinite

recursion", instead the match "fails" to match at the given position.

This is broken behaviour, in multiple respects. Though we already have
some similar to it.

If a recursion occurs without moving through the string, this does
not necessarily mean that the regexp execution would fail to terminate
if left to its own devices. The arbitrary code that we can embed in a
regexp means that it's easy to have part of a regexp behave differently on
different iterations. But even excluding that possibility, non-embedding
regexp features are quite sufficient to make a regexp that must iterate
a couple of times without moving in order to match. For example,

  "ab" =~ /a(?(1)z|(?(2)()|()))*\1b/

should match, with exactly two iterations of the * loop, each consuming
no characters. The first iteration captures $2 and can't match any
other way, the second captures $1 and can't match any other way, after
that it can't iterate more, and \1 must be matched to complete the match.
In fact, trying this out on perl 5.22.0, perl says that it doesn't match.
This is not due to any error in the structure of the regexp​: perl will
happily perform an equivalent match if some character is introduced such
that each iteration of the loop advances through the string​:

  "axxb" =~ /a(?​:x(?(1)z|(?(2)()|())))*\1b/

So perl has a bug here in failing to perform the earlier match.

Now let's consider cases that actually could recurse infinitely.
The simplest case is something like

  "ab" =~ /a(?​:)*b/

Note that in actual *regular* expressions, in the sense of matching a
regular language, this kind of infinite iteration is no problem at all.
If this kind of pattern is compiled down to a DFA, then the DFA just runs
along the string, in constant time per character, and tells you whether
or not it matches. It doesn't iterate for the iteration that appears in
the expression, and so executes the infinite loop in finite time. Srsly.
Of course, you can't then ask it how many iterations of the empty loop
were included in its match, because it doesn't generate that sort of
information.

perl does have a problem with infinite iteration, because it uses an NFA
with backtracking. On the above simple case it does correctly find the
match, but emits a warning while doing so. And here we *can* ask how
many times it went round the empty loop. In principle, in this simple
case, any number of times round the loop, zero or more, could produce a
valid match. The rub is that larger numbers of iterations are preferred,
so with no largest valid number of iterations there is no most-preferred
match. perl actually took exactly one iteration, so although it was
correct to say that it matched, it was actually incorrect in which way
it matched. That doesn't make any difference in the simplest case,
but it's detectable with a bit of extra work.

Incidentally, if we change the * to a *?, preferring lower numbers
of iterations, then perl correctly produces a zero-iterations match.
Though it still emits the warning.

We can muck about with the loop structure to get worse misbehaviour
from perl. For example,

  "ab" =~ /a(?​:()|())*\1\2b/

should match, and in fact has infinitely many ways to match. It requires
at least two iterations of the loop, at least one taking each branch.
But perl declares no match, this time emitting the warning as it
does so. The way in which this fails is a lot like the finite example
that I started with, but that first one doesn't generate the warning.
This infinite case shares with an earlier case the problem of having
no most-preferred match, but changing to the closest variant that has
a most-preferred match

  "ab" =~ /a(?(2)()|())*?\1\2b/

doesn't make perl recognise the match.

So there are many cases here where the extra logic put in to avoid hanging
is resulting in failing to find a match, or in finding the wrong match.
We do define regexps to try the options in a specific order, and to yield
the most-preferred match, so the latter is not an insignificant matter.

Detecting something that would otherwise hang is nice, but not essential.
We should never compromise correctness to get it. If the programmer
has instructed us to infinitely recurse, then hanging is a reasonable
response, and so is exhausting memory. We should not be afraid of doing
what the programmer said.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2016

From @demerphq

On 8 March 2016 at 01​:36, Zefram <zefram@​fysh.org> wrote​:

demerphq wrote​:

                      It is no longer illegal to do "infinite

recursion", instead the match "fails" to match at the given position.

This is broken behaviour, in multiple respects. Though we already have
some similar to it.

I understand your argument, but my reading of the docs says that it is
broken by design.

Specifically​:

"To break the loop, the following
match after a zero-length match is prohibited to have a length of zero.
This prohibition interacts with backtracking (see L<"Backtracking">),
and so the I<second best> match is chosen if the I<best> match is of
zero length.
"

If a recursion occurs without moving through the string, this does
not necessarily mean that the regexp execution would fail to terminate
if left to its own devices. The arbitrary code that we can embed in a
regexp means that it's easy to have part of a regexp behave differently on
different iterations. But even excluding that possibility, non-embedding
regexp features are quite sufficient to make a regexp that must iterate
a couple of times without moving in order to match. For example,

    "ab" =~ /a\(?\(1\)z|\(?\(2\)\(\)|\(\)\)\)\*\\1b/

should match, with exactly two iterations of the * loop, each consuming
no characters. The first iteration captures $2 and can't match any
other way, the second captures $1 and can't match any other way, after
that it can't iterate more, and \1 must be matched to complete the match.
In fact, trying this out on perl 5.22.0, perl says that it doesn't match.
This is not due to any error in the structure of the regexp​: perl will
happily perform an equivalent match if some character is introduced such
that each iteration of the loop advances through the string​:

    "axxb" =~ /a\(?&#8203;:x\(?\(1\)z|\(?\(2\)\(\)|\(\)\)\)\)\*\\1b/

So perl has a bug here in failing to perform the earlier match.

Now let's consider cases that actually could recurse infinitely.
The simplest case is something like

    "ab" =~ /a\(?&#8203;:\)\*b/

Note that in actual *regular* expressions, in the sense of matching a
regular language, this kind of infinite iteration is no problem at all.
If this kind of pattern is compiled down to a DFA, then the DFA just runs
along the string, in constant time per character, and tells you whether
or not it matches. It doesn't iterate for the iteration that appears in
the expression, and so executes the infinite loop in finite time. Srsly.
Of course, you can't then ask it how many iterations of the empty loop
were included in its match, because it doesn't generate that sort of
information.

perl does have a problem with infinite iteration, because it uses an NFA
with backtracking. On the above simple case it does correctly find the
match, but emits a warning while doing so. And here we *can* ask how
many times it went round the empty loop. In principle, in this simple
case, any number of times round the loop, zero or more, could produce a
valid match. The rub is that larger numbers of iterations are preferred,
so with no largest valid number of iterations there is no most-preferred
match. perl actually took exactly one iteration, so although it was
correct to say that it matched, it was actually incorrect in which way
it matched. That doesn't make any difference in the simplest case,
but it's detectable with a bit of extra work.

Incidentally, if we change the * to a *?, preferring lower numbers
of iterations, then perl correctly produces a zero-iterations match.
Though it still emits the warning.

We can muck about with the loop structure to get worse misbehaviour
from perl. For example,

    "ab" =~ /a\(?&#8203;:\(\)|\(\)\)\*\\1\\2b/

should match, and in fact has infinitely many ways to match. It requires
at least two iterations of the loop, at least one taking each branch.
But perl declares no match, this time emitting the warning as it
does so. The way in which this fails is a lot like the finite example
that I started with, but that first one doesn't generate the warning.
This infinite case shares with an earlier case the problem of having
no most-preferred match, but changing to the closest variant that has
a most-preferred match

    "ab" =~ /a\(?\(2\)\(\)|\(\)\)\*?\\1\\2b/

doesn't make perl recognise the match.

So there are many cases here where the extra logic put in to avoid hanging
is resulting in failing to find a match, or in finding the wrong match.
We do define regexps to try the options in a specific order, and to yield
the most-preferred match, so the latter is not an insignificant matter.

Except see the documentation I quote above.

Detecting something that would otherwise hang is nice, but not essential.
We should never compromise correctness to get it. If the programmer
has instructed us to infinitely recurse, then hanging is a reasonable
response, and so is exhausting memory. We should not be afraid of doing
what the programmer said.

I am sympathetic to what you say here.

But there is a subtle difference, Patterns are often supplied by the
user, and until regex recursion was introduced, it was safe(ish) to
allow them to do so.

So, to me, making every Perl program that allows user supplied regexes
be vulnerable to user supplied infinite loops is not a good thing.

Anyway, much of the dialog that you posted here IMO is unrelated to
regex infinite recursion, it relates to how some parts of the regex
handle matching the empty string.

Can you please reframe your dialog to be purely about regex recursion?
Can you show a pattern where regex recursion does not do the right
thing?

FWIW, I am open to restoring the old behavior, and making it a fatal
exception to enter an infinite loop recursively from the same position
due to left recursion in a pattern.

But I do think that /(?<foo>(?&foo)foo)/ infinite looping and then
eventually exiting due to memory exhaustion is NOT the right thing for
perl to do in this case.

Note that

/(?<foo>a(?&foo)?b)/

works just fine. The only problem are things like /(?<foo>(?&foo))/.

Thanks for your feedback.

I will say that independently of this, we *CAN* implement some form of
"maximum work" mechanism for matching, which would allow people to
address the "insecure pattern" problem in a saner way. Which would to
some degree ameliorate the concern I have here.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2016

From zefram@fysh.org

demerphq wrote​:

Anyway, much of the dialog that you posted here IMO is unrelated to
regex infinite recursion, it relates to how some parts of the regex
handle matching the empty string.

It's almost precisely equivalent. The problems you're addressing in (?R)
type recursion are the same ones that occur with * iteration, because the
iteration really operates in a recursive fashion. In general, /(x*)/
is equivalent to /((?​:x(?1))?)/, and for the purposes of which ways it
can match those are equivalent to /((?​:(?1)x)?)/. The (documented)
behaviour that captures by the recursively-called subpattern are not
available to the calling pattern breaks some of my constructions, but
generally the same problems apply.

Can you show a pattern where regex recursion does not do the right
thing?

Just using left recursion, not emulating a zero-length iteration, consider

  "abbc" =~ /a((?​:(?1)b)?)c/

Mathematically this does match, in exactly one way. But the naive
execution of the pattern would infinitely recurse, attempting to try out
infinitely many paths that would not match before it'll consider the
way that actually does. So acceptable results are to yield a match,
hang, exhaust memory, or signal an exception to forestall hanging.
5.22 signals "Infinite recursion in regex", which is fine. blead says
it does not match, which is the wrong answer.

The left recursion can be rescued by changing the ? to ??, so preferring
less recursion. Of course this changes the equivalent iterative construct
from * to *?. With that change, blead correctly shows a match, but 5.22
still signals "Infinite recursion".

Looking at some zero-length-iteration cases, I found some curious
behaviour on

  "abc" =~ /a((?1)??)c/

5.22 signals "Infinite recursion" for this, not surprisingly, but
blead hangs, slowly eating memory. Hanging is acceptable here, as I've
said, but this seems to be a regression from the point of view of your
objective of avoiding hanging. I'd expect your fail-on-recursion logic
to (correctly) say that this does not match, in finite time. There are
closely related cases, such as the same thing with ? in place of ??,
for which blead produces the right answer and 5.22 signals "Infinite
recursion".

The only problem are things like /(?<foo>(?&foo))/.

With unconditional recursion, that has no way to match using a finite
amount of structure, and mathematically it's tautological, so I'm fine
about it being shortcutted. I'm concerned about things that in principle
can actually match.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2016

From @demerphq

On 8 March 2016 at 18​:20, Zefram <zefram@​fysh.org> wrote​:

demerphq wrote​:

Anyway, much of the dialog that you posted here IMO is unrelated to
regex infinite recursion, it relates to how some parts of the regex
handle matching the empty string.

It's almost precisely equivalent. The problems you're addressing in (?R)
type recursion are the same ones that occur with * iteration, because the
iteration really operates in a recursive fashion. In general, /(x*)/
is equivalent to /((?​:x(?1))?)/, and for the purposes of which ways it
can match those are equivalent to /((?​:(?1)x)?)/. The (documented)
behaviour that captures by the recursively-called subpattern are not
available to the calling pattern breaks some of my constructions, but
generally the same problems apply.

Can you show a pattern where regex recursion does not do the right
thing?

Just using left recursion, not emulating a zero-length iteration, consider

    "abbc" =~ /a\(\(?&#8203;:\(?1\)b\)?\)c/

Mathematically this does match, in exactly one way. But the naive
execution of the pattern would infinitely recurse, attempting to try out
infinitely many paths that would not match before it'll consider the
way that actually does. So acceptable results are to yield a match,
hang, exhaust memory, or signal an exception to forestall hanging.
5.22 signals "Infinite recursion in regex", which is fine. blead says
it does not match, which is the wrong answer.

Ok, then I will restore the "Infinite recursion in regex" so we do not
do the wrong thing.

The left recursion can be rescued by changing the ? to ??, so preferring
less recursion. Of course this changes the equivalent iterative construct
from * to *?. With that change, blead correctly shows a match, but 5.22
still signals "Infinite recursion".

Hmm.

Looking at some zero-length-iteration cases, I found some curious
behaviour on

    "abc" =~ /a\(\(?1\)??\)c/

5.22 signals "Infinite recursion" for this, not surprisingly, but
blead hangs, slowly eating memory. Hanging is acceptable here, as I've
said, but this seems to be a regression from the point of view of your
objective of avoiding hanging. I'd expect your fail-on-recursion logic
to (correctly) say that this does not match, in finite time.

Yes, me too. I am looking into why it doesn't.

There are
closely related cases, such as the same thing with ? in place of ??,
for which blead produces the right answer and 5.22 signals "Infinite
recursion".

Could you help me with this and post some test you think should pass?

I am digging into the example you gave to figure out why it doesnt get
caught by the new algorithm.

The only problem are things like /(?<foo>(?&foo))/.

With unconditional recursion, that has no way to match using a finite
amount of structure, and mathematically it's tautological, so I'm fine
about it being shortcutted. I'm concerned about things that in principle
can actually match.

I appreciate that concern and I share it too.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 10, 2016

From @khwilliamson

In http​://nntp.perl.org/group/perl.perl5.porters/235003, which got attached to the wrong ticket, it was proposed to possibly revert the patch. I do not believe that it is acceptable to eat up all the machine's available memory. which would be one consequence of that reversion.

@p5pRT
Copy link
Author

p5pRT commented Apr 4, 2016

From @demerphq

On Thu Mar 10 10​:33​:39 2016, khw wrote​:

In http​://nntp.perl.org/group/perl.perl5.porters/235003, which got
attached to the wrong ticket, it was proposed to possibly revert the
patch. I do not believe that it is acceptable to eat up all the
machine's available memory. which would be one consequence of that
reversion.

I believe that this ticket can be closed now. Outside of an issue with Solaris as far as I know I fixed the issues raised in this thread, including the ones from zefram.

ce12e25
595de76
d1c49ad

and especially

401a802
ba6840f

If the OP concurs then I think this is done.

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2016

From @khwilliamson

On Mon Apr 04 03​:35​:12 2016, demerphq wrote​:

On Thu Mar 10 10​:33​:39 2016, khw wrote​:

In http​://nntp.perl.org/group/perl.perl5.porters/235003, which got
attached to the wrong ticket, it was proposed to possibly revert the
patch. I do not believe that it is acceptable to eat up all the
machine's available memory. which would be one consequence of that
reversion.

I believe that this ticket can be closed now. Outside of an issue with
Solaris as far as I know I fixed the issues raised in this thread,
including the ones from zefram.

ce12e25
595de76
d1c49ad

and especially

401a802
ba6840f

If the OP concurs then I think this is done.

The OP has not responded one way or the other. I am taking this ticket, and if the OP continues to not respond, after at least 7 days, I will close the ticket

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Sep 1, 2016

From @khwilliamson

Resolved, as I threatened earlier
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Sep 1, 2016

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant