Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ancient Regex Regression #16616

Open
p5pRT opened this issue Jul 9, 2018 · 62 comments
Open

Ancient Regex Regression #16616

p5pRT opened this issue Jul 9, 2018 · 62 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 9, 2018

Migrated from rt.perl.org#133352 (status was 'open')

Searchable as RT133352$

@p5pRT
Copy link
Author

p5pRT commented Jul 9, 2018

From @deven

Created by @deven

I discovered a bug in Perl's regular expression engine a few
months ago. I showed it to many people at The Perl Conference
in Salt Lake City a couple weeks ago, and everyone agreed that
this was a bug in the regex engine in Perl itself, including
Abigail, Tom Christiansen, Karl Williamson and Larry Wall.

I even ended up doing a lightning talk about the bug​:

  https://www.youtube.com/watch?v=U-JhPIECkPY

This was my test case, which works with or without anchors​:

  "afoobar" =~ /((.)foo|bar)*/
  "afoobar" =~ /^((.)foo|bar)*$/

Or, as a standalone command​:

  perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b", even though "bfoo" never appears in "afoobar"!

I understand why this is happening -- the inner group does match
against "b" in "bar" on the second iteration, but this branch of
the alternation fails. The capture is still being used, despite
the fact that it came from a failed branch of the alternation.

The correct answer seems to be "a", since that's the last match
of the inner group and the overall match is successful. Perl 1.0
can't handle this regex (Larry said it was the regex engine from
Gosling Emacs), but Perl 2.0 through Perl 5.0 alpha 9 all print
"a" for the command above. Other regex implementations, such as
PCRE, RE2, GNU and others, also return "a" for the inner group.

Perl 5.000 (from 1994) is the first commit in the git repository
(commit a0d0e21) which exhibits
the bug, printing "b" instead of "a". I just built blead again
today and confirmed that the bug is still there, despite passing
the full test suite. (Tom Christiansen pointed out that this bug
is technically a regression, since it used to work correctly.)

Even though I have never worked on the Perl core, and I've been
warned that the regex engine is particularly difficult, I would
still like to attempt to develop a patch for this bug myself.

I've already managed to create a working patch that fixes this
bug without breaking any of the regular expression tests in the
test suite, so I think I'm on the right track, but I think there
may be a few edge cases to consider, so I'm not ready to submit
the patch just yet.

Yves, SawyerX thought you might be willing to mentor me on this?
If so, that would be great!

My solution involves saving the captures with regcppush() on
BRANCH and TRIE nodes and restoring them with regcp_restore() at
BRANCH_next_fail and TRIE_next_fail. Does that sound like the
right approach, give or take?

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.29.1:

Configured by deven at Sat Jul  7 22:07:54 EDT 2018.

Summary of my perl5 (revision 5 version 29 subversion 1) configuration:
  Commit id: 71525f77826ad33944c007b06b68a1f14a085e7a
  Platform:
    osname=linux
    osvers=3.19.8-100.fc20.x86_64
    archname=x86_64-linux-thread-multi
    uname='linux twist.ties.org 3.19.8-100.fc20.x86_64 #1 smp tue may 12
17:08:50 utc 2015 x86_64 x86_64 x86_64 gnulinux '
    config_args=''
    hint=previous
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='gcc'
    ccflags ='-DDEBUGGING -D_REENTRANT -D_GNU_SOURCE -fwrapv
-fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-g'
    cppflags='-DDEBUGGING -D_REENTRANT -D_GNU_SOURCE -fwrapv
-fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='4.8.3 20140911 (Red Hat 4.8.3-7)'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='gcc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib
/lib64 /usr/lib64 /usr/local/lib64 /usr/local/lib /usr/lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.18.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.18'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -g -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.29.1:
    lib
    /usr/local/lib/perl5/site_perl/5.29.1/x86_64-linux-thread-multi
    /usr/local/lib/perl5/site_perl/5.29.1
    /usr/local/lib/perl5/5.29.1/x86_64-linux-thread-multi
    /usr/local/lib/perl5/5.29.1


Environment for perl 5.29.1:
    HOME=/home/deven
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)

PATH=/home/deven/bin:/home/deven/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/etc:/sbin:/home/deven/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @jkeenan

On Mon, 09 Jul 2018 04​:04​:32 GMT, dcorzine wrote​:

This is a bug report for perl from deven@​ties.org,
generated with the help of perlbug 1.41 running under perl 5.29.1.

-----------------------------------------------------------------
[Please describe your issue here]

I discovered a bug in Perl's regular expression engine a few
months ago. I showed it to many people at The Perl Conference
in Salt Lake City a couple weeks ago, and everyone agreed that
this was a bug in the regex engine in Perl itself, including
Abigail, Tom Christiansen, Karl Williamson and Larry Wall.

I even ended up doing a lightning talk about the bug​:

https://www.youtube.com/watch?v=U-JhPIECkPY

This was my test case, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/
"afoobar" =~ /^((.)foo|bar)*$/

Or, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b", even though "bfoo" never appears in "afoobar"!

I understand why this is happening -- the inner group does match
against "b" in "bar" on the second iteration, but this branch of
the alternation fails. The capture is still being used, despite
the fact that it came from a failed branch of the alternation.

The correct answer seems to be "a", since that's the last match
of the inner group and the overall match is successful. Perl 1.0
can't handle this regex (Larry said it was the regex engine from
Gosling Emacs), but Perl 2.0 through Perl 5.0 alpha 9 all print
"a" for the command above. Other regex implementations, such as
PCRE, RE2, GNU and others, also return "a" for the inner group.

Perl 5.000 (from 1994) is the first commit in the git repository
(commit a0d0e21) which exhibits
the bug, printing "b" instead of "a". I just built blead again
today and confirmed that the bug is still there, despite passing
the full test suite. (Tom Christiansen pointed out that this bug
is technically a regression, since it used to work correctly.)

Even though I have never worked on the Perl core, and I've been
warned that the regex engine is particularly difficult, I would
still like to attempt to develop a patch for this bug myself.

I've already managed to create a working patch that fixes this
bug without breaking any of the regular expression tests in the
test suite, so I think I'm on the right track, but I think there
may be a few edge cases to consider, so I'm not ready to submit
the patch just yet.

Please submit the patch as an attachment. Even if it's not complete, it gives us something to run through the test suite and a starting point for discussion.

Yves, SawyerX thought you might be willing to mentor me on this?
If so, that would be great!

My solution involves saving the captures with regcppush() on
BRANCH and TRIE nodes and restoring them with regcp_restore() at
BRANCH_next_fail and TRIE_next_fail. Does that sound like the
right approach, give or take?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @deven

On Tue, Jul 10, 2018 at 1​:09 PM, James E Keenan via RT <
perlbug-followup@​perl.org> wrote​:

On Mon, 09 Jul 2018 04​:04​:32 GMT, dcorzine wrote​:

This is a bug report for perl from deven@​ties.org,
generated with the help of perlbug 1.41 running under perl 5.29.1.

[...]

Please submit the patch as an attachment. Even if it's not complete, it
gives us something to run through the test suite and a starting point for
discussion.

Fair enough. I'll do that now.

Deven

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @deven

On Tue, Jul 10, 2018 at 1​:09 PM, James E Keenan via RT <
perlbug-followup@​perl.org> wrote​:

On Mon, 09 Jul 2018 04​:04​:32 GMT, dcorzine wrote​:

This is a bug report for perl from deven@​ties.org,
generated with the help of perlbug 1.41 running under perl 5.29.1.

[...]

Please submit the patch as an attachment. Even if it's not complete, it
gives us something to run through the test suite and a starting point for
discussion.

Fair enough. I'll do that now.

Deven

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @hvds

On Sun, 08 Jul 2018 21​:04​:32 -0700, dcorzine wrote​:

This was my test case, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/
"afoobar" =~ /^((.)foo|bar)*$/

Or, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b", even though "bfoo" never appears in "afoobar"!
[...]
The correct answer seems to be "a", since that's the last match
of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that, since in /(...)*/ we return the match for the last iteration of the group, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation, the correct answer would be undef, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

The fact that other examples don't match such an interpretation argues against it​:
% perl -wle 'use Data​::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])'
$VAR1 = [
  'bar',
  'foo',
  'bar'
  ];
%
.. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @deven

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994​:

  perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that
  I may have overlooked with this patch. Since TRIE_next_fail included
  a test for ST.jump which would call UNWIND_PAREN(), I want to better
  understand how that was working and determine whether ST.jump being
  set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
  the patch to fix it, and I know that both BRANCH and TRIE variations
  need to be tested.

* This patch is currently saving ALL captures, including captures that
  can't possibly change inside the alternation. This is an easier fix,
  but I believe the performance impact should be mitigated, since this
  would impact many everyday regular expressions which aren't affected
  by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
  mitigate most (but not all) of the performance impact of this fix,
  but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
  for alternations containing capture groups, maybe only for ones which
  also have a quantifier applied, since I believe both are required to
  trigger this bug. (This might have to be treated as an optimization
  in regcomp.c to use the old logic for alternations that can't trigger
  the bug, but it might be hard to tell if a quantifier is applied for
  qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
  regular expression engine (maybe Yves?) about this patch and whether
  the approach I've taken is on the right track.


regexec.c | 28 ++++++++++++++++------------
regexp.h | 3 +++
2 files changed, 19 insertions(+), 12 deletions(-)

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @deven

perl-133136-test1.patch
From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00:00:00 2001
From: "Deven T. Corzine" <deven@ties.org>
Date: Sat, 23 Jun 2018 23:17:03 -0400
Subject: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994:

     perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that
  I may have overlooked with this patch.  Since TRIE_next_fail included
  a test for ST.jump which would call UNWIND_PAREN(), I want to better
  understand how that was working and determine whether ST.jump being
  set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
  the patch to fix it, and I know that both BRANCH and TRIE variations
  need to be tested.

* This patch is currently saving ALL captures, including captures that
  can't possibly change inside the alternation.  This is an easier fix,
  but I believe the performance impact should be mitigated, since this
  would impact many everyday regular expressions which aren't affected
  by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
  mitigate most (but not all) of the performance impact of this fix,
  but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
  for alternations containing capture groups, maybe only for ones which
  also have a quantifier applied, since I believe both are required to
  trigger this bug.  (This might have to be treated as an optimization
  in regcomp.c to use the old logic for alternations that can't trigger
  the bug, but it might be hard to tell if a quantifier is applied for
  qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
  regular expression engine (maybe Yves?) about this patch and whether
  the approach I've taken is on the right track.
---
 regexec.c | 28 ++++++++++++++++------------
 regexp.h  |  3 +++
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/regexec.c b/regexec.c
index ba52ae9..1d42b0d 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
                 U32 state = trie->startstate;
 
+		/* save state of captures at branch start */
+		ST.cp = regcppush(rex, 0, maxopenparen);
+		REGCP_SET(ST.lastcp);
+
                 if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
                     _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
                     if (utf8_target
@@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 	case TRIE_next_fail: /* we failed - try next alternative */
         {
             U8 *uc;
-            if ( ST.jump ) {
-                /* undo any captures done in the tail part of a branch,
-                 * e.g.
-                 *    /(?:X(.)(.)|Y(.)).../
-                 * where the trie just matches X then calls out to do the
-                 * rest of the branch */
-                REGCP_UNWIND(ST.cp);
-                UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
-	    }
+
+	    /* restore state of captures from branch start */
+	    regcp_restore(rex, ST.lastcp, &maxopenparen);
+
 	    if (!--ST.accepted) {
 	        DEBUG_EXECUTE_r({
                     Perl_re_exec_indentf( aTHX_  "%sTRIE failed...%s\n",
@@ -8043,7 +8042,10 @@ NULL
 	    ST.lastparen = rex->lastparen;
 	    ST.lastcloseparen = rex->lastcloseparen;
 	    ST.next_branch = next;
-	    REGCP_SET(ST.cp);
+
+	    /* save state of captures at branch start */
+	    ST.cp = regcppush(rex, 0, maxopenparen);
+	    REGCP_SET(ST.lastcp);
 
 	    /* Now go into the branch */
 	    if (has_cutgroup) {
@@ -8077,8 +8079,10 @@ NULL
 	        do_cutgroup = 0;
 	        no_final = 0;
 	    }
-	    REGCP_UNWIND(ST.cp);
-            UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+	    /* restore state of captures from branch start */
+	    regcp_restore(rex, ST.lastcp, &maxopenparen);
+
 	    scan = ST.next_branch;
 	    /* no more branches? */
 	    if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {
diff --git a/regexp.h b/regexp.h
index 44409f0..aa61c97 100644
--- a/regexp.h
+++ b/regexp.h
@@ -754,6 +754,7 @@ typedef struct regmatch_state {
 	    U32 lastparen;
 	    U32 lastcloseparen;
 	    CHECKPOINT cp;
+	    CHECKPOINT lastcp;
 	    
         } branchlike;
         	    
@@ -763,6 +764,7 @@ typedef struct regmatch_state {
 	    U32 lastparen;
 	    U32 lastcloseparen;
 	    CHECKPOINT cp;
+	    CHECKPOINT lastcp;
 	    
 	    regnode *next_branch; /* next branch node */
 	} branch;
@@ -773,6 +775,7 @@ typedef struct regmatch_state {
 	    U32 lastparen;
 	    U32 lastcloseparen;
 	    CHECKPOINT cp;
+	    CHECKPOINT lastcp;
 
 	    U32		accepted; /* how many accepting states left */
 	    bool	longfold;/* saw a fold with a 1->n char mapping */
-- 
2.7.4

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2018

From @deven

On Tue, 10 Jul 2018 12​:19​:30 -0700, hv wrote​:

On Sun, 08 Jul 2018 21​:04​:32 -0700, dcorzine wrote​:

This was my test case, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/
"afoobar" =~ /^((.)foo|bar)*$/

Or, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b", even though "bfoo" never appears in "afoobar"!
[...]
The correct answer seems to be "a", since that's the last match
of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the
other candidate worth considering is that, since in /(...)*/ we return
the match for the last iteration of the group, you'd expect further
captures embedded within there also to deliver the version from the
last iteration of the group.

Under that interpretation, the correct answer would be undef, and the
earlier releases that returned 'a' were merely more subtly wrong than
the current release.

Yeah, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration, but the inner group does successfully match during the first iteration only, so "a" is indeed the last successful match.

I see two different viewpoints here, and both are quite reasonable. One viewpoint is that these are nested groups and both should be returning results from the same iteration, and therefore $2 should be undef because the nested match doesn't match anything on that iteration. Intuitively, this feels right. The other viewpoint is that each group can match multiple times and we only get to keep one capture per group, so $2 should be "a" since that's the last successful match for that group, despite the confusion of matching $1 and $2 from different iterations.

Personally, the $2="a" viewpoint seems like a stronger argument to me, but I could be in the minority in thinking that. I like the $2=undef viewpoint too, but we can't have both. In terms of how regular expressions are defined and documented, I'm having trouble rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve the question without ambiguity? I haven't found it yet, if there is.

For what it's worth, other regular expression engines like PCRE, RE2, GNU and others all return "a", which seems to suggest there may be some sort of consensus that "a" is the correct answer, but maybe it's just a gray area with no definite right answer?

The fact that other examples don't match such an interpretation argues
against it​:
% perl -wle 'use Data​::Dumper; print Dumper([ "foobar" =~
/((foo)|(bar))*/ ])'
$VAR1 = [
'bar',
'foo',
'bar'
];
%
.. but I don't recall seeing docs to justify such results. I'll have
to have another wade through them.

Hugo

That example is certainly returning $2 and $3 from different loop iterations, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example. Is that what you were getting at?

Deven

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2018

From @deven

[I originally entered this via RT, but I didn't see it come though
p5p, so I'm resending directly, hopefully it's not a duplicate!]

On Tue, 10 Jul 2018 12​:19​:30 -0700, hv wrote​:

On Sun, 08 Jul 2018 21​:04​:32 -0700, dcorzine wrote​:

This was my test case, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/
"afoobar" =~ /^((.)foo|bar)*$/

Or, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b", even though "bfoo" never appears in "afoobar"!
[...]
The correct answer seems to be "a", since that's the last match
of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the
other candidate worth considering is that, since in /(...)*/ we return
the match for the last iteration of the group, you'd expect further
captures embedded within there also to deliver the version from the
last iteration of the group.

Under that interpretation, the correct answer would be undef, and the
earlier releases that returned 'a' were merely more subtly wrong than
the current release.

Yeah, that's why I said the correct "seems" to be "a". There's a
decent argument for returning undef, and it's certainly
counter-intuitive to some degree to have $1="bar" from the second
iteration and $2="a" from the first iteration, but the inner group
does successfully match during the first iteration only, so "a" is
indeed the last successful match.

I see two different viewpoints here, and both are quite reasonable.
One viewpoint is that these are nested groups and both should be
returning results from the same iteration, and therefore $2 should be
undef because the nested match doesn't match anything on that
iteration. Intuitively, this feels right. The other viewpoint is
that each group can match multiple times and we only get to keep one
capture per group, so $2 should be "a" since that's the last
successful match for that group, despite the confusion of matching $1
and $2 from different iterations.

Personally, the $2="a" viewpoint seems like a stronger argument to me,
but I could be in the minority in thinking that. I like the $2=undef
viewpoint too, but we can't have both. In terms of how regular
expressions are defined and documented, I'm having trouble
rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve
the question without ambiguity? I haven't found it yet, if there is.

For what it's worth, other regular expression engines like PCRE, RE2,
GNU and others all return "a", which seems to suggest there may be
some sort of consensus that "a" is the correct answer, but maybe it's
just a gray area with no definite right answer?

The fact that other examples don't match such an interpretation argues
against it​:
% perl -wle 'use Data​::Dumper; print Dumper([ "foobar" =~
/((foo)|(bar))*/ ])'
$VAR1 = [
'bar',
'foo',
'bar'
];
%
.. but I don't recall seeing docs to justify such results. I'll have
to have another wade through them.

Hugo

That example is certainly returning $2 and $3 from different loop
iterations, and it's less clear in this case that returning $2=undef
would somehow be preferable or more intuitive for this example. Is
that what you were getting at?

Deven

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2018

From @deven

Has anyone taken a look at this patch yet? I'd like to hear what
others think so far, thanks!

Deven

On Tue, Jul 10, 2018 at 3​:19 PM, Deven T. Corzine via RT
<perlbug-followup@​perl.org> wrote​:

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that
I may have overlooked with this patch. Since TRIE_next_fail included
a test for ST.jump which would call UNWIND_PAREN(), I want to better
understand how that was working and determine whether ST.jump being
set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
the patch to fix it, and I know that both BRANCH and TRIE variations
need to be tested.

* This patch is currently saving ALL captures, including captures that
can't possibly change inside the alternation. This is an easier fix,
but I believe the performance impact should be mitigated, since this
would impact many everyday regular expressions which aren't affected
by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
mitigate most (but not all) of the performance impact of this fix,
but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
for alternations containing capture groups, maybe only for ones which
also have a quantifier applied, since I believe both are required to
trigger this bug. (This might have to be treated as an optimization
in regcomp.c to use the old logic for alternations that can't trigger
the bug, but it might be hard to tell if a quantifier is applied for
qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
regular expression engine (maybe Yves?) about this patch and whether
the approach I've taken is on the right track.
---
regexec.c | 28 ++++++++++++++++------------
regexp.h | 3 +++
2 files changed, 19 insertions(+), 12 deletions(-)

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133352

From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00​:00​:00 2001
From​: "Deven T. Corzine" <deven@​ties.org>
Date​: Sat, 23 Jun 2018 23​:17​:03 -0400
Subject​: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that
I may have overlooked with this patch. Since TRIE_next_fail included
a test for ST.jump which would call UNWIND_PAREN(), I want to better
understand how that was working and determine whether ST.jump being
set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
the patch to fix it, and I know that both BRANCH and TRIE variations
need to be tested.

* This patch is currently saving ALL captures, including captures that
can't possibly change inside the alternation. This is an easier fix,
but I believe the performance impact should be mitigated, since this
would impact many everyday regular expressions which aren't affected
by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
mitigate most (but not all) of the performance impact of this fix,
but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
for alternations containing capture groups, maybe only for ones which
also have a quantifier applied, since I believe both are required to
trigger this bug. (This might have to be treated as an optimization
in regcomp.c to use the old logic for alternations that can't trigger
the bug, but it might be hard to tell if a quantifier is applied for
qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
regular expression engine (maybe Yves?) about this patch and whether
the approach I've taken is on the right track.
---
regexec.c | 28 ++++++++++++++++------------
regexp.h | 3 +++
2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/regexec.c b/regexec.c
index ba52ae9..1d42b0d 100644
--- a/regexec.c
+++ b/regexec.c
@​@​ -5920,6 +5920,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
U32 state = trie->startstate;

+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
+
if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
_CHECK_AND_WARN_PROBLEMATIC_LOCALE;
if (utf8_target
@​@​ -6067,15 +6071,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
case TRIE_next_fail​: /* we failed - try next alternative */
{
U8 *uc;
- if ( ST.jump ) {
- /* undo any captures done in the tail part of a branch,
- * e.g.
- * /(?​:X(.)(.)|Y(.)).../
- * where the trie just matches X then calls out to do the
- * rest of the branch */
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
- }
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
if (!--ST.accepted) {
DEBUG_EXECUTE_r({
Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n",
@​@​ -8043,7 +8042,10 @​@​ NULL
ST.lastparen = rex->lastparen;
ST.lastcloseparen = rex->lastcloseparen;
ST.next_branch = next;
- REGCP_SET(ST.cp);
+
+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);

        /\* Now go into the branch \*/
        if \(has\_cutgroup\) \{

@​@​ -8077,8 +8079,10 @​@​ NULL
do_cutgroup = 0;
no_final = 0;
}
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
scan = ST.next_branch;
/* no more branches? */
if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {
diff --git a/regexp.h b/regexp.h
index 44409f0..aa61c97 100644
--- a/regexp.h
+++ b/regexp.h
@​@​ -754,6 +754,7 @​@​ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;

     \} branchlike;

@​@​ -763,6 +764,7 @​@​ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;

        regnode \*next\_branch; /\* next branch node \*/
    \} branch;

@​@​ -773,6 +775,7 @​@​ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;

        U32         accepted; /\* how many accepting states left \*/
        bool        longfold;/\* saw a fold with a 1\->n char mapping \*/

--
2.7.4

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2018

From @deven

Has anyone taken a look at this patch yet? I'd like to hear what
others think so far, thanks!

Deven

On Tue, Jul 10, 2018 at 3​:19 PM, Deven T. Corzine via RT
<perlbug-followup@​perl.org> wrote​:

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that
I may have overlooked with this patch. Since TRIE_next_fail included
a test for ST.jump which would call UNWIND_PAREN(), I want to better
understand how that was working and determine whether ST.jump being
set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
the patch to fix it, and I know that both BRANCH and TRIE variations
need to be tested.

* This patch is currently saving ALL captures, including captures that
can't possibly change inside the alternation. This is an easier fix,
but I believe the performance impact should be mitigated, since this
would impact many everyday regular expressions which aren't affected
by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
mitigate most (but not all) of the performance impact of this fix,
but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
for alternations containing capture groups, maybe only for ones which
also have a quantifier applied, since I believe both are required to
trigger this bug. (This might have to be treated as an optimization
in regcomp.c to use the old logic for alternations that can't trigger
the bug, but it might be hard to tell if a quantifier is applied for
qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
regular expression engine (maybe Yves?) about this patch and whether
the approach I've taken is on the right track.
---
regexec.c | 28 ++++++++++++++++------------
regexp.h | 3 +++
2 files changed, 19 insertions(+), 12 deletions(-)

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133352

From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00​:00​:00 2001
From​: "Deven T. Corzine" <deven@​ties.org>
Date​: Sat, 23 Jun 2018 23​:17​:03 -0400
Subject​: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that
I may have overlooked with this patch. Since TRIE_next_fail included
a test for ST.jump which would call UNWIND_PAREN(), I want to better
understand how that was working and determine whether ST.jump being
set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
the patch to fix it, and I know that both BRANCH and TRIE variations
need to be tested.

* This patch is currently saving ALL captures, including captures that
can't possibly change inside the alternation. This is an easier fix,
but I believe the performance impact should be mitigated, since this
would impact many everyday regular expressions which aren't affected
by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
mitigate most (but not all) of the performance impact of this fix,
but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
for alternations containing capture groups, maybe only for ones which
also have a quantifier applied, since I believe both are required to
trigger this bug. (This might have to be treated as an optimization
in regcomp.c to use the old logic for alternations that can't trigger
the bug, but it might be hard to tell if a quantifier is applied for
qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
regular expression engine (maybe Yves?) about this patch and whether
the approach I've taken is on the right track.
---
regexec.c | 28 ++++++++++++++++------------
regexp.h | 3 +++
2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/regexec.c b/regexec.c
index ba52ae9..1d42b0d 100644
--- a/regexec.c
+++ b/regexec.c
@​@​ -5920,6 +5920,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
U32 state = trie->startstate;

+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
+
if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
_CHECK_AND_WARN_PROBLEMATIC_LOCALE;
if (utf8_target
@​@​ -6067,15 +6071,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
case TRIE_next_fail​: /* we failed - try next alternative */
{
U8 *uc;
- if ( ST.jump ) {
- /* undo any captures done in the tail part of a branch,
- * e.g.
- * /(?​:X(.)(.)|Y(.)).../
- * where the trie just matches X then calls out to do the
- * rest of the branch */
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
- }
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
if (!--ST.accepted) {
DEBUG_EXECUTE_r({
Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n",
@​@​ -8043,7 +8042,10 @​@​ NULL
ST.lastparen = rex->lastparen;
ST.lastcloseparen = rex->lastcloseparen;
ST.next_branch = next;
- REGCP_SET(ST.cp);
+
+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);

        /\* Now go into the branch \*/
        if \(has\_cutgroup\) \{

@​@​ -8077,8 +8079,10 @​@​ NULL
do_cutgroup = 0;
no_final = 0;
}
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
scan = ST.next_branch;
/* no more branches? */
if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {
diff --git a/regexp.h b/regexp.h
index 44409f0..aa61c97 100644
--- a/regexp.h
+++ b/regexp.h
@​@​ -754,6 +754,7 @​@​ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;

     \} branchlike;

@​@​ -763,6 +764,7 @​@​ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;

        regnode \*next\_branch; /\* next branch node \*/
    \} branch;

@​@​ -773,6 +775,7 @​@​ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;

        U32         accepted; /\* how many accepting states left \*/
        bool        longfold;/\* saw a fold with a 1\->n char mapping \*/

--
2.7.4

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From @iabyn

On Sun, Jul 15, 2018 at 03​:52​:30PM -0400, Deven T. Corzine wrote​:

Has anyone taken a look at this patch yet? I'd like to hear what
others think so far, thanks!

I plan to look at it at some point, but I hate and fear the capture
save and restore code in regmatch(), which has kept me away from looking
at your patch yet. Maybe in a few days.

--
Art is anything that has a label (especially if the label is "untitled 1")

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From @iabyn

On Sun, Jul 15, 2018 at 03​:52​:30PM -0400, Deven T. Corzine wrote​:

Has anyone taken a look at this patch yet? I'd like to hear what
others think so far, thanks!

I plan to look at it at some point, but I hate and fear the capture
save and restore code in regmatch(), which has kept me away from looking
at your patch yet. Maybe in a few days.

--
Art is anything that has a label (especially if the label is "untitled 1")

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From @deven

On Mon, Jul 16, 2018 at 4​:13 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

I plan to look at it at some point, but I hate and fear the capture
save and restore code in regmatch(), which has kept me away from looking
at your patch yet. Maybe in a few days.

I look forward to hearing your feedback, thanks Dave! That's
definitely complex code! It took quite a few hours of study to
understand it well enough to make that patch, though it's only a few
lines long.

The patch still needs work -- right now it's saving and restoring ALL
the captures, which is overkill. There's no need to save and restore
captures that are outside the alternation, and that could affect the
performance of many regexes which aren't affected by this bug in the
first place.

My initial inclination was try setting a paren floor (which CURLYX
does), but I haven't figured out a clean way to do that yet.

When a regmatch_state is allocated, is the memory cleared or uninitialized?

Deven

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From @deven

On Mon, Jul 16, 2018 at 4​:13 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

I plan to look at it at some point, but I hate and fear the capture
save and restore code in regmatch(), which has kept me away from looking
at your patch yet. Maybe in a few days.

I look forward to hearing your feedback, thanks Dave! That's
definitely complex code! It took quite a few hours of study to
understand it well enough to make that patch, though it's only a few
lines long.

The patch still needs work -- right now it's saving and restoring ALL
the captures, which is overkill. There's no need to save and restore
captures that are outside the alternation, and that could affect the
performance of many regexes which aren't affected by this bug in the
first place.

My initial inclination was try setting a paren floor (which CURLYX
does), but I haven't figured out a clean way to do that yet.

When a regmatch_state is allocated, is the memory cleared or uninitialized?

Deven

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @iabyn

On Tue, Jul 10, 2018 at 02​:14​:03PM -0700, Deven T. Corzine via RT wrote​:

Yeah, that's why I said the correct "seems" to be "a". There's a decent
argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression
or similar, the captures from the Nth iteration are still valid, until
over-written by the (N+1)th iteration. This allows patterns of this form
to work​:

  print "1=[$1]\n" if "AA-ABB-" =~ /^ (?​: \1? (\w) \1 - )* $/x

which outputs 'B'.

On the first iteration​:
  the first \1 fails to match anything, and is skipped via the '?';
  the second \1 matches 'A'

On the second iteration​:
  the first \1 matches 'A' - the value from the last iteration
  the second \1 matches 'B' - the value from the current iteration

The second principle is that following an alternation, the captures
from branches that failed or weren't tried, are invalid.

Neither of these match​:

  print "matched\n" if "B" =~ /^ (?​: A(.) | B ) \1 $/x;
  print "matched\n" if "B" =~ /^ (?​: A | B | (.) ) \1 $/x;

So the real question is, at the end of an alternation, should any
'unused' captures within the alternation be flagged as invalid,
or should they preserved - retaining the values they had at the start of
the alternation (which may be real values if this is a second+ iteration
of an enclosing '*' etc).

I think you could argue it either way. However, since this bug has been
around since forever, with no-one apparently noticing it before, I think
can fix it how we like - so we should pick whichever is easiest to
implement.

Invalidating captures set by a failing branch involves just knowing the
max index at the start and end of the branch execution, and invalidating
everything in between; restoring previous values involves saving a whole
set of capture indices and restoring them on failure (which is what I
think your patch does). The latter sounds a whole lot more expensive, and
would potentially slow down all alterations.

--
The Enterprise is involved in a bizarre time-warp experience which is in
some way unconnected with the Late 20th Century.
  -- Things That Never Happen in "Star Trek" #14

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @iabyn

On Tue, Jul 10, 2018 at 02​:14​:03PM -0700, Deven T. Corzine via RT wrote​:

Yeah, that's why I said the correct "seems" to be "a". There's a decent
argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression
or similar, the captures from the Nth iteration are still valid, until
over-written by the (N+1)th iteration. This allows patterns of this form
to work​:

  print "1=[$1]\n" if "AA-ABB-" =~ /^ (?​: \1? (\w) \1 - )* $/x

which outputs 'B'.

On the first iteration​:
  the first \1 fails to match anything, and is skipped via the '?';
  the second \1 matches 'A'

On the second iteration​:
  the first \1 matches 'A' - the value from the last iteration
  the second \1 matches 'B' - the value from the current iteration

The second principle is that following an alternation, the captures
from branches that failed or weren't tried, are invalid.

Neither of these match​:

  print "matched\n" if "B" =~ /^ (?​: A(.) | B ) \1 $/x;
  print "matched\n" if "B" =~ /^ (?​: A | B | (.) ) \1 $/x;

So the real question is, at the end of an alternation, should any
'unused' captures within the alternation be flagged as invalid,
or should they preserved - retaining the values they had at the start of
the alternation (which may be real values if this is a second+ iteration
of an enclosing '*' etc).

I think you could argue it either way. However, since this bug has been
around since forever, with no-one apparently noticing it before, I think
can fix it how we like - so we should pick whichever is easiest to
implement.

Invalidating captures set by a failing branch involves just knowing the
max index at the start and end of the branch execution, and invalidating
everything in between; restoring previous values involves saving a whole
set of capture indices and restoring them on failure (which is what I
think your patch does). The latter sounds a whole lot more expensive, and
would potentially slow down all alterations.

--
The Enterprise is involved in a bizarre time-warp experience which is in
some way unconnected with the Late 20th Century.
  -- Things That Never Happen in "Star Trek" #14

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @deven

On Tue, Jul 17, 2018 at 11​:02 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Jul 10, 2018 at 02​:14​:03PM -0700, Deven T. Corzine via RT wrote​:

Yeah, that's why I said the correct "seems" to be "a". There's a decent
argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression
or similar, the captures from the Nth iteration are still valid, until
over-written by the (N+1)th iteration. This allows patterns of this form
to work​:

print "1=\[$1\]\\n" if "AA\-ABB\-" =~ /^ \(?&#8203;: \\1? \(\\w\) \\1 \- \)\* $/x

which outputs 'B'.

On the first iteration​:
the first \1 fails to match anything, and is skipped via the '?';
the second \1 matches 'A'

On the second iteration​:
the first \1 matches 'A' - the value from the last iteration
the second \1 matches 'B' - the value from the current iteration

Interesting. It's not a forward reference, but it's sort of an
inverted backreference. I don't think I've ever seen that done
before, but it seems to me that this ought to be valid, unusual though
it is.

The second principle is that following an alternation, the captures
from branches that failed or weren't tried, are invalid.

Neither of these match​:

print "matched\\n" if "B" =~ /^ \(?&#8203;: A\(\.\) | B \) \\1 $/x;
print "matched\\n" if "B" =~ /^ \(?&#8203;: A | B | \(\.\) \) \\1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

So the real question is, at the end of an alternation, should any
'unused' captures within the alternation be flagged as invalid,
or should they preserved - retaining the values they had at the start of
the alternation (which may be real values if this is a second+ iteration
of an enclosing '*' etc).

Indeed, and there is certainly room for debate here, as both arguments
are reasonable.

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me. Consider this example​:

  "foobar" =~ /^ (?​: (foo) | (bar) )* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar". If
the "unused" capture were to be invalidated, this would leave $1
undefined instead. Would this be desirable?

I think you could argue it either way. However, since this bug has been
around since forever, with no-one apparently noticing it before, I think
can fix it how we like - so we should pick whichever is easiest to
implement.

That's a good point. Still, it's worth considering that restoring the
original captures is what Perl used to do in Perl 2.0 through Perl 5.0
alpha 9, and it's also what PCRE, RE2 and other regex implementations
do also. Javascript does seem to invalidate the capture instead, but
that's the only counter-example I've seen so far.

Strangely, I can't find a single word in the documentation about what
happens with captures when quantifiers are applied. We're used to the
capture being replaced with the new one, yet I can't find anything
saying that it does that! All of it seems to be unspecified behavior.
Is it somewhere I'm missing?

Invalidating captures set by a failing branch involves just knowing the
max index at the start and end of the branch execution, and invalidating
everything in between; restoring previous values involves saving a whole
set of capture indices and restoring them on failure (which is what I
think your patch does). The latter sounds a whole lot more expensive, and
would potentially slow down all alterations.

Yes, that's what my current patch does, and there IS a performance
issue here, since I'm currently saving and restoring ALL captures,
whether or not they're in the alternation. This is unnecessary, but I
haven't figured out yet how to set the paren floor correctly to only
save the necessary ones. I think that would mitigate most of the
performance impact, though not all of it.

Would it be better to remember the captures some other way? Perl 6
returns ALL the captures; should Perl 5 have that capability too?

Deven

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @deven

On Tue, Jul 17, 2018 at 11​:02 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Jul 10, 2018 at 02​:14​:03PM -0700, Deven T. Corzine via RT wrote​:

Yeah, that's why I said the correct "seems" to be "a". There's a decent
argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression
or similar, the captures from the Nth iteration are still valid, until
over-written by the (N+1)th iteration. This allows patterns of this form
to work​:

print "1=\[$1\]\\n" if "AA\-ABB\-" =~ /^ \(?&#8203;: \\1? \(\\w\) \\1 \- \)\* $/x

which outputs 'B'.

On the first iteration​:
the first \1 fails to match anything, and is skipped via the '?';
the second \1 matches 'A'

On the second iteration​:
the first \1 matches 'A' - the value from the last iteration
the second \1 matches 'B' - the value from the current iteration

Interesting. It's not a forward reference, but it's sort of an
inverted backreference. I don't think I've ever seen that done
before, but it seems to me that this ought to be valid, unusual though
it is.

The second principle is that following an alternation, the captures
from branches that failed or weren't tried, are invalid.

Neither of these match​:

print "matched\\n" if "B" =~ /^ \(?&#8203;: A\(\.\) | B \) \\1 $/x;
print "matched\\n" if "B" =~ /^ \(?&#8203;: A | B | \(\.\) \) \\1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

So the real question is, at the end of an alternation, should any
'unused' captures within the alternation be flagged as invalid,
or should they preserved - retaining the values they had at the start of
the alternation (which may be real values if this is a second+ iteration
of an enclosing '*' etc).

Indeed, and there is certainly room for debate here, as both arguments
are reasonable.

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me. Consider this example​:

  "foobar" =~ /^ (?​: (foo) | (bar) )* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar". If
the "unused" capture were to be invalidated, this would leave $1
undefined instead. Would this be desirable?

I think you could argue it either way. However, since this bug has been
around since forever, with no-one apparently noticing it before, I think
can fix it how we like - so we should pick whichever is easiest to
implement.

That's a good point. Still, it's worth considering that restoring the
original captures is what Perl used to do in Perl 2.0 through Perl 5.0
alpha 9, and it's also what PCRE, RE2 and other regex implementations
do also. Javascript does seem to invalidate the capture instead, but
that's the only counter-example I've seen so far.

Strangely, I can't find a single word in the documentation about what
happens with captures when quantifiers are applied. We're used to the
capture being replaced with the new one, yet I can't find anything
saying that it does that! All of it seems to be unspecified behavior.
Is it somewhere I'm missing?

Invalidating captures set by a failing branch involves just knowing the
max index at the start and end of the branch execution, and invalidating
everything in between; restoring previous values involves saving a whole
set of capture indices and restoring them on failure (which is what I
think your patch does). The latter sounds a whole lot more expensive, and
would potentially slow down all alterations.

Yes, that's what my current patch does, and there IS a performance
issue here, since I'm currently saving and restoring ALL captures,
whether or not they're in the alternation. This is unnecessary, but I
haven't figured out yet how to set the paren floor correctly to only
save the necessary ones. I think that would mitigate most of the
performance impact, though not all of it.

Would it be better to remember the captures some other way? Perl 6
returns ALL the captures; should Perl 5 have that capability too?

Deven

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @davidnicol

This is a good test case for the bug​:

On the other hand, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

The bug appears to be the result of an optimization of describing the
capture buffer with an offset -- essentially a dynamic substring expression
-- rather than copying the captured string into it.
Were the capture buffer to be copied into, it would get 'A' (the character
before the B) rather than 'C ' (the first character in the match) and the
behavior would be the same as what the other regex engines do.

Is that what the patch changes?

--
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @davidnicol

This is a good test case for the bug​:

On the other hand, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

The bug appears to be the result of an optimization of describing the
capture buffer with an offset -- essentially a dynamic substring expression
-- rather than copying the captured string into it.
Were the capture buffer to be copied into, it would get 'A' (the character
before the B) rather than 'C ' (the first character in the match) and the
behavior would be the same as what the other regex engines do.

Is that what the patch changes?

--
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @deven

On Tue, Jul 17, 2018 at 12​:58 PM David Nicol <davidnicol@​gmail.com> wrote​:

This is a good test case for the bug​:

On the other hand, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

The bug appears to be the result of an optimization of describing the
capture buffer with an offset -- essentially a dynamic substring expression
-- rather than copying the captured string into it.
Were the capture buffer to be copied into, it would get 'A' (the
character before the B) rather than 'C ' (the first character in the match)
and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

That optimization doesn't cause the bug, it's the attempt to match the (.)
again against "CD" that causes it -- the (.) matches, but the "D" doesn't,
and it doesn't restore the original capture.

Deven

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @deven

On Tue, Jul 17, 2018 at 12​:58 PM David Nicol <davidnicol@​gmail.com> wrote​:

This is a good test case for the bug​:

On the other hand, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

The bug appears to be the result of an optimization of describing the
capture buffer with an offset -- essentially a dynamic substring expression
-- rather than copying the captured string into it.
Were the capture buffer to be copied into, it would get 'A' (the
character before the B) rather than 'C ' (the first character in the match)
and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

That optimization doesn't cause the bug, it's the attempt to match the (.)
again against "CD" that causes it -- the (.) matches, but the "D" doesn't,
and it doesn't restore the original capture.

Deven

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @davidnicol

On Tue, Jul 17, 2018 at 12​:56 PM Deven T. Corzine <deven@​ties.org> wrote​:

On the other hand, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

That optimization doesn't cause the bug, it's the attempt to match the
(.) again against "CD" that causes it -- the (.) matches, but the "D"
doesn't, and it doesn't restore the original capture.

Deven

Thank you, I misunderstood. So in the original demonstration, the "b" got
into $2 before the branch failed because the b was not followed by "foo",
not due to $2 being internally tracked as an offset, and as that branch had
succeeded, the capture was assignable.

As the current documentation (the section on "Capture Groups" in
https://perldoc.perl.org/perlre.html, accessed just now) states "If a group
did not match, the associated backreference won't match either. (This can
happen if the group is optional, or in a different branch of an
alternation.) " there is clearly a bug somewhere. On the other hand, as
CDCDC fails to match the test while CBCDC (unsurprisingly, but for
surprising reason) does, so there is some kind of "did this match"
knowledge happening, otherwise CDCDC would set \1 to C before failing to
match the B, and the implementation could be interpreted as conformant with
the documentation's "if a group did not match" but it takes a lot of
squinting.

The current documentation (that section) contains no guidance concerning
capture groups within repeating constructs. Honestly, before today I
expected regex constructions like

  "abcdef" =~ /(?​:(.))+/

to magically create $1 through $6 and load them all. This was erroneous!
That's not how it works! The documentation is silent on the matter.

As an opinionated person, I'm in favor of fixing the regression and
including

  # we don't clobber capture groups with data from failed alternate
branches (although we used to)
* ( **"ABCD" =~ /^(?​:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in
alternations that passed in earlier iterations but not the most recent one
used to work, in perldoc perlre.

...
After looking at
https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch
I wonder if it might be possible to defer the assignment into the capture
buffers until after branches have succeeded, rather than resetting them.
This approach might require making a set of provisional capture buffers at
every juncture that could become a descent into an iterating subregex
containing captures, but wouldn't be vulnerable to only operating correctly
at the first level. But maybe the engine already stacks these things so
with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?​:(?​:(.)B|CD)+|(?​:(.)D|A(.))*)+/ and print
"$1 > $2 > $3"'
C > A > F

will do the right thing, whatever that is.

Thank you

--
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @davidnicol

On Tue, Jul 17, 2018 at 12​:56 PM Deven T. Corzine <deven@​ties.org> wrote​:

On the other hand, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

That optimization doesn't cause the bug, it's the attempt to match the
(.) again against "CD" that causes it -- the (.) matches, but the "D"
doesn't, and it doesn't restore the original capture.

Deven

Thank you, I misunderstood. So in the original demonstration, the "b" got
into $2 before the branch failed because the b was not followed by "foo",
not due to $2 being internally tracked as an offset, and as that branch had
succeeded, the capture was assignable.

As the current documentation (the section on "Capture Groups" in
https://perldoc.perl.org/perlre.html, accessed just now) states "If a group
did not match, the associated backreference won't match either. (This can
happen if the group is optional, or in a different branch of an
alternation.) " there is clearly a bug somewhere. On the other hand, as
CDCDC fails to match the test while CBCDC (unsurprisingly, but for
surprising reason) does, so there is some kind of "did this match"
knowledge happening, otherwise CDCDC would set \1 to C before failing to
match the B, and the implementation could be interpreted as conformant with
the documentation's "if a group did not match" but it takes a lot of
squinting.

The current documentation (that section) contains no guidance concerning
capture groups within repeating constructs. Honestly, before today I
expected regex constructions like

  "abcdef" =~ /(?​:(.))+/

to magically create $1 through $6 and load them all. This was erroneous!
That's not how it works! The documentation is silent on the matter.

As an opinionated person, I'm in favor of fixing the regression and
including

  # we don't clobber capture groups with data from failed alternate
branches (although we used to)
* ( **"ABCD" =~ /^(?​:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in
alternations that passed in earlier iterations but not the most recent one
used to work, in perldoc perlre.

...
After looking at
https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch
I wonder if it might be possible to defer the assignment into the capture
buffers until after branches have succeeded, rather than resetting them.
This approach might require making a set of provisional capture buffers at
every juncture that could become a descent into an iterating subregex
containing captures, but wouldn't be vulnerable to only operating correctly
at the first level. But maybe the engine already stacks these things so
with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?​:(?​:(.)B|CD)+|(?​:(.)D|A(.))*)+/ and print
"$1 > $2 > $3"'
C > A > F

will do the right thing, whatever that is.

Thank you

--
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2018

From @iabyn

On Tue, Jul 17, 2018 at 11​:42​:11AM -0400, Deven T. Corzine wrote​:

On Tue, Jul 17, 2018 at 11​:02 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

So the real question is, at the end of an alternation, should any
'unused' captures within the alternation be flagged as invalid,
or should they preserved - retaining the values they had at the start of
the alternation (which may be real values if this is a second+ iteration
of an enclosing '*' etc).

Indeed, and there is certainly room for debate here, as both arguments
are reasonable.

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar". If
the "unused" capture were to be invalidated, this would leave $1
undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol
quoted​:

  If a group did not match, the associated backreference won't match
  either. (This can happen if the group is optional, or in a different
  branch of an alternation.)

Yes, that's what my current patch does, and there IS a performance
issue here, since I'm currently saving and restoring ALL captures,
whether or not they're in the alternation. This is unnecessary, but I
haven't figured out yet how to set the paren floor correctly to only
save the necessary ones. I think that would mitigate most of the
performance impact, though not all of it.

There is considerable performance overhead in saving even just one set of
capture indices - the marginal cost of saving more than one is less. So
saving fewer is good, saving none is a *lot* better.

The only ones needing saving or invalidating are the ones with indices
lying between lastopen+1 .. maxopenparen (I think, based on a quick look).

It might be worth writing an alternative patch which does just the
invalidation, rather than saving/restoring, and see what, if any, tests fail.
Those failures may give more insight into original intent, and whether
saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the
sense that the eventual code will be simple, but working out what that
code should be exactly may be hard).

Would it be better to remember the captures some other way? Perl 6
returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

--
Little fly, thy summer's play my thoughtless hand
has terminated with extreme prejudice.
  (with apologies to William Blake)

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2018

From @iabyn

On Tue, Jul 17, 2018 at 11​:42​:11AM -0400, Deven T. Corzine wrote​:

On Tue, Jul 17, 2018 at 11​:02 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

So the real question is, at the end of an alternation, should any
'unused' captures within the alternation be flagged as invalid,
or should they preserved - retaining the values they had at the start of
the alternation (which may be real values if this is a second+ iteration
of an enclosing '*' etc).

Indeed, and there is certainly room for debate here, as both arguments
are reasonable.

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar". If
the "unused" capture were to be invalidated, this would leave $1
undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol
quoted​:

  If a group did not match, the associated backreference won't match
  either. (This can happen if the group is optional, or in a different
  branch of an alternation.)

Yes, that's what my current patch does, and there IS a performance
issue here, since I'm currently saving and restoring ALL captures,
whether or not they're in the alternation. This is unnecessary, but I
haven't figured out yet how to set the paren floor correctly to only
save the necessary ones. I think that would mitigate most of the
performance impact, though not all of it.

There is considerable performance overhead in saving even just one set of
capture indices - the marginal cost of saving more than one is less. So
saving fewer is good, saving none is a *lot* better.

The only ones needing saving or invalidating are the ones with indices
lying between lastopen+1 .. maxopenparen (I think, based on a quick look).

It might be worth writing an alternative patch which does just the
invalidation, rather than saving/restoring, and see what, if any, tests fail.
Those failures may give more insight into original intent, and whether
saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the
sense that the eventual code will be simple, but working out what that
code should be exactly may be hard).

Would it be better to remember the captures some other way? Perl 6
returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

--
Little fly, thy summer's play my thoughtless hand
has terminated with extreme prejudice.
  (with apologies to William Blake)

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2018

From @davidnicol

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar". If
the "unused" capture were to be invalidated, this would leave $1
undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol
quoted​:

If a group did not match\, the associated backreference won't match
either\. \(This can happen if the group is optional\, or in a different
branch of an alternation\.\)

the expression in the capture matching isn't enough; the capture has to
match and then the outer expression has to match too.
I'd like to think that this can be fixed by rearranging the order of things
so the assignment to the capture group is deferred until
the branch the capture group is in matches.

"go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor
bar
$ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | (bar) )* /x; print
"1​:$1;2​:$2;3​:$3"'
1​:fo;2​:o;3​:bar

"go" clobbers $1 even though "gox" is not [fg]oo, as a later alternative
within this iteration matches
$ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | gox | (bar) )* /x; print
"1​:$1;2​:$2;3​:$3"'
1​:go;2​:o;3​:bar

When exactly does the assignment happen?

--
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2018

From @davidnicol

Personally, invalidating a successful capture that was part of a
successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch, this leaves $1="foo" and $2="bar". If
the "unused" capture were to be invalidated, this would leave $1
undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol
quoted​:

If a group did not match\, the associated backreference won't match
either\. \(This can happen if the group is optional\, or in a different
branch of an alternation\.\)

the expression in the capture matching isn't enough; the capture has to
match and then the outer expression has to match too.
I'd like to think that this can be fixed by rearranging the order of things
so the assignment to the capture group is deferred until
the branch the capture group is in matches.

"go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor
bar
$ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | (bar) )* /x; print
"1​:$1;2​:$2;3​:$3"'
1​:fo;2​:o;3​:bar

"go" clobbers $1 even though "gox" is not [fg]oo, as a later alternative
within this iteration matches
$ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | gox | (bar) )* /x; print
"1​:$1;2​:$2;3​:$3"'
1​:go;2​:o;3​:bar

When exactly does the assignment happen?

--
"At this point, given the limited available data, certainty about only a
very small number of things can be achieved." -- Plato, and others

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2018

From @iabyn

On Sun, Aug 12, 2018 at 09​:39​:16PM -0400, Deven T. Corzine wrote​:

It's been quiet for a couple weeks. Has anyone had a chance to take a
look at the patch yet?

(Yves, I'e CCed you as there's a question for you further down this
email.)

Sorry for the late reply. I took the opportunity to try to properly
understand this issue (you may remember me mentioning at the beginning how
"I hate and fear the capture save and restore code in regmatch()").

I think I now have a much better handle on this. I've just pushed a branch
which contains a proposed fix, along with lots of tests and benchmark
entries. It avoids saving and restoring capture indices unless the
alternation is wrapped in a repeat, and only saves indices back to the
start of the repeat. This is less inefficient than your original
suggestion, but it still makes alternations with captures within a repeat
run at about 50% of their former speed.

I know a way to make it much faster, but it involves, for every
BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
this can stored.

Yves, is there type of node - or a way to extend the current nodes - such
that a 32-bit capture index can stored as part of each of these nodes
at compile time? Without breaking everything?

Anyway, I've pushed my current proposed fix as smoke-me/davem/captures,
and here's the commit message​:

commit d0f9ad5
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Tue Aug 28 09​:15​:04 2018 +0100
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Tue Aug 28 15​:32​:46 2018 +0100

  regex​: restore failed branch capture within repeat
 
  RT #133352
 
  There are two competing behaviours for what to do with any captures
  within a failed alternation branch.
 
  Normally the regex engine invalidates any captures created during
  execution of the failed branch. For example in something like
 
  /(A) (?​: (B)(C) | (D)(E) ) /x
 
  perl remembers the current highest capture index at the start of each
  branch attempt (1 in this example), and if the branch fails it
  invalidates all captures above this level (i.e. 2..3 or 2..5 depending
  on which branch failed) before trying the next branch (if any).
 
  However for repeats, such as (...)*, on each new iteration any captures
  inside the repeat initially maintain their value from the previous
  iteration, until updated within the current iteration. For example in
 
  "ABCD" =~ /(?​: (.)(.) )*/x
 
  At the start of the first iteration, \1 and \2 are invalid.
  At the start of the second iteration, \1 and \2 are 'A', 'B'.
  During the course of the second iteration, \1 changes to 'C', then
  \2 changes to 'D'.
 
  This causes a problem when there's an alternation within a repeat.
  Taking the first example above and adding a repeat​:
 
  /(A) (?​: (B)(C) | (D)(E) )* /x
 
  Now on the second iteration, on entry to the branch the current highest
  capture index is no longer 1 (i.e. A), but is instead 3 or 5 depending
  on what was captured on the previous iteration. Thus on branch failure,
  the 'disable everything above the previous floor' technique no longer
  works​: rather than disabling (1+1)..5 say, it disables (5+1)..5 - i.e.
  failed captures are left as-is. This is the essence of the bug in this
  ticket.
 
  It is not clear what the correct behaviour should be - arguably failed
  branch captures could be either invalidated or restored to their value
  at the start of the branch. Invalidating is probably cheaper. However,
  there are a couple of tests in t/te/re_tests added many years ago by
  Ilya which assume the latter behaviour (they pass because the pattern of
  branches and backtracking happen to leave $1 with the correct value;
  a slight alteration to the tests and they would have the wrong value).
  But in any case the tests aren't expecting $1 to be undefined.
 
  This commit fixes the bug by, at the start of any branch, saving any
  captures beyond the start of the innermost enclosing repeat, but only if
  it *is* within a repeat, and if there are any valid captures beyond the
  start of that repeat. These are restored on branch failure. Otherwise
  it just invalidates back to the highest capture index at the start of
  the branch, as before.
 
  So for example this is unchanged, just invalidating indices 2+ on each
  branch failure​:
 
  /(A) (?​: (B)(C) | (D)(E) ) /x
 
  While this​:
 
  "ABCD" =~ /(?​: ([AC]) ([BD]) )*/x
 
  does invalidation on the first iteration, and full saving/restore of
  capture indices 2+ on subsequent iterations, while this​:
 
  "-AB-CD" =~ /(?​: (-) ([AC]) ([BD]) )*/x
 
  saves and restores on *every* iteration, since the (-) is above the
  floor of the curlyx, so at the start of each branch, lastparen is always
  above cur_curlyx->lastparen.
 
  In terms of performance, the benchmarks included with this commit show
  that a plain branch is unaffected; a plain branch with captures is a few
  % slower, and a branch within a repeat is now about half the speed.
 
  This fix could be made a lot more efficient if the current paren index
  was stored in every BRANCH/BRANCHJ/TRIE/TRIEC node, and if a pointer to
  the current WHILEM node was maintained. Since every iteration of a
  CURLYX/WHILEM saves all paren indices above the curlyx floor anyway, it
  wouldn't be necessary to save them again for each branch; just on branch
  failure, restore all saved indices from that last WHILEM which are above
  the branch floor,

--
The warp engines start playing up a bit, but seem to sort themselves out
after a while without any intervention from boy genius Wesley Crusher.
  -- Things That Never Happen in "Star Trek" #17

@p5pRT
Copy link
Author

p5pRT commented Oct 16, 2018

From @deven

On Tue, Aug 28, 2018 at 10​:47 AM Dave Mitchell <davem@​iabyn.com> wrote​:

On Sun, Aug 12, 2018 at 09​:39​:16PM -0400, Deven T. Corzine wrote​:

It's been quiet for a couple weeks. Has anyone had a chance to take a
look at the patch yet?

(Yves, I'e CCed you as there's a question for you further down this
email.)

Sorry for the late reply. I took the opportunity to try to properly
understand this issue (you may remember me mentioning at the beginning how
"I hate and fear the capture save and restore code in regmatch()").

Sorry, now I'm the one being slow to reply! Yes, I remember that comment,
which certainly illustrates why so many people told me I was crazy to
attempt a patch myself!

I think I now have a much better handle on this. I've just pushed a branch
which contains a proposed fix, along with lots of tests and benchmark
entries. It avoids saving and restoring capture indices unless the
alternation is wrapped in a repeat, and only saves indices back to the
start of the repeat. This is less inefficient than your original
suggestion, but it still makes alternations with captures within a repeat
run at about 50% of their former speed.

I submitted my initial patch at James Keenan's request, solely as a
starting point for discussion. I stated in the accompanying notes that I
didn't consider the patch ready to apply yet, for reasons including the
unacceptable performance impact on regexes that aren't even affected by the
bug in the first place. I was hoping to find a clean way to set a paren
floor (as you've done with your patch), to mitigate the performance impact
that would otherwise affect every regex with captures inside branches. I'd
hate to have a 50% performance impact on common regexes just to fix a bug
that's so rarely encountered that it went unnoticed for decades!

My initial patch was only focused on correctness, not performance.
Stipulating that the performance was unacceptably slow, could you tell me
if my original patch was clean and valid from a correctness perspective, or
was I doing something wrong/ugly/unvalid in the patch?

I know a way to make it much faster, but it involves, for every

BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
this can stored.

Yves, is there type of node - or a way to extend the current nodes - such
that a 32-bit capture index can stored as part of each of these nodes
at compile time? Without breaking everything?

I know there are only 8 bits to distinguish them, but perhaps this is a
situation where adding new regnode types would be worthwhile? Am I right
in thinking that a 32-bit capture index and the WHILEM pointer could be
stored that way?

Alternatively, if "ARG(scan)" or another mechanism could be used to save
even an 8-bit or 16-bit index value, it would probably cover most use
cases, and the max value (255 or 65535) could be used by default if the
index doesn't fit. I'm sure the vast majority of regexes have 255 or fewer
captures in the first place, and in the rare case where a regex has
thousands of captures and the proper index won't fit, this would only cause
an extra unnecessary performance for those rare cases, and the correctness
of the result should not be affected.

Anyway, I've pushed my current proposed fix as smoke-me/davem/captures,

and here's the commit message​:

I'll definitely check out your proposed fix; hopefully I'll get to it
sooner than later!

Thanks for looking into this!

Deven

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2018

From @iabyn

On Tue, Oct 16, 2018 at 03​:18​:07PM -0400, Deven T. Corzine wrote​:

My initial patch was only focused on correctness, not performance.
Stipulating that the performance was unacceptably slow, could you tell me
if my original patch was clean and valid from a correctness perspective, or
was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this, the details
have ebbed from my mind so I can't remember whether your approach was
merely slow or had issues.

I know a way to make it much faster, but it involves, for every

BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
this can stored.

Yves, is there type of node - or a way to extend the current nodes - such
that a 32-bit capture index can stored as part of each of these nodes
at compile time? Without breaking everything?

I know there are only 8 bits to distinguish them, but perhaps this is a
situation where adding new regnode types would be worthwhile? Am I right
in thinking that a 32-bit capture index and the WHILEM pointer could be
stored that way?

The node just needs to store the capture index, while the runtime code
in S_regmatch needs to keep track of the last seen WHILEM node, presumably
in a local var (that may need saving and restoring).

Alternatively, if "ARG(scan)" or another mechanism could be used to save
even an 8-bit or 16-bit index value, it would probably cover most use
cases, and the max value (255 or 65535) could be used by default if the
index doesn't fit. I'm sure the vast majority of regexes have 255 or fewer
captures in the first place, and in the rare case where a regex has
thousands of captures and the proper index won't fit, this would only cause
an extra unnecessary performance for those rare cases, and the correctness
of the result should not be affected.

With the current node types used by the various branch ops, there is
no spare space to store even an 8-bit index for the TRIE/TRIEC nodes,
since they make use of the 8-but flags field. (That field appears to be
unused in BRANCH/BRANCHJ).

Which is why I asked Yves about the best way to extend such nodes.
The naive way would be to use regnode_1 for BRANCH which
adds an extra 32-bit field, and 'upgrade' BRANCHJ from regnode_1 to
regnode_2L; But the TRIE nodes are more complex than that and I don't
think I could safely add an extra 32-bit field without breaking things.

--
"Strange women lying in ponds distributing swords is no basis for a system
of government. Supreme executive power derives from a mandate from the
masses, not from some farcical aquatic ceremony."
  -- Dennis, "Monty Python and the Holy Grail"

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2018

From @deven

On Wed, Oct 17, 2018 at 7​:02 AM Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Oct 16, 2018 at 03​:18​:07PM -0400, Deven T. Corzine wrote​:

My initial patch was only focused on correctness, not performance.
Stipulating that the performance was unacceptably slow, could you tell me
if my original patch was clean and valid from a correctness perspective,
or
was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this, the details
have ebbed from my mind so I can't remember whether your approach was
merely slow or had issues.

I had already spent probably 20 hours or more tracking down this bug while
I was at YAPC. After leaving the conference, I spent at least 4-6 more
hours very carefully studying and tracing the code, including running test
cases with debugging output many times, adding extra debugging output of my
own to trace key variables through all the state changes during a regex
match of my test case.

After all that research and testing, I felt confident that I finally
understood regcppush(), REGCP_SET(), REGCP_UNWIND(), UNWIND_PARENT() and
regcp_restore() pretty well, so I started developing this patch. I spent
about an hour writing the patch and 2-3 hours debugging it, because I
initially overlooked something simple that should have been obvious. I
forget what it was, but it was something dumb that I should have noticed
right away when the regression tests failed -- when I finally figured out
what I did wrong, I fixed the code to be what I actually _thought_ that I
had typed in the first place! (That's why it took me hours to notice it.)

Ultimately, the changes were fairly straightforward, considering. I'll
describe the changes I made, in chronological order. (The hunks in the
patch are in a different order.)

First, I added "CHECKPOINT lastcp;" to the "regmatch_state,u" union in the
"branchlike", "branch" and "trie" structs, just after "CHECKPOINT cp;".

Next, I used regcppush() to save the captures near the beginning of "case
BRANCH"​:

@​@​ -8043,7 +8042,10 @​@​ NULL
  ST.lastparen = rex->lastparen;
  ST.lastcloseparen = rex->lastcloseparen;
  ST.next_branch = next;
- REGCP_SET(ST.cp);
+
+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);

  /* Now go into the branch */
  if (has_cutgroup) {

From looking at the existing code, this appeared the right way to save all
the captures. I realized that saving ALL the captures was overkill and
very slow, of course. I was initially only attempting to get the correct
captures back from my test case without breaking any regression tests, and
this patch did succeed at doing that. (I was still studying the code to
try to figure out a clean way to set a paren floor like CURLYX does.) I
didn't see any point in keeping the original REGCP_SET() call here.

I did consider whether "case BRANCHJ" might need any separate changes, but
"case BRANCHJ" falls through to "case BRANCH", so this change seems to
cover it.

Next, I used regcp_restore() to restore the saved captures in "case
BRANCH_next_fail"​:

@​@​ -8077,8 +8079,10 @​@​ NULL
  do_cutgroup = 0;
  no_final = 0;
  }
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
  scan = ST.next_branch;
  /* no more branches? */
  if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {

From studying the existing code, regcp_restore() appeared to be the right
way to restore the saved captures without discarding them. The original
REGCP_UNWIND() and UNWIND_PAREN() calls were clearly intended to undo the
captures from the failed alternation, but that approach seems to be
unworkable for these edge cases involving captures inside repeated
alternations. I suspect UNWIND_PAREN() might always have such edge cases,
but I haven't investigated that hunch.

Finally, I mirrored both of these changes in "case TRIE" and "case
TRIE_next_fail", for trie-optimized alternations​:

@​@​ -5920,6 +5920,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
  HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
  U32 state = trie->startstate;

+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
+
  if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
  _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
  if (utf8_target
@​@​ -6067,15 +6071,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
  case TRIE_next_fail​: /* we failed - try next alternative */
  {
  U8 *uc;
- if ( ST.jump ) {
- /* undo any captures done in the tail part of a branch,
- * e.g.
- * /(?​:X(.)(.)|Y(.)).../
- * where the trie just matches X then calls out to do the
- * rest of the branch */
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
- }
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
  if (!--ST.accepted) {
  DEBUG_EXECUTE_r({
  Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n",

I didn't analyze ST.jump as fully as regcppush() and friends, so I was a
bit unsure about removing this ST.jump code, but the regcp_restore() seemed
to make it redundant. Since all the regression tests passed, I included
this change in the patch. Was it correct to remove that code?

Come to think of it, perhaps I should have also removed this other ST.jump
block near the beginning of "case trie_first_try"?

  if ( ST.jump ) {
  ST.lastparen = rex->lastparen;
  ST.lastcloseparen = rex->lastcloseparen;
  REGCP_SET(ST.cp);
  }

Looking at it again, this seems redundant too, but I didn't think to try
removing it when I developed the initial patch.

I'm not sure why the TRIE case was only dealing with the captures when
ST.jump is set. Is it possible that captures might need to be restored
when ST.jump isn't set?

The node just needs to store the capture index, while the runtime code

in S_regmatch needs to keep track of the last seen WHILEM node, presumably
in a local var (that may need saving and restoring).

That makes sense. Mind if I take a stab at that, if I get a chance?

With the current node types used by the various branch ops, there is
no spare space to store even an 8-bit index for the TRIE/TRIEC nodes,
since they make use of the 8-but flags field. (That field appears to be
unused in BRANCH/BRANCHJ).

Yeah, I came to the same conclusions about the flags too.

Which is why I asked Yves about the best way to extend such nodes.
The naive way would be to use regnode_1 for BRANCH which
adds an extra 32-bit field, and 'upgrade' BRANCHJ from regnode_1 to
regnode_2L; But the TRIE nodes are more complex than that and I don't
think I could safely add an extra 32-bit field without breaking things.

This sounds like a viable option. Why do you call it naive?

The TRIE nodes are more complex, but is there a particular reason you're
wary of breaking things by adding a field to it?

If the flags can't be used, I'm thinking that new regnode types make sense
here. That way, the regex compiler could trigger the save/restore
functionality only when needed, provide the correct paren floor and avoid
increasing the size of compiled regexes that don't need to restore captures
like this. It would take more work to implement, of course, and use up a
few of the unused 8-bit regnode slots, but it seems like it might be the
most effective way to optimize this. What do you and Yves think about this
idea?

Thanks again for taking the time to work on this bug. (I hope I'm not
wasting too much of your time on the conversation!)

Deven

@p5pRT
Copy link
Author

p5pRT commented Oct 20, 2018

From @demerphq

I'm sorry I have missed some of this discussion over the past while...

On Wed, 17 Oct 2018, 16​:29 Deven T. Corzine, <deven@​ties.org> wrote​:

On Wed, Oct 17, 2018 at 7​:02 AM Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Oct 16, 2018 at 03​:18​:07PM -0400, Deven T. Corzine wrote​:

My initial patch was only focused on correctness, not performance.
Stipulating that the performance was unacceptably slow, could you tell
me
if my original patch was clean and valid from a correctness
perspective, or
was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this, the details
have ebbed from my mind so I can't remember whether your approach was
merely slow or had issues.

I had already spent probably 20 hours or more tracking down this bug while
I was at YAPC. After leaving the conference, I spent at least 4-6 more
hours very carefully studying and tracing the code, including running test
cases with debugging output many times, adding extra debugging output of my
own to trace key variables through all the state changes during a regex
match of my test case.

After all that research and testing, I felt confident that I finally
understood regcppush(), REGCP_SET(), REGCP_UNWIND(), UNWIND_PARENT() and
regcp_restore() pretty well, so I started developing this patch. I spent
about an hour writing the patch and 2-3 hours debugging it, because I
initially overlooked something simple that should have been obvious. I
forget what it was, but it was something dumb that I should have noticed
right away when the regression tests failed -- when I finally figured out
what I did wrong, I fixed the code to be what I actually _thought_ that I
had typed in the first place! (That's why it took me hours to notice it.)

Ultimately, the changes were fairly straightforward, considering. I'll
describe the changes I made, in chronological order. (The hunks in the
patch are in a different order.)

First, I added "CHECKPOINT lastcp;" to the "regmatch_state,u" union in the
"branchlike", "branch" and "trie" structs, just after "CHECKPOINT cp;".

Next, I used regcppush() to save the captures near the beginning of "case
BRANCH"​:

@​@​ -8043,7 +8042,10 @​@​ NULL
ST.lastparen = rex->lastparen;
ST.lastcloseparen = rex->lastcloseparen;
ST.next_branch = next;
- REGCP_SET(ST.cp);
+
+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);

  /\* Now go into the branch \*/
  if \(has\_cutgroup\) \{

From looking at the existing code, this appeared the right way to save all
the captures. I realized that saving ALL the captures was overkill and
very slow, of course. I was initially only attempting to get the correct
captures back from my test case without breaking any regression tests, and
this patch did succeed at doing that. (I was still studying the code to
try to figure out a clean way to set a paren floor like CURLYX does.) I
didn't see any point in keeping the original REGCP_SET() call here.

I did consider whether "case BRANCHJ" might need any separate changes, but
"case BRANCHJ" falls through to "case BRANCH", so this change seems to
cover it.

Next, I used regcp_restore() to restore the saved captures in "case
BRANCH_next_fail"​:

@​@​ -8077,8 +8079,10 @​@​ NULL
do_cutgroup = 0;
no_final = 0;
}
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
scan = ST.next_branch;
/* no more branches? */
if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {

From studying the existing code, regcp_restore() appeared to be the right
way to restore the saved captures without discarding them. The original
REGCP_UNWIND() and UNWIND_PAREN() calls were clearly intended to undo the
captures from the failed alternation, but that approach seems to be
unworkable for these edge cases involving captures inside repeated
alternations. I suspect UNWIND_PAREN() might always have such edge cases,
but I haven't investigated that hunch.

Finally, I mirrored both of these changes in "case TRIE" and "case
TRIE_next_fail", for trie-optimized alternations​:

@​@​ -5920,6 +5920,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
U32 state = trie->startstate;

+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
+
if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
_CHECK_AND_WARN_PROBLEMATIC_LOCALE;
if (utf8_target
@​@​ -6067,15 +6071,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
case TRIE_next_fail​: /* we failed - try next alternative */
{
U8 *uc;
- if ( ST.jump ) {
- /* undo any captures done in the tail part of a branch,
- * e.g.
- * /(?​:X(.)(.)|Y(.)).../
- * where the trie just matches X then calls out to do the
- * rest of the branch */
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
- }
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
if (!--ST.accepted) {
DEBUG_EXECUTE_r({
Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n",

I didn't analyze ST.jump as fully as regcppush() and friends, so I was a
bit unsure about removing this ST.jump code, but the regcp_restore() seemed
to make it redundant. Since all the regression tests passed, I included
this change in the patch. Was it correct to remove that code?

Come to think of it, perhaps I should have also removed this other ST.jump
block near the beginning of "case trie_first_try"?

        if \( ST\.jump \) \{
            ST\.lastparen = rex\->lastparen;
            ST\.lastcloseparen = rex\->lastcloseparen;
            REGCP\_SET\(ST\.cp\);
        \}

Looking at it again, this seems redundant too, but I didn't think to try
removing it when I developed the initial patch.

I'm not sure why the TRIE case was only dealing with the captures when
ST.jump is set. Is it possible that captures might need to be restored
when ST.jump isn't set?

A non jump trie node is a Boolean test, meaning it shouldn't touch any
capture buffers (except maybe if it's successful?) Whereas a jump trie is a
Boolean test followed by arbitrary pattern, so it might change the capture
buffer state.

Consider

/(Foo(baz)|wh(oops))(zop)/

Versus

/(Foobaz|whoops)(zop)/

The first should be a jump trie, the second a non-jump true.

The node just needs to store the capture index, while the runtime code

in S_regmatch needs to keep track of the last seen WHILEM node, presumably
in a local var (that may need saving and restoring).

That makes sense. Mind if I take a stab at that, if I get a chance?

With the current node types used by the various branch ops, there is
no spare space to store even an 8-bit index for the TRIE/TRIEC nodes,
since they make use of the 8-but flags field. (That field appears to be
unused in BRANCH/BRANCHJ).

Yeah, I came to the same conclusions about the flags too.

Which is why I asked Yves about the best way to extend such nodes.
The naive way would be to use regnode_1 for BRANCH which
adds an extra 32-bit field, and 'upgrade' BRANCHJ from regnode_1 to
regnode_2L; But the TRIE nodes are more complex than that and I don't
think I could safely add an extra 32-bit field without breaking things.

This sounds like a viable option. Why do you call it naive?

The TRIE nodes are more complex, but is there a particular reason you're
wary of breaking things by adding a field to it?

I imagine because it is a product of peephole optimisation and thus is
sometimes written directly over the data it replaces. Trie nodes actually
have several forms and imagine changing the size could be troublesome but
it should be possible.

If the flags can't be used, I'm thinking that new regnode types make sense
here. That way, the regex compiler could trigger the save/restore
functionality only when needed, provide the correct paren floor and avoid
increasing the size of compiled regexes that don't need to restore captures
like this. It would take more work to implement, of course, and use up a
few of the unused 8-bit regnode slots, but it seems like it might be the
most effective way to optimize this. What do you and Yves think about this
idea?

I will try to find time to think about this but make no promises.

Cheers
Yves

Thanks again for taking the time to work on this bug. (I hope I'm not
wasting too much of your time on the conversation!)

Deven

Its not a waste of time at all,, it's just hard to find time to complement
your efforts

@p5pRT
Copy link
Author

p5pRT commented Oct 21, 2018

From @khwilliamson

On 10/20/18 8​:04 AM, demerphq wrote​:

If the flags can't be used\, I'm thinking that new regnode types make
sense here\.  That way\, the regex compiler could trigger the
save/restore functionality only when needed\, provide the correct
paren floor and avoid increasing the size of compiled regexes that
don't need to restore captures like this\.  It would take more work
to implement\, of course\, and use up a few of the unused 8\-bit
regnode slots\, but it seems like it might be the most effective way
to optimize this\.  What do you and Yves think about this idea?

I don't know much about this particular code area, but I can say that we
have enough regnode slots available that we don't need to worry about
conserving them.

I will try to find time to think about this but make no promises.

Cheers
Yves

Thanks again for taking the time to work on this bug\.  \(I hope I'm
not wasting too much of your time on the conversation\!\)

Deven

Its not a waste of time at all,, it's just hard to find time to
complement your efforts

And, it's good that you've learned about this code. Perhaps you'll
consider tackling something else at some point. And maybe you could
contribute some of the knowledge you've gain to commenting the source to
make it easier for others who come along later.

@deven
Copy link

deven commented Sep 13, 2021

I haven't worked on this in a long time, but I stumbled across the following archive:

http://mirrors.develooper.com/perl/really-ancient-perls/oldperl/dist/fun.tar.gz

Inside this archive, there are two nested archives containing ancient versions of Perl:

-rw-r--r-- jhi/ftp     1000058 1994-08-26 13:21 src/5.0/5.000alpha/perl5.000a12h.tar.gz
-rwxr-xr-x jhi/ftp      924479 1994-10-07 17:58 src/5.0/5.000beta/perl5.000b3h.tar.gz

These particular versions are significant because the main perl5 repository unfortunately had to jump straight from perl-5.000alpha9 to perl-5.000 in a massive single commit, skipping the 31 intervening versions of Perl listed on the perlhist page between 5.000alpha9 and the final 5.000 release (which were unavailable at the time), and the two versions above are both in that range of missing versions.

I've added commits to my clone of the perl5 repository for both of these two versions:

I also ran the following command on my local repository, to graft a replacement commit for the original perl-5.000 commit, but it appears to be impossible to actually make a pull request for replacement graft commits like this.

git replace --graft a0d0e21ea6ea90a22318550944fe6cb09ae10cda 8be46f3624cc1174b59b786aea4676c88c443dec

Since this issue began somewhere in this range of 31 missing versions, I built perl-5.000alpha9 and reconfirmed that it does not exhibit the issue:

$ perl-5.000alpha9/perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
a

After this, I built perl-5.000a12h and learned that it does exhibit the issue:

$ perl-5.000alpha12h/perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
b

This narrows the range where the issue began down to one of the 15 versions from 5.000alpha10 through 5.000alpha12h...

@deven
Copy link

deven commented Sep 13, 2021

There were a lot of changes to the regex engine between 5.000alpha9 and 5.000alpha12h, including:

  • Multi-line and single-line modifiers /m and /s
  • Non-greedy matching quantifier ?
  • Global matching modifier /g
  • Positive and negative lookahead assertions (?=...) and (?!...)
  • Refactored handling of simple vs. complex regular expressions

@deven
Copy link

deven commented Sep 13, 2021

Overall, between regexp.h, regcomp.h, regcomp.c and regexec.c, there were 408 lines deleted and 647 lines added.

@demerphq
Copy link
Collaborator

I guess this thread got stalled?

Dave Mitchel asked:

"I know a way to make it much faster, but it involves, for every
BRANCH/BRANCHJ/TRIE/TRIEC node, knowing the current capture index - i.e.
in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how
this can stored.

Yves, is there type of node - or a way to extend the current nodes - such
that a 32-bit capture index can stored as part of each of these nodes
at compile time? Without breaking everything?"

Not really unfortunately. In theory you can change the structures that are used to implement BRANCH, BRANCHJ, TRIE, TRIEC and run make regen. In practice this can break a lot, unfortunately a bunch of code in regex compiler makes unhealthy assumptions about the size of regops. I just tried a simple fix of changing BRANCH to be regnode_1 and BRANCHJ to be regnode_2L, and it broke a ton of stuff. TRIE/TRIEC are easier to deal with likely.

I have put this on my todo list, my time availability will be limited in the near future however.

Yves

@demerphq
Copy link
Collaborator

demerphq commented Apr 2, 2022

Hi, wanted to give an update. I have been working on a patch that makes it much easier to resize regnodes, and I am implementing the change requested in this ticket. I will follow up with a pr next week.

@demerphq
Copy link
Collaborator

demerphq commented Apr 7, 2022

Hi @iabyn

I now have a branch which stores the number of capture buffers were opened before it, but reading your request again I am not sure I understood what you wanted properly. The branch is yves/regnode_typedefs which is not an ideal name, but there is a bunch of stuff in the branch and some of it relates to this. The branch is messy as heck and needs some rebase -i reorganization, squashing, and etc, but it does work, and it adds support for detecting when changing the size of a regnode would break things. It changes the type of a BRANCH regop from struct regnode to struct regnode_1, and the BRANCHJ regop from struct regnode_1 to struct regnode_2L. I could not change the size of the TRIE regops as it would mean that we would also have to make all of the EXACT regops larger. Instead I have added a new member to the struct _reg_trie_data, which is stored in the data-array and accessible that way, and because of this it doesn't make sense to make TRIEC nodes larger, they can just use the same infra as the TRIE ones would.

In the below output you can see the new data as (buf: 2) and similar.

$ ./perl -Ilib -Mre=debug -e'/()()(?:[fF]oo()|[bB]ar()|[bB]az())([xX]|[yY])/'
Compiling REx "()()(?:[fF]oo()|[bB]ar()|[bB]az())([xX]|[yY])"
Final program:
   1: OPEN1 (4)
   3:   NOTHING (4)
   4: CLOSE1 (6)
   6: OPEN2 (9)
   8:   NOTHING (9)
   9: CLOSE2 (11)
  11: BRANCH (buf:2) (22)
  13:   ANYOFM[Ff] (15)
  15:   EXACT <oo> (17)
  17:   OPEN3 (20)
  19:     NOTHING (20)
  20:   CLOSE3 (45)
  22: BRANCH (buf:3) (33)
  24:   ANYOFM[Bb] (26)
  26:   EXACT <ar> (28)
  28:   OPEN4 (31)
  30:     NOTHING (31)
  31:   CLOSE4 (45)
  33: BRANCH (buf:4) (FAIL)
  35:   ANYOFM[Bb] (37)
  37:   EXACT <az> (39)
  39:   OPEN5 (42)
  41:     NOTHING (42)
  42:   CLOSE5 (45)
  44: TAIL (45)
  45: OPEN6 (47)
  47:   BRANCH (buf:6) (51)
  49:     ANYOFM[Xx] (55)
  51:   BRANCH (buf:6) (FAIL)
  53:     ANYOFM[Yy] (55)
  55: CLOSE6 (57)
  57: END (0)
minlen 4 
String shorter than min possible regex match (0 < 4)
Freeing REx: "()()(?:[fF]oo()|[bB]ar()|[bB]az())([xX]|[yY])"

and also here:

./perl -Ilib -Mre=debug -e'/()()(?:foo()|bar()|baz())(x|y)/'
Compiling REx "()()(?:foo()|bar()|baz())(x|y)"
Final program:
   1: OPEN1 (4)
   3:   NOTHING (4)
   4: CLOSE1 (6)
   6: OPEN2 (9)
   8:   NOTHING (9)
   9: CLOSE2 (11)
  11: TRIE-EXACT[bf] (buf:2) (39)
      <foo> (15)
  15:   OPEN3 (18)
  17:     NOTHING (18)
  18:   CLOSE3 (39)
      <bar> (24)
  24:   OPEN4 (27)
  26:     NOTHING (27)
  27:   CLOSE4 (39)
      <baz> (33)
  33:   OPEN5 (36)
  35:     NOTHING (36)
  36:   CLOSE5 (39)
  39: OPEN6 (41)
  41:   TRIE-EXACT[xy] (buf:6) (49)
        <x> 
        <y> 
  49: CLOSE6 (51)
  51: END (0)
minlen 4 
String shorter than min possible regex match (0 < 4)
Freeing REx: "()()(?:foo()|bar()|baz())(x|y)"

As I said I am not entirely sure I am storing what you expect. For instance what should /(foo|bar()|baz()()|bop)/ show for the branches? MY current implementation shows this:

$ ./perl -Ilib -Mre=debug -e'/([fF]oo|[bB]ar()|[Bb]az()()|[bB]op)/'
Compiling REx "([fF]oo|[bB]ar()|[Bb]az()()|[bB]op)"
Final program:
   1: OPEN1 (3)
   3:   BRANCH (buf:1) (9)
   5:     ANYOFM[Ff] (7)
   7:     EXACT <oo> (42)
   9:   BRANCH (buf:1) (20)
  11:     ANYOFM[Bb] (13)
  13:     EXACT <ar> (15)
  15:     OPEN2 (18)
  17:       NOTHING (18)
  18:     CLOSE2 (42)
  20:   BRANCH (buf:2) (36)
  22:     ANYOFM[Bb] (24)
  24:     EXACT <az> (26)
  26:     OPEN3 (29)
  28:       NOTHING (29)
  29:     CLOSE3 (31)
  31:     OPEN4 (34)
  33:       NOTHING (34)
  34:     CLOSE4 (42)
  36:   BRANCH (buf:4) (FAIL)
  38:     ANYOFM[Bb] (40)
  40:     EXACT <op> (42)
  42: CLOSE1 (44)
  44: END (0)
minlen 3 
String shorter than min possible regex match (0 < 3)
Freeing REx: "([fF]oo|[bB]ar()|[Bb]az()()|[bB]op)"

FWIW, I am highly tempted to change the debug output so that the "next regop" is shown as -> \d and not (\d) as it would be nice to use the parens for something more useful.

@khwilliamson and @deven you may find this interesting also.

@demerphq
Copy link
Collaborator

demerphq commented Apr 7, 2022

BTW, @iabyn the smoke-me/davem/captures branch was deleted, I have rebased it and I will repush it to the repo to save you from doing it yourself.

@deven
Copy link

deven commented Apr 7, 2022

@demerphq For whatever my opinion may be worth, I think that's an excellent idea, to display the "next regop" with -> instead of parens. Apart from freeing up the parens for other purposes, it would be much more intuitive in the first place!

@deven
Copy link

deven commented Apr 7, 2022

I had hoped to take a stab at implementing a production-ready fix for this bug, which I knew my example patch was not, even though it did fix the bug. But after the work you guys have done, which I haven't even absorbed yet, I'm not sure there's much left for me to help with on the implementation now?

@demerphq
Copy link
Collaborator

demerphq commented Apr 8, 2022

@iabyn I have repushed your branch as smoke-me/davem/captures_rebased, it is rebased on top of blead from a few minutes ago. I will start cleaning up my branch to enable tracking which parens we are in.

@demerphq
Copy link
Collaborator

demerphq commented Apr 8, 2022

@deven, I mostly wanted you to know your work on this was not for nothing and that it was back in play. One way or another I plan to get either your or daves or daves improved patch in for 5.38, we are too close to 5.36 to risk it, but I do want to get this fixed! Also I figured you might have something to add to the discussion as you did all that work understanding how things work.

@iabyn
Copy link
Contributor

iabyn commented Apr 12, 2022 via email

@demerphq
Copy link
Collaborator

I will start looking into this soon again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants