New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ancient Regex Regression #16616
Comments
From @devenCreated by @devenI discovered a bug in Perl's regular expression engine a few I even ended up doing a lightning talk about the bug: https://www.youtube.com/watch?v=U-JhPIECkPY This was my test case, which works with or without anchors: "afoobar" =~ /((.)foo|bar)*/ Or, as a standalone command: perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;' This prints "b", even though "bfoo" never appears in "afoobar"! I understand why this is happening -- the inner group does match The correct answer seems to be "a", since that's the last match Perl 5.000 (from 1994) is the first commit in the git repository Even though I have never worked on the Perl core, and I've been I've already managed to create a working patch that fixes this Yves, SawyerX thought you might be willing to mentor me on this? My solution involves saving the captures with regcppush() on Perl Info
|
From @jkeenanOn Mon, 09 Jul 2018 04:04:32 GMT, dcorzine wrote:
Please submit the patch as an attachment. Even if it's not complete, it gives us something to run through the test suite and a starting point for discussion.
Thank you very much. -- |
The RT System itself - Status changed from 'new' to 'open' |
From @devenOn Tue, Jul 10, 2018 at 1:09 PM, James E Keenan via RT <
[...]
Fair enough. I'll do that now. Deven |
1 similar comment
From @devenOn Tue, Jul 10, 2018 at 1:09 PM, James E Keenan via RT <
[...]
Fair enough. I'll do that now. Deven |
From @hvdsOn Sun, 08 Jul 2018 21:04:32 -0700, dcorzine wrote:
It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that, since in /(...)*/ we return the match for the last iteration of the group, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group. Under that interpretation, the correct answer would be undef, and the earlier releases that returned 'a' were merely more subtly wrong than the current release. The fact that other examples don't match such an interpretation argues against it: Hugo |
From @devenThis is a first pass at a patch to fix RT #133352, to save and restore perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;' Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while This patch appears to work (making the test case above print "a" again), * I haven't finished analyzing the code to check for edge cases that * I haven't developed any new regression tests to exercise this bug and * This patch is currently saving ALL captures, including captures that * I believe that setting a paren floor (which CURLYX does) would likely * Another possibility might be to create new regnode types to use only * Regardless, I would like feedback from an expert on this part of the regexec.c | 28 ++++++++++++++++------------ |
From @devenperl-133136-test1.patchFrom 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00:00:00 2001
From: "Deven T. Corzine" <deven@ties.org>
Date: Sat, 23 Jun 2018 23:17:03 -0400
Subject: [PATCH] Save and restore captures for BRANCH/TRIE failures
This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994:
perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.
This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons:
* I haven't finished analyzing the code to check for edge cases that
I may have overlooked with this patch. Since TRIE_next_fail included
a test for ST.jump which would call UNWIND_PAREN(), I want to better
understand how that was working and determine whether ST.jump being
set might have any implications for this patch.
* I haven't developed any new regression tests to exercise this bug and
the patch to fix it, and I know that both BRANCH and TRIE variations
need to be tested.
* This patch is currently saving ALL captures, including captures that
can't possibly change inside the alternation. This is an easier fix,
but I believe the performance impact should be mitigated, since this
would impact many everyday regular expressions which aren't affected
by this bug in the first place.
* I believe that setting a paren floor (which CURLYX does) would likely
mitigate most (but not all) of the performance impact of this fix,
but I haven't yet found a clean way to implement this.
* Another possibility might be to create new regnode types to use only
for alternations containing capture groups, maybe only for ones which
also have a quantifier applied, since I believe both are required to
trigger this bug. (This might have to be treated as an optimization
in regcomp.c to use the old logic for alternations that can't trigger
the bug, but it might be hard to tell if a quantifier is applied for
qr// cases.)
* Regardless, I would like feedback from an expert on this part of the
regular expression engine (maybe Yves?) about this patch and whether
the approach I've taken is on the right track.
---
regexec.c | 28 ++++++++++++++++------------
regexp.h | 3 +++
2 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/regexec.c b/regexec.c
index ba52ae9..1d42b0d 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);
U32 state = trie->startstate;
+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
+
if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
_CHECK_AND_WARN_PROBLEMATIC_LOCALE;
if (utf8_target
@@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
case TRIE_next_fail: /* we failed - try next alternative */
{
U8 *uc;
- if ( ST.jump ) {
- /* undo any captures done in the tail part of a branch,
- * e.g.
- * /(?:X(.)(.)|Y(.)).../
- * where the trie just matches X then calls out to do the
- * rest of the branch */
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
- }
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
if (!--ST.accepted) {
DEBUG_EXECUTE_r({
Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n",
@@ -8043,7 +8042,10 @@ NULL
ST.lastparen = rex->lastparen;
ST.lastcloseparen = rex->lastcloseparen;
ST.next_branch = next;
- REGCP_SET(ST.cp);
+
+ /* save state of captures at branch start */
+ ST.cp = regcppush(rex, 0, maxopenparen);
+ REGCP_SET(ST.lastcp);
/* Now go into the branch */
if (has_cutgroup) {
@@ -8077,8 +8079,10 @@ NULL
do_cutgroup = 0;
no_final = 0;
}
- REGCP_UNWIND(ST.cp);
- UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+ /* restore state of captures from branch start */
+ regcp_restore(rex, ST.lastcp, &maxopenparen);
+
scan = ST.next_branch;
/* no more branches? */
if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {
diff --git a/regexp.h b/regexp.h
index 44409f0..aa61c97 100644
--- a/regexp.h
+++ b/regexp.h
@@ -754,6 +754,7 @@ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;
} branchlike;
@@ -763,6 +764,7 @@ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;
regnode *next_branch; /* next branch node */
} branch;
@@ -773,6 +775,7 @@ typedef struct regmatch_state {
U32 lastparen;
U32 lastcloseparen;
CHECKPOINT cp;
+ CHECKPOINT lastcp;
U32 accepted; /* how many accepting states left */
bool longfold;/* saw a fold with a 1->n char mapping */
--
2.7.4
|
From @devenOn Tue, 10 Jul 2018 12:19:30 -0700, hv wrote:
Yeah, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration, but the inner group does successfully match during the first iteration only, so "a" is indeed the last successful match. I see two different viewpoints here, and both are quite reasonable. One viewpoint is that these are nested groups and both should be returning results from the same iteration, and therefore $2 should be undef because the nested match doesn't match anything on that iteration. Intuitively, this feels right. The other viewpoint is that each group can match multiple times and we only get to keep one capture per group, so $2 should be "a" since that's the last successful match for that group, despite the confusion of matching $1 and $2 from different iterations. Personally, the $2="a" viewpoint seems like a stronger argument to me, but I could be in the minority in thinking that. I like the $2=undef viewpoint too, but we can't have both. In terms of how regular expressions are defined and documented, I'm having trouble rationalizing $2=undef even though it sounds good. Is there anything definitive in the documentation that would resolve the question without ambiguity? I haven't found it yet, if there is. For what it's worth, other regular expression engines like PCRE, RE2, GNU and others all return "a", which seems to suggest there may be some sort of consensus that "a" is the correct answer, but maybe it's just a gray area with no definite right answer?
That example is certainly returning $2 and $3 from different loop iterations, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example. Is that what you were getting at? Deven |
From @deven[I originally entered this via RT, but I didn't see it come though On Tue, 10 Jul 2018 12:19:30 -0700, hv wrote:
Yeah, that's why I said the correct "seems" to be "a". There's a I see two different viewpoints here, and both are quite reasonable. Personally, the $2="a" viewpoint seems like a stronger argument to me, Is there anything definitive in the documentation that would resolve For what it's worth, other regular expression engines like PCRE, RE2,
That example is certainly returning $2 and $3 from different loop Deven |
From @devenHas anyone taken a look at this patch yet? I'd like to hear what Deven On Tue, Jul 10, 2018 at 3:19 PM, Deven T. Corzine via RT
|
1 similar comment
From @devenHas anyone taken a look at this patch yet? I'd like to hear what Deven On Tue, Jul 10, 2018 at 3:19 PM, Deven T. Corzine via RT
|
From @iabynOn Sun, Jul 15, 2018 at 03:52:30PM -0400, Deven T. Corzine wrote:
I plan to look at it at some point, but I hate and fear the capture -- |
1 similar comment
From @iabynOn Sun, Jul 15, 2018 at 03:52:30PM -0400, Deven T. Corzine wrote:
I plan to look at it at some point, but I hate and fear the capture -- |
From @devenOn Mon, Jul 16, 2018 at 4:13 AM, Dave Mitchell <davem@iabyn.com> wrote:
I look forward to hearing your feedback, thanks Dave! That's The patch still needs work -- right now it's saving and restoring ALL My initial inclination was try setting a paren floor (which CURLYX When a regmatch_state is allocated, is the memory cleared or uninitialized? Deven |
1 similar comment
From @devenOn Mon, Jul 16, 2018 at 4:13 AM, Dave Mitchell <davem@iabyn.com> wrote:
I look forward to hearing your feedback, thanks Dave! That's The patch still needs work -- right now it's saving and restoring ALL My initial inclination was try setting a paren floor (which CURLYX When a regmatch_state is allocated, is the memory cleared or uninitialized? Deven |
From @iabynOn Tue, Jul 10, 2018 at 02:14:03PM -0700, Deven T. Corzine via RT wrote:
My take on it is based on two (possibly conflicting) ideals. The first is that when starting the (N+1)th iteration of a '*' expression print "1=[$1]\n" if "AA-ABB-" =~ /^ (?: \1? (\w) \1 - )* $/x which outputs 'B'. On the first iteration: On the second iteration: The second principle is that following an alternation, the captures Neither of these match: print "matched\n" if "B" =~ /^ (?: A(.) | B ) \1 $/x; So the real question is, at the end of an alternation, should any I think you could argue it either way. However, since this bug has been Invalidating captures set by a failing branch involves just knowing the -- |
1 similar comment
From @iabynOn Tue, Jul 10, 2018 at 02:14:03PM -0700, Deven T. Corzine via RT wrote:
My take on it is based on two (possibly conflicting) ideals. The first is that when starting the (N+1)th iteration of a '*' expression print "1=[$1]\n" if "AA-ABB-" =~ /^ (?: \1? (\w) \1 - )* $/x which outputs 'B'. On the first iteration: On the second iteration: The second principle is that following an alternation, the captures Neither of these match: print "matched\n" if "B" =~ /^ (?: A(.) | B ) \1 $/x; So the real question is, at the end of an alternation, should any I think you could argue it either way. However, since this bug has been Invalidating captures set by a failing branch involves just knowing the -- |
From @devenOn Tue, Jul 17, 2018 at 11:02 AM, Dave Mitchell <davem@iabyn.com> wrote:
Interesting. It's not a forward reference, but it's sort of an
My patch doesn't change the behavior of those examples. On the other hand, my patch does allow this example to match: print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x; Without my patch, this matches instead: print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;
Indeed, and there is certainly room for debate here, as both arguments Personally, invalidating a successful capture that was part of a "foobar" =~ /^ (?: (foo) | (bar) )* $/x; Both with and without my patch, this leaves $1="foo" and $2="bar". If
That's a good point. Still, it's worth considering that restoring the Strangely, I can't find a single word in the documentation about what
Yes, that's what my current patch does, and there IS a performance Would it be better to remember the captures some other way? Perl 6 Deven |
1 similar comment
From @devenOn Tue, Jul 17, 2018 at 11:02 AM, Dave Mitchell <davem@iabyn.com> wrote:
Interesting. It's not a forward reference, but it's sort of an
My patch doesn't change the behavior of those examples. On the other hand, my patch does allow this example to match: print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x; Without my patch, this matches instead: print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;
Indeed, and there is certainly room for debate here, as both arguments Personally, invalidating a successful capture that was part of a "foobar" =~ /^ (?: (foo) | (bar) )* $/x; Both with and without my patch, this leaves $1="foo" and $2="bar". If
That's a good point. Still, it's worth considering that restoring the Strangely, I can't find a single word in the documentation about what
Yes, that's what my current patch does, and there IS a performance Would it be better to remember the captures some other way? Perl 6 Deven |
From @davidnicolThis is a good test case for the bug: On the other hand, my patch does allow this example to match: print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x; Without my patch, this matches instead: print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x; The bug appears to be the result of an optimization of describing the Is that what the patch changes? -- |
1 similar comment
From @davidnicolThis is a good test case for the bug: On the other hand, my patch does allow this example to match: print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x; Without my patch, this matches instead: print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x; The bug appears to be the result of an optimization of describing the Is that what the patch changes? -- |
From @devenOn Tue, Jul 17, 2018 at 12:58 PM David Nicol <davidnicol@gmail.com> wrote:
That optimization doesn't cause the bug, it's the attempt to match the (.) Deven |
1 similar comment
From @devenOn Tue, Jul 17, 2018 at 12:58 PM David Nicol <davidnicol@gmail.com> wrote:
That optimization doesn't cause the bug, it's the attempt to match the (.) Deven |
From @davidnicolOn Tue, Jul 17, 2018 at 12:56 PM Deven T. Corzine <deven@ties.org> wrote:
Thank you, I misunderstood. So in the original demonstration, the "b" got As the current documentation (the section on "Capture Groups" in The current documentation (that section) contains no guidance concerning "abcdef" =~ /(?:(.))+/ to magically create $1 through $6 and load them all. This was erroneous! As an opinionated person, I'm in favor of fixing the regression and # we don't clobber capture groups with data from failed alternate into the test suite and documenting how captures into buffers in ... $ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print will do the right thing, whatever that is. Thank you -- |
1 similar comment
From @davidnicolOn Tue, Jul 17, 2018 at 12:56 PM Deven T. Corzine <deven@ties.org> wrote:
Thank you, I misunderstood. So in the original demonstration, the "b" got As the current documentation (the section on "Capture Groups" in The current documentation (that section) contains no guidance concerning "abcdef" =~ /(?:(.))+/ to magically create $1 through $6 and load them all. This was erroneous! As an opinionated person, I'm in favor of fixing the regression and # we don't clobber capture groups with data from failed alternate into the test suite and documenting how captures into buffers in ... $ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print will do the right thing, whatever that is. Thank you -- |
From @iabynOn Tue, Jul 17, 2018 at 11:42:11AM -0400, Deven T. Corzine wrote:
It would match the (admittedly ambiguous) bit in perlre that David Nicol If a group did not match, the associated backreference won't match
There is considerable performance overhead in saving even just one set of The only ones needing saving or invalidating are the ones with indices It might be worth writing an alternative patch which does just the I suspect that writing the invalidation code might be quite tricky (in the
What do you mean exactly? -- |
1 similar comment
From @iabynOn Tue, Jul 17, 2018 at 11:42:11AM -0400, Deven T. Corzine wrote:
It would match the (admittedly ambiguous) bit in perlre that David Nicol If a group did not match, the associated backreference won't match
There is considerable performance overhead in saving even just one set of The only ones needing saving or invalidating are the ones with indices It might be worth writing an alternative patch which does just the I suspect that writing the invalidation code might be quite tricky (in the
What do you mean exactly? -- |
From @davidnicol
the expression in the capture matching isn't enough; the capture has to "go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor "go" clobbers $1 even though "gox" is not [fg]oo, as a later alternative When exactly does the assignment happen? -- |
1 similar comment
From @davidnicol
the expression in the capture matching isn't enough; the capture has to "go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor "go" clobbers $1 even though "gox" is not [fg]oo, as a later alternative When exactly does the assignment happen? -- |
From @iabynOn Sun, Aug 12, 2018 at 09:39:16PM -0400, Deven T. Corzine wrote:
(Yves, I'e CCed you as there's a question for you further down this Sorry for the late reply. I took the opportunity to try to properly I think I now have a much better handle on this. I've just pushed a branch I know a way to make it much faster, but it involves, for every Yves, is there type of node - or a way to extend the current nodes - such Anyway, I've pushed my current proposed fix as smoke-me/davem/captures, commit d0f9ad5 regex: restore failed branch capture within repeat -- |
From @devenOn Tue, Aug 28, 2018 at 10:47 AM Dave Mitchell <davem@iabyn.com> wrote:
Sorry, now I'm the one being slow to reply! Yes, I remember that comment,
I submitted my initial patch at James Keenan's request, solely as a My initial patch was only focused on correctness, not performance. I know a way to make it much faster, but it involves, for every
I know there are only 8 bits to distinguish them, but perhaps this is a Alternatively, if "ARG(scan)" or another mechanism could be used to save Anyway, I've pushed my current proposed fix as smoke-me/davem/captures,
I'll definitely check out your proposed fix; hopefully I'll get to it Thanks for looking into this! Deven |
From @iabynOn Tue, Oct 16, 2018 at 03:18:07PM -0400, Deven T. Corzine wrote:
Since nearly two months have passed since I worked on this, the details
The node just needs to store the capture index, while the runtime code
With the current node types used by the various branch ops, there is Which is why I asked Yves about the best way to extend such nodes. -- |
From @devenOn Wed, Oct 17, 2018 at 7:02 AM Dave Mitchell <davem@iabyn.com> wrote:
I had already spent probably 20 hours or more tracking down this bug while After all that research and testing, I felt confident that I finally Ultimately, the changes were fairly straightforward, considering. I'll First, I added "CHECKPOINT lastcp;" to the "regmatch_state,u" union in the Next, I used regcppush() to save the captures near the beginning of "case @@ -8043,7 +8042,10 @@ NULL /* Now go into the branch */ From looking at the existing code, this appeared the right way to save all I did consider whether "case BRANCHJ" might need any separate changes, but Next, I used regcp_restore() to restore the saved captures in "case @@ -8077,8 +8079,10 @@ NULL From studying the existing code, regcp_restore() appeared to be the right Finally, I mirrored both of these changes in "case TRIE" and "case @@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char + /* save state of captures at branch start */ I didn't analyze ST.jump as fully as regcppush() and friends, so I was a Come to think of it, perhaps I should have also removed this other ST.jump if ( ST.jump ) { Looking at it again, this seems redundant too, but I didn't think to try I'm not sure why the TRIE case was only dealing with the captures when The node just needs to store the capture index, while the runtime code
That makes sense. Mind if I take a stab at that, if I get a chance?
Yeah, I came to the same conclusions about the flags too.
This sounds like a viable option. Why do you call it naive? The TRIE nodes are more complex, but is there a particular reason you're If the flags can't be used, I'm thinking that new regnode types make sense Thanks again for taking the time to work on this bug. (I hope I'm not Deven |
From @demerphqI'm sorry I have missed some of this discussion over the past while... On Wed, 17 Oct 2018, 16:29 Deven T. Corzine, <deven@ties.org> wrote:
A non jump trie node is a Boolean test, meaning it shouldn't touch any Consider /(Foo(baz)|wh(oops))(zop)/ Versus /(Foobaz|whoops)(zop)/ The first should be a jump trie, the second a non-jump true.
I imagine because it is a product of peephole optimisation and thus is
I will try to find time to think about this but make no promises. Cheers
Its not a waste of time at all,, it's just hard to find time to complement |
From @khwilliamsonOn 10/20/18 8:04 AM, demerphq wrote:
I don't know much about this particular code area, but I can say that we
And, it's good that you've learned about this code. Perhaps you'll |
I haven't worked on this in a long time, but I stumbled across the following archive: http://mirrors.develooper.com/perl/really-ancient-perls/oldperl/dist/fun.tar.gz Inside this archive, there are two nested archives containing ancient versions of Perl:
These particular versions are significant because the main perl5 repository unfortunately had to jump straight from perl-5.000alpha9 to perl-5.000 in a massive single commit, skipping the 31 intervening versions of Perl listed on the perlhist page between 5.000alpha9 and the final 5.000 release (which were unavailable at the time), and the two versions above are both in that range of missing versions. I've added commits to my clone of the perl5 repository for both of these two versions: I also ran the following command on my local repository, to graft a replacement commit for the original perl-5.000 commit, but it appears to be impossible to actually make a pull request for replacement graft commits like this.
Since this issue began somewhere in this range of 31 missing versions, I built perl-5.000alpha9 and reconfirmed that it does not exhibit the issue:
After this, I built perl-5.000a12h and learned that it does exhibit the issue:
This narrows the range where the issue began down to one of the 15 versions from 5.000alpha10 through 5.000alpha12h... |
There were a lot of changes to the regex engine between 5.000alpha9 and 5.000alpha12h, including:
|
Overall, between |
I guess this thread got stalled? Dave Mitchel asked: "I know a way to make it much faster, but it involves, for every Yves, is there type of node - or a way to extend the current nodes - such Not really unfortunately. In theory you can change the structures that are used to implement BRANCH, BRANCHJ, TRIE, TRIEC and run make regen. In practice this can break a lot, unfortunately a bunch of code in regex compiler makes unhealthy assumptions about the size of regops. I just tried a simple fix of changing BRANCH to be regnode_1 and BRANCHJ to be regnode_2L, and it broke a ton of stuff. TRIE/TRIEC are easier to deal with likely. I have put this on my todo list, my time availability will be limited in the near future however. Yves |
Hi, wanted to give an update. I have been working on a patch that makes it much easier to resize regnodes, and I am implementing the change requested in this ticket. I will follow up with a pr next week. |
Hi @iabyn I now have a branch which stores the number of capture buffers were opened before it, but reading your request again I am not sure I understood what you wanted properly. The branch is yves/regnode_typedefs which is not an ideal name, but there is a bunch of stuff in the branch and some of it relates to this. The branch is messy as heck and needs some rebase -i reorganization, squashing, and etc, but it does work, and it adds support for detecting when changing the size of a regnode would break things. It changes the type of a BRANCH regop from struct regnode to struct regnode_1, and the BRANCHJ regop from struct regnode_1 to struct regnode_2L. I could not change the size of the TRIE regops as it would mean that we would also have to make all of the EXACT regops larger. Instead I have added a new member to the struct _reg_trie_data, which is stored in the data-array and accessible that way, and because of this it doesn't make sense to make TRIEC nodes larger, they can just use the same infra as the TRIE ones would. In the below output you can see the new data as (buf: 2) and similar.
and also here:
As I said I am not entirely sure I am storing what you expect. For instance what should /(foo|bar()|baz()()|bop)/ show for the branches? MY current implementation shows this:
FWIW, I am highly tempted to change the debug output so that the "next regop" is shown as -> \d and not (\d) as it would be nice to use the parens for something more useful. @khwilliamson and @deven you may find this interesting also. |
BTW, @iabyn the smoke-me/davem/captures branch was deleted, I have rebased it and I will repush it to the repo to save you from doing it yourself. |
@demerphq For whatever my opinion may be worth, I think that's an excellent idea, to display the "next regop" with |
I had hoped to take a stab at implementing a production-ready fix for this bug, which I knew my example patch was not, even though it did fix the bug. But after the work you guys have done, which I haven't even absorbed yet, I'm not sure there's much left for me to help with on the implementation now? |
@iabyn I have repushed your branch as smoke-me/davem/captures_rebased, it is rebased on top of blead from a few minutes ago. I will start cleaning up my branch to enable tracking which parens we are in. |
@deven, I mostly wanted you to know your work on this was not for nothing and that it was back in play. One way or another I plan to get either your or daves or daves improved patch in for 5.38, we are too close to 5.36 to risk it, but I do want to get this fixed! Also I figured you might have something to add to the discussion as you did all that work understanding how things work. |
On Thu, Apr 07, 2022 at 07:15:31AM -0700, Yves Orton wrote:
I now have a branch which stores the number of capture buffers were
opened before it, but reading your request again I am not sure I
understood what you wanted properly.
The original ticket was from so long ago that I've forgotten most details,
but from what I recall:
1) the issue involves the saving and restoring of capture info on repeats /
alternations in a way that is hard to understand, does my head in, and
will take a while for me to get my head back around.
2) I had a potential fix for the bug, but it made some things slower, so
wasn't ideal.
3) I knew a better way to fix things but it involved needing access to the
current capture index in some places where that info wasn't currently
held. Hopefully your branch solves that issue, which means that my better
method may now work.
4) However, I now have no idea what my better method was. See (1) about
taking a while to get my head round it again.
So I'll add this to my list of things to look at, but it won't be any time
very soon.
…--
Monto Blanco... scorchio!
|
I will start looking into this soon again. |
Migrated from rt.perl.org#133352 (status was 'open')
Searchable as RT133352$
The text was updated successfully, but these errors were encountered: