New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex bug with optional capturing group (<...>)? and backreference #14848
Comments
From hdthanh@tma.com.vnHi, Please see the report in the attachment. PATH variable is redacted, since it To clarify: The bug is present in Perl 5.20.1, which I tested via ideone. |
From @khwilliamsonOn 08/12/2015 09:27 PM, Thanh Hong Dai (via RT) wrote:
I bisected the change of behavior to this: 7016d6e is the first bad commit stop regex engine reading beyond end of string Historically the regex engine has assumed that any string passed to it The engine currently relies on there being a null char in the following First, when at the end of string, the main loop of regmatch() still Second, the matching algorithm often required the trailing Thirdly, some match ops require the trailing char to be null to operate Also, some utf8 ops will try to extract the code point at the end, The main fix is in S_regmatch, where the 'read next char' code has been Lots of other random bits in the regex engine needed to be fixed up To track these down, I temporarily hacked regexec_flags() to make a |
The RT System itself - Status changed from 'new' to 'open' |
From @iabynOn Wed, Aug 12, 2015 at 08:27:20PM -0700, Thanh Hong Dai wrote:
Arguably both 5.14.4 and bleadperl are wrong. It all depends on what we expect \1 to match when the capture is Consider the following: "aa" =~ /^(a?)(a)(\2\1)/ or die; which gives: $ perl5144 ~/tmp/p; perl5220 ~/tmp/p Move that '?' outside: "aa" =~ /^(a)?(a)(\2\1)/ or die; and we get this: $ perl5144 ~/tmp/p; perl5220 ~/tmp/p Clearly for /(a?)/, \1 should match a zero-length string; Does anyone have any opinions what the correct semantics should be? Note that perl distinguishes between zero length and undefined $1: $ perl -le'"aa" =~ /^(x?)/; print "def" if defined $1;'def -- |
From @AbigailOn Mon, Aug 17, 2015 at 01:15:20PM +0100, Dave Mitchell wrote:
Perl isn't very consistent in how it treats /(x)?/ vs /(x?)/: $ perl -wE '"aa" =~ /^(x?)/; say scalar @-, " ", scalar @+' So, a (x)? will not cause an entry in @- if the match is empty, but only IMO, "aa" =~ /^(x?)/ should set $1 (and \1) to the empty string, and I do think scalar @- should always equal to scalar @+; and if not, Abigail |
From @iabynOn Mon, Aug 17, 2015 at 03:01:59PM +0200, Abigail wrote:
The implementation of @+/@- is (somewhat) orthogonal to this issue. Internally perl maintains an array of start and end indexes (which @+/@- Initially the (start,tmp_start;end) pairs are set to (-1,-1; -1). An It also maintains a high-water mark of the highest close paren seen. On failure backtrack, it sets the end index of any backtracked close The issue with the \<N> code is that it checks for for start being -1,
The issue with @+/@- seems to be not handling the high water mark $ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x?)/; f(@+);f(@-)' For the second one above, I think they should both be '0:u', or possibly '0'.
Agreed.
Agreed. -- |
From @hvdsDave Mitchell <davem@iabyn.com> wrote: I believe this was discussed on the list before now, possibly even to the (It's just possible that the discussion started shortly after implementation My current preference is that /(x)?...\1/ should not match if the (x) was I think it'll be clear enough in the source if it is making some sort of Hugo |
From @iabynOn Mon, Aug 17, 2015 at 07:30:24PM +0100, hv@crypt.org wrote:
Changing the src so that this line: if (rex->lastparen < n || ln == -1) instead pretends that \N is the empty string, causes the following tests (a)|\1 x n - Reference to group in different branch So I think its clear that backrefs to captures that didn't capture are -- |
From @ap* Dave Mitchell <davem@iabyn.com> [2015-08-17 14:20]:
It seems logical that “match the same thing again” cannot possibly match (?x) (delimiter)? (actual value pattern) (?> \1 | terminator) where the failure of a backref to a failed group is a useful semantic. So I think even if the existing indicators didn’t already point that And of course by way of (?(1)...) it is possible to get either semantic Regards, |
From @demerphqOn 19 August 2015 at 12:26, Dave Mitchell <davem@iabyn.com> wrote:
IMO we definitely want to differentiate between (X)? and (X?). You can see why if you define X? as the alternation of X and the empty /(?:(X)|)/ and /((?:X|))/ In other words I would expect the following to behave as it does: 'print "d"=~/((?:x|))\1d/ ? 1 : 0' $ perl -le'print "d"=~/(?:(x)|)\1d/ ? 1 : 0' I find this subject confusing because () does two jobs, grouping and Yves |
From @demerphqOn 21 August 2015 at 20:25, demerphq <demerphq@gmail.com> wrote:
I worded that poorly. Let me try again. It is easy to get confused how /X*/ is actually /(?:X)*/ which means that /(X)*/ is actually /(?:(X))*/ and if you define /X?/ as /(?:X|)/ then it is also obvious that /(X)?/ becomes /(?:(X)|)/ which is obviously different from /((?:X|))/. cheers. -- |
I noticed a comment in this thread from @iabyn about @+ and @-.
|
I am closing this ticket. I think it was resolved long ago. |
Migrated from rt.perl.org#125798 (status was 'open')
Searchable as RT125798$
The text was updated successfully, but these errors were encountered: