Regex bug with optional capturing group (<...>)? and backreference #14848

p5pRT · 2015-08-13T03:27:19Z

Migrated from rt.perl.org#125798 (status was 'open')

Searchable as RT125798$

p5pRT · 2015-08-13T03:27:20Z

From hdthanh@tma.com.vn

Hi,

Please see the report in the attachment. PATH variable is redacted, since it
is non-crucial to the bug report and it contains sensitive information.

To clarify: The bug is present in Perl 5.20.1, which I tested via ideone.
The version I use to generate the report is Perl 5.14.4, which doesn't have
the problem.

p5pRT · 2015-08-13T03:27:20Z

From hdthanh@tma.com.vn

recapturebug.rep

p5pRT · 2015-08-15T20:19:57Z

From @khwilliamson

On 08/12/2015 09:27 PM, Thanh Hong Dai (via RT) wrote:

# New Ticket Created by Thanh Hong Dai
# Please include the string: [perl #125798]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125798 >

Hi,

Please see the report in the attachment. PATH variable is redacted, since it
is non-crucial to the bug report and it contains sensitive information.

To clarify: The bug is present in Perl 5.20.1, which I tested via ideone.
The version I use to generate the report is Perl 5.14.4, which doesn't have
the problem.

I bisected the change of behavior to this:

7016d6e is the first bad commit
commit 7016d6e
Author: David Mitchell <davem@iabyn.com>
Date: Fri Sep 21 10:29:04 2012 +0100

stop regex engine reading beyond end of string

Historically the regex engine has assumed that any string passed to it
will have a trailing null char. This isn't normally an issue in
perl code,
since perl strings *are* null terminated; but it could cause
problems with
strings returned by XS code, or with someone calling the regex engine
directly from XS, with strend not pointing at a null char.

The engine currently relies on there being a null char in the following
ways.

First, when at the end of string, the main loop of regmatch() still
reads
in the 'next' character (i.e. the character following the end of
string)
even if it doesn't make any use of it. This precludes using memory
mapped
files as strings for example, since the read off the end would SEGV.

Second, the matching algorithm often required the trailing
character to be
\0 to work correctly: the test for 'EOF' was "if next char is null
*and*
locinput >= PL_regeol, then stop". So a random non-null trailing char
could cause an overshoot.

Thirdly, some match ops require the trailing char to be null to operate
correctly; for example, \b applied at the end of the string only
happens
to work because the trailing char (\0) happens to match \W.

Also, some utf8 ops will try to extract the code point at the end,
which
can result in multiple bytes past the end of string being read, and
possible problems if they don't correspond to well-formed utf8.

The main fix is in S_regmatch, where the 'read next char' code has been
updated to set it to a special value, NEXTCHR_EOS instead, if we
would be
reading past the end of the string.

Lots of other random bits in the regex engine needed to be fixed up
too.

To track these down, I temporarily hacked regexec_flags() to make a
copy
of the string but without trailing \0, then ran all the t/re/*.t tests
under valgrind to flush out all buffer overruns. So I think I've
removed
most of the bad code, but by no means all of it. The code within the
various functions in regexec.c is far too complex to be able to
visually
audit the code with any confidence.

p5pRT · 2015-08-15T20:19:58Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2015-08-17T12:15:53Z

From @iabyn

On Wed, Aug 12, 2015 at 08:27:20PM -0700, Thanh Hong Dai wrote:

Given this test program:

while ("aa" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
}

while ("aba" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
}

In version 5.14.14, the program output nothing, which is the
expected behavior.

However, in version 5.20.1, the program outputs:

[a] [] [a] [] []
[ab] [] [a] [b] []

Arguably both 5.14.4 and bleadperl are wrong.

It all depends on what we expect \1 to match when the capture is
zero-length, and in particular whether it's of the form /(x)?/ or /(x?)/

Consider the following:

"aa" =~ /^(a?)(a)(\2\1)/ or die;
print "&=[$&] 1=[$1] 2=[$2] 3=[$3]\n";

which gives:

$ perl5144 ~/tmp/p; perl5220 ~/tmp/p
&=[aa] 1=[] 2=[a] 3=[a]
&=[aa] 1=[] 2=[a] 3=[a]

Move that '?' outside:

"aa" =~ /^(a)?(a)(\2\1)/ or die;
print "&=[$&] 1=[$1] 2=[$2] 3=[$3]\n";

and we get this:

$ perl5144 ~/tmp/p; perl5220 ~/tmp/p
Died at /home/davem/tmp/p line 3.
&=[a] 1=[] 2=[a] 3=[]

Clearly for /(a?)/, \1 should match a zero-length string;
for /(a)?/ its not so obvious what should happen. It could be regarded as
a zero-length string, or it could be regarded as a non-existent capture
and so \1 always fails to match.

Does anyone have any opinions what the correct semantics should be?
Then I'll know the correct way to fix it.

Note that perl distinguishes between zero length and undefined $1:

$ perl -le'"aa" =~ /^(x?)/; print "def" if defined $1;'def
def
$ perl -le'"aa" =~ /^(x)?/; print "def" if defined $1;'
$

--
"Do not dabble in paradox, Edward, it puts you in danger of fortuitous wit."
-- Lady Croom, "Arcadia"

p5pRT · 2015-08-17T13:02:39Z

From @Abigail

On Mon, Aug 17, 2015 at 01:15:20PM +0100, Dave Mitchell wrote:

On Wed, Aug 12, 2015 at 08:27:20PM -0700, Thanh Hong Dai wrote:

Given this test program:

while ("aa" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
}

while ("aba" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
}

In version 5.14.14, the program output nothing, which is the
expected behavior.

However, in version 5.20.1, the program outputs:

[a] [] [a] [] []
[ab] [] [a] [b] []

Arguably both 5.14.4 and bleadperl are wrong.

It all depends on what we expect \1 to match when the capture is
zero-length, and in particular whether it's of the form /(x)?/ or /(x?)/

Consider the following:
"aa" =~ /^$a?$$a$$\\2\\1$/ or die;
print "&=\[$&\] 1=\[$1\] 2=\[$2\] 3=\[$3\]\\n";
which gives:
$ perl5144 ~/tmp/p; perl5220 ~/tmp/p
&=\[aa\] 1=\[\] 2=\[a\] 3=\[a\]
&=\[aa\] 1=\[\] 2=\[a\] 3=\[a\]
Move that '?' outside:
"aa" =~ /^$a$?$a$$\\2\\1$/ or die;
print "&=\[$&\] 1=\[$1\] 2=\[$2\] 3=\[$3\]\\n";
and we get this:
$ perl5144 ~/tmp/p; perl5220 ~/tmp/p
Died at /home/davem/tmp/p line 3\.
&=\[a\] 1=\[\] 2=\[a\] 3=\[\]
Clearly for /(a?)/, \1 should match a zero-length string;
for /(a)?/ its not so obvious what should happen. It could be regarded as
a zero-length string, or it could be regarded as a non-existent capture
and so \1 always fails to match.

Does anyone have any opinions what the correct semantics should be?
Then I'll know the correct way to fix it.

Note that perl distinguishes between zero length and undefined $1:
$ perl \-le'"aa" =~ /^$x?$/; print "def" if defined $1;'def
def
$ perl \-le'"aa" =~ /^$x$?/; print "def" if defined $1;'
$

Perl isn't very consistent in how it treats /(x)?/ vs /(x?)/:

$ perl -wE '"aa" =~ /^(x?)/; say scalar @-, " ", scalar @+'
2 2
$ perl -wE '"aa" =~ /^(x)?/; say scalar @-, " ", scalar @+'
1 2
$ perl -wE '"aa" =~ /^(x?)(a)/; say scalar @-, " ", scalar @+'
3 3
$ perl -wE '"aa" =~ /^(x)?(a)/; say scalar @-, " ", scalar @+'
3 3

So, a (x)? will not cause an entry in @- if the match is empty, but only
if it's not followed by a non-empty match; it does get a match in @+
either way.

IMO, "aa" =~ /^(x?)/ should set $1 (and \1) to the empty string, and
"aa" =~ /^(x)?/ should set $1 (and \1) to undef. Interpolating undef
should result in an empty string. And since it currently doesn't
warn, we should not change that. (And in general, Perl does this).

I do think scalar @- should always equal to scalar @+; and if not,
it should be doumented in which cases it doesn't, preferably with
an explanation why this is such an awesome idea (as opposed to it
being an ordinary bug).

Abigail

p5pRT · 2015-08-17T14:21:02Z

From @iabyn

On Mon, Aug 17, 2015 at 03:01:59PM +0200, Abigail wrote:

Perl isn't very consistent in how it treats /(x)?/ vs /(x?)/:

$ perl \-wE '"aa" =~ /^\(x?\)/; say scalar @&#8203;\-\, " "\, scalar @&#8203;\+'
2 2
$ perl \-wE '"aa" =~ /^\(x\)?/; say scalar @&#8203;\-\, " "\, scalar @&#8203;\+'
1 2
$ perl \-wE '"aa" =~ /^\(x?\)\(a\)/; say scalar @&#8203;\-\, " "\, scalar @&#8203;\+'
3 3
$ perl \-wE '"aa" =~ /^\(x\)?\(a\)/; say scalar @&#8203;\-\, " "\, scalar @&#8203;\+'
3 3

The implementation of @+/@- is (somewhat) orthogonal to this issue.

Internally perl maintains an array of start and end indexes (which @+/@-
exposes), but uses an index of -1 to indicate an invalid value.
It also has a 3rd value, the tmp_start, which is used while in the middle
of a capture.

Initially the (start,tmp_start;end) pairs are set to (-1,-1; -1). An
open parenthesis sets the tmp_start. A close parenthesis sets start to
tmp_start and sets end to the current position.
(splitting start into start and tmp_start allows the previous capture
position to be preserved when failing the nth iteration of something.
e.g. /(foo){3,4}/.)

It also maintains a high-water mark of the highest close paren seen.

On failure backtrack, it sets the end index of any backtracked close
parentheses to -1, and also restores the previous high-water mark.
The starts are left untouched.

The issue with the \<N> code is that it checks for for start being -1,
but doesn't handle end being -1 properly (and so lots of weird off-by-one
issues present themselves).

So, a (x)? will not cause an entry in @- if the match is empty, but only
if it's not followed by a non-empty match; it does get a match in @+
either way.

The issue with @+/@- seems to be not handling the high water mark
correctly:

$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x?)/; f(@+);f(@-)'
0:0
0:0
$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x)?/; f(@+);f(@-)'
0:u
0
$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x?)(a)/; f(@+);f(@-)'
1:0:1
0:0:0
$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x)?(a)/; f(@+);f(@-)'
1:u:1
0:u:0

For the second one above, I think they should both be '0:u', or possibly '0'.

IMO, "aa" =~ /^(x?)/ should set $1 (and \1) to the empty string, and
"aa" =~ /^(x)?/ should set $1 (and \1) to undef. Interpolating undef
should result in an empty string. And since it currently doesn't
warn, we should not change that. (And in general, Perl does this).

Agreed.

I do think scalar @- should always equal to scalar @+;

Agreed.

--
But Pity stayed his hand. "It's a pity I've run out of bullets",
he thought. -- "Bored of the Rings"

p5pRT · 2015-08-17T18:59:04Z

From @hvds

Dave Mitchell <davem@iabyn.com> wrote:
:On Wed, Aug 12, 2015 at 08:27:20PM -0700, Thanh Hong Dai wrote:
:> Given this test program:
:>
:> while ("aa" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
:> print "[$&] [$1] [$2] [$3] [$4]\n";
:> }
:>
:> while ("aba" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
:> print "[$&] [$1] [$2] [$3] [$4]\n";
:> }
:>
:> In version 5.14.14, the program output nothing, which is the
:> expected behavior.
:>
:> However, in version 5.20.1, the program outputs:
:>
:> [a] [] [a] [] []
:> [ab] [] [a] [b] []
:
:Arguably both 5.14.4 and bleadperl are wrong.
:
:It all depends on what we expect \1 to match when the capture is
:zero-length, and in particular whether it's of the form /(x)?/ or /(x?)/

I believe this was discussed on the list before now, possibly even to the
point of deciding something definitive; however that would've been several
years ago, and is likely to be difficult to track down.

(It's just possible that the discussion started shortly after implementation
of @- and @+ landed to highlight existing shortcomings, but I wouldn't bet
on it.)

My current preference is that /(x)?...\1/ should not match if the (x) was
not matched, but I don't recall enough of the previous discussion to know
whether this is because of or despite that discussion.

I think it'll be clear enough in the source if it is making some sort of
attempt to make that happen - ie to distinguish between (x)? and (x?), and
attempt to make the former look like an unmatchable capture: if it is, I'd
take that as evidence this is the intended behaviour. Sorry I don't have
time to research it myself right now.

Hugo

p5pRT · 2015-08-19T10:26:38Z

From @iabyn

On Mon, Aug 17, 2015 at 07:30:24PM +0100, hv@crypt.org wrote:

I believe this was discussed on the list before now, possibly even to the
point of deciding something definitive; however that would've been several
years ago, and is likely to be difficult to track down.

(It's just possible that the discussion started shortly after implementation
of @- and @+ landed to highlight existing shortcomings, but I wouldn't bet
on it.)

My current preference is that /(x)?...\1/ should not match if the (x) was
not matched, but I don't recall enough of the previous discussion to know
whether this is because of or despite that discussion.

I think it'll be clear enough in the source if it is making some sort of
attempt to make that happen - ie to distinguish between (x)? and (x?), and
attempt to make the former look like an unmatchable capture: if it is, I'd
take that as evidence this is the intended behaviour. Sorry I don't have
time to research it myself right now.

Changing the src so that this line:

if (rex->lastparen < n || ln == -1)
sayNO; /* Do not match unless seen CLOSEn. */

instead pretends that \N is the empty string, causes the following tests
in t/re/re_tests to fail:

(a)|\1 x n - Reference to group in different branch
(?:(b)?a)\1 a n - Reference to group that did not match
((\3|b)\2(a)x)+ aaxabxbaxbbx n - -
((\3|b)\2(a)x)+ aaaxabaxbaaxbbax y $&-$1-$2-$3 bbax-bbax-b-a
((\3|b)\2(a)){2,} bbaababbabaaaaabbaaaabba y $&-$1-$2-$3 bbaaaabba-bba-b-a

So I think its clear that backrefs to captures that didn't capture are
supposed to fail.

--
The crew of the Enterprise encounter an alien life form which is
surprisingly neither humanoid nor made from pure energy.
-- Things That Never Happen in "Star Trek" #22

p5pRT · 2015-08-21T16:35:49Z

From @ap

* Dave Mitchell <davem@iabyn.com> [2015-08-17 14:20]:

Clearly for /(a?)/, \1 should match a zero-length string;
for /(a)?/ its not so obvious what should happen. It could be regarded
as a zero-length string, or it could be regarded as a non-existent
capture and so \1 always fails to match.

It seems logical that “match the same thing again” cannot possibly match
anything if it refers to a thing that didn’t match. This can be employed
in something like

(?x) (delimiter)? (actual value pattern) (?> \1 | terminator)

where the failure of a backref to a failed group is a useful semantic.
And if the backref is not supposed to fail, there are various options
(such as putting a ? quantifier on it) to protect against that – which
have the useful property that they require the user to specify how the
failure should be dealt with (instead of a one-size-fits-all “matches
the empty string” for everyone).

So I think even if the existing indicators didn’t already point that
way, it can be argued well that it is the more self-consistent choice.

And of course by way of (?(1)...) it is possible to get either semantic
regardless of the default, so this is ultimately only a question of what
the better default is.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

p5pRT · 2015-08-21T18:26:17Z

From @demerphq

On 19 August 2015 at 12:26, Dave Mitchell <davem@iabyn.com> wrote:

On Mon, Aug 17, 2015 at 07:30:24PM +0100, hv@crypt.org wrote:

I believe this was discussed on the list before now, possibly even to the
point of deciding something definitive; however that would've been several
years ago, and is likely to be difficult to track down.

(It's just possible that the discussion started shortly after implementation
of @- and @+ landed to highlight existing shortcomings, but I wouldn't bet
on it.)

My current preference is that /(x)?...\1/ should not match if the (x) was
not matched, but I don't recall enough of the previous discussion to know
whether this is because of or despite that discussion.

I think it'll be clear enough in the source if it is making some sort of
attempt to make that happen - ie to distinguish between (x)? and (x?), and
attempt to make the former look like an unmatchable capture: if it is, I'd
take that as evidence this is the intended behaviour. Sorry I don't have
time to research it myself right now.

IMO we definitely want to differentiate between (X)? and (X?).

You can see why if you define X? as the alternation of X and the empty
string. Once you do that there is a clear difference between:

/(?:(X)|)/

and

/((?:X|))/

In other words I would expect the following to behave as it does:

'print "d"=~/((?:x|))\1d/ ? 1 : 0'
1

$ perl -le'print "d"=~/(?:(x)|)\1d/ ? 1 : 0'
0

I find this subject confusing because () does two jobs, grouping and
capturing, at the same time, with /(X)*/ being the same as:
/(?:(X))*/, which isn't always obvious to people.

Yves

p5pRT · 2015-08-21T19:12:34Z

From @demerphq

On 21 August 2015 at 20:25, demerphq <demerphq@gmail.com> wrote:

On 19 August 2015 at 12:26, Dave Mitchell <davem@iabyn.com> wrote:

On Mon, Aug 17, 2015 at 07:30:24PM +0100, hv@crypt.org wrote:

I believe this was discussed on the list before now, possibly even to the
point of deciding something definitive; however that would've been several
years ago, and is likely to be difficult to track down.

(It's just possible that the discussion started shortly after implementation
of @- and @+ landed to highlight existing shortcomings, but I wouldn't bet
on it.)

My current preference is that /(x)?...\1/ should not match if the (x) was
not matched, but I don't recall enough of the previous discussion to know
whether this is because of or despite that discussion.

I think it'll be clear enough in the source if it is making some sort of
attempt to make that happen - ie to distinguish between (x)? and (x?), and
attempt to make the former look like an unmatchable capture: if it is, I'd
take that as evidence this is the intended behaviour. Sorry I don't have
time to research it myself right now.

IMO we definitely want to differentiate between (X)? and (X?).

You can see why if you define X? as the alternation of X and the empty
string. Once you do that there is a clear difference between:

/(?:(X)|)/

and

/((?:X|))/

In other words I would expect the following to behave as it does:

'print "d"=~/((?:x|))\1d/ ? 1 : 0'
1

$ perl -le'print "d"=~/(?:(x)|)\1d/ ? 1 : 0'
0

I find this subject confusing because () does two jobs, grouping and
capturing, at the same time, with /(X)*/ being the same as:
/(?:(X))*/, which isn't always obvious to people.

I worded that poorly. Let me try again. It is easy to get confused how
capturing buffers and quantifiers work together. Some of this is due
to the dual nature of capturing parens, with grouping and capturing
semantics. It is also because it is not always obvious that
quantifiers essentially always operate on an implied grouping parens.
With this understanding:

/X*/ is actually /(?:X)*/

which means that /(X)*/ is actually /(?:(X))*/

and if you define /X?/ as /(?:X|)/ then it is also obvious that

/(X)?/ becomes /(?:(X)|)/ which is obviously different from /((?:X|))/.

cheers.
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq · 2023-01-07T21:32:29Z

I noticed a comment in this thread from @iabyn about @+ and @-.

The issue with @+/@- seems to be not handling the high water mark
correctly:

$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x?)/; f(@+);f(@-)'
0:0
0:0
$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x)?/; f(@+);f(@-)'
0:u
0
$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x?)(a)/; f(@+);f(@-)'
1:0:1
0:0:0
$ perl -wE 'sub f { say join ":", map $_//"u", @_ } "aa" =~ /^(x)?(a)/; f(@+);f(@-)'
1:u:1
0:u:0

For the second one above, I think they should both be '0:u', or possibly '0'.

@+ and @- are documented to have a different number potentially. @+ is populated with one entry per capture buffer in the pattern, with undefs for capture buffers that dont match. @- is populated only up to the last capture buffer that matched. This is from perlvar @-. The docs for @+ are not that clear.

            One can use $#- to
            find the last matched subgroup in the last successful match.
            Contrast with $#+, the number of subgroups in the regular
            expression.

demerphq · 2023-01-07T21:38:03Z

I am closing this ticket. I think it was resolved long ago.

p5pRT added Severity Low type-regex labels Oct 19, 2019

toddr removed the khw label Oct 25, 2019

xenu removed the Severity Low label Dec 29, 2021

This was referenced Jan 1, 2023

MetaBug: Capture Buffer, Quantifiers, and Backtracking Behavior and Optimization #20661

Open

fixup many regexp bugs - not to be merged yet #20677

Draft

demerphq closed this as completed Jan 7, 2023

demerphq mentioned this issue Mar 9, 2023

Regex bug fixes, refactoring (macroization) and code improvments #20918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex bug with optional capturing group (<...>)? and backreference #14848

Regex bug with optional capturing group (<...>)? and backreference #14848

p5pRT commented Aug 13, 2015

p5pRT commented Aug 13, 2015

p5pRT commented Aug 13, 2015

p5pRT commented Aug 15, 2015

p5pRT commented Aug 15, 2015

p5pRT commented Aug 17, 2015

p5pRT commented Aug 17, 2015

p5pRT commented Aug 17, 2015

p5pRT commented Aug 17, 2015

p5pRT commented Aug 19, 2015

p5pRT commented Aug 21, 2015

p5pRT commented Aug 21, 2015

p5pRT commented Aug 21, 2015

demerphq commented Jan 7, 2023

demerphq commented Jan 7, 2023

Regex bug with optional capturing group (<...>)? and backreference #14848

Regex bug with optional capturing group (<...>)? and backreference #14848

Comments

p5pRT commented Aug 13, 2015

p5pRT commented Aug 13, 2015

From hdthanh@tma.com.vn

p5pRT commented Aug 13, 2015

From hdthanh@tma.com.vn

p5pRT commented Aug 15, 2015

From @khwilliamson

p5pRT commented Aug 15, 2015

p5pRT commented Aug 17, 2015

From @iabyn

p5pRT commented Aug 17, 2015

From @Abigail

p5pRT commented Aug 17, 2015

From @iabyn

p5pRT commented Aug 17, 2015

From @hvds

p5pRT commented Aug 19, 2015

From @iabyn

p5pRT commented Aug 21, 2015

From @ap

p5pRT commented Aug 21, 2015

From @demerphq

p5pRT commented Aug 21, 2015

From @demerphq

demerphq commented Jan 7, 2023

demerphq commented Jan 7, 2023