Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior with backreferences nested inside subpattern references #13619

Open
p5pRT opened this issue Feb 22, 2014 · 25 comments · May be fixed by #20677
Open

Inconsistent behavior with backreferences nested inside subpattern references #13619

p5pRT opened this issue Feb 22, 2014 · 25 comments · May be fixed by #20677

Comments

@p5pRT
Copy link

p5pRT commented Feb 22, 2014

Migrated from rt.perl.org#121299 (status was 'open')

Searchable as RT121299$

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2014

From nbtrap@nbtrap.com

This bug report corresponds to the discussion at​:
http​://www.nntp.perl.org/group/perl.perl5.porters/2014/02/msg212894.html

Perl (5.16.3 and 5.18.1 at least) doesn't always handle backreferences
inside subpattern calls correctly. For example,

  say 'fofof' =~ /(.)((o)\1)(?2)/ ? 'true' : 'false'
  ==> true

  say 'fffffff' =~ /(.)(?2)((\1)(?4)(\1))/ ? 'true' : 'false'
  ==> false

  say 'foffoff' =~ /(.)(?2)((.)(?4)(\1))/ ? 'true' : 'false'
  ==> false

"use re 'debug'" shows that the second and third examples fail at the
first '\1' encounter, as though the first capture were being reset upon
entering the sub call. However, in the first example, the capture is
cearly retained in the '(?2)' call.

A similar (same?) problem occurs with forward or self-referential
backreferences​:

  print 'abcb' =~ /^(.\2?)(.)(?1)$/ ? "true\n" : "false\n"
  => false

  print 'abcb' =~ /^(.(?{ printf "%d, $2\n", pos()})\2?)(.)(?1)$/ ? "true\n" : "false\n"
  => 1,
  => 3, b
  => false

  print 'aba' =~ /^(.\1?)(?1)$/ ? "true\n" : "false\n"
  => false

  print 'aba' =~ /^(.(?{printf "%d, $1\n", pos()})\1?)(?1)$/ ? "true\n" : "false\n"
  => 1,
  => 2, a
  => false

In second example immediately above, the diagnostics say that we're
about to try to match 'b' at pos 3, which should succeed. Ditto for the
fourth example.

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2014

From nbtrap@nbtrap.com

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

There are still a couple bizarre cases I can think of, but it usually
takes some intense concentration to write them. Once they're written,
I'll be sure to test with Perl.

I tried to come up with a test that used backreferences of capture
groups at different levels of the subcall stack, and I found a rather
pathological case where Perl and Lisp give different results. I think
Lisp is correct. (I'm using your branch to test this.)

  say 'abbcdcabbda' =~ /^ (a|\3(?1)\2|(?2)) ((b|c)(?4)?) (?1) (d(?1)) $/x ? 'true' : 'false'
  ==> false

Pretty sure this should match. I had to use paper and pencil to walk
through and verify it.

Here's the short and sweet of it. After matching the 'b' at pos 2, we
descend into sub 1 (this is the first instance (visually from left to
right) of '(?1)'). Upon return from this sub, pos is at 8, and we have
to match '\2'. Now, Lisp says '\2' at this point is 'b', but Perl
thinks it's 'cdcab'. But the latter can't be correct, because 'cdcab'
is matched within the '(?1)' call itself and so should have been removed
from the capture group stack upon return therefrom.

Test patch attached

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2014

From nbtrap@nbtrap.com

perl.patch
From 1a55ddecc2867b27fcb0ce64b06263a6f4a4fb7b Mon Sep 17 00:00:00 2001
From: Nathan Trapuzzano <nbtrap@nbtrap.com>
Date: Sat, 22 Feb 2014 11:13:19 -0500
Subject: [PATCH] Add pathological test case for RT #121299.

---
 t/re/re_tests | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/t/re/re_tests b/t/re/re_tests
index c55e3d3..d066330 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1848,10 +1848,12 @@ A+(*PRUNE)BC(?{})	AAABC	y	$&	AAABC
 /^\S+=/d	\x{3a3}=\x{3a0}	y	$&	\x{3a3}=
 /^\S+=/u	\x{3a3}=\x{3a0}	y	$&	\x{3a3}=
 
+# This group is from RT #121299
 /(.)((o)\1)(?2)/	fofof	y	$2	of
 /(.)(?2)((\1)(?4)(\1))/	fffffff	y	$1	f
 /(.)(?2)((.)(?4)(\1))/	foffoff	y	$2	off
 /^(.\2?)(.)(?1)$/	abcb	y	$2	b
 /^(.\1?)(?1)$/	aba	y	$1	a
+/^(a|\3(?1)\2|(?2))((b|c)(?4)?)(?1)(d(?1))$/	abbcdcabbda	y	$1	a
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
-- 
1.9.0

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2014

From nbtrap@nbtrap.com

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

But the latter can't be correct, because 'cdcab' is matched within the
'(?1)' call itself and so should have been removed from the capture
group stack upon return therefrom.

Let me give a simpler example of what I mean​:

  say 'aaba' =~ /^ (?3)? ((.)) (\2(?1)\2) $/x ? 'true' : 'false'
  ==> true

See how the second occurrence of '\2' matches 'a' and not 'b'?

Now this bug becomes evident when we switch the ordering​:

  say 'aaba' =~ /^ (\3(?2)\3)? ((.)) (?1) $/x ? 'true' : 'false'
  ==> false

In stable Perl, this fails because '\3' isn't saved when we enter the
sub call. But a build of your branch fails here because the second '\3'
tries to match 'b' where it ought to try to match 'a'.

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2014

From @demerphq

On 22 February 2014 19​:15, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

But the latter can't be correct, because 'cdcab' is matched within the
'(?1)' call itself and so should have been removed from the capture
group stack upon return therefrom.

Let me give a simpler example of what I mean​:

say 'aaba' =~ /^ (?3)? ((.)) (\2(?1)\2) $/x ? 'true' : 'false'
==> true

See how the second occurrence of '\2' matches 'a' and not 'b'?

I think this should match, and that the the second occurrence of \2
should match "a". The results from the GOSUB shouldn't be visible in
the calling context. I would expect the (?3)? to first match a single
character, and then match nothing, at which point the total pattern
would match with \1='a' and \2='a'. The second call to (?1) would
match 'b', but it wouldnt change its previous setting of 'a'.

Now this bug becomes evident when we switch the ordering​:

say 'aaba' =~ /^ (\3(?2)\3)? ((.)) (?1) $/x ? 'true' : 'false'
==> false

In stable Perl, this fails because '\3' isn't saved when we enter the
sub call. But a build of your branch fails here because the second '\3'
tries to match 'b' where it ought to try to match 'a'.

Hmm. I think this should also match. First we should match a single
char, \3 would be '' in both cases, then rest of the match should
fail, then we would match nothing, and go on to match similar to
above, with \2 and \3 == 'a', and \1 == undef/''.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2014

From nbtrap@nbtrap.com

demerphq <demerphq@​gmail.com> writes​:

On 22 February 2014 19​:15, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

But the latter can't be correct, because 'cdcab' is matched within the
'(?1)' call itself and so should have been removed from the capture
group stack upon return therefrom.

Let me give a simpler example of what I mean​:

say 'aaba' =~ /^ (?3)? ((.)) (\2(?1)\2) $/x ? 'true' : 'false'
==> true

See how the second occurrence of '\2' matches 'a' and not 'b'?

I think this should match,

Agreed. I was not clear on that point.

The results from the GOSUB shouldn't be visible in the calling
context.

Yes.

I would expect the (?3)? to first match a single character, and then
match nothing, at which point the total pattern would match with
\1='a' and \2='a'.

($1==a and $2==a) should become true at pos 1, after the initial 'a' is
matched by top-leve '((.))'. (I think we agree).

The second call to (?1) would match 'b', but it wouldnt change its
previous setting of 'a'.

Yes, sine the capture group bindings are dynamic.,

Now this bug becomes evident when we switch the ordering​:

say 'aaba' =~ /^ (\3(?2)\3)? ((.)) (?1) $/x ? 'true' : 'false'
==> false

In stable Perl, this fails because '\3' isn't saved when we enter the
sub call. But a build of your branch fails here because the second '\3'
tries to match 'b' where it ought to try to match 'a'.

Hmm. I think this should also match.

Agreed, if it wasn't clear. Here we see the inconsistency of the
current implementation.

First we should match a single char, \3 would be '' in both cases,
then rest of the match should fail

Not quite. I believe \3 would fail outright, because undefined capture
variables behave differently from variables that matched ''. References
to undefined capture variables always fail. This should be easy to
verify with a test​:

  say '' =~ /\1()/ ? 'true' : 'false'
  => false
  say '' =~ /()\1/ ? 'true' : 'false'
  => true

then we would match nothing, and go on to match similar to above,
with \2 and \3 == 'a', and \1 == undef/''.

See above.

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2014

From @demerphq

On 23 February 2014 16​:47, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

demerphq <demerphq@​gmail.com> writes​:

On 22 February 2014 19​:15, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

But the latter can't be correct, because 'cdcab' is matched within the
'(?1)' call itself and so should have been removed from the capture
group stack upon return therefrom.

Let me give a simpler example of what I mean​:

say 'aaba' =~ /^ (?3)? ((.)) (\2(?1)\2) $/x ? 'true' : 'false'
==> true

See how the second occurrence of '\2' matches 'a' and not 'b'?

I think this should match,

Agreed. I was not clear on that point.

The results from the GOSUB shouldn't be visible in the calling
context.

Yes.

I would expect the (?3)? to first match a single character, and then
match nothing, at which point the total pattern would match with
\1='a' and \2='a'.

($1==a and $2==a) should become true at pos 1, after the initial 'a' is
matched by top-leve '((.))'. (I think we agree).

Me too. :-)

The second call to (?1) would match 'b', but it wouldnt change its
previous setting of 'a'.

Yes, sine the capture group bindings are dynamic.,

Now this bug becomes evident when we switch the ordering​:

say 'aaba' =~ /^ (\3(?2)\3)? ((.)) (?1) $/x ? 'true' : 'false'
==> false

In stable Perl, this fails because '\3' isn't saved when we enter the
sub call. But a build of your branch fails here because the second '\3'
tries to match 'b' where it ought to try to match 'a'.

Hmm. I think this should also match.

Agreed, if it wasn't clear. Here we see the inconsistency of the
current implementation.

First we should match a single char, \3 would be '' in both cases,
then rest of the match should fail

Not quite. I believe \3 would fail outright,

Yep, you are correct. I was mistaken there.

because undefined capture
variables behave differently from variables that matched ''. References
to undefined capture variables always fail. This should be easy to
verify with a test​:

say '' =~ /\1()/ ? 'true' : 'false'
=> false
say '' =~ /()\1/ ? 'true' : 'false'
=> true

then we would match nothing, and go on to match similar to above,
with \2 and \3 == 'a', and \1 == undef/''.

See above.

Well, i think we are on the same page. :-)

I think I have a patch for this, let me see how it goes.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2014

From @demerphq

On 23 February 2014 17​:05, demerphq <demerphq@​gmail.com> wrote​:

On 23 February 2014 16​:47, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

demerphq <demerphq@​gmail.com> writes​:

On 22 February 2014 19​:15, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

But the latter can't be correct, because 'cdcab' is matched within the
'(?1)' call itself and so should have been removed from the capture
group stack upon return therefrom.

Let me give a simpler example of what I mean​:

say 'aaba' =~ /^ (?3)? ((.)) (\2(?1)\2) $/x ? 'true' : 'false'
==> true

See how the second occurrence of '\2' matches 'a' and not 'b'?

I think this should match,

Agreed. I was not clear on that point.

The results from the GOSUB shouldn't be visible in the calling
context.

Yes.

I would expect the (?3)? to first match a single character, and then
match nothing, at which point the total pattern would match with
\1='a' and \2='a'.

($1==a and $2==a) should become true at pos 1, after the initial 'a' is
matched by top-leve '((.))'. (I think we agree).

Me too. :-)

The second call to (?1) would match 'b', but it wouldnt change its
previous setting of 'a'.

Yes, sine the capture group bindings are dynamic.,

Now this bug becomes evident when we switch the ordering​:

say 'aaba' =~ /^ (\3(?2)\3)? ((.)) (?1) $/x ? 'true' : 'false'
==> false

In stable Perl, this fails because '\3' isn't saved when we enter the
sub call. But a build of your branch fails here because the second '\3'
tries to match 'b' where it ought to try to match 'a'.

Hmm. I think this should also match.

Agreed, if it wasn't clear. Here we see the inconsistency of the
current implementation.

First we should match a single char, \3 would be '' in both cases,
then rest of the match should fail

Not quite. I believe \3 would fail outright,

Yep, you are correct. I was mistaken there.

because undefined capture
variables behave differently from variables that matched ''. References
to undefined capture variables always fail. This should be easy to
verify with a test​:

say '' =~ /\1()/ ? 'true' : 'false'
=> false
say '' =~ /()\1/ ? 'true' : 'false'
=> true

then we would match nothing, and go on to match similar to above,
with \2 and \3 == 'a', and \1 == undef/''.

See above.

Well, i think we are on the same page. :-)

I think I have a patch for this, let me see how it goes.

I pushed the following​:

d1b2014 fix RT #121299 - Inconsistent
behavior with backreferences nested inside subpattern reference

Which I think fixes the issues you have raised. Please verify with the
latest Perl.

It is a rework of my earlier patch to take aboard Dave's comments in
the other thread. I think it makes a lot more sense now.

Also, in order to make it easier to debug what REF was doing during a
'debug' match, I added support for showing the string it matches​:

2395827 Improve how regprop dumps
REF-like nodes during execution

Assuming this fixes the issues you noted, I think this ticket can be closed.

cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2014

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2014

From nbtrap@nbtrap.com

"Zsban Ambrus via RT" <perlbug-followup@​perl.org> writes​:

See also the related ticket https://rt-archive.perl.org/perl5/Ticket/Display.html?id=76622 .

Interesting. So this issue was (kind of) raised before but dismissed
because Perl was behaving as documented. The citation from perlre is no
longer in the docs, for what it's worth.

I guess it just raises the issue again​: is this really a bug? Perhaps
it's not, but I still think the behavior should change according to how
we've discussed it. At the very least, some of the examples I provided
manifested inconsistent behavior.

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From nbtrap@nbtrap.com

demerphq <demerphq@​gmail.com> writes​:

Also, in order to make it easier to debug what REF was doing during a
'debug' match, I added support for showing the string it matches​:

That's helpful, thanks.

Assuming this fixes the issues you noted, I think this ticket can be closed.

I found a couple of problems so far. Most immediately​:

  say 'match' if "abcdefghijk\12S" =~ /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/

causes Perl to lock up. It doesn't dump core, but it doesn't do
anything else either.

But relating to this issue specifically, the engine still behaves
incorrectly with your patch in at least one respect. (Though this is
more debatable than previous examples I brought up.) Consider​:

  say 'match' if 'aaba' =~ /^ ((?​:(a)|(b))\2) (?1) $/x
  ==> match

This matches in stable perl, and it still matches with your patch. I
don't think it should.

It comes down to the issue of "when do the capture variables receive new
bindings?", which I tried to bring up in another post. Imagine you
wrote that regex without the '(?1)' at the end. In that case, the '\2'
means "did I *just* match 'a', and am I now looking at another 'a'?".
But when you add the '(?1)', the '\2' takes on (under the current
implementation) a different meaning depending on whether we're in the
subcall. When we're in the subcall, it means "have I captured 'a' in $2
anywhere in the dynamic subcall stack?"

This is problematic. The semantics of backreferences referring to local
capture groups should be knowable without having recourse to non-local
parts. In other words, if I'm building the regex piecemeal, capture
group by capture group, the meaning of references (whether subcalls or
backreferences) to other parts of the same piece should be determinate
without having to look at any of the other pieces. With respect to the
example above, the '\2' should always match 'a' or undef (fail)
depending on whether we matched 'a' _just now_.

It may help illustrate what I have in mind if I explain how I
implemented subpattern references in Lisp. At compile time, when we see
a capture group, we keep track of which other capture groups are
lexically contained within it. Then, during the matching phase, when we
enter a subcall to, say, (?5), we immediately create new dynamic
bindings for $5 and for the capture variables of the groups contained
within group 5 (and only for those capture variables)--to put it another
way, we make new bindings for the _local_ capture groups only. On
return from the subcall, the new bindings are destroyed, and the capture
variables assume the values they had immediately prior to the subcall.
So for instance, in the above example, $1, $2, and $3 are all
dynamically bound to undef upon entrance into the '(?1)' subcall, since
those three groups are local to group 1. That regex should match no
string except 'aaaa'.

So far as I know, this is a concise and full definition of subpattern
calls, which Perl is lacking at the moment. For what it's worth, it's
also the only good definition I've come across as I worked to implement
this feature over the last couple of months.

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From nbtrap@nbtrap.com

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

Then, during the matching phase, when we enter a subcall to, say,
(?5), we immediately create new dynamic bindings for $5 and for the
capture variables of the groups contained within group 5

Yves, please take note that I provided you with an incorrect example
earlier, namely​:

print 'aba' =~ /^(.\1?)(?1)$/ ? "true\n" : "false\n"
=> false

Disregard what I said earlier about this one. I believe it should fail.
The '\1' should fail unconditionally, since $1 should get a new binding
upon entrance to '(?1)'.

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From nbtrap@nbtrap.com

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

It may help illustrate what I have in mind if I explain how I
implemented subpattern references in Lisp...

Just realized I've been describing the "referential tranparency" of
local references​:

http​://en.wikipedia.org/wiki/Referential_transparency_%28computer_science%29

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From @demerphq

On 25 February 2014 01​:54, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

demerphq <demerphq@​gmail.com> writes​:

Also, in order to make it easier to debug what REF was doing during a
'debug' match, I added support for showing the string it matches​:

That's helpful, thanks.

Assuming this fixes the issues you noted, I think this ticket can be closed.

I found a couple of problems so far. Most immediately​:

say 'match' if "abcdefghijk\12S" =~ /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/

causes Perl to lock up. It doesn't dump core, but it doesn't do
anything else either.

Well, that isn't related to my patch sequence. And is a Perl 5.20 showstopper.

The rest of your comments l will investigate later.

cheers,
Yves

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From @demerphq

On 25 February 2014 01​:54, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

demerphq <demerphq@​gmail.com> writes​:

Also, in order to make it easier to debug what REF was doing during a
'debug' match, I added support for showing the string it matches​:

That's helpful, thanks.

Assuming this fixes the issues you noted, I think this ticket can be closed.

I found a couple of problems so far. Most immediately​:

say 'match' if "abcdefghijk\12S" =~ /(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/

causes Perl to lock up. It doesn't dump core, but it doesn't do
anything else either.

Thanks, this is fixed as part of RT #121321 by commit​:

commit 845ab12
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Tue Feb 25 12​:07​:18 2014 +0100

  Fix RT #121321 - Fencepost error causes infinite loop in regex compilation

  Due to a fencepost error if a pattern had more than 9 capture buffers
  and after the last capture buffer there was an octal style escape which
  when interpreted as decimal evaluated to one more than the number of
  defined buffers then the regex compiler would go into an infinite loop.

  This fixes the fencepost error, adds tests, and adds some comments to
  explain what is going on.

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From @demerphq

On 25 February 2014 01​:54, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

demerphq <demerphq@​gmail.com> writes​:
But relating to this issue specifically, the engine still behaves
incorrectly with your patch in at least one respect. (Though this is
more debatable than previous examples I brought up.) Consider​:

say 'match' if 'aaba' =~ /^ ((?​:(a)|(b))\2) (?1) $/x
==> match

This matches in stable perl, and it still matches with your patch. I
don't think it should.

It comes down to the issue of "when do the capture variables receive new
bindings?", which I tried to bring up in another post. Imagine you
wrote that regex without the '(?1)' at the end. In that case, the '\2'
means "did I *just* match 'a', and am I now looking at another 'a'?".
But when you add the '(?1)', the '\2' takes on (under the current
implementation) a different meaning depending on whether we're in the
subcall. When we're in the subcall, it means "have I captured 'a' in $2
anywhere in the dynamic subcall stack?"

This is problematic. The semantics of backreferences referring to local
capture groups should be knowable without having recourse to non-local
parts. In other words, if I'm building the regex piecemeal, capture
group by capture group, the meaning of references (whether subcalls or
backreferences) to other parts of the same piece should be determinate
without having to look at any of the other pieces. With respect to the
example above, the '\2' should always match 'a' or undef (fail)
depending on whether we matched 'a' _just now_.

It may help illustrate what I have in mind if I explain how I
implemented subpattern references in Lisp. At compile time, when we see
a capture group, we keep track of which other capture groups are
lexically contained within it. Then, during the matching phase, when we
enter a subcall to, say, (?5), we immediately create new dynamic
bindings for $5 and for the capture variables of the groups contained
within group 5 (and only for those capture variables)--to put it another
way, we make new bindings for the _local_ capture groups only. On
return from the subcall, the new bindings are destroyed, and the capture
variables assume the values they had immediately prior to the subcall.
So for instance, in the above example, $1, $2, and $3 are all
dynamically bound to undef upon entrance into the '(?1)' subcall, since
those three groups are local to group 1. That regex should match no
string except 'aaaa'.

So far as I know, this is a concise and full definition of subpattern
calls, which Perl is lacking at the moment. For what it's worth, it's
also the only good definition I've come across as I worked to implement
this feature over the last couple of months.

I need to think about this a bit. My first take is that I expect your
example to match as it does. Since the (a) is matched in the first
pass, and we do not execute the close paren (because the 'a' doesnt
match) the second time, what $1 captured first remains the same.

I need to think about that. One could make the argument that after
unsuccessfully matching the 'a' that the capture buffer for 'a' in
that context should be undefined.

However this gets onto poorly defined ground as it conflicts with the
behavior of (a)*. Consider the match

"aaaa" =~ /(a)*/

we expect $1 to contain the last successful 'a'. However this would
conflict with the case you describe. (I think. :-).

It is very possible that Perl has done a poor job at *defining* the
expected semantics and then implementing them as defined, and has
instead instead of relied on specifying the expected semantics based
on the current implementation. Your efforts at reimplementation are of
course bringing all these issues to the fore as would be expected from
such a situation. I tend towards thinking your expectations are right
for this reason alone, but we need more thinking time for them.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From @demerphq

On 25 February 2014 02​:09, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

Then, during the matching phase, when we enter a subcall to, say,
(?5), we immediately create new dynamic bindings for $5 and for the
capture variables of the groups contained within group 5

Yves, please take note that I provided you with an incorrect example
earlier, namely​:

print 'aba' =~ /^(.\1?)(?1)$/ ? "true\n" : "false\n"
=> false

Disregard what I said earlier about this one. I believe it should fail.
The '\1' should fail unconditionally, since $1 should get a new binding
upon entrance to '(?1)'.

I am not sure about this. You could argue it doesn't get a new binding
until it hits the close paren.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2014

From @druud62

On 2014-02-25 12​:37, demerphq wrote​:

However this gets onto poorly defined ground as it conflicts with the
behavior of (a)*. Consider the match

"aaaa" =~ /(a)*/

we expect $1 to contain the last successful 'a'. However this would
conflict with the case you describe. (I think. :-).

Illustration​:

perl -wE'
  say for "aA" =~ /(a)*/i;
'
A

--
Ruud

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2014

From nbtrap@nbtrap.com

demerphq <demerphq@​gmail.com> writes​:

I need to think about that. One could make the argument that after
unsuccessfully matching the 'a' that the capture buffer for 'a' in
that context should be undefined.

What I'm arguing is for is something slightly stronger​: the 'a' capture
variable and other local capture group variables (but only the local
ones) would get new bindings to undef upon entrance to the subcall.

However this gets onto poorly defined ground as it conflicts with the
behavior of (a)*. Consider the match

"aaaa" =~ /(a)*/

we expect $1 to contain the last successful 'a'. However this would
conflict with the case you describe. (I think. :-).

I don't think that follows. If it did follow, I agree it would be
incorrect.

Look at it like this​: if subpattern calls are like function calls, then
quantifiers are like interations. When you have an assignment in a
loop, the value of the variable assigned to when you come out of the
loop is the value of the last assignment. I think you've mistakenly
taken the repetition to be like a recursive subroutine call, rather than
a loop/goto.

It is very possible that Perl has done a poor job at *defining* the
expected semantics and then implementing them as defined, and has
instead instead of relied on specifying the expected semantics based
on the current implementation. Your efforts at reimplementation are of
course bringing all these issues to the fore as would be expected from
such a situation. I tend towards thinking your expectations are right
for this reason alone, but we need more thinking time for them.

At this point, the precise aim of this ticket has become obscured, since
the correct behavior isn't agreed on fully. What's the appropriate
thing to do next (aside from me finishing testing your patch)?

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2014

From nbtrap@nbtrap.com

demerphq <demerphq@​gmail.com> writes​:

On 25 February 2014 02​:09, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

Then, during the matching phase, when we enter a subcall to, say,
(?5), we immediately create new dynamic bindings for $5 and for the
capture variables of the groups contained within group 5

Yves, please take note that I provided you with an incorrect example
earlier, namely​:

print 'aba' =~ /^(.\1?)(?1)$/ ? "true\n" : "false\n"
=> false

Disregard what I said earlier about this one. I believe it should fail.
The '\1' should fail unconditionally, since $1 should get a new binding
upon entrance to '(?1)'.

I am not sure about this. You could argue it doesn't get a new binding
until it hits the close paren.

You could argue it doesn't get its new _value_ until it hits close paren
(which it doesn't), but I don't think you could argue that about the
bindings.

There may be a language barrier here, since Lisp and Perl have different
vocabularies to some extent. Here is an illustration of the words as
I've been using them.

  use strict;

  print $x; # compile-time error​: $x is _unbound_

  my $x; # $x gets a binding, but is still undefined
  $x = 5; # $x is bound to (or defined as) 5

  {
  my $x; # $x gets a new binding, which shadows the old binding; $x
  # is undefined

  print $x; # prints nothing since $x is undef

  } # newer binding destroyed, old binding unshadowed; $x is 5 again

If $1 didn't get its bindings until close paren of its group, then
references to '\1' before the end of the group would be errors.

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2014

From @iabyn

On Tue, Feb 25, 2014 at 08​:13​:49PM -0500, Nathan Trapuzzano wrote​:

You could argue it doesn't get its new _value_ until it hits close paren
(which it doesn't), but I don't think you could argue that about the
bindings.

There may be a language barrier here, since Lisp and Perl have different
vocabularies to some extent. Here is an illustration of the words as
I've been using them.

use strict;

print $x; # compile-time error​: $x is _unbound_

my $x; # $x gets a binding, but is still undefined
$x = 5; # $x is bound to (or defined as) 5

{
my $x; # $x gets a new binding, which shadows the old binding; $x
# is undefined

print $x;  \# prints nothing since $x is undef

} # newer binding destroyed, old binding unshadowed; $x is 5 again

If $1 didn't get its bindings until close paren of its group, then
references to '\1' before the end of the group would be errors.

Just as a data point, sometimes while executing a regex in perl, the value of
$1 can still be defined, and equal to its previous value, *after* the
'(' has been processed, but before ')'. e.g.

  "abcd" =~ /( . (?{ print "[$1]\n" }) )+/x or die;

which outputs​:

[]
[a]
[b]
[c]

So this​:

  "aababcabcdabcde" =~ /((?​:\b|\1)\w)+/ or die;
  print "[$1]\n";

matches and prints "[abcde]"​:

--
Dave's first rule of Opera​:
If something needs saying, say it​: don't warble it.

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2014

From nbtrap@nbtrap.com

Dave Mitchell <davem@​iabyn.com> writes​:

Just as a data point, sometimes while executing a regex in perl, the value of
$1 can still be defined, and equal to its previous value, *after* the
'(' has been processed, but before ')'. e.g.

"abcd" =~ /\( \. \(?\{ print "\[$1\]\\n" \}\) \)\+/x or die;

which outputs​:

[]
[a]
[b]
[c]

So this​:

"aababcabcdabcde" =~ /\(\(?&#8203;:\\b|\\1\)\\w\)\+/ or die;
print "\[$1\]\\n";

matches and prints "[abcde]"​:

Interesting point. For the quantifier/iteration analogy to hold, this
makes sense. However, to be clear, I don't think it bears upon what
I've said about backreferences vis-a-vis subpattern calls. The two
issues can be defined independently.

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2014

From @demerphq

On 26 February 2014 02​:13, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

demerphq <demerphq@​gmail.com> writes​:

On 25 February 2014 02​:09, Nathan Trapuzzano <nbtrap@​nbtrap.com> wrote​:

Nathan Trapuzzano <nbtrap@​nbtrap.com> writes​:

Then, during the matching phase, when we enter a subcall to, say,
(?5), we immediately create new dynamic bindings for $5 and for the
capture variables of the groups contained within group 5

Yves, please take note that I provided you with an incorrect example
earlier, namely​:

print 'aba' =~ /^(.\1?)(?1)$/ ? "true\n" : "false\n"
=> false

Disregard what I said earlier about this one. I believe it should fail.
The '\1' should fail unconditionally, since $1 should get a new binding
upon entrance to '(?1)'.

I am not sure about this. You could argue it doesn't get a new binding
until it hits the close paren.

You could argue it doesn't get its new _value_ until it hits close paren
(which it doesn't), but I don't think you could argue that about the
bindings.

There may be a language barrier here, since Lisp and Perl have different
vocabularies to some extent. Here is an illustration of the words as
I've been using them.

use strict;

print $x; # compile-time error​: $x is _unbound_

my $x; # $x gets a binding, but is still undefined
$x = 5; # $x is bound to (or defined as) 5

{
my $x; # $x gets a new binding, which shadows the old binding; $x
# is undefined

print $x;  \# prints nothing since $x is undef

} # newer binding destroyed, old binding unshadowed; $x is 5 again

If $1 didn't get its bindings until close paren of its group, then
references to '\1' before the end of the group would be errors.

I have asked Damian Conway to share his opinions on this. I would like
to hear his thoughts about your argument before I proceed with any
futher changes.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@demerphq
Copy link
Collaborator

demerphq commented Jan 7, 2023

We have cases like this

"aba"=~/^([ba]\1?)+$/ and print "$1=$&"; # outputs "ba=aba"

where we expect that on the second iteration of the quantifier that \1 refers to the value of the capture buffer from the previous iteration. This also applies to GOSUB. So when you jump into a paren group it does not reset the capture buffer it is associated with.

We also have cases like this:

"acbcdbc"=~/^(([ab]\3?)([cd]\2?))+$/ and print "$1-$2-$3=$&";  # outputs bcdbc-bc-dbc=acbcdbc

That is we have forward references. It doesn't make sense until you think of loops.

We did have inconsistencies in our back reference logic, but i think we have fixed everything from this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants