Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive back references do not always work #10073

Closed
p5pRT opened this issue Jan 11, 2010 · 10 comments · Fixed by #20918 · May be fixed by #20677
Closed

Recursive back references do not always work #10073

p5pRT opened this issue Jan 11, 2010 · 10 comments · Fixed by #20918 · May be fixed by #20677

Comments

@p5pRT
Copy link

p5pRT commented Jan 11, 2010

Migrated from rt.perl.org#72020 (status was 'open')

Searchable as RT72020$

@p5pRT
Copy link
Author

p5pRT commented Jan 11, 2010

From ph10@hermes.cam.ac.uk

Consider the pattern ^(xa|=?\1a){2}$ when matched against the string "xa=xaaa".
The first time the group matches "xa", so $1 is set to that value, and the
remaining subject string is "=xaaa". The next time round the loop, the first
alternative fails, but =?\1a matches the next three characters "=xaa", so the
group ends, $1 is set to "=xaa", but there is still one character "a" in the
subject string. Now we get to $ and that fails because we are not at the end of
the string.

Perl (apparently) backtracks into the second iteration of the group, to the
position where "=" is optional, and tries without matching the "=" character.
However, it does NOT (it seems) reset the value of $1, so it still set to
"=xaa". This does match the next three characters in the string, and then the
final "a" matches, and we are now at the end of the string. I have tested this
with Perl 5.8.8, 5.10.1 and Perl 5.11.3.

So, the whole match succeeds with $1 set to "=xaaa". However, it should fail.

One way of thinking about this is to consider the pattern without the question
mark. ^(xa|=\1a){2}$ does fail when matched against "xa=xaaa". Since the =
character is actually present in the subject, making it optional should not
affect the outcome.

I discovered this bug because somebody reported exactly the same effect in
PCRE. It seems to have been there for a very long time. In PCRE, the only way I
have been able to fix the bug is to force any group that contains a reference
to itself to be an atomic group. PCRE has treated recursive and "subroutine"
group calls as atomic for some time, because of a similar issue - the inability
to exactly replicate the environment when backtracking into them.

Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2013

From @jkeenan

On Mon Jan 11 07​:56​:52 2010, ph10@​hermes.cam.ac.uk wrote​:

Consider the pattern ^(xa|=?\1a){2}$ when matched against the string
"xa=xaaa".
The first time the group matches "xa", so $1 is set to that value, and
the
remaining subject string is "=xaaa". The next time round the loop, the
first
alternative fails, but =?\1a matches the next three characters "=xaa",
so the
group ends, $1 is set to "=xaa", but there is still one character "a"
in the
subject string. Now we get to $ and that fails because we are not at
the end of
the string.

Perl (apparently) backtracks into the second iteration of the group,
to the
position where "=" is optional, and tries without matching the "="
character.
However, it does NOT (it seems) reset the value of $1, so it still set
to
"=xaa". This does match the next three characters in the string, and
then the
final "a" matches, and we are now at the end of the string. I have
tested this
with Perl 5.8.8, 5.10.1 and Perl 5.11.3.

So, the whole match succeeds with $1 set to "=xaaa". However, it
should fail.

One way of thinking about this is to consider the pattern without the
question
mark. ^(xa|=\1a){2}$ does fail when matched against "xa=xaaa". Since
the =
character is actually present in the subject, making it optional
should not
affect the outcome.

I discovered this bug because somebody reported exactly the same
effect in
PCRE. It seems to have been there for a very long time. In PCRE, the
only way I
have been able to fix the bug is to force any group that contains a
reference
to itself to be an atomic group. PCRE has treated recursive and
"subroutine"
group calls as atomic for some time, because of a similar issue - the
inability
to exactly replicate the environment when backtracking into them.

Philip

1. Do you get the same problems if you use the '\g1' or '\g{1}' notation
introduced in Perl 5.10 (see 'perlre')?

2. Have we ever claimed (in our documentation) that "recursive
backreferences" will, in some sense, "work"? (I.e., are we dealing with
a bug or a feature request?)

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2013

From @hvds

"James E Keenan via RT" <perlbug-followup@​perl.org> wrote​:
:On Mon Jan 11 07​:56​:52 2010, ph10@​hermes.cam.ac.uk wrote​:
:> Consider the pattern ^(xa|=?\1a){2}$ when matched against the string
:> "xa=xaaa".
[...]
:2. Have we ever claimed (in our documentation) that "recursive
:backreferences" will, in some sense, "work"? (I.e., are we dealing with
:a bug or a feature request?)

If the string were "xa=xaa" then it would make sense to look to the docs
to decide if we were talking about a bug or a feature request; but given
the string is "xa=xaaa" I'd say it's a clear bug.

% perl -wle 'print "not ok" if "xa=xaaa" =~ /^(xa|=?\1a){2}$/'
not ok
%

Hugo

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2013

From @druud62

On 2013-02-24 08​:37, hv@​crypt.org wrote​:

"James E Keenan via RT" <perlbug-followup@​perl.org> wrote​:
:On Mon Jan 11 07​:56​:52 2010, ph10@​hermes.cam.ac.uk wrote​:

:> Consider the pattern ^(xa|=?\1a){2}$ when matched against the string
:> "xa=xaaa".
[...]
:2. Have we ever claimed (in our documentation) that "recursive
:backreferences" will, in some sense, "work"? (I.e., are we dealing with
:a bug or a feature request?)

If the string were "xa=xaa" then it would make sense to look to the docs
to decide if we were talking about a bug or a feature request; but given
the string is "xa=xaaa" I'd say it's a clear bug.

% perl -wle 'print "not ok" if "xa=xaaa" =~ /^(xa|=?\1a){2}$/'
not ok
%

I think this should compile \1 as \x{01},
because there isn't a first caption yet.

--
Ruud

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2013

From ph10@hermes.cam.ac.uk

On Sat, 23 Feb 2013, James E Keenan via RT wrote​:

1. Do you get the same problems if you use the '\g1' or '\g{1}' notation
introduced in Perl 5.10 (see 'perlre')?

Yes in both cases, with Perl 5.016002. Exactly the same behaviour in all
three cases.

2. Have we ever claimed (in our documentation) that "recursive
backreferences" will, in some sense, "work"? (I.e., are we dealing with
a bug or a feature request?)

I have no idea on your documentation question, I'm afraid.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2013

From @demerphq

On 24 February 2013 08​:37, <hv@​crypt.org> wrote​:

"James E Keenan via RT" <perlbug-followup@​perl.org> wrote​:
:On Mon Jan 11 07​:56​:52 2010, ph10@​hermes.cam.ac.uk wrote​:
:> Consider the pattern ^(xa|=?\1a){2}$ when matched against the string
:> "xa=xaaa".
[...]
:2. Have we ever claimed (in our documentation) that "recursive
:backreferences" will, in some sense, "work"? (I.e., are we dealing with
:a bug or a feature request?)

If the string were "xa=xaa" then it would make sense to look to the docs
to decide if we were talking about a bug or a feature request; but given
the string is "xa=xaaa" I'd say it's a clear bug.

% perl -wle 'print "not ok" if "xa=xaaa" =~ /^(xa|=?\1a){2}$/'
not ok
%

Agreed. I consider this a bug, and time permitting will look into it.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@khwilliamson
Copy link
Contributor

@demerphq ping

@demerphq
Copy link
Collaborator

This seems like the bug about not resetting capture buffers after a failed alternation: #16616

@demerphq
Copy link
Collaborator

I just checked into this again, what is happening is that it is not restoring the value of the capture buffer as it renters the while loop after a failed tail (B pattern). I will dig more. This might need some guidance from @iabyn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants