Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp bugs. #10011

Closed
p5pRT opened this issue Dec 8, 2009 · 13 comments
Closed

Regexp bugs. #10011

p5pRT opened this issue Dec 8, 2009 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 8, 2009

Migrated from rt.perl.org#71136 (status was 'open')

Searchable as RT71136$

@p5pRT
Copy link
Author

p5pRT commented Dec 8, 2009

From @Abigail

This is a bug report for perl from abigail@​abigail.be,
generated with the help of perlbug 1.39 running under perl 5.10.1.


[Please describe your issue here]

While trying to work around bug #59792, I found some different problems​:

  "a" =~ m {(?|(?<A>a)|(?<B>b))} and say $+ {B};

The code about prints 'a'. This shouldn't happen, as the (?<B>b) alternation
isn't taken.

  "a b" =~ m {(?|(?​:(?<A>(?<B>a)(?<C>[^ab]*)(?<D>b))) |
  (?​:(?<B>c)(?<C>[^cd]*)(?<D>d)))}x;

Afterwards, %- is equal to​:

  (A => ["a b"],
  B => ["a", "a b"],
  C => [" ", "a"],
  D => ["b", " "])

Instead of the expected​:

  (A => ["a b"],
  B => ["a"],
  C => [" "],
  D => ["b"])

Replacing (?|) with (?​:) solves it partially (at the expense of having
many undefined number variables), but there are still oddities. For
instance, 'keys %+' and 'keys %-' don't always return the same set of
keys​:

  "0" =~ m {(?​:(?<A>0)|(?<B>1))};
  say for keys %+; # Prints 'A'
  say for keys %-; # Prints 'A', 'B'

All these issues are still present in 5.11.2.

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2009

From ben@morrow.me.uk

Quoth perl5-porters@​perl.org​:

Replacing (?|) with (?​:) solves it partially (at the expense of having
many undefined number variables), but there are still oddities. For
instance, 'keys %+' and 'keys %-' don't always return the same set of
keys​:

"0" =~ m \{\(?&#8203;:\(?\<A>0\)|\(?\<B>1\)\)\};
say for keys %\+;    \# Prints 'A'
say for keys %\-;    \# Prints 'A'\, 'B'

I can't speak to the rest, but that at least is as documented. 'keys %+'
gives the names that actually matched, whereas 'keys %-' gives all the
names that were present in the pattern.

Ben

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2009

From @davidnicol

perldoc perlre​:
(?<NAME>pattern)
A named capture buffer. Identical in every respect to normal capturing
parentheses () but for the additional fact that %+ or %- may be used
after a successful match to refer to a named buffer.

Inside (?|pattern) the "branch reset" pattern, the capture buffers are
numbered from the same starting point in each alternation branch.

Apparently the branch reset pattern is implemented by aliasing the
capture buffers in the alternate branches to each other, and this
aliasing also happens with the named capture buffers, causing buffers
which would be separate inside (?​:pat) to be aliases for the same
buffer inside (?|pat).

possible paths forward​:

1​: document that using named capture buffers within the branch reset
pattern is inadvisable, possibly including issuing an optional runtime
warning

2​: modify the branch reset pattern to respect the uniqueness of named
capture buffers, by introducing yet another additional level of
indirection, deferring the assignment of captured values into named
and numbered buffers until afterwards, in a new "renumbering step",
instead of aliasing the buffers in advance (which must be what
currently happens.)

3​: make use of a named capture buffer within a branch reset pattern a
fatal compile-time error

4​: ?

--
intake, compression, power, exhaust, repeat.

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2009

From @Abigail

On Wed, Dec 09, 2009 at 12​:01​:41PM -0600, David Nicol wrote​:

perldoc perlre​:
(?<NAME>pattern)
A named capture buffer. Identical in every respect to normal capturing
parentheses () but for the additional fact that %+ or %- may be used
after a successful match to refer to a named buffer.

Inside (?|pattern) the "branch reset" pattern, the capture buffers are
numbered from the same starting point in each alternation branch.

Apparently the branch reset pattern is implemented by aliasing the
capture buffers in the alternate branches to each other, and this
aliasing also happens with the named capture buffers, causing buffers
which would be separate inside (?​:pat) to be aliases for the same
buffer inside (?|pat).

possible paths forward​:

1​: document that using named capture buffers within the branch reset
pattern is inadvisable, possibly including issuing an optional runtime
warning

2​: modify the branch reset pattern to respect the uniqueness of named
capture buffers, by introducing yet another additional level of
indirection, deferring the assignment of captured values into named
and numbered buffers until afterwards, in a new "renumbering step",
instead of aliasing the buffers in advance (which must be what
currently happens.)

3​: make use of a named capture buffer within a branch reset pattern a
fatal compile-time error

4​: ?

I would hate for 1) or 3) to happen.

The regexp I had that made this problem surface had the form​:

  /(?| (?<Open>PAT1)(?<Body>PAT2) (?<Close>PAT3)
  |(?<Outer>(?<Open>PAT3)(?<Body>PAT4(?&Outer)?)(?<Close>PAT6)))/x

If I make both alternations have the same number and order of named
captures, it works as expected (your explaination made me find the
work around)​:

  /(?|(?<Outer> (?<Open>PAT3)(?<Body>PAT4(?&Outer)?)(?<Close>PAT6)))
  |(?<Outer>(*FAIL))?(?<Open>PAT1)(?<Body>PAT2) (?<Close>PAT3)/x

Not optimal (note the necessary switch of the alternations), but now
at least I get a usuable %-.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2009

From @demerphq

2009/12/9 David Nicol <davidnicol@​gmail.com>​:

perldoc perlre​:
(?<NAME>pattern)
A named capture buffer. Identical in every respect to normal capturing
parentheses () but for the additional fact that %+ or %- may be used
after a successful match to refer to a named buffer.

Inside (?|pattern) the "branch reset" pattern, the capture buffers are
numbered from the same starting point in each alternation branch.

Apparently the branch reset pattern is implemented by aliasing the
capture buffers in the alternate branches to each other, and this
aliasing also happens with the named capture buffers, causing buffers
which would be separate inside (?​:pat) to be aliases for the same
buffer inside (?|pat).

possible paths forward​:

1​: document that using named capture buffers within the branch reset
pattern is inadvisable, possibly including issuing an optional runtime
warning

no

2​: modify the branch reset pattern to respect the uniqueness of named
capture buffers, by introducing yet another additional level of
indirection, deferring the assignment of captured values into named
and numbered buffers until afterwards, in a new "renumbering step",
instead of aliasing the buffers in advance (which must be what
currently happens.)

Well there is the rub​: we dont "assign" capture capture buffers into
anything at all.

We maintain two arrays integer offsets into the strings with n+1 items
in each array, with the n being the number of capture buffers in a
pattern. These are exposed to perl as​: @​- and @​+.

Each OPEN regop and CLOSE regop is hard bound to a particular index in
these arrays and when the regop is executed the "current position" is
written into the hard bound slot.

The named capture buffer hack basically just adds a HoA to this, where
the keys in the hash are the names and the array contains the index
into the @​-/@​+ arrays that that name is associated to. This structure
facilitates mapping one name to multiple buffers, but not the other
way round.

When you access $1 you are using a tied interface that uses the
original string, and the contents of @​- and @​+ to derive *on fetch*
the string you speak of being "assigned".

The (?|...) hack merely plays around with the internal buffer counter
so that the OPEN and CLOSE regop's are allowed to bind to previously
assigned "index" in the @​- and @​+ arrays. There really was no
consideration of how this impacted on named capture buffers as the
original conception was that people would use one or the other as they
have overlapping domains. I mean you dont really need to use the
branch reset pattern if you are using named capture buffers.

Anyway the point here is that untangling or changing ANY of these is
non-trivial.

3​: make use of a named capture buffer within a branch reset pattern a
fatal compile-time error

no.

Number 4 could just be "document the weird way it works, and offer
some workaround patterns for those that Really Want To Go There".
Abigail would be my choice to write this.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2009

From @Abigail

On Thu, Dec 10, 2009 at 11​:47​:18AM +0100, demerphq wrote​:

2009/12/9 David Nicol <davidnicol@​gmail.com>​:

perldoc perlre​:
(?<NAME>pattern)
A named capture buffer. Identical in every respect to normal capturing
parentheses () but for the additional fact that %+ or %- may be used
after a successful match to refer to a named buffer.

Inside (?|pattern) the "branch reset" pattern, the capture buffers are
numbered from the same starting point in each alternation branch.

Apparently the branch reset pattern is implemented by aliasing the
capture buffers in the alternate branches to each other, and this
aliasing also happens with the named capture buffers, causing buffers
which would be separate inside (?​:pat) to be aliases for the same
buffer inside (?|pat).

possible paths forward​:

1​: document that using named capture buffers within the branch reset
pattern is inadvisable, possibly including issuing an optional runtime
warning

no

2​: modify the branch reset pattern to respect the uniqueness of named
capture buffers, by introducing yet another additional level of
indirection, deferring the assignment of captured values into named
and numbered buffers until afterwards, in a new "renumbering step",
instead of aliasing the buffers in advance (which must be what
currently happens.)

Well there is the rub​: we dont "assign" capture capture buffers into
anything at all.

We maintain two arrays integer offsets into the strings with n+1 items
in each array, with the n being the number of capture buffers in a
pattern. These are exposed to perl as​: @​- and @​+.

Each OPEN regop and CLOSE regop is hard bound to a particular index in
these arrays and when the regop is executed the "current position" is
written into the hard bound slot.

The named capture buffer hack basically just adds a HoA to this, where
the keys in the hash are the names and the array contains the index
into the @​-/@​+ arrays that that name is associated to. This structure
facilitates mapping one name to multiple buffers, but not the other
way round.

When you access $1 you are using a tied interface that uses the
original string, and the contents of @​- and @​+ to derive *on fetch*
the string you speak of being "assigned".

The (?|...) hack merely plays around with the internal buffer counter
so that the OPEN and CLOSE regop's are allowed to bind to previously
assigned "index" in the @​- and @​+ arrays. There really was no
consideration of how this impacted on named capture buffers as the
original conception was that people would use one or the other as they
have overlapping domains. I mean you dont really need to use the
branch reset pattern if you are using named capture buffers.

I disagree. Consider (sub) patterns (you don't contol the code that generates
these pattern - they could come from Regexp​::Common510 for instance)​:

  my $pat1 = "(?​:(?<A>a)1|(?<A>b)2|(?<A>c)3)";
vs
  my $pat2 = "(?|(?<A>a)1|(?<A>b)2|(?<A>c)3)";

and you create larger patterns $PAT1 = "$pat1 $pat1"; $PAT2 = "$pat2 $pat2";

Suppose you use this larger pattern to match something, and you want to see
what's matched by the second set of 'A's. Then the options are​:

  if ($str =~ /$PAT1/) {
  $A_matched = $- {A} [3] // $- {A} [4] // $- {A} [5];
  }

or

  if ($str =~ /$PAT2/) {
  $A_matched = $- {A} [1];
  }

Not only is $pat2 easier to deal with, it's also a lot easier to document.
Also, we have to change the code if $pat1 is changed to
"(?​:(?<A>a)1|(?<A>b)2|(?<A>c)3|(?<A>d)4)", while changing $pat2 to
"(?|(?<A>a)1|(?<A>b)2|(?<A>c)3|(?<A>d)4)" doesn't require a futher
change.

Furthermore, despite the captures being named, you just may want to
capture the return values and process them (for instance, the processing
code might have been written in the 5.8 era).

  my @​a1 = $str =~ /$PAT1/; # 6 elements, only 2 defined.
  my @​a2 = $str =~ /$PAT2/; # 2 elements, both defined.

IMO, (?|) in combination with named captures is still very useful.

Anyway the point here is that untangling or changing ANY of these is
non-trivial.

And hence unlikely (if at all) to happen before 5.12. I'll understand.

3​: make use of a named capture buffer within a branch reset pattern a
fatal compile-time error

no.

Number 4 could just be "document the weird way it works, and offer
some workaround patterns for those that Really Want To Go There".
Abigail would be my choice to write this.

I will accept that challenge.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2009

From @demerphq

2009/12/10 Abigail <abigail@​abigail.be>​:

On Thu, Dec 10, 2009 at 11​:47​:18AM +0100, demerphq wrote​:

2009/12/9 David Nicol <davidnicol@​gmail.com>​:

perldoc perlre​:
(?<NAME>pattern)
A named capture buffer. Identical in every respect to normal capturing
parentheses () but for the additional fact that %+ or %- may be used
after a successful match to refer to a named buffer.

Inside (?|pattern) the "branch reset" pattern, the capture buffers are
numbered from the same starting point in each alternation branch.

Apparently the branch reset pattern is implemented by aliasing the
capture buffers in the alternate branches to each other, and this
aliasing also happens with the named capture buffers, causing buffers
which would be separate inside (?​:pat) to be aliases for the same
buffer inside (?|pat).

possible paths forward​:

1​: document that using named capture buffers within the branch reset
pattern is inadvisable, possibly including issuing an optional runtime
warning

no

2​: modify the branch reset pattern to respect the uniqueness of named
capture buffers, by introducing yet another additional level of
indirection, deferring the assignment of captured values into named
and numbered buffers until afterwards, in a new "renumbering step",
instead of aliasing the buffers in advance (which must be what
currently happens.)

Well there is the rub​: we dont "assign" capture capture buffers into
anything at all.

We maintain two arrays integer offsets into the strings with n+1 items
in each array, with the n being the number of capture buffers in a
pattern. These are exposed to perl as​: @​- and @​+.

Each OPEN regop and CLOSE regop is hard bound to a particular index in
these arrays and when the regop is executed the "current position" is
written into the hard bound slot.

The named capture buffer hack basically just adds a HoA to this, where
the keys in the hash are the names and the array contains the index
into the @​-/@​+ arrays that that name is associated to. This structure
facilitates mapping one name to multiple buffers, but not the other
way round.

When you access $1 you are using a tied interface that uses the
original string, and the contents of @​- and @​+ to derive *on fetch*
the string you speak of being "assigned".

The (?|...) hack merely plays around with the internal buffer counter
so that the OPEN and CLOSE regop's are allowed to bind to previously
assigned "index" in the @​- and @​+ arrays. There really was no
consideration of how this impacted on named capture buffers as the
original conception was that people would use one or the other as they
have overlapping domains. I mean you dont really need to use the
branch reset pattern if you are using named capture buffers.

I disagree. Consider (sub) patterns (you don't contol the code that generates
these pattern - they could come from Regexp​::Common510 for instance)​:

   my $pat1 = "(?​:(?<A>a)1|(?<A>b)2|(?<A>c)3)";
vs
   my $pat2 = "(?|(?<A>a)1|(?<A>b)2|(?<A>c)3)";

and you create larger patterns $PAT1 = "$pat1 $pat1"; $PAT2 = "$pat2 $pat2";

Suppose you use this larger pattern to match something, and you want to see
what's matched by the second set of 'A's. Then the options are​:

   if ($str =~ /$PAT1/) {
       $A_matched = $- {A} [3] // $- {A} [4] // $- {A} [5];
   }

or

   if ($str =~ /$PAT2/) {
       $A_matched = $- {A} [1];
   }

Not only is $pat2 easier to deal with, it's also a lot easier to document.
Also, we have to change the code if $pat1 is changed to
"(?​:(?<A>a)1|(?<A>b)2|(?<A>c)3|(?<A>d)4)", while changing $pat2 to
"(?|(?<A>a)1|(?<A>b)2|(?<A>c)3|(?<A>d)4)" doesn't require a futher
change.

Furthermore, despite the captures being named, you just may want to
capture the return values and process them (for instance, the processing
code might have been written in the 5.8 era).

   my @​a1 = $str =~ /$PAT1/;   # 6 elements, only 2 defined.
   my @​a2 = $str =~ /$PAT2/;   # 2 elements, both defined.

IMO, (?|) in combination with named captures is still very useful.

Anyway the point here is that untangling or changing ANY of these is
non-trivial.

And hence unlikely (if at all) to happen before 5.12. I'll understand.

Well i dont have any time for it for sure. :-)

In the new year i will probably.

3​: make use of a named capture buffer within a branch reset pattern a
fatal compile-time error

no.

Number 4 could just be "document the weird way it works, and offer
some workaround patterns for those that Really Want To Go There".
Abigail would be my choice to write this.

I will accept that challenge.

Cool. However if you can figure out a simple description of how it
should work then id like to see it.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2009

From @davidnicol

On Thu, Dec 10, 2009 at 6​:02 AM, demerphq <demerphq@​gmail.com> wrote​:

Cool. However if you can figure out a simple description of how it
should work then id like to see it.

cheers,
Yves

currently %- and %+ are named aliases for slots in @​- and @​+, while @​-
and @​+ store integer offsets into a string buffer.

What if %- and %+ held the offsets directly? Ideally the part that
does that would be ifdeffed out and the code included twice, to keep
from having to do run-time tests all the time.

--
intake, compression, power, exhaust, repeat.

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2009

From @demerphq

2009/12/10 David Nicol <davidnicol@​gmail.com>​:

On Thu, Dec 10, 2009 at 6​:02 AM, demerphq <demerphq@​gmail.com> wrote​:

Cool. However if you can figure out a simple description of how it
should work then id like to see it.

cheers,
Yves

currently %- and %+ are named aliases for slots in @​- and @​+, while @​-
and @​+ store integer offsets into a string buffer.

What if %- and %+ held the offsets directly? Ideally the part that
does that would be ifdeffed out and the code included twice, to keep
from having to do run-time tests all the time.

That would mean that all users would pay a price for named captures
for ever more.

The way we do it is because we could update a slot in these array
thousands of time in a match, or hundreds of thousands of times. If we
add a hash lookup every time....

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 16, 2009

From @Abigail

On Thu, Dec 10, 2009 at 12​:50​:41PM +0100, Abigail wrote​:

On Thu, Dec 10, 2009 at 11​:47​:18AM +0100, demerphq wrote​:

Number 4 could just be "document the weird way it works, and offer
some workaround patterns for those that Really Want To Go There".
Abigail would be my choice to write this.

I will accept that challenge.

Done so as commit ab10618.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From @doy

Is there anything more we want to do here?

-doy

@khwilliamson
Copy link
Contributor

No response in 10 years; the docs were updated. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants