Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed longest (ltm) and lexical alternation makes wrong match - "42" ~~ / [ "0" || "42" ] | "4" / #6003

Closed
p6rt opened this issue Jan 15, 2017 · 9 comments

Comments

@p6rt
Copy link

p6rt commented Jan 15, 2017

Migrated from rt.perl.org#130562 (status was 'rejected')

Searchable as RT130562$

@p6rt
Copy link
Author

p6rt commented Jan 15, 2017

From @ronaldxs

So "42" ~~ / [ "0" || "42" ] | "4" / matches 4 but if you stick to just
one kind of alternation​:

"42" ~~ / [ "0" | "42" ] | "4" /

"42" ~~ / [ "0" || "42" ] || "4" /

the match correctly comes back with 42. It looks like in the mixed case
the "|" longest alternation is not picking the longest match which is a
bug.

A test might look like​:

is ~("42" ~~ / [ "0" || "42" ] | "4" /), "42", "mixed longest with
grouped lexical alternation picks longest match";

@p6rt
Copy link
Author

p6rt commented Jan 15, 2017

From @jnthn

On Sun, 15 Jan 2017 05​:06​:21 -0800, ronaldxs wrote​:

So "42" ~~ / [ "0" || "42" ] | "4" / matches 4 but if you stick to just
one kind of alternation​:

"42" ~~ / [ "0" | "42" ] | "4" /

"42" ~~ / [ "0" || "42" ] || "4" /

the match correctly comes back with 42. It looks like in the mixed case
the "|" longest alternation is not picking the longest match which is a
bug.

This is not a bug. `||` is by design not declarative, and so only the first branch of it is significant for LTM purposes. This behavior can be confirmed by re-ordering the imperative alternation​:

"42" ~~ / [ "0" || "42" ] | "4" /
「4」
"42" ~~ / [ "42" || "0" ] | "4" /
「42」

Therefore, the longest declarative match is 4, and so that is the branch that LTM selects. Since there's no anchoring, there's furthermore no reason for it to backtrack into the second branch of the `|` to try the other `||` branch.

I'm pretty sure this behavior is relied upon in the Perl 6 grammar itself; off the top of my head, it shows up in a bunch of places for the sake of error handling.

So, working as designed.

Thanks,

/jnthn

@p6rt
Copy link
Author

p6rt commented Jan 15, 2017

The RT System itself - Status changed from 'new' to 'open'

@p6rt p6rt closed this as completed Jan 15, 2017
@p6rt
Copy link
Author

p6rt commented Jan 15, 2017

@jnthn - Status changed from 'open' to 'rejected'

@p6rt
Copy link
Author

p6rt commented Jan 16, 2017

From 1parrota@gmail.com

Any time there's a bug report based on a serious misunderstanding of
language behaviour by someone otherwise apparently competent, should
it be considered an LTA for the relevant part of the documentation?

On 1/15/17, jnthn@​jnthn.net via RT <perl6-bugs-followup@​perl.org> wrote​:

On Sun, 15 Jan 2017 05​:06​:21 -0800, ronaldxs wrote​:

So "42" ~~ / [ "0" || "42" ] | "4" / matches 4 but if you stick to just
one kind of alternation​:

"42" ~~ / [ "0" | "42" ] | "4" /

"42" ~~ / [ "0" || "42" ] || "4" /

the match correctly comes back with 42. It looks like in the mixed case
the "|" longest alternation is not picking the longest match which is a
bug.

This is not a bug. `||` is by design not declarative, and so only the first
branch of it is significant for LTM purposes. This behavior can be confirmed
by re-ordering the imperative alternation​:

"42" ~~ / [ "0" || "42" ] | "4" /
「4」
"42" ~~ / [ "42" || "0" ] | "4" /
「42」

Therefore, the longest declarative match is 4, and so that is the branch
that LTM selects. Since there's no anchoring, there's furthermore no reason
for it to backtrack into the second branch of the `|` to try the other `||`
branch.

I'm pretty sure this behavior is relied upon in the Perl 6 grammar itself;
off the top of my head, it shows up in a bunch of places for the sake of
error handling.

So, working as designed.

Thanks,

/jnthn

@p6rt
Copy link
Author

p6rt commented Jan 16, 2017

From @AlexDaniel

Correct. See Raku/doc#1141

On 2017-01-16 08​:38​:26, 1parrota@​gmail.com wrote​:

Any time there's a bug report based on a serious misunderstanding of
language behaviour by someone otherwise apparently competent, should
it be considered an LTA for the relevant part of the documentation?

On 1/15/17, jnthn@​jnthn.net via RT <perl6-bugs-followup@​perl.org>
wrote​:

On Sun, 15 Jan 2017 05​:06​:21 -0800, ronaldxs wrote​:

So "42" ~~ / [ "0" || "42" ] | "4" / matches 4 but if you stick to
just
one kind of alternation​:

"42" ~~ / [ "0" | "42" ] | "4" /

"42" ~~ / [ "0" || "42" ] || "4" /

the match correctly comes back with 42. It looks like in the mixed
case
the "|" longest alternation is not picking the longest match which
is a
bug.

This is not a bug. `||` is by design not declarative, and so only the
first
branch of it is significant for LTM purposes. This behavior can be
confirmed
by re-ordering the imperative alternation​:

"42" ~~ / [ "0" || "42" ] | "4" /
「4」
"42" ~~ / [ "42" || "0" ] | "4" /
「42」

Therefore, the longest declarative match is 4, and so that is the
branch
that LTM selects. Since there's no anchoring, there's furthermore no
reason
for it to backtrack into the second branch of the `|` to try the
other `||`
branch.

I'm pretty sure this behavior is relied upon in the Perl 6 grammar
itself;
off the top of my head, it shows up in a bunch of places for the sake
of
error handling.

So, working as designed.

Thanks,

/jnthn

@p6rt
Copy link
Author

p6rt commented Jan 16, 2017

From @ronaldxs

I contacted jnthn privately today and he was very helpful.

The behavior seems to be specced in S05 in the last paragraph of section https://design.perl6.org/S05.html#Longest-token_matching starting "The || form has the old short-circuit semantics,". The paragraph goes on to say "The first || in a regex makes the token patterns on its left available to the outer longest-token matcher, but hides any subsequent tests from longest-token matching. "

I also came across another relevant RT #​125608 https://rt.perl.org/Ticket/Display.html?id=125608 . There are tests for the RT in roast - https://github.com/perl6/roast/blob/master/S05-metasyntax/longest-alternative.t#L449 . I don't see a test that exactly matches this ticket and am considering adding one together with the longest-alternative.t RT125608 tests.

jnthn and I discussed documentation of the concern in traps. I mentioned that the split into | ltm and || sequential regex alternatives was new to Perl 6 and was not in Perl 5 regexes. jnthn replied​: 'So maybe the trap is "using | in regexes without understanding its Perl 6 semantics"'. jnthn also said he thought the concept of "declarative prefixes" might be important and when asked a bit more agreed that the "term may have originated with Perl 6". The docs do have some help for alternatives in the context of 5 to 6 migration here​: https://docs.perl6.org/language/5to6-nutshell#Longest_token_matching_(LTM)_displaces_alternation . I read through the section and think it could use more work including (at the least) a pointer to https://docs.perl6.org/language/regexes#Alternation:_|| .

I first hit thr problem when working on a grammar that used '|' inheriting another grammar that used exclusively '||'. With people
using grammars to convert BNF/ABNF and grammar composition I think this could come up again. While translating Perl 5 '|' to '||' might be good for regexes that don't worry about reuse and the Perl 6 grammar, I will suggest that we may consider documenting someplace that for grammars which expect to share regexes by composition/inheritance '|' may be the safer choice.

@p6rt
Copy link
Author

p6rt commented Jan 17, 2017

From @coke

If you'd like to turn this into a doc-u-bug, that's fine; please open
a ticket at

https://github.com/perl6/doc/issues

On Mon, Jan 16, 2017 at 11​:37 AM, Parrot Raiser <1parrota@​gmail.com> wrote​:

Any time there's a bug report based on a serious misunderstanding of
language behaviour by someone otherwise apparently competent, should
it be considered an LTA for the relevant part of the documentation?

On 1/15/17, jnthn@​jnthn.net via RT <perl6-bugs-followup@​perl.org> wrote​:

On Sun, 15 Jan 2017 05​:06​:21 -0800, ronaldxs wrote​:

So "42" ~~ / [ "0" || "42" ] | "4" / matches 4 but if you stick to just
one kind of alternation​:

"42" ~~ / [ "0" | "42" ] | "4" /

"42" ~~ / [ "0" || "42" ] || "4" /

the match correctly comes back with 42. It looks like in the mixed case
the "|" longest alternation is not picking the longest match which is a
bug.

This is not a bug. `||` is by design not declarative, and so only the first
branch of it is significant for LTM purposes. This behavior can be confirmed
by re-ordering the imperative alternation​:

"42" ~~ / [ "0" || "42" ] | "4" /
「4」
"42" ~~ / [ "42" || "0" ] | "4" /
「42」

Therefore, the longest declarative match is 4, and so that is the branch
that LTM selects. Since there's no anchoring, there's furthermore no reason
for it to backtrack into the second branch of the `|` to try the other `||`
branch.

I'm pretty sure this behavior is relied upon in the Perl 6 grammar itself;
off the top of my head, it shows up in a bunch of places for the sake of
error handling.

So, working as designed.

Thanks,

/jnthn

--
Will "Coke" Coleda

@p6rt
Copy link
Author

p6rt commented Jan 17, 2017

From @ronaldxs

On 2017-01-17 21​:35, Will Coleda wrote​:

If you'd like to turn this into a doc-u-bug, that's fine; please open
a ticket at

https://github.com/perl6/doc/issues

Already done and described on earlier timestamp on ticket "Mon, 16 Jan
2017 11​:12​:59 -0800" by AlexDaniel++

Raku/doc#1141

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant