Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTM favors some character class atoms over others, instead of using text order as a tie breaker #6024

Open
p6rt opened this issue Jan 21, 2017 · 5 comments
Labels
regex Regular expressions, pattern matching, user-defined grammars, tokens and rules

Comments

@p6rt
Copy link

p6rt commented Jan 21, 2017

Migrated from rt.perl.org#130612 (status was 'open')

Searchable as RT130612$

@p6rt
Copy link
Author

p6rt commented Jan 21, 2017

From @ronaldxs

# https://design.perl6.org/S05.html#Longest-token_matching says​:
# For matches of same length ... If the alternatives are in the same
# grammar file, the textually earlier alternative takes precedence.

use Test;

grammar text-order {
  token alt-na_1 { <n> | <na_1> };
  token alt-na_2 { <n> | <na_2> };
  token n { <digit>+ }
  token na_1 { <+alpha +digit> + }
  token na_2 { <+alpha +[0..9]> + }
}

is text-order.parse('1', :rule<alt-na_1>).keys[0], 'n',
  'Match first textual rule OK';
is text-order.parse('1', :rule<alt-na_2>).keys[0], 'n',
'Match second textual rule fail';

result​:

ok 1 - Match first textual rule OK
not ok 2 - Match second textual rule fail
# Failed test 'Match second textual rule fail'
# at <tmp> line 17
# expected​: 'n
# got​: 'na_2'

Posted on IRC, https://irclog.perlgeek.de/perl6/2017-01-20#i_13960886

The IRC discussion notes that the tests should be expected to pass and
the S05 rule is intended for implementation in current Perl 6 / Rakudo.

See also roast issue​: Raku/roast#224

@p6rt
Copy link
Author

p6rt commented Sep 11, 2017

From @smls

This bug is still present in

  Rakudo version 2017.08-104-g76f1d8970
  built on MoarVM version 2017.08.1-148-g1059eed1
  implementing Perl 6.c.


However, there's also a design question to be answered here​:

On Sat, 21 Jan 2017 10​:32​:39 -0800, ronaldxs wrote​:

https://design.perl6.org/S05.html#Longest-token_matching says​:
# For matches of same length ... If the alternatives are in the same
# grammar file, the textually earlier alternative takes precedence.

This S05 paragraph was clearly written with LTM of `proto` regexes in mind.

What does it mean for LTM of `|` alternations?

I.e., should the test-case above match 'n'...

a) because `token n { ... }` is declared above `token na_2 { ... }`?
b) because in `<n> | <na_1>`, `<n> ` is to the left of ` <na_1>`?

The test-case at hand does not distinguish between those two interpretations (Rakudo is currently wrong either way), but it's an important distinction that needs to be clarified if the bug is to be fixed.

http://design.perl6.org/S05.html#Overview has a sentence on this tie-breaker which actually mentions alternations​:

Within a given compilation unit, the earlier alternative or multi wins

I'd interpret that as option (b), but am not 100% sure.

@p6rt
Copy link
Author

p6rt commented Sep 11, 2017

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Sep 11, 2017

From @smls

Actually...

Rakudo *does* generally follow interpretation (b)​:

  ➜ 'x' ~~ / .* { say '*' } | .? { say '?' } /; # *
  ➜ 'x' ~~ / .? { say '?' } | .* { say '*' } /; # ?

The observed bug is specifically with character classes​:
 
  ➜ '1' ~~ /<digit> { say 'digit' } | <[0..9]> { say '0..9' } /; # 0..9
  ➜ '1' ~~ /<[0..9]> { say '0..9' } | <digit> { say 'digit' } /; # 0..9

Following some more experimentation, here are various atoms for matching the digit '1', sorted into three categories based on how much LTM favors them in current Rakudo​:

  tier 1​: '1'
  tier 2​: . \d \w <[0..9]>
  tier 3​: <digits> <alnum> <​:Number> <​:Decimal> etc.

That the literal `1` is preferred over everything else by LTM is to be expected ("longest literal prefix" tie-breaker).

However, that the character classes are split into two tiers - with the syntactic ones being preferred over the named and uniprop ones - seems strange.
At least I don't see anything in S05 to back that up.

http://design.perl6.org/S05.html#Overview says that LTM is transitive through subrules, so even if the named character classes are treated as subrule calls they shouldn't be disfavored, right?

@p6rt
Copy link
Author

p6rt commented Sep 12, 2017

From @smls

Confirmed on IRC that this is a bug.

Relevant log¹ (edited for clarity)​:

  <smls> Is it intended that LTM prefers \d and <[0..9]>
  over <digit> and <​:Number>, as a tie-breaker?
  (re. RT #​130612)

  <TimToady> no, that wasn't intended

  <jnthn> m​: say '1' ~~ / \d | <digit> { say 'digit' }/
  <camelia> rakudo-moar 760530​: OUTPUT​: «「1」␤»
  <jnthn> m​: say '1' ~~ / <digit> { say 'digit' } | \d /
  <camelia rakudo-moar 760530​: OUTPUT​: «「1」␤»

  <jnthn> Interesting
  Though I can guess why

  <jnthn> m​: say '1' ~~ / \d | <​:N> { say 'digit' }/
  <camelia> rakudo-moar 760530​: OUTPUT​: «「1」␤»
  <jnthn> m​: say '1' ~~ /<​:N> { say 'digit' } | \d/
  <camelia> rakudo-moar 760530​: OUTPUT​: «「1」␤»

  <jnthn> That one I don't understand


[1] https://irclog.perlgeek.de/perl6/2017-09-11#i_15149008

@p6rt p6rt added the regex Regular expressions, pattern matching, user-defined grammars, tokens and rules label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regex Regular expressions, pattern matching, user-defined grammars, tokens and rules
Projects
None yet
Development

No branches or pull requests

1 participant