Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar.parse DWIMs only if your TOP rule has $ on the end #5796

Closed
p6rt opened this issue Nov 13, 2016 · 7 comments
Closed

Grammar.parse DWIMs only if your TOP rule has $ on the end #5796

p6rt opened this issue Nov 13, 2016 · 7 comments
Labels

Comments

@p6rt
Copy link

p6rt commented Nov 13, 2016

Migrated from rt.perl.org#130081 (status was 'resolved')

Searchable as RT130081$

@p6rt
Copy link
Author

p6rt commented Nov 13, 2016

From @AlexDaniel

To demonstrate how it works, let's say we have this grammar​:

grammar G { regex TOP { ‘a’ || ‘abc’ } };

So you might think that it will match either “a” or “abc”, but no, “abc”
will never work.

*Code​:*
grammar G { regex TOP { ‘a’ || ‘abc’ } };
say G.parse(‘abc’)

*Result​:*
Nil

Why? Well, see this​:
https://github.com/rakudo/rakudo/blob/cb8f783eeb8ab25a5090fdc4e5cc318c36ee1afa/src/core/Grammar.pm#L23

Basically, TOP will first match ‘a’, and given that it is a good enough
result it will bail out without trying anything else. Then, .parse method
will see that it did not parse the whole string, so it fails.

I am not sure how this behavior could be useful or even how could anyone
expect this, to me it feels like the top rule should always have an
implicit $ on the end so that it knows that it should keep trying to find a
better solution.

But yes, we can close our eyes on this issue and add yet another trap to
the documentation.

Another closely related issue is method .subparse. As it seems, subparse is
supposed to return partial results (.partial-parse ?), but for some reason
instead of returning Nil it returns a failed Match, unlike .parse.

@p6rt
Copy link
Author

p6rt commented Nov 14, 2016

From @zoffixznet

Additional TimToady's comments​: https://irclog.perlgeek.de/perl6/2016-11-13#i_13560371

On Sat, 12 Nov 2016 18​:36​:15 -0800, alex.jakimenko@​gmail.com wrote​:

To demonstrate how it works, let's say we have this grammar​:

grammar G { regex TOP { ‘a’ || ‘abc’ } };

So you might think that it will match either “a” or “abc”, but no,
“abc”
will never work.

*Code​:*
grammar G { regex TOP { ‘a’ || ‘abc’ } };
say G.parse(‘abc’)

*Result​:*
Nil

Why? Well, see this​:
https://github.com/rakudo/rakudo/blob/cb8f783eeb8ab25a5090fdc4e5cc318c36ee1afa/src/core/Grammar.pm#L23

Basically, TOP will first match ‘a’, and given that it is a good
enough
result it will bail out without trying anything else. Then, .parse
method
will see that it did not parse the whole string, so it fails.

I am not sure how this behavior could be useful or even how could
anyone
expect this, to me it feels like the top rule should always have an
implicit $ on the end so that it knows that it should keep trying to
find a
better solution.

But yes, we can close our eyes on this issue and add yet another trap
to
the documentation.

Another closely related issue is method .subparse. As it seems,
subparse is
supposed to return partial results (.partial-parse ?), but for some
reason
instead of returning Nil it returns a failed Match, unlike .parse.

@p6rt
Copy link
Author

p6rt commented Nov 14, 2016

From @zoffixznet

My personal two cents from someone who knows very little about the workings of grammars and regexes is this is quite a bit of a WAT. `regex` is meant to backtrack and it does, for example here​:

  <ZoffixW> m​: say grammar { regex TOP { <foo> 'foo' }; regex foo { [ ‘a’ || ‘abc’ ] } }.parse​: 'abcfoo'
  <camelia> rakudo-moar dfb58d​: OUTPUT«「abcfoo」␤ foo => 「abc」␤»
  <ZoffixW> m​: say grammar { regex TOP { <foo> 'foo' }; token foo { [ ‘a’ || ‘abc’ ] } }.parse​: 'abcfoo'
  <camelia> rakudo-moar dfb58d​: OUTPUT«Nil␤»

However, if the `regex` is a top-level rule, then we add an extra requirement that to enable backtracking the user also needs an anchor to the end of string.

It's a special exception to the rule and would need to be documented in Traps, but why do we have it?

On Sat, 12 Nov 2016 18​:36​:15 -0800, alex.jakimenko@​gmail.com wrote​:

To demonstrate how it works, let's say we have this grammar​:

grammar G { regex TOP { ‘a’ || ‘abc’ } };

So you might think that it will match either “a” or “abc”, but no,
“abc”
will never work.

*Code​:*
grammar G { regex TOP { ‘a’ || ‘abc’ } };
say G.parse(‘abc’)

*Result​:*
Nil

Why? Well, see this​:
https://github.com/rakudo/rakudo/blob/cb8f783eeb8ab25a5090fdc4e5cc318c36ee1afa/src/core/Grammar.pm#L23

Basically, TOP will first match ‘a’, and given that it is a good
enough
result it will bail out without trying anything else. Then, .parse
method
will see that it did not parse the whole string, so it fails.

I am not sure how this behavior could be useful or even how could
anyone
expect this, to me it feels like the top rule should always have an
implicit $ on the end so that it knows that it should keep trying to
find a
better solution.

But yes, we can close our eyes on this issue and add yet another trap
to
the documentation.

Another closely related issue is method .subparse. As it seems,
subparse is
supposed to return partial results (.partial-parse ?), but for some
reason
instead of returning Nil it returns a failed Match, unlike .parse.

@p6rt
Copy link
Author

p6rt commented Nov 14, 2016

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Nov 15, 2016

From @AlexDaniel

After a really long discussion today, here is one finding​: current behavior appeared in 2014.03 after this commit​: rakudo/rakudo@4d8734d

(from https://irclog.perlgeek.de/perl6/2016-11-15#i_13573639 )

On 2016-11-12 18​:36​:15, alex.jakimenko@​gmail.com wrote​:

To demonstrate how it works, let's say we have this grammar​:

grammar G { regex TOP { ‘a’ || ‘abc’ } };

So you might think that it will match either “a” or “abc”, but no,
“abc”
will never work.

*Code​:*
grammar G { regex TOP { ‘a’ || ‘abc’ } };
say G.parse(‘abc’)

*Result​:*
Nil

Why? Well, see this​:
https://github.com/rakudo/rakudo/blob/cb8f783eeb8ab25a5090fdc4e5cc318c36ee1afa/src/core/Grammar.pm#L23

Basically, TOP will first match ‘a’, and given that it is a good
enough
result it will bail out without trying anything else. Then, .parse
method
will see that it did not parse the whole string, so it fails.

I am not sure how this behavior could be useful or even how could
anyone
expect this, to me it feels like the top rule should always have an
implicit $ on the end so that it knows that it should keep trying to
find a
better solution.

But yes, we can close our eyes on this issue and add yet another trap
to
the documentation.

Another closely related issue is method .subparse. As it seems,
subparse is
supposed to return partial results (.partial-parse ?), but for some
reason
instead of returning Nil it returns a failed Match, unlike .parse.

@p6rt
Copy link
Author

p6rt commented Nov 16, 2016

From @jnthn

On Sat, 12 Nov 2016 18​:36​:15 -0800, alex.jakimenko@​gmail.com wrote​:

To demonstrate how it works, let's say we have this grammar​:

grammar G { regex TOP { ‘a’ || ‘abc’ } };

So you might think that it will match either “a” or “abc”, but no,
“abc”
will never work.

*Code​:*
grammar G { regex TOP { ‘a’ || ‘abc’ } };
say G.parse(‘abc’)

*Result​:*
Nil

Why? Well, see this​:
https://github.com/rakudo/rakudo/blob/cb8f783eeb8ab25a5090fdc4e5cc318c36ee1afa/src/core/Grammar.pm#L23

Basically, TOP will first match ‘a’, and given that it is a good
enough
result it will bail out without trying anything else.

This is how regexes work. When they produce a successful match they return the Cursor indicating the match that they produced. Backtracking into a regex works by asking that Cursor to try again (if it's got anything else to try).

Then, .parse method
will see that it did not parse the whole string, so it fails.

I am not sure how this behavior could be useful or even how could
anyone
expect this, to me it feels like the top rule should always have an
implicit $ on the end so that it knows that it should keep trying to
find a
better solution.

It's more that the parse method that should ask the Cursor it gets back to try again. I've implemented that in Rakudo 4ccb2f3, and added a test in S05-grammar/parse_and_parsefile.t, so now it does what you'd expect.

I'm sympathetic to the arguments that building a grammar out of regexes is, in general, not wise. However, that doesn't really justify the parse method being uncooperative towards folks who choose to so, especially given it's an easy fix and adds no cost to grammars that don't need it. And in reality, if we don't change parse the way I have, folks won't go and re-think their grammar, they'll just add the $. The memory and performance hit when they get to parsing something big with a grammar full of regexes will be a much better incentive for them to learn how to write grammars better. :-)

/jnthn

@p6rt p6rt closed this as completed Nov 16, 2016
@p6rt
Copy link
Author

p6rt commented Nov 16, 2016

@jnthn - Status changed from 'open' to 'resolved'

@p6rt p6rt added the at_larry label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant