Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backtracking into a parameterized subrule like <meh(42)> tries to call it without arguments. #6120

Open
p6rt opened this issue Mar 4, 2017 · 6 comments
Labels
regex Regular expressions, pattern matching, user-defined grammars, tokens and rules

Comments

@p6rt
Copy link

p6rt commented Mar 4, 2017

Migrated from rt.perl.org#130910 (status was 'open')

Searchable as RT130910$

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

From @AlexDaniel

Here are 3 examples that work as expected​:

Code​:
my regex meh($t) { xy }; say 'xy' ~~ /^ &lt;meh(42)&gt; $/

Result​:
「xy」
meh => 「xy」

Code
my regex meh($t) { xy }; say 'ab' ~~ /^ &lt;meh(42)&gt; $/

Result​:
Nil

Code​:
my regex meh($t) { .. }; say 'xy' ~~ /^ &lt;meh(42)&gt; $/

Result​:
「xy」
meh => 「xy」

And here is one that doesn't​:

Code​:
my regex meh($t) { .. }; say 'xyz' ~~ /^ &lt;meh(42)&gt; $/

Result​:
Too few positionals passed; expected 2 arguments but got 1
  in regex meh at <tmp> line 1
  in block <unit> at <tmp> line 1

Why? What second argument is it expecting?
Nil is the right answer here.

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

From @MasterDuke17

On Fri, 03 Mar 2017 20​:25​:27 -0800, alex.jakimenko@​gmail.com wrote​:

Here are 3 examples that work as expected​:

Code​:
my regex meh($t) { xy }; say 'xy' ~~ /^ &lt;meh(42)&gt; $/

Result​:
「xy」
meh => 「xy」

Code
my regex meh($t) { xy }; say 'ab' ~~ /^ &lt;meh(42)&gt; $/

Result​:
Nil

Code​:
my regex meh($t) { .. }; say 'xy' ~~ /^ &lt;meh(42)&gt; $/

Result​:
「xy」
meh => 「xy」

And here is one that doesn't​:

Code​:
my regex meh($t) { .. }; say 'xyz' ~~ /^ &lt;meh(42)&gt; $/

Result​:
Too few positionals passed; expected 2 arguments but got 1
in regex meh at <tmp> line 1
in block <unit> at <tmp> line 1

Why? What second argument is it expecting?
Nil is the right answer here.

Some more examples

[23​:10] <MasterDuke> m​: my regex meh($t, $s) { .. }; say 'xyz' ~~ /^ &lt;meh(1)&gt; $/
[23​:10] <evalable6> MasterDuke, rakudo-moar 11ee2fe17​: OUTPUT​: «(exit code 1) Too few positionals passed; expected 3 arguments but got 2␤ in regex meh at /tmp/cCSpiqVwyt line 1␤ in block <unit> at /tmp/cCSpiqVwyt line 1␤»
[23​:11] <MasterDuke> m​: my regex meh($t, $s) { .. }; say 'xyz' ~~ /^ &lt;meh(1, 2)&gt; $/
[23​:11] <evalable6> MasterDuke, rakudo-moar 11ee2fe17​: OUTPUT​: «(exit code 1) Too few positionals passed; expected 3 arguments but got 1␤ in regex meh at /tmp/FVoRYxVyfG line 1␤ in block <unit> at /tmp/FVoRYxVyfG line 1␤»
[23​:11] <MasterDuke> m​: my regex meh($t, $s) { .. }; say 'xyz' ~~ /^ &lt;meh(1, 2, 3)&gt; $/
[23​:11] <evalable6> MasterDuke, rakudo-moar 11ee2fe17​: OUTPUT​: «(exit code 1) Too many positionals passed; expected 3 arguments but got 4␤ in regex meh at /tmp/RumqR3rpLh line 1␤ in block <unit> at /tmp/RumqR3rpLh line 1␤»
[23​:11] <MasterDuke> very weird
[23​:11] <MasterDuke> m​: my regex meh($t, $s) { .. }; say 'xyz' ~~ /^ &lt;meh(1, 2, 3, 4)&gt; $/
[23​:11] <evalable6> MasterDuke, rakudo-moar 11ee2fe17​: OUTPUT​: «(exit code 1) Too many positionals passed; expected 3 arguments but got 5␤ in regex meh at /tmp/Mtn6BJJxmP line 1␤ in block <unit> at /tmp/Mtn6BJJxmP line 1␤»

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Aug 22, 2017

From @smls

It has to do with backtracking, because​:

1) The problem disappears when `​:ratchet` mode is enabled in the top-level regex​:

  ➜ my regex meh ($t) { . };
  ➜ say "ab" ~~ /^ :ratchet <meh(1)> $ /;
  Nil

2) The problem disappears when the named regex is made a `token`​:
 
  ➜ my token meh ($t) { . };
  ➜ say "ab" ~~ /^ <meh(1)> $ /;
  Nil

Of course, the regex engine could avoid backtracking entirely in that example, but maybe it's just not optimized enough to know that.
Here's a different example in which backtracking is actually necessary​:

  my regex meh ($t) {
  { say "meh start"}
  .+?
  { say "meh end"}
  }

  say "abcde" ~~ /
  ^
  <meh(42)> { say '$<meh> = ', $<meh> }
  $
  /;

It outputs​:

  meh start
  meh end
  $<meh> = 「abcde」
  Too few positionals passed; expected 2 arguments but got 1
  in regex meh at [...]

Note how the error message appears after having reached the end of the regex for the first time, just before it would have backtracked into `meh` for the first time.

In comparison, when removing the parameterization of `meh`, the example prints the following (Note how it backtracked into `meh` four times, like it should)​:

  meh start
  meh end
  $<meh> = 「a」
  meh end
  $<meh> = 「ab」
  meh end
  $<meh> = 「abc」
  meh end
  $<meh> = 「abcd」
  meh end
  $<meh> = 「abcde」

In summary, what *appears* to be happening, is this​:

- If a named subrule is called with parameters...
- And it matched...
- But then the regex engine wants to backtrack into it...
- Then it "calls" the subrule again, but fails to pass the parameters again.

@p6rt
Copy link
Author

p6rt commented Aug 22, 2017

From @smls

Sorry, copy-pasto in the second-to-last output listing. It is​:

  meh start
  meh end
  $<meh> = 「a」
  Too few positionals passed; expected 2 arguments but got 1
  in regex meh at [...]

@p6rt
Copy link
Author

p6rt commented May 8, 2018

From @skids

This is also an issue in nqp.

$ nqp -e 'grammar f { regex TOP { ^ <foo(42)> $ }; regex foo($i) { .. } }; nqp​::say(f.parse("aaa"));'
Too few positionals passed; expected 2 arguments but got 1

Fixing it in nqp first is probably the best first step. To
that end I investigated some and it looks like this will require
some fairly tricky modifications.

Currently, a Cursor will fill in its $!regexsub parameter by getting the
callercode of the rule that called a .cursor_start_* method. This code
has the param checking instructions at the top. Then when the cursor
is matched it copies this code reference into $!restart in .cursor_pass.
Then the regex node code (made by .regex_mast which is called by .as_mast
which simply inserts the .regex_mast instructions inline with the rest
of the code .as_mast generates) will call cursor_next when backtracking.

If it finds code in $!restart, .cursor_next invokes it with no arguments.
The .as_mast code will skip calling the .regex_mast code when invoked
with a function pointer in $!restart so it will only unwind the
cursor stack (based on the backtrack stack). However, the code
to check the parameter count is before the as_mast code in the
frame and gets hit before it gets there. You can see this behavior
as such by making the positional optional​:

$ nqp -e 'grammar f { regex TOP { ^ <foo(42)> $ }; regex foo($i?) { {nqp​::say($i)} .. } }; f.parse("aaa");'
42

...noting that the 42 is only said once on the first call where the
match occurs, not on the second call during the backtrack.

There is also a cursor_more in NQP which seems to be unused in NQP, which
will call $!regexsub with nothing but a new cursor as a parameter.

In rakudo, cursor_next and cursor_more are replicated under different
names, along with an additional one used for exhaustive/overlapping, and then
renamed pointers to those functions are thrown into a grist mill of
code where it is hard to enumerate the number of places in which they
are called.

Long story short, it does not look like passing args along down
the call chain is practical. Either some way to move the param checks
for everything but the invocant down into the regex_mast instructions,
or to take a curry closure around the params and put that in regexsub
instead would be required.

Worth noting as a side note, it has been expressed before that having
a way to fire a phaser (or code somehow otherwise attached) when a
block in a regex is backtracked over would be useful in building some
interesting constructs. It is speculated in S04/"Definition of Success"
that a block that gets backtracked over should fire UNDO
(which implies that KEEP would not be fired until the whole match succeeds.)
I would guess we would only want to keep half-finished frames around
to do that when there actually were user-defined phasers to fire,
for performance reasons. Also any block where the return value is
used for interpolation or assertion would obviously not be compatible
with this premise.

@p6rt p6rt added the regex Regular expressions, pattern matching, user-defined grammars, tokens and rules label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regex Regular expressions, pattern matching, user-defined grammars, tokens and rules
Projects
None yet
Development

No branches or pull requests

1 participant