Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unquoted literals in separators in quantifiers in regex matches produce the wrong result in Rakudo #1482

Closed
p6rt opened this issue Feb 2, 2010 · 7 comments
Labels

Comments

@p6rt
Copy link

p6rt commented Feb 2, 2010

Migrated from rt.perl.org#72440 (status was 'resolved')

Searchable as RT72440$

@p6rt
Copy link
Author

p6rt commented Feb 2, 2010

From @masak

This be Rakudo a609d7 on Parrot r43600.

$ perl6 -e 'say "1ab2ab3c" ~~ /^ \d ** abc $/ ?? "OH NOES" !! "oh phew"'
OH NOES

This is a PGE bug. Here follows a brief explanation.

S05 states that unquotes literals like C<abc> are actually three
distinct atoms, each of which can be quantified separately. Thus,
C<abc*> means C<ab[c]*>, not C<[abc]*>. With that reasoning, C<\d **
abc> means C<\d ** [a] bc>.

However (though S05, to my knowledge, does not mention it), one might
perhaps temporarily lift the rule about each unquoted alphanumeric
character being its own atom in "** separator context". In that case,
C<\d ** abc> could be made to mean C<\d ** [abc]>. (I'm not saying
this exception would be a good idea, language-wise.)

In PGE, as we see above, C<\d ** abc> currently means C<\d ** [ab] c>.
This is due to an internal optimization that's usually invisible to
the user. When parsing C<abc>, PGE conveniently reads it as C<'ab' c>
or, more generally, it reads all characters in an unquoted literal,
save for the last character. This optimization makes a lot of sense if
it turns out that C<c> had a quantifier on it. Later steps in the
regex compilation merge the C<ab> and C<c> into one literal string if
it didn't.

In the case of the separator in C<**>, this optimization produces the
wrong results. At the time C<ab> and C<c> would be merged, C<ab> has
already been bound as the separator of the C<**> operator.

I probably wouldn't submit this as a rakudobug, were it not for the
fact that, according to my reading of
<http://github.com/perl6/nqp-rx/blob/eb9c75a9b6bf144808ca6d24f31b606e9e8adba8/src/Regex/P6Regex/Grammar.pm>
(lines 47 and 67), this problem persists in nqp-rx, and thus in the ng
branch of Rakudo, once it supports regex matching.

For what it's worth, I suggest that /\d ** abc/ actually be
interpreted as /\d ** [a] bc/, but that a (suppressible) warning be
emitted whenever an atom follows a quantifier separator with no
whitespace in between.

// Carl

@p6rt
Copy link
Author

p6rt commented Feb 2, 2010

From @masak

masak (>)​:

$ perl6 -e 'say "1ab2ab3c" ~~ /^ \d ** abc $/ ?? "OH NOES" !! "oh
phew"'
OH NOES

This is a PGE bug.

As pmichaud pointed out, the same bug also manifests itself in goals,
for the same reasons​:

$ perl6 -e 'say "(foo)" ~~ /^ \( ~ \) foo $/'
Unable to parse , couldn't find final \)
[...]

$ perl6 -e 'say "(fo)o" ~~ /^ \( ~ \) foo $/'
(fo)o

@p6rt
Copy link
Author

p6rt commented Feb 2, 2010

From [Unknown Contact. See original ticket]

masak (>)​:

$ perl6 -e 'say "1ab2ab3c" ~~ /^ \d ** abc $/ ?? "OH NOES" !! "oh
phew"'
OH NOES

This is a PGE bug.

As pmichaud pointed out, the same bug also manifests itself in goals,
for the same reasons​:

$ perl6 -e 'say "(foo)" ~~ /^ \( ~ \) foo $/'
Unable to parse , couldn't find final \)
[...]

$ perl6 -e 'say "(fo)o" ~~ /^ \( ~ \) foo $/'
(fo)o

@p6rt
Copy link
Author

p6rt commented Feb 2, 2010

@masak - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Nov 15, 2012

From @bbkr

2012.10 - still broken

$ perl6 -e 'say "(foo)" ~~ /^ \( ~ \) foo $/'
Unable to parse expression in ; couldn't find final \)
  in any FAILGOAL at src/stage2/QRegex.nqp​:871
  in regex at -e​:1
  in method ACCEPTS at src/gen/CORE.setting​:10180
  in block at -e​:1

@p6rt
Copy link
Author

p6rt commented Jun 30, 2015

From @jnthn

On Tue Feb 02 01​:21​:55 2010, masak wrote​:

masak (>)​:

$ perl6 -e 'say "1ab2ab3c" ~~ /^ \d ** abc $/ ?? "OH NOES" !! "oh
phew"'
OH NOES

This is a PGE bug.

As pmichaud pointed out, the same bug also manifests itself in goals,
for the same reasons​:

$ perl6 -e 'say "(foo)" ~~ /^ \( ~ \) foo $/'
Unable to parse , couldn't find final \)
[...]

$ perl6 -e 'say "(fo)o" ~~ /^ \( ~ \) foo $/'
(fo)o

Fixed, and added tests for the two cases to S05-metachars/tilde.t and S05-metasyntax/repeat.t.

@p6rt
Copy link
Author

p6rt commented Jun 30, 2015

@jnthn - Status changed from 'open' to 'resolved'

@p6rt p6rt closed this as completed Jun 30, 2015
@p6rt p6rt added the Bug label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant