Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar bug <alnum> vs <alpha> #6688

Open
p6rt opened this issue Sep 25, 2018 · 13 comments
Open

Grammar bug <alnum> vs <alpha> #6688

p6rt opened this issue Sep 25, 2018 · 13 comments
Labels
regex Regular expressions, pattern matching, user-defined grammars, tokens and rules

Comments

@p6rt
Copy link

p6rt commented Sep 25, 2018

Migrated from rt.perl.org#133541 (status was 'open')

Searchable as RT133541$

@p6rt
Copy link
Author

p6rt commented Sep 25, 2018

From jvs@dyumnin.com

In the attached code, the only difference between the Grammars G0 and G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

@p6rt
Copy link
Author

p6rt commented Sep 25, 2018

From jvs@dyumnin.com

grammar G0 {
  token TOP {<rport>|<ruport>.*}
  regex rport { <type>}
  rule ruport { <type>}
  #token type {<alpha>+}
  token type {<alnum>+}
}

grammar G1 {
  token TOP {<rport>|<ruport>.*}
  regex rport { <type>}
  rule ruport { <type>}
  token type {<alpha>+}
  #token type {<alnum>+}
}
my $str="sc_in<foo> bar";
say "=========== <alnum> Example==============";
say G0.parse($str);
say "=========== <alpha> Example==============";
say G1.parse($str);

@p6rt
Copy link
Author

p6rt commented Sep 28, 2018

From @geekosaur

"_" is not an alphabetic character. It's allowed in "alnum" because that is
by intent what is \w in other regex implementations, which includes "_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by Vijayvithal
# Please include the string​: [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0 and G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

@p6rt
Copy link
Author

p6rt commented Sep 28, 2018

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Sep 28, 2018

From @labster

Are you sure about that? Underscore has been part of the specs (synopses)
for <alpha> for at least 10 years, probably longer.

"_" ~~ /<alpha>/
「_」
alpha => 「_」

On Thu, Sep 27, 2018 at 7​:52 PM Brandon Allbery <allbery.b@​gmail.com> wrote​:

"_" is not an alphabetic character. It's allowed in "alnum" because that
is by intent what is \w in other regex implementations, which includes "_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by Vijayvithal
# Please include the string​: [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0 and G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

@p6rt
Copy link
Author

p6rt commented Sep 28, 2018

From @pmichaud

The issue doesn't seem to be the underscore, because I get the same result even when converting the underscore into a letter ('b')​:

$ cat gentb.p6
grammar G0 {
  token TOP {<rport>|<ruport>.*}
  regex rport { <type>}
  rule ruport { <type>}
  #token type {<alpha>+}
  token type {<alnum>+}
}

grammar G1 {
  token TOP {<rport>|<ruport>.*}
  regex rport { <type>}
  rule ruport { <type>}
  token type {<alpha>+}
  #token type {<alnum>+}
}
my $str="scbin<foo> bar";
say "=========== <alnum> Example==============";
say G0.parse($str);
say "=========== <alpha> Example==============";
say G1.parse($str);

$ perl6 gentb.p6
=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「scbin<foo> bar」
ruport => 「scbin」
  type => 「scbin」
  alpha => 「s」
  alpha => 「c」
  alpha => 「b」
  alpha => 「i」
  alpha => 「n」
$

On Fri, Sep 28, 2018 at 02​:26​:41AM -0700, Brent Laabs wrote​:

Are you sure about that? Underscore has been part of the specs (synopses)
for <alpha> for at least 10 years, probably longer.

"_" ~~ /<alpha>/
「_」
alpha => 「_」

On Thu, Sep 27, 2018 at 7​:52 PM Brandon Allbery <allbery.b@​gmail.com> wrote​:

"_" is not an alphabetic character. It's allowed in "alnum" because that
is by intent what is \w in other regex implementations, which includes "_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by Vijayvithal
# Please include the string​: [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0 and G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

@p6rt
Copy link
Author

p6rt commented Sep 28, 2018

From @labster

Golfs to just the top grammar, which is the only one that returns Nil.

grammar Alnum1 {
  token TOP {<alnum>|<alnum>.*}
}
grammar AlnumReversed {
  token TOP {<alnum>.*|<alnum>}
}
grammar Alpha1 {
  token TOP {<alpha>|<alpha>.*}
}
my $rx = rx/^ [&lt;alnum&gt;|&lt;alnum&gt;.*] $/;

my $str="n~";

.say for "=========== <alnum> ==============",
Alnum1.parse($str),
"=========== <alnum> (reversed) ===",
AlnumReversed.parse($str),
"=========== <alpha> ==============",
Alpha1.parse($str),
"=========== Regex ==============",
$str ~~ $rx;

On Fri, Sep 28, 2018 at 7​:19 AM Patrick R. Michaud via RT <
perl6-bugs-followup@​perl.org> wrote​:

The issue doesn't seem to be the underscore, because I get the same result
even when converting the underscore into a letter ('b')​:

$ cat gentb.p6
grammar G0 {
token TOP {<rport>|<ruport>.*}
regex rport { <type>}
rule ruport { <type>}
#token type {<alpha>+}
token type {<alnum>+}
}

grammar G1 {
token TOP {<rport>|<ruport>.*}
regex rport { <type>}
rule ruport { <type>}
token type {<alpha>+}
#token type {<alnum>+}
}
my $str="scbin<foo> bar";
say "=========== <alnum> Example==============";
say G0.parse($str);
say "=========== <alpha> Example==============";
say G1.parse($str);

$ perl6 gentb.p6
=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「scbin<foo> bar」
ruport => 「scbin」
type => 「scbin」
alpha => 「s」
alpha => 「c」
alpha => 「b」
alpha => 「i」
alpha => 「n」
$

On Fri, Sep 28, 2018 at 02​:26​:41AM -0700, Brent Laabs wrote​:

Are you sure about that? Underscore has been part of the specs
(synopses)
for <alpha> for at least 10 years, probably longer.

"_" ~~ /<alpha>/
「_」
alpha => 「_」

On Thu, Sep 27, 2018 at 7​:52 PM Brandon Allbery <allbery.b@​gmail.com>
wrote​:

"_" is not an alphabetic character. It's allowed in "alnum" because
that
is by intent what is \w in other regex implementations, which includes
"_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <
perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by Vijayvithal
# Please include the string​: [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0 and
G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

@p6rt
Copy link
Author

p6rt commented Oct 2, 2018

From jvs@dyumnin.com

This is in conflict with the documentation at https://docs.perl6.org/language/regexes which states

<alpha>Alphabetic characters including _

And

<alnum>\w. <alpha> plus <digit>

In my example.

'_' matches the alpha regex.

As per specifications, Everything that matches alpha should match alnum.
Which in the given example does not.On Sep 28, 2018 8​:22 AM, Brandon Allbery via RT <perl6-bugs-followup@​perl.org> wrote​:

"_" is not an alphabetic character. It's allowed in "alnum" because that is
by intent what is \w in other regex implementations, which includes "_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by  Vijayvithal
# Please include the string​:  [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0 and G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

@p6rt
Copy link
Author

p6rt commented Oct 2, 2018

From jvs@dyumnin.com

This issue surfaces because of the token TOP line. If instead of
<rport>|<ruport> only ruport was used the testcase works for both cases. So it is quite
possible that the bug is elsewhere but shows up as a difference between
alpha and alnum.

Regards
Vijay

On Fri, Sep 28, 2018 at 07​:18​:49AM -0700, Patrick R. Michaud via RT wrote​:

The issue doesn't seem to be the underscore, because I get the same result even when converting the underscore into a letter ('b')​:

$ cat gentb.p6
grammar G0 {
token TOP {<rport>|<ruport>.*}
regex rport { <type>}
rule ruport { <type>}
#token type {<alpha>+}
token type {<alnum>+}
}

grammar G1 {
token TOP {<rport>|<ruport>.*}
regex rport { <type>}
rule ruport { <type>}
token type {<alpha>+}
#token type {<alnum>+}
}
my $str="scbin<foo> bar";
say "=========== <alnum> Example==============";
say G0.parse($str);
say "=========== <alpha> Example==============";
say G1.parse($str);

$ perl6 gentb.p6
=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「scbin<foo> bar」
ruport => 「scbin」
type => 「scbin」
alpha => 「s」
alpha => 「c」
alpha => 「b」
alpha => 「i」
alpha => 「n」
$

On Fri, Sep 28, 2018 at 02​:26​:41AM -0700, Brent Laabs wrote​:

Are you sure about that? Underscore has been part of the specs (synopses)
for <alpha> for at least 10 years, probably longer.

"_" ~~ /<alpha>/
「_」
alpha => 「_」

On Thu, Sep 27, 2018 at 7​:52 PM Brandon Allbery <allbery.b@​gmail.com> wrote​:

"_" is not an alphabetic character. It's allowed in "alnum" because that
is by intent what is \w in other regex implementations, which includes "_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by Vijayvithal
# Please include the string​: [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0 and G1
is the defination of token 'type' it is defined as <alpha> in one case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

--
Vijayvithal
Dyumnin Semiconductors

@p6rt
Copy link
Author

p6rt commented Oct 2, 2018

From @labster

Actually, if you change it to <ruport>.*|<rport> -- this will work as you
expect. It's a bug that your version doesn't work, of course. It does
seem to involve <alpha> tangentially, but it is unrelated to underscore.

On Mon, Oct 1, 2018 at 6​:17 PM Vijayvithal via RT <
perl6-bugs-followup@​perl.org> wrote​:

This issue surfaces because of the token TOP line. If instead of
<rport>|<ruport> only ruport was used the testcase works for both cases.
So it is quite
possible that the bug is elsewhere but shows up as a difference between
alpha and alnum.

Regards
Vijay

On Fri, Sep 28, 2018 at 07​:18​:49AM -0700, Patrick R. Michaud via RT wrote​:

The issue doesn't seem to be the underscore, because I get the same
result even when converting the underscore into a letter ('b')​:

$ cat gentb.p6
grammar G0 {
token TOP {<rport>|<ruport>.*}
regex rport { <type>}
rule ruport { <type>}
#token type {<alpha>+}
token type {<alnum>+}
}

grammar G1 {
token TOP {<rport>|<ruport>.*}
regex rport { <type>}
rule ruport { <type>}
token type {<alpha>+}
#token type {<alnum>+}
}
my $str="scbin<foo> bar";
say "=========== <alnum> Example==============";
say G0.parse($str);
say "=========== <alpha> Example==============";
say G1.parse($str);

$ perl6 gentb.p6
=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「scbin<foo> bar」
ruport => 「scbin」
type => 「scbin」
alpha => 「s」
alpha => 「c」
alpha => 「b」
alpha => 「i」
alpha => 「n」
$

On Fri, Sep 28, 2018 at 02​:26​:41AM -0700, Brent Laabs wrote​:

Are you sure about that? Underscore has been part of the specs
(synopses)
for <alpha> for at least 10 years, probably longer.

"_" ~~ /<alpha>/
「_」
alpha => 「_」

On Thu, Sep 27, 2018 at 7​:52 PM Brandon Allbery <allbery.b@​gmail.com>
wrote​:

"_" is not an alphabetic character. It's allowed in "alnum" because
that
is by intent what is \w in other regex implementations, which
includes "_".

On Thu, Sep 27, 2018 at 10​:47 PM Vijayvithal <
perl6-bugs-followup@​perl.org>
wrote​:

# New Ticket Created by Vijayvithal
# Please include the string​: [perl #​133541]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=133541 >

In the attached code, the only difference between the Grammars G0
and G1
is the defination of token 'type' it is defined as <alpha> in one
case
and as <alnum> in another.

Since the string being matched is 'sc_in' both the alpha and alnum
tokens should have captured it. But we see the following result on
execution

=========== <alnum> Example==============
Nil
=========== <alpha> Example==============
「sc_in<foo> bar」
ruport => 「sc_in」
type => 「sc_in」
alpha => 「s」
alpha => 「c」
alpha => 「_」
alpha => 「i」
alpha => 「n」

Perl Version is

This is Rakudo Star version 2018.06 built on MoarVM version 2018.06
implementing Perl 6.c.

--
Vijayvithal
Dyumnin Semiconductors

--
brandon s allbery kf8nh
allbery.b@​gmail.com

--
Vijayvithal
Dyumnin Semiconductors

@JJ
Copy link

JJ commented May 30, 2020

Guess this has not been fixed yet. Although it's not really clear what the issue is.

@JJ
Copy link

JJ commented Jul 27, 2020

Looking at the definition of the rules:

https://github.com/Raku/nqp/blob/db0c1088fc34e3518052ee441943fdfa7b3dcbb7/src/QRegex/Cursor.nqp#L973-L989

The only difference seems to be the class of characters it accepts. In the second case it's WORD, which I guess includes _. So looking at the last example, I would say it's a problem with ratcheting in alternations.

@JJ JJ changed the title Grammer bug <alnum> vs <alpha> Grammar bug <alnum> vs <alpha> Jul 27, 2020
@JJ JJ added docs Anything documentation-related regex Regular expressions, pattern matching, user-defined grammars, tokens and rules and removed docs Anything documentation-related labels Jul 27, 2020
@JJ
Copy link

JJ commented Jul 27, 2020

This might be a version of #6570 or related to that in some way, since it involves ratcheting and alternation. As a matter of fact, it does not seem to happen for alpha, so it might go a bit deeper than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regex Regular expressions, pattern matching, user-defined grammars, tokens and rules
Projects
None yet
Development

No branches or pull requests

2 participants