Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<[a..z]> ranges break grapheme awareness #5425

Closed
p6rt opened this issue Jul 5, 2016 · 6 comments
Closed

<[a..z]> ranges break grapheme awareness #5425

p6rt opened this issue Jul 5, 2016 · 6 comments
Labels

Comments

@p6rt
Copy link

p6rt commented Jul 5, 2016

Migrated from rt.perl.org#128550 (status was 'resolved')

Searchable as RT128550$

@p6rt
Copy link
Author

p6rt commented Jul 5, 2016

From zefram@fysh.org

Built-in character classes such as <lower> consistently accept any
diacritics on a matching base character, matching the whole grapheme​:

/^<lower>$/.ACCEPTS("u\x[308]").Bool
True
/^<lower>$/.ACCEPTS("n\x[308]").Bool
True

Matching against a literal character or a <[abc]>-type enumerated
character class consistently rejects any diacritics on a matching base
character​:

/^<[nu]>$/.ACCEPTS("u\x[308]").Bool
False
/^<[nu]>$/.ACCEPTS("n\x[308]").Bool
False

But a <[a..z]>-type range-based character class has inconsistent
behaviour​:

/^<[a..z]>$/.ACCEPTS("u\x[308]").Bool
False
/^<[a..z]>$/.ACCEPTS("n\x[308]").Bool
True

The behaviour seems to be that if in NFC the first character of the
grapheme is the unadorned base character then it accepts, but if it's a
combined character then it rejects. This dependence on the representation
breaks the grapheme view of the string, and so is presumably a bug.

I think a <[a..z]>-type range should, with respect to diacritics, behave
either like <lower> or like <[abc]>. I am unable to discern which
is really intended; none of the documentation that I've seen addresses
grapheme semantics. I note that matching a specified base character with
arbitrary diacritics is a meaningful facility, and given that <lower>
et al have that behaviour it should probably be available somewhere.
The character range feature is almost providing it, but it's obviously
not been designed to, because a single-character range such as <[n..n]>
is rejected.

-zefram

@p6rt
Copy link
Author

p6rt commented Aug 4, 2016

From @smls

I note that matching a specified base character with
arbitrary diacritics is a meaningful facility, and given that <lower>
et al have that behaviour it should probably be available somewhere.

This facility is already available in the form of the :ignoremark flag (or :m for short).

I'm pretty sure that <[a..c]> should match exactly the same thing as <[abc]>, and that ("b\x[308]" ~~ /<[a..c]>/) matching is a bug.

@p6rt
Copy link
Author

p6rt commented Aug 4, 2016

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Aug 13, 2016

From @TimToady

The code generator in nqp for char ranges was incorrectly using ordat and ordfirst to find the character to compare, which throw away information on synthetic characters. We now use the getcp_s instruction instead, which leaves synthetics negative, so that they drop out of the character range correctly (but only when :m is not specified, of course).

nqp fix in 2df0a0656e4b20a72bd73e6d2b6214b584d095ac

rakudo bump in fe90be01c6546e1dbb2ee7ff794e8b6ea1491268

Tests updated in 172f6945653ddbce944af1d7cae9ad956f3a70b9

1 similar comment
@p6rt
Copy link
Author

p6rt commented Aug 13, 2016

From @TimToady

The code generator in nqp for char ranges was incorrectly using ordat and ordfirst to find the character to compare, which throw away information on synthetic characters. We now use the getcp_s instruction instead, which leaves synthetics negative, so that they drop out of the character range correctly (but only when :m is not specified, of course).

nqp fix in 2df0a0656e4b20a72bd73e6d2b6214b584d095ac

rakudo bump in fe90be01c6546e1dbb2ee7ff794e8b6ea1491268

Tests updated in 172f6945653ddbce944af1d7cae9ad956f3a70b9

@p6rt
Copy link
Author

p6rt commented Aug 13, 2016

@TimToady - Status changed from 'open' to 'resolved'

@p6rt p6rt closed this as completed Aug 13, 2016
@p6rt p6rt added the at_larry label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant