<[a..z]> ranges break grapheme awareness #5425

p6rt · 2016-07-05T21:11:19Z

Migrated from rt.perl.org#128550 (status was 'resolved')

Searchable as RT128550$

p6rt · 2016-07-05T21:11:19Z

From zefram@fysh.org

Built-in character classes such as <lower> consistently accept any
diacritics on a matching base character, matching the whole grapheme:

/^<lower>$/.ACCEPTS("u\x[308]").Bool
True
/^<lower>$/.ACCEPTS("n\x[308]").Bool
True

Matching against a literal character or a <[abc]>-type enumerated
character class consistently rejects any diacritics on a matching base
character:

/^<[nu]>$/.ACCEPTS("u\x[308]").Bool
False
/^<[nu]>$/.ACCEPTS("n\x[308]").Bool
False

But a <[a..z]>-type range-based character class has inconsistent
behaviour:

/^<[a..z]>$/.ACCEPTS("u\x[308]").Bool
False
/^<[a..z]>$/.ACCEPTS("n\x[308]").Bool
True

The behaviour seems to be that if in NFC the first character of the
grapheme is the unadorned base character then it accepts, but if it's a
combined character then it rejects. This dependence on the representation
breaks the grapheme view of the string, and so is presumably a bug.

I think a <[a..z]>-type range should, with respect to diacritics, behave
either like <lower> or like <[abc]>. I am unable to discern which
is really intended; none of the documentation that I've seen addresses
grapheme semantics. I note that matching a specified base character with
arbitrary diacritics is a meaningful facility, and given that <lower>
et al have that behaviour it should probably be available somewhere.
The character range feature is almost providing it, but it's obviously
not been designed to, because a single-character range such as <[n..n]>
is rejected.

-zefram

p6rt · 2016-08-04T12:23:13Z

From @smls

I note that matching a specified base character with
arbitrary diacritics is a meaningful facility, and given that <lower>
et al have that behaviour it should probably be available somewhere.

This facility is already available in the form of the :ignoremark flag (or :m for short).

I'm pretty sure that <[a..c]> should match exactly the same thing as <[abc]>, and that ("b\x[308]" ~~ /<[a..c]>/) matching is a bug.

p6rt · 2016-08-04T12:23:13Z

The RT System itself - Status changed from 'new' to 'open'

p6rt · 2016-08-13T19:29:30Z

From @TimToady

The code generator in nqp for char ranges was incorrectly using ordat and ordfirst to find the character to compare, which throw away information on synthetic characters. We now use the getcp_s instruction instead, which leaves synthetics negative, so that they drop out of the character range correctly (but only when :m is not specified, of course).

nqp fix in 2df0a0656e4b20a72bd73e6d2b6214b584d095ac

rakudo bump in fe90be01c6546e1dbb2ee7ff794e8b6ea1491268

Tests updated in 172f6945653ddbce944af1d7cae9ad956f3a70b9

p6rt · 2016-08-13T19:29:30Z

From @TimToady

The code generator in nqp for char ranges was incorrectly using ordat and ordfirst to find the character to compare, which throw away information on synthetic characters. We now use the getcp_s instruction instead, which leaves synthetics negative, so that they drop out of the character range correctly (but only when :m is not specified, of course).

nqp fix in 2df0a0656e4b20a72bd73e6d2b6214b584d095ac

rakudo bump in fe90be01c6546e1dbb2ee7ff794e8b6ea1491268

Tests updated in 172f6945653ddbce944af1d7cae9ad956f3a70b9

p6rt · 2016-08-13T19:29:30Z

@TimToady - Status changed from 'open' to 'resolved'

p6rt closed this as completed Aug 13, 2016

p6rt added the at_larry label Jan 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<[a..z]> ranges break grapheme awareness #5425

<[a..z]> ranges break grapheme awareness #5425

p6rt commented Jul 5, 2016

p6rt commented Jul 5, 2016

p6rt commented Aug 4, 2016

p6rt commented Aug 4, 2016

p6rt commented Aug 13, 2016

p6rt commented Aug 13, 2016

p6rt commented Aug 13, 2016

<[a..z]> ranges break grapheme awareness #5425

<[a..z]> ranges break grapheme awareness #5425

Comments

p6rt commented Jul 5, 2016

p6rt commented Jul 5, 2016

From zefram@fysh.org

p6rt commented Aug 4, 2016

From @smls

p6rt commented Aug 4, 2016

p6rt commented Aug 13, 2016

From @TimToady

p6rt commented Aug 13, 2016

From @TimToady

p6rt commented Aug 13, 2016