Rakudo doesn't do Unicode Special Casing (uc, tc, uclc, tclc for ffl ligature,Turkish i, etc.) #4381

p6rt · 2015-07-06T00:56:45Z

Migrated from rt.perl.org#125556 (status was 'resolved')

Searchable as RT125556$

p6rt · 2015-07-06T00:56:45Z

From @raiph

What I did:

say 'ﬄ'.uc; # say the uppercased version of an ffl ligature

What I got with camelia (rakudo-moar 01edd3):

ﬄ

"What I expected":

FFL

"What I expected" is based on http://unicode.org/Public/UNIDATA/SpecialCasing.txt which defines a bunch of special casing rules:

"The data in this file, combined with the simple case mappings in UnicodeData.txt, defines the full case mappings Lowercase_Mapping (lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc)."

The entry for ﬄ approximates to:

<code>; <lower>; <title>; <upper>; # <comment>
FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL LIGATURE FFL

(Note difference between title case and upper case.)

A quick search of MoarVM's source code for SpecialCasing reveals this comment:

# XXX SpecialCasing.txt # haven't decided how to do it

(in the ucd2c.pl tool)

I'm surmising that Rakudo (MoarVM) does none of this special casing yet.

Digging a little bit more in to What I did and What I got:

say 'ﬄ'.uniname

LATIN SMALL LIGATURE FFL

say 'ﬄ'.NFD

NFD:0x<fb04>

The canonical decomposition of this precomposed codepoint is to the individual 'f' and 'l' characters of which the ligature is composed, i.e. three codepoints:

say 'ﬄ'.NFKD, 'ﬄ'.NFKD.Str

NFKD:0x<0066 0066 006c>, ffl

p6rt · 2015-09-20T23:04:46Z

From @ShimmerFairy

For the ﬄ ligature, it should be noted that the "Simple_Uppercase_Mapping" property is ﬄ (in other words, itself), and the "Uppercase_Mapping" property is "FFL". I think the best course of action would be for P6 to distinguish provide clear methods for both kinds, in the event that you either want full mappings (.uc), or just the simple mapping (.suc, hypothetically)

The simple mapping offers a guarantee that the length of the string will not change, which may be useful in certain situations (working with text that goes into/out of binary files, or working with C strings through NativeCall come to mind). We should definitely make clear that .uc, .lc, .tc, and .fc are for the full version of their case mappings, and I think it wouldn't do harm to make the simple case mappings (which the aforementioned functions are doing currently, AFAICT) available too.

To be clear, the .uc, .lc, and .tc methods would take their "Full {$case}case Mapping" property values from SpecialCasing.txt, ignoring records in there with a <condition_list>. With no specified mapping, the full mappings take on their respective simple mappings. (This is my understanding of the comments in SpecialCasing.txt, at least.)

For .fc (case folding), those would be taken from CaseFolding.txt, with full mapping getting F and C type mappings in that file, simple mapping getting S and C, type T being ignored, and unlisted codepoints having an implied C mapping to itself.

(.tclc as a result of this would imply full case mappings according to its name, though I'm not sure if we need the .stclc, .tcslc, and .stcslc variants that could then be suggested)

The conditional mappings in SpecialCasing.txt and CaseFolding.txt we'll most likely want to ignore for the time being (this includes Turkish i/İ), until we can better figure out how Unicode tailoring figures into P6 (likely a post-6.0.0 thing). I imagine CLDR would be involved in that as well, though I'm not yet familiar with CLDR.

That's my thoughts on the subject; I've been thinking about how to improve our stringy stuff lately, so this happens to be something on my mind. I think we can fix at least a few problems in this area by at the very least clarifying our existing casing methods for full case unconditional mappings :) .

p6rt · 2015-09-20T23:04:47Z

The RT System itself - Status changed from 'new' to 'open'

p6rt · 2015-10-09T09:53:51Z

From @jnthn

On Sun Jul 05 17:56:45 2015, raiph wrote:

What I did:

say 'ﬄ'.uc; # say the uppercased version of an ffl ligature

What I got with camelia (rakudo-moar 01edd3):

ﬄ

"What I expected":

FFL

----

"What I expected" is based on
http://unicode.org/Public/UNIDATA/SpecialCasing.txt which defines a
bunch of special casing rules:

"The data in this file, combined with the simple case mappings in
UnicodeData.txt, defines the full case mappings Lowercase_Mapping
(lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc)."

The entry for ﬄ approximates to:

<code>; <lower>; <title>; <upper>; # <comment>
FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL
LIGATURE FFL

(Note difference between title case and upper case.)

----

A quick search of MoarVM's source code for SpecialCasing reveals this
comment:

# XXX SpecialCasing.txt # haven't decided how to do it

(in the ucd2c.pl tool)

I'm surmising that Rakudo (MoarVM) does none of this special casing
yet.

----

We handle SpecialCasing in MoarVM now. I've added and unfudged various spectests covering that. The Greek final sigma is also properly handled, the various cases well tested.

The Turkish i is not something a generic Unicode implementation should do; it's marked with a regional condition in SpecialCasing.txt. Handling of those will be left to module space for the time being.

p6rt · 2015-10-09T09:53:52Z

@jnthn - Status changed from 'open' to 'resolved'

p6rt closed this as completed Oct 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rakudo doesn't do Unicode Special Casing (uc, tc, uclc, tclc for ffl ligature,Turkish i, etc.) #4381

Rakudo doesn't do Unicode Special Casing (uc, tc, uclc, tclc for ffl ligature,Turkish i, etc.) #4381

p6rt commented Jul 6, 2015

p6rt commented Jul 6, 2015

p6rt commented Sep 20, 2015

p6rt commented Sep 20, 2015

p6rt commented Oct 9, 2015

p6rt commented Oct 9, 2015

Rakudo doesn't do Unicode Special Casing (uc, tc, uclc, tclc for ffl ligature,Turkish i, etc.) #4381

Rakudo doesn't do Unicode Special Casing (uc, tc, uclc, tclc for ffl ligature,Turkish i, etc.) #4381

Comments

p6rt commented Jul 6, 2015

p6rt commented Jul 6, 2015

From @raiph

p6rt commented Sep 20, 2015

From @ShimmerFairy

p6rt commented Sep 20, 2015

p6rt commented Oct 9, 2015

From @jnthn

p6rt commented Oct 9, 2015