Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rakudo doesn't do Unicode Special Casing (uc, tc, uclc, tclc for ffl ligature,Turkish i, etc.) #4381

Closed
p6rt opened this issue Jul 6, 2015 · 5 comments

Comments

@p6rt
Copy link

p6rt commented Jul 6, 2015

Migrated from rt.perl.org#125556 (status was 'resolved')

Searchable as RT125556$

@p6rt
Copy link
Author

p6rt commented Jul 6, 2015

From @raiph

What I did​:

say 'ffl'.uc; # say the uppercased version of an ffl ligature

What I got with camelia (rakudo-moar 01edd3)​:

  ffl

"What I expected"​:

  FFL


"What I expected" is based on http://unicode.org/Public/UNIDATA/SpecialCasing.txt which defines a bunch of special casing rules​:

"The data in this file, combined with the simple case mappings in UnicodeData.txt, defines the full case mappings Lowercase_Mapping (lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc)."

The entry for ffl approximates to​:

<code>; <lower>; <title>; <upper>; # <comment>
FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL LIGATURE FFL

(Note difference between title case and upper case.)


A quick search of MoarVM's source code for SpecialCasing reveals this comment​:

  # XXX SpecialCasing.txt # haven't decided how to do it

(in the ucd2c.pl tool)

I'm surmising that Rakudo (MoarVM) does none of this special casing yet.


Digging a little bit more in to What I did and What I got​:

say 'ffl'.uniname

  LATIN SMALL LIGATURE FFL

say 'ffl'.NFD

  NFD​:0x<fb04>

The canonical decomposition of this precomposed codepoint is to the individual 'f' and 'l' characters of which the ligature is composed, i.e. three codepoints​:

say 'ffl'.NFKD, 'ffl'.NFKD.Str

  NFKD​:0x<0066 0066 006c>, ffl

@p6rt
Copy link
Author

p6rt commented Sep 20, 2015

From @ShimmerFairy

For the ffl ligature, it should be noted that the "Simple_Uppercase_Mapping" property is ffl (in other words, itself), and the "Uppercase_Mapping" property is "FFL". I think the best course of action would be for P6 to distinguish provide clear methods for both kinds, in the event that you either want full mappings (.uc), or just the simple mapping (.suc, hypothetically)

The simple mapping offers a guarantee that the length of the string will not change, which may be useful in certain situations (working with text that goes into/out of binary files, or working with C strings through NativeCall come to mind). We should definitely make clear that .uc, .lc, .tc, and .fc are for the full version of their case mappings, and I think it wouldn't do harm to make the simple case mappings (which the aforementioned functions are doing currently, AFAICT) available too.

To be clear, the .uc, .lc, and .tc methods would take their "Full {$case}case Mapping" property values from SpecialCasing.txt, ignoring records in there with a <condition_list>. With no specified mapping, the full mappings take on their respective simple mappings. (This is my understanding of the comments in SpecialCasing.txt, at least.)

For .fc (case folding), those would be taken from CaseFolding.txt, with full mapping getting F and C type mappings in that file, simple mapping getting S and C, type T being ignored, and unlisted codepoints having an implied C mapping to itself.

(.tclc as a result of this would imply full case mappings according to its name, though I'm not sure if we need the .stclc, .tcslc, and .stcslc variants that could then be suggested)

The conditional mappings in SpecialCasing.txt and CaseFolding.txt we'll most likely want to ignore for the time being (this includes Turkish i/İ), until we can better figure out how Unicode tailoring figures into P6 (likely a post-6.0.0 thing). I imagine CLDR would be involved in that as well, though I'm not yet familiar with CLDR.

That's my thoughts on the subject; I've been thinking about how to improve our stringy stuff lately, so this happens to be something on my mind. I think we can fix at least a few problems in this area by at the very least clarifying our existing casing methods for full case unconditional mappings :) .

@p6rt
Copy link
Author

p6rt commented Sep 20, 2015

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Oct 9, 2015

From @jnthn

On Sun Jul 05 17​:56​:45 2015, raiph wrote​:

What I did​:

say 'ffl'.uc; # say the uppercased version of an ffl ligature

What I got with camelia (rakudo-moar 01edd3)​:

"What I expected"​:

FFL

----

"What I expected" is based on
http://unicode.org/Public/UNIDATA/SpecialCasing.txt which defines a
bunch of special casing rules​:

"The data in this file, combined with the simple case mappings in
UnicodeData.txt, defines the full case mappings Lowercase_Mapping
(lc), Titlecase_Mapping (tc), and Uppercase_Mapping (uc)."

The entry for ffl approximates to​:

<code>; <lower>; <title>; <upper>; # <comment>
FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL
LIGATURE FFL

(Note difference between title case and upper case.)

----

A quick search of MoarVM's source code for SpecialCasing reveals this
comment​:

# XXX SpecialCasing.txt # haven't decided how to do it

(in the ucd2c.pl tool)

I'm surmising that Rakudo (MoarVM) does none of this special casing
yet.

----

We handle SpecialCasing in MoarVM now. I've added and unfudged various spectests covering that. The Greek final sigma is also properly handled, the various cases well tested.

The Turkish i is not something a generic Unicode implementation should do; it's marked with a regional condition in SpecialCasing.txt. Handling of those will be left to module space for the time being.

@p6rt
Copy link
Author

p6rt commented Oct 9, 2015

@jnthn - Status changed from 'open' to 'resolved'

@p6rt p6rt closed this as completed Oct 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant