Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

नि -- a grapheme cluster boundary algo problem (CCC = 0 can be a valid combiner?) #4492

Closed
p6rt opened this issue Aug 27, 2015 · 6 comments

Comments

@p6rt
Copy link

p6rt commented Aug 27, 2015

Migrated from rt.perl.org#125927 (status was 'resolved')

Searchable as RT125927$

@p6rt
Copy link
Author

p6rt commented Aug 27, 2015

From @raiph

jnthn++ and others are busy with work that is far more important and urgent than dealing with this right now. I'm filing this bug now because there are reasons to consider addressing it before christmas as explained below.

What I did

say "नि".chars

What I expected

1

What I got

2


Some reasons why I think it's appropriate to classify नि as a single grapheme​:

1. It's the last of 4 sample single graphemes in the "Extended Grapheme Clusters" section of the Unicode Standard Annex #​29 on Text Segmentation​: http://www.unicode.org/reports/tr29/tr29-27.html#Table_Sample_Grapheme_Clusters

(The Unicode standard suggests aiming at Extended Grapheme Clusters at a minimum if an implementation wishes to deal with grapheme clusters.)

2. It's the first example in S15​: https://raw.githubusercontent.com/perl6/specs/master/S15-unicode.pod

3. It behaves as a single unit for selection in my browser. (You too?)


The bug I'm reporting in this RT was discussed briefly today on IRC​:

jnthn m​: say "नि".NFC.list.say
camelia OUTPUT«2344 2367�True�»
jnthn m​: say uniprop(2367, 'Canonical_Combining_Class')
camelia OUTPUT«0�»
jnthn ... combiners are identified in the NFG algo by having a CCC > 0


So, presumably, to match Unicode's default extended grapheme cluster definition, the CCC > 0 condition is insufficient for identifying combiners, including one that's part of a sample grapheme that the Unicode standard saw fit to highlight. It's this latter point -- that it's become a go-to example on the net -- that's one of the main reasons I'm filing this bug; I don't otherwise use Devanagari!

@p6rt
Copy link
Author

p6rt commented Oct 26, 2015

From @FROGGS

Here are more examples (by novapatch++ via irc)​:

m​: say «நி à¸�ำ षि "\r\n"»».chars
rakudo-moar cd7766​: OUTPUT«(2 2 2 2)â�¤Â»

@p6rt
Copy link
Author

p6rt commented Oct 26, 2015

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Nov 2, 2015

From @coke

On Mon Oct 26 11​:44​:45 2015, FROGGS.de wrote​:

Here are more examples (by novapatch++ via irc)​:

m​: say «நி à¸�ำ षि "\r\n"»».chars
rakudo-moar cd7766​: OUTPUT«(2 2 2 2)â�¤Â»

These now all report as 1, with the exception of \r\n.

--
Will "Coke" Coleda

@p6rt
Copy link
Author

p6rt commented Nov 3, 2015

From @jnthn

On Thu Aug 27 16​:37​:21 2015, raiph wrote​:

jnthn++ and others are busy with work that is far more important and
urgent than dealing with this right now. I'm filing this bug now
because there are reasons to consider addressing it before christmas
as explained below.

What I did

say "नि".chars

What I expected

1

What I got

2

-----------------

Some reasons why I think it's appropriate to classify नि as a single
grapheme​:

1. It's the last of 4 sample single graphemes in the "Extended
Grapheme Clusters" section of the Unicode Standard Annex #​29 on Text
Segmentation​: http://www.unicode.org/reports/tr29/tr29-
27.html#Table_Sample_Grapheme_Clusters

(The Unicode standard suggests aiming at Extended Grapheme Clusters at
a minimum if an implementation wishes to deal with grapheme clusters.)

2. It's the first example in S15​:
https://raw.githubusercontent.com/perl6/specs/master/S15-unicode.pod

3. It behaves as a single unit for selection in my browser. (You too?)

--------

Our NFG algorithm has now been aligned with the definition of graphemes provided in Unicode Standard Annex #​29. The Unicode database provides a test suite, which has been incorporated into the spectests in S15-nfg/grapheme-break.t (over 400 tests, all passing).

Thanks,

/jnthn

@p6rt p6rt closed this as completed Nov 3, 2015
@p6rt
Copy link
Author

p6rt commented Nov 3, 2015

@jnthn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant