New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
नि -- a grapheme cluster boundary algo problem (CCC = 0 can be a valid combiner?) #4492
Comments
From @raiphjnthn++ and others are busy with work that is far more important and urgent than dealing with this right now. I'm filing this bug now because there are reasons to consider addressing it before christmas as explained below. What I didsay "नि".chars What I expected1 What I got2 Some reasons why I think it's appropriate to classify नि as a single grapheme: 1. It's the last of 4 sample single graphemes in the "Extended Grapheme Clusters" section of the Unicode Standard Annex #29 on Text Segmentation: http://www.unicode.org/reports/tr29/tr29-27.html#Table_Sample_Grapheme_Clusters (The Unicode standard suggests aiming at Extended Grapheme Clusters at a minimum if an implementation wishes to deal with grapheme clusters.) 2. It's the first example in S15: https://raw.githubusercontent.com/perl6/specs/master/S15-unicode.pod 3. It behaves as a single unit for selection in my browser. (You too?) The bug I'm reporting in this RT was discussed briefly today on IRC: jnthn m: say "नि".NFC.list.say So, presumably, to match Unicode's default extended grapheme cluster definition, the CCC > 0 condition is insufficient for identifying combiners, including one that's part of a sample grapheme that the Unicode standard saw fit to highlight. It's this latter point -- that it's become a go-to example on the net -- that's one of the main reasons I'm filing this bug; I don't otherwise use Devanagari! |
From @FROGGSHere are more examples (by novapatch++ via irc): m: say «நி �ำ षि "\r\n"»».chars |
The RT System itself - Status changed from 'new' to 'open' |
From @cokeOn Mon Oct 26 11:44:45 2015, FROGGS.de wrote:
These now all report as 1, with the exception of \r\n. -- |
From @jnthnOn Thu Aug 27 16:37:21 2015, raiph wrote:
Our NFG algorithm has now been aligned with the definition of graphemes provided in Unicode Standard Annex #29. The Unicode database provides a test suite, which has been incorporated into the spectests in S15-nfg/grapheme-break.t (over 400 tests, all passing). Thanks, /jnthn |
@jnthn - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#125927 (status was 'resolved')
Searchable as RT125927$
The text was updated successfully, but these errors were encountered: