Skip Menu |
Report information
Id: 130384
Status: rejected
Priority: 0/
Queue: perl6

Owner: samantham [at] posteo.net
Requestors: samantham [at] posteo.net
Cc:
AdminCc:

Severity: (no value)
Tag: (no value)
Platform: (no value)
Patch Status: (no value)
VM: Moar



Subject: Mo or Mn Unicode characters incorrectly combine with any other character
From: Samantha McVey <samantham [...] posteo.net>
Date: Wed, 21 Dec 2016 11:59:27 -0800
To: rakudobug [...] perl.org
Download (untitled) / with headers
text/plain 660b
say "ୈ"; # U+0B48 ORIYA VOWEL SIGN AI Bogus statement at /home/samantha/git/roast/EVAL_0:1 ------> <BOL>⏏'ୈ' expecting any of: prefix term Discovered this while trying to add a test to roast to cover the Indic_Positional_Category Unicode property. The most telling part of the bug is: say Q<ୈtest<ୈ OUTPUT: test It seems these combining characters are combining with characters they should not combine with. If I try Q style quoting normally: Q<ୈ> ===SORRY!=== Error while compiling: Couldn't find terminator <ୈ (corresponding <ୈ was at line 1) at line 2 It seems this is also true for other Mn or Mo charactcers
From: Samantha McVey <samantham [...] posteo.net>
Date: Fri, 23 Dec 2016 15:52:32 -0800
To: perl6-bugs-followup [...] perl.org
Subject: Re: [perl #130384] AutoReply: Mo or Mn Unicode characters incorrectly combine with any other character
Download (untitled) / with headers
text/plain 745b
It looks like according to the Unicode grapheme things, ‘degenerates’ do not have to be accounted for in supported the spec. Show quoted text
> Ignore degenerates. No special provisions are made to get marginally better
behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark. So we don't *have* to support this case, but the spec makes it very clear that the grapheme separation rules are allowed to cover more cases which may not be covered by the rules laid out in http://unicode.org/reports/tr29/ #Default_Grapheme_Cluster_Table These degenerate cases are also not tested for in any of the Unicode grapheme spec tests they provide as well, so we are free to be smarter if we wish for this.
Download (untitled) / with headers
text/plain 135b
Looks like JVM handles these degenerates nicely: JVM: say Q<ୈtest> #> <ୈtest say "<ୈ".chars #> 2 Moar: say "<ୈ".chars #> 1
Download (untitled) / with headers
text/plain 300b
On Wed, 28 Dec 2016 16:19:48 -0800, samantham@posteo.net wrote: Show quoted text
> Looks like JVM handles these degenerates nicely: > > JVM: > say Q<ୈtest> #> <ୈtest
Looks like the JVM backend doesn't implement character counting except by codepoint. So it is just not aware of it except on the codepoint level.
Subject: [UNI] degenerates: Mo or Mn Unicode characters combine with punctuation
Download (untitled) / with headers
text/plain 367b
Changing the subject to indicate that our current functionality is not technically incorrect. Am not sure if I want to add a LTA tag to this or not, since I have not determined yet resolving this in any way is feasible or wanted from a technical standpoint. I am definitely going to leave this open and will add more information or notes if there is new information.
Download (untitled) / with headers
text/plain 283b
Bug has been open a while, and I have not forgotten it, I had just not reached a final decision. After further thought I'm closing this WONTFIX. It would needlessly complicate our grapheme concatenation and in addition I believe it may break some of the grapheme concatenation tests.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org