degenerates: Mo or Mn Unicode characters combine with punctuation #5902

p6rt · 2016-12-21T19:59:59Z

Migrated from rt.perl.org#130384 (status was 'rejected')

Searchable as RT130384$

p6rt · 2016-12-21T19:59:59Z

From @samcv

say "ୈ"; # U+0B48 ORIYA VOWEL SIGN AI
Bogus statement
at /home/samantha/git/roast/EVAL_0:1
------> <BOL>⏏'ୈ'
expecting any of:
prefix
term

Discovered this while trying to add a test to roast to cover the
Indic_Positional_Category Unicode property.

The most telling part of the bug is:
say Q<ୈtest<ୈ
OUTPUT: test

It seems these combining characters are combining with characters they should
not combine with.

If I try Q style quoting normally:

Q<ୈ>
===SORRY!=== Error while compiling:
Couldn't find terminator <ୈ (corresponding <ୈ was at line 1)
at line 2

It seems this is also true for other Mn or Mo charactcers

p6rt · 2016-12-24T04:55:14Z

From @samcv

It looks like according to the Unicode grapheme things, ‘degenerates’ do not
have to be accounted for in supported the spec.

Ignore degenerates. No special provisions are made to get marginally better
behavior for degenerate cases that never occur in practice, such as an A
followed by an Indic combining mark.

So we don't *have* to support this case, but the spec makes it very clear that
the grapheme separation rules are allowed to cover more cases which may not be
covered by the rules laid out in http://unicode.org/reports/tr29/
#Default_Grapheme_Cluster_Table

These degenerate cases are also not tested for in any of the Unicode grapheme
spec tests they provide as well, so we are free to be smarter if we wish for
this.

p6rt · 2016-12-29T00:19:48Z

From @samcv

Looks like JVM handles these degenerates nicely:

JVM:
say Q<ୈtest> #> <ୈtest

say "<ୈ".chars #> 2

Moar:
say "<ୈ".chars #> 1

p6rt · 2017-01-01T10:36:26Z

From @samcv

On Wed, 28 Dec 2016 16:19:48 -0800, samantham@posteo.net wrote:

Looks like JVM handles these degenerates nicely:

JVM:
say Q<ୈtest> #> <ୈtest

Looks like the JVM backend doesn't implement character counting except by codepoint. So it is just not aware of it except on the codepoint level.

p6rt · 2017-03-20T08:00:14Z

From @samcv

Changing the subject to indicate that our current functionality is not technically incorrect. Am not sure if I want to add a LTA tag to this or not, since I have not determined yet resolving this in any way is feasible or wanted from a technical standpoint.

I am definitely going to leave this open and will add more information or notes if there is new information.

p6rt · 2017-07-16T06:18:59Z

From @samcv

Bug has been open a while, and I have not forgotten it, I had just not reached a final decision. After further thought I'm closing this WONTFIX. It would needlessly complicate our grapheme concatenation and in addition I believe it may break some of the grapheme concatenation tests.

p6rt · 2017-07-16T06:19:00Z

@samcv - Status changed from 'new' to 'rejected'

p6rt closed this as completed Jul 16, 2017

p6rt added the uni label Jan 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

degenerates: Mo or Mn Unicode characters combine with punctuation #5902

degenerates: Mo or Mn Unicode characters combine with punctuation #5902

p6rt commented Dec 21, 2016

p6rt commented Dec 21, 2016

p6rt commented Dec 24, 2016

p6rt commented Dec 29, 2016

p6rt commented Jan 1, 2017

p6rt commented Mar 20, 2017

p6rt commented Jul 16, 2017

p6rt commented Jul 16, 2017

degenerates: Mo or Mn Unicode characters combine with punctuation #5902

degenerates: Mo or Mn Unicode characters combine with punctuation #5902

Comments

p6rt commented Dec 21, 2016

p6rt commented Dec 21, 2016

From @samcv

p6rt commented Dec 24, 2016

From @samcv

p6rt commented Dec 29, 2016

From @samcv

p6rt commented Jan 1, 2017

From @samcv

p6rt commented Mar 20, 2017

From @samcv

p6rt commented Jul 16, 2017

From @samcv

p6rt commented Jul 16, 2017