Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

degenerates: Mo or Mn Unicode characters combine with punctuation #5902

Closed
p6rt opened this issue Dec 21, 2016 · 7 comments
Closed

degenerates: Mo or Mn Unicode characters combine with punctuation #5902

p6rt opened this issue Dec 21, 2016 · 7 comments
Labels

Comments

@p6rt
Copy link

p6rt commented Dec 21, 2016

Migrated from rt.perl.org#130384 (status was 'rejected')

Searchable as RT130384$

@p6rt
Copy link
Author

p6rt commented Dec 21, 2016

From @samcv

say "ୈ"; # U+0B48 ORIYA VOWEL SIGN AI
Bogus statement
at /home/samantha/git/roast/EVAL_0​:1
------> <BOL>⏏'ୈ'
  expecting any of​:
  prefix
  term

Discovered this while trying to add a test to roast to cover the
Indic_Positional_Category Unicode property.

The most telling part of the bug is​:
say Q<ୈtest<ୈ
OUTPUT​: test

It seems these combining characters are combining with characters they should
not combine with.

If I try Q style quoting normally​:

Q<ୈ>
===SORRY!=== Error while compiling​:
Couldn't find terminator <ୈ (corresponding <ୈ was at line 1)
at line 2

It seems this is also true for other Mn or Mo charactcers

@p6rt
Copy link
Author

p6rt commented Dec 24, 2016

From @samcv

It looks like according to the Unicode grapheme things, ‘degenerates’ do not
have to be accounted for in supported the spec.

Ignore degenerates. No special provisions are made to get marginally better
behavior for degenerate cases that never occur in practice, such as an A
followed by an Indic combining mark.

So we don't *have* to support this case, but the spec makes it very clear that
the grapheme separation rules are allowed to cover more cases which may not be
covered by the rules laid out in http://unicode.org/reports/tr29/
#Default_Grapheme_Cluster_Table

These degenerate cases are also not tested for in any of the Unicode grapheme
spec tests they provide as well, so we are free to be smarter if we wish for
this.

@p6rt
Copy link
Author

p6rt commented Dec 29, 2016

From @samcv

Looks like JVM handles these degenerates nicely​:

JVM​:
say Q<ୈtest> #> <ୈtest

say "<ୈ".chars #> 2

Moar​:
say "<ୈ".chars #> 1

@p6rt
Copy link
Author

p6rt commented Jan 1, 2017

From @samcv

On Wed, 28 Dec 2016 16​:19​:48 -0800, samantham@​posteo.net wrote​:

Looks like JVM handles these degenerates nicely​:

JVM​:
say Q<ୈtest> #> <ୈtest

Looks like the JVM backend doesn't implement character counting except by codepoint. So it is just not aware of it except on the codepoint level.

@p6rt
Copy link
Author

p6rt commented Mar 20, 2017

From @samcv

Changing the subject to indicate that our current functionality is not technically incorrect. Am not sure if I want to add a LTA tag to this or not, since I have not determined yet resolving this in any way is feasible or wanted from a technical standpoint.

I am definitely going to leave this open and will add more information or notes if there is new information.

@p6rt
Copy link
Author

p6rt commented Jul 16, 2017

From @samcv

Bug has been open a while, and I have not forgotten it, I had just not reached a final decision. After further thought I'm closing this WONTFIX. It would needlessly complicate our grapheme concatenation and in addition I believe it may break some of the grapheme concatenation tests.

@p6rt
Copy link
Author

p6rt commented Jul 16, 2017

@samcv - Status changed from 'new' to 'rejected'

@p6rt p6rt closed this as completed Jul 16, 2017
@p6rt p6rt added the uni label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant