Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chr() aliases codepoint numbers mod 2**32 #6123

Closed
p6rt opened this issue Mar 4, 2017 · 10 comments
Closed

chr() aliases codepoint numbers mod 2**32 #6123

p6rt opened this issue Mar 4, 2017 · 10 comments

Comments

@p6rt
Copy link

p6rt commented Mar 4, 2017

Migrated from rt.perl.org#130914 (status was 'resolved')

Searchable as RT130914$

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

From zefram@fysh.org

chr(0x100000001).ords
(1)
"\x[100000001]".ords
(1)
chr(-0xffffffff).ords
(1)

chr() is reducing the supplied codepoint number mod 2**32. The output
produced is not what the user asked for. chr() should instead just
signal an error for any codepoint outside the supported [0, 2**31) range.

-zefram

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

From @lizmat

Fixed with rakudo/rakudo@20fa14be7a , tests needed.

On 4 Mar 2017, at 11​:24, Zefram (via RT) <perl6-bugs-followup@​perl.org> wrote​:

# New Ticket Created by Zefram
# Please include the string​: [perl #​130914]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=130914 >

chr(0x100000001).ords
(1)
"\x[100000001]".ords
(1)
chr(-0xffffffff).ords
(1)

chr() is reducing the supplied codepoint number mod 2**32. The output
produced is not what the user asked for. chr() should instead just
signal an error for any codepoint outside the supported [0, 2**31) range.

-zefram

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Mar 19, 2017

From @usev6

I started to add a test or two for this issue, but then I found the following test in S29-conversions/ord_and_chr.t​:

  #?rakudo.moar todo 'chr max RT #​124837'
  dies-ok {chr(0x10FFFF+1)}, "chr out of range (max)";

Looking at https://en.wikipedia.org/wiki/Code_point and http://www.unicode.org/glossary/#code_point I understand that U+10FFFF is indeed the maximum Unicode code point.

On the JVM backend we already throw for invalid code points (this is handled by class Character, method toChars under the hood​: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#toChars-int-):

$ ./perl6-j -e 'say chr(0x10FFFF+1)'
java.lang.IllegalArgumentException
  in block <unit> at -e line 1

So, IMHO we could do be better on MoarVM as well. It feels to me that the check for valid code points shouldn't be implemented in NQP, but in MoarVM. Actually, MVM_unicode_get_name already has such a check implemented.

1 similar comment
@p6rt
Copy link
Author

p6rt commented Mar 19, 2017

From @usev6

I started to add a test or two for this issue, but then I found the following test in S29-conversions/ord_and_chr.t​:

  #?rakudo.moar todo 'chr max RT #​124837'
  dies-ok {chr(0x10FFFF+1)}, "chr out of range (max)";

Looking at https://en.wikipedia.org/wiki/Code_point and http://www.unicode.org/glossary/#code_point I understand that U+10FFFF is indeed the maximum Unicode code point.

On the JVM backend we already throw for invalid code points (this is handled by class Character, method toChars under the hood​: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#toChars-int-):

$ ./perl6-j -e 'say chr(0x10FFFF+1)'
java.lang.IllegalArgumentException
  in block <unit> at -e line 1

So, IMHO we could do be better on MoarVM as well. It feels to me that the check for valid code points shouldn't be implemented in NQP, but in MoarVM. Actually, MVM_unicode_get_name already has such a check implemented.

@p6rt
Copy link
Author

p6rt commented Mar 20, 2017

From @toolforger

Am 19.03.2017 um 23​:00 schrieb Christian Bartolomaeus via RT​:

Looking at https://en.wikipedia.org/wiki/Code_point and
http://www.unicode.org/glossary/#code_point I understand that
U+10FFFF is indeed the maximum Unicode code point.

Yes, that's the maximum value you can encode in four bytes with UTF-8,
see https://en.wikipedia.org/wiki/UTF-8#Description.

I was wondering how the Unicode consortium might extend this limit, so I
investigated a bit.

TL;DR

I can confirm that 10ffff is going to remain the maximum for the
foreseeable future.

DETAILS

Technical limits​:

  UTF-8 could be extended up to 0x108430ffff [1]
  UTF-16 ("surrogate pairs") cannot be extended beyond 0x10ffff
  UTF-32 can be extended up to 0xffffffff (32 bits available)

Political limits​:

Since Java chose to use surrogate pairs, and UTF-16 is not extensible,
any motion to extend the Unicode code range would be met with opposition
from Oracle, and from any language community that has a JVM
implementation and wants to be interoperable with Java libraries.

Code space exhaustion​:

Unicode assigns code points like this​:
  characters​: 128,237 code points
  private use​: U+e000—U+f8ff (6,400 code points) [2]
  U+f0000—U+ffffd (65,534) [2]
  U+100000—U+10fffd (65,534) [2]
  surrogates​: U+d800—U+dfff (2,048 code points) [2]
So out of the
  0x10ffff=1,114,111 available code points,
  128,237+6,400+65,534+65,534=265705 are in use, leaving
  848406 free for future character set extension.

Assuming 10,000 new characters per year (which is conservative given the
numbers in [3]), the current encoding space will be exhaused in ca. 85
years.

Regards,
Jo

[1]
The Unicode consortium could extend the maximum value of UTF-8 by using
more prefixes​:
  111110xx for 5-byte encoding
  1111110x for 6-byte encoding
  11111110 for 7-byte encoding
  11111111 for 8-byte encoding
No prefixes are possible for a longer encoding.

Bit counts for each prefix are​:
  5-byte​: nr of 4-byte encoding bits (21) + 5 = 26
  6-byte​: 26 + 5 = 31
  7-byte​: 26 + 5 = 36
  8-byte​: 36 + 6 (prefix does not lose a bit) = 42
The maximum 8-byte-encoded value is
  0x10ffff+2^26+2^31+2^36+2^42 = 0x108430FFFF

[2]
Numbers taken from
http://unicodebook.readthedocs.io/unicode.html#statistics

[3]
See https://en.wikipedia.org/wiki/Unicode#Versions

@p6rt
Copy link
Author

p6rt commented Mar 20, 2017

From @usev6

On Mon, 20 Mar 2017 01​:19​:43 -0700, jo@​durchholz.org wrote​:

I was wondering how the Unicode consortium might extend this limit, so I
investigated a bit.

TL;DR

I can confirm that 10ffff is going to remain the maximum for the
foreseeable future.

Thanks for sharing your findings!

I looked some more at our code and the tests we have in roast. Things are complicated ... Probably it wasn't wise from me to mix the original point of this issue ("chr() is reducing the supplied codepoint number mod 2**32") with the question of the maximum allowed code point. But here we go.

At one point in 2014 we had additional validity checks for nqp​::chr. Those checks also looked for the upper limit of 0x10ffff. According to the IRC logs [1] the checks were removed [2], because the Unicode Consortium made it clear in "Corrigendum #​9​: Clarification About Noncharacters" [3] that Noncharacters are not illegal, but reserved for private use. (See also the answer to the question "Are noncharacters invalid in Unicode strings and UTFs?" in the FAQ [4].)

AFAIU the check for the upper limit was useful, since 0x110000 and above are illegal (as opposed to the Noncharacters). Trying to add those checks back, I found failures in S32-str/encode.t on MoarVM. There are tests, that expect the following code to live. The tests where added for RT 123673​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=123673

$ ./perl6-m -e '"\x[FFFFFF]".sink; say "alive"' # .sink to avoid warning
alive

Another thing to note in this context​: Since we have \x, the patch from lizmat didn't fix the whole mod 2**32 thing​:

$ ./perl6-m -e 'chr(0x100000063).sink; say "alive"' # dies as expected
chr codepoint too large​: 4294967395
  in block <unit> at -e line 1
$ ./perl6-m -e '"\x[100000063]".sink; say "alive"' # does not die
alive

So, adding the check for the upper limit for MoarVM [5] led to failing tests in S32-str/encode.t and did not help with the mod 2**32 problem. (AFAIU the conversion to 32 bit is done before the code from [5] in src/strings/ops.c runs.)

On the JVM backend things look a bit better. Adding similiar code to method chr in src/vm/jvm/runtime/org/perl6/nqp/runtime/Ops.java helps with the upper limit for code points and helps with the mod 2**32 problem (since we cast to int after said check. The tests from S32-str/encode.t were failing before (they have been fudged for a while).

I'd be glad if someone with a deeper knowledge would double check if these tests are correct wrt "\x[FFFFFF]"​:
https://github.com/perl6/roast/blob/add852b082a2fca83dbefe03d890dd5939c5ff45/S32-str/encode.t#L70-L89

In case they are dubious, I'd propose to add a validity check for the upper limit to MVM_string_chr (MoarVM) and chr (JVM). That would only leave the mod 2**32 problem on MoarVM.

[1] https://irclog.perlgeek.de/perl6/2014-03-28#i_8509990 (and below)

[2] usev6/nqp@a4eda0bcd2 (JVM) and MoarVM/MoarVM@d93a73303f (MoarVM)

[3] http://www.unicode.org/versions/corrigendum9.html

[4] http://www.unicode.org/faq/private_use.html#nonchar8

[5] $ git diff

Inline Patch
diff --git a/src/strings/ops.c b/src/strings/ops.c
index 9bfa536..7e77d21 100644
--- a/src/strings/ops.c
+++ b/src/strings/ops.c
@@ -1919,6 +1919,8 @@ MVMString * MVM_string_chr(MVMThreadContext *tc, MVMCodepoint cp) {
 
     if (cp < 0)
         MVM_exception_throw_adhoc(tc, "chr codepoint cannot be negative");
+    else if (cp > 0x10ffff)
+        MVM_exception_throw_adhoc(tc, "chr codepoint cannot be greater than 0x10FFFF");
 
     MVM_unicode_normalizer_init(tc, &norm, MVM_NORMALIZE_NFG);
     if (!MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, cp, &g)) {

1 similar comment
@p6rt
Copy link
Author

p6rt commented Mar 20, 2017

From @usev6

On Mon, 20 Mar 2017 01​:19​:43 -0700, jo@​durchholz.org wrote​:

I was wondering how the Unicode consortium might extend this limit, so I
investigated a bit.

TL;DR

I can confirm that 10ffff is going to remain the maximum for the
foreseeable future.

Thanks for sharing your findings!

I looked some more at our code and the tests we have in roast. Things are complicated ... Probably it wasn't wise from me to mix the original point of this issue ("chr() is reducing the supplied codepoint number mod 2**32") with the question of the maximum allowed code point. But here we go.

At one point in 2014 we had additional validity checks for nqp​::chr. Those checks also looked for the upper limit of 0x10ffff. According to the IRC logs [1] the checks were removed [2], because the Unicode Consortium made it clear in "Corrigendum #​9​: Clarification About Noncharacters" [3] that Noncharacters are not illegal, but reserved for private use. (See also the answer to the question "Are noncharacters invalid in Unicode strings and UTFs?" in the FAQ [4].)

AFAIU the check for the upper limit was useful, since 0x110000 and above are illegal (as opposed to the Noncharacters). Trying to add those checks back, I found failures in S32-str/encode.t on MoarVM. There are tests, that expect the following code to live. The tests where added for RT 123673​: https://rt-archive.perl.org/perl6/Ticket/Display.html?id=123673

$ ./perl6-m -e '"\x[FFFFFF]".sink; say "alive"' # .sink to avoid warning
alive

Another thing to note in this context​: Since we have \x, the patch from lizmat didn't fix the whole mod 2**32 thing​:

$ ./perl6-m -e 'chr(0x100000063).sink; say "alive"' # dies as expected
chr codepoint too large​: 4294967395
  in block <unit> at -e line 1
$ ./perl6-m -e '"\x[100000063]".sink; say "alive"' # does not die
alive

So, adding the check for the upper limit for MoarVM [5] led to failing tests in S32-str/encode.t and did not help with the mod 2**32 problem. (AFAIU the conversion to 32 bit is done before the code from [5] in src/strings/ops.c runs.)

On the JVM backend things look a bit better. Adding similiar code to method chr in src/vm/jvm/runtime/org/perl6/nqp/runtime/Ops.java helps with the upper limit for code points and helps with the mod 2**32 problem (since we cast to int after said check. The tests from S32-str/encode.t were failing before (they have been fudged for a while).

I'd be glad if someone with a deeper knowledge would double check if these tests are correct wrt "\x[FFFFFF]"​:
https://github.com/perl6/roast/blob/add852b082a2fca83dbefe03d890dd5939c5ff45/S32-str/encode.t#L70-L89

In case they are dubious, I'd propose to add a validity check for the upper limit to MVM_string_chr (MoarVM) and chr (JVM). That would only leave the mod 2**32 problem on MoarVM.

[1] https://irclog.perlgeek.de/perl6/2014-03-28#i_8509990 (and below)

[2] usev6/nqp@a4eda0bcd2 (JVM) and MoarVM/MoarVM@d93a73303f (MoarVM)

[3] http://www.unicode.org/versions/corrigendum9.html

[4] http://www.unicode.org/faq/private_use.html#nonchar8

[5] $ git diff

Inline Patch
diff --git a/src/strings/ops.c b/src/strings/ops.c
index 9bfa536..7e77d21 100644
--- a/src/strings/ops.c
+++ b/src/strings/ops.c
@@ -1919,6 +1919,8 @@ MVMString * MVM_string_chr(MVMThreadContext *tc, MVMCodepoint cp) {
 
     if (cp < 0)
         MVM_exception_throw_adhoc(tc, "chr codepoint cannot be negative");
+    else if (cp > 0x10ffff)
+        MVM_exception_throw_adhoc(tc, "chr codepoint cannot be greater than 0x10FFFF");
 
     MVM_unicode_normalizer_init(tc, &norm, MVM_NORMALIZE_NFG);
     if (!MVM_unicode_normalizer_process_codepoint_to_grapheme(tc, &norm, cp, &g)) {

@p6rt
Copy link
Author

p6rt commented Nov 4, 2017

From @samcv

This now throws if the size is too large. Closing as resolved.

@p6rt p6rt closed this as completed Nov 4, 2017
@p6rt
Copy link
Author

p6rt commented Nov 4, 2017

@samcv - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant