Skip Menu |
Report information
Id: 130912
Status: new
Priority: 0/
Queue: perl6

Owner: Nobody
Requestors: zefram [at] fysh.org
Cc:
AdminCc:

Severity: (no value)
Tag: Bug
Platform: (no value)
Patch Status: (no value)
VM: (no value)



Subject: [BUG] Str.perl/repl fail on outside-Unicode codepoints
From: Zefram <zefram [...] fysh.org>
To: rakudobug [...] perl.org
Date: Sat, 4 Mar 2017 10:17:14 +0000
Download (untitled) / with headers
text/plain 1.8k
Show quoted text
> "\x[110000]".ords
(1114112) Show quoted text
> "\x[110000]".gist.ords
(1114112) Show quoted text
> "\x[110000]".perl.ords
(34 1114112 34) Show quoted text
> "\x[110000]"
Error encoding UTF-8 string: could not encode codepoint 1114112 Show quoted text
> "\x[110000]".perl
Error encoding UTF-8 string: could not encode codepoint 1114112 This string contains the first out-of-Unicode-range codepoint. There is a bug somewhere in the above, leading to the error messages, but it's a matter of opinion which part contains the bug. Since the Str type is normally described as representing a Unicode string, it would be reasonable to say that it cannot contain an out-of-Unicode codepoint. In that view, the bug is that the string literal "\x[110000]" is accepted. It's also then a bug that chr(0x110000) evaluates without error, and so on for other ways of constructing a string. If it is accepted that a Str can contain an out-of-Unicode codepoint, then methods such as .perl and .gist need to handle that appropriately. The range of characters that may be used in .perl output isn't explicitly stated, but it would certainly be reasonable to say that it should be a subset of Unicode. In that view, it is a bug that .perl uses this codepoint in its output: it should represent that grapheme non-literally, in the same way that it does for "\x[1]". Similar arguments apply to .gist, though not as strongly. If it's accepted that text from .perl or .gist, intended for the user to see, may contain out-of-Unicode-range codepoints, then it is a bug that the repl fails to display such text. The UTF-8 codepoint-to-octets encoding extends up to codepoint 0x7fffffff, so there's an obvious way to output it if it were willing. Whether the user's terminal could render it is another matter, but if you're concerned about that then that would be a good reason to say that .perl and .gist shouldn't be including this sort of thing in their output. -zefram


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org