Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Str.perl/repl fail on outside-Unicode codepoints #6122

Open
p6rt opened this issue Mar 4, 2017 · 1 comment
Open

Str.perl/repl fail on outside-Unicode codepoints #6122

p6rt opened this issue Mar 4, 2017 · 1 comment
Labels

Comments

@p6rt
Copy link

p6rt commented Mar 4, 2017

Migrated from rt.perl.org#130912 (status was 'new')

Searchable as RT130912$

@p6rt
Copy link
Author

p6rt commented Mar 4, 2017

From zefram@fysh.org

"\x[110000]".ords
(1114112)
"\x[110000]".gist.ords
(1114112)
"\x[110000]".perl.ords
(34 1114112 34)
"\x[110000]"
Error encoding UTF-8 string​: could not encode codepoint 1114112
"\x[110000]".perl
Error encoding UTF-8 string​: could not encode codepoint 1114112

This string contains the first out-of-Unicode-range codepoint. There is
a bug somewhere in the above, leading to the error messages, but it's
a matter of opinion which part contains the bug.

Since the Str type is normally described as representing a Unicode string,
it would be reasonable to say that it cannot contain an out-of-Unicode
codepoint. In that view, the bug is that the string literal "\x[110000]"
is accepted. It's also then a bug that chr(0x110000) evaluates without
error, and so on for other ways of constructing a string.

If it is accepted that a Str can contain an out-of-Unicode codepoint,
then methods such as .perl and .gist need to handle that appropriately.
The range of characters that may be used in .perl output isn't explicitly
stated, but it would certainly be reasonable to say that it should be
a subset of Unicode. In that view, it is a bug that .perl uses this
codepoint in its output​: it should represent that grapheme non-literally,
in the same way that it does for "\x[1]". Similar arguments apply to
.gist, though not as strongly.

If it's accepted that text from .perl or .gist, intended for the user
to see, may contain out-of-Unicode-range codepoints, then it is a bug
that the repl fails to display such text. The UTF-8 codepoint-to-octets
encoding extends up to codepoint 0x7fffffff, so there's an obvious way to
output it if it were willing. Whether the user's terminal could render
it is another matter, but if you're concerned about that then that would
be a good reason to say that .perl and .gist shouldn't be including this
sort of thing in their output.

-zefram

@p6rt p6rt added the uni label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant