Unicode broken for 0x10FFFF #4931

p5pRT · 2002-01-30T04:16:59Z

Migrated from rt.perl.org#8375 (status was 'resolved')

Searchable as RT8375$

p5pRT · 2002-01-30T04:16:59Z

From msergeant@startechgroup.co.uk

This is a bug report for perl from matt@matt_dev.star.net.uk,
generated with the help of perlbug 1.33 running under perl v5.7.2.

[Please enter your report here]

In testing XML::SAX::PurePerl on bleadperl, the unicode character entity
reference 􏿿 causes a warning to be issued:

Unicode character 0x10ffff is illegal at blib/lib/XML/SAX/PurePerl.pm line
353.

This can be replicated by the following code:

$ perl5.7.2 -le 'print chr 0x10FFFF'
Unicode character 0x10ffff is illegal at -e line 1.

Interestingly the character seems to be converted correctly, despite
the strange warning.

The warning occurs regardless of if I use pack("U") or chr.

The problem is in this block of code in utf8.c:

else if (
((uv >= 0xFDD0 && uv <= 0xFDEF &&
!(flags & UNICODE_ALLOW_FDD0))
||
((uv & 0xFFFF) == 0xFFFE &&
!(flags & UNICODE_ALLOW_FFFE))
||
((uv & 0xFFFF) == 0xFFFF &&
!(flags & UNICODE_ALLOW_FFFF))) &&
/* UNICODE_ALLOW_SUPER includes
* FFFEs and FFFFs beyond 0x10FFFF. */
((uv <= PERL_UNICODE_MAX) ||
!(flags & UNICODE_ALLOW_SUPER))
)
Perl_warner(aTHX_ WARN_UTF8,
"Unicode character 0x%04"UVxf" is illegal", uv);

Which obviously (yeah ok so I had to stare for a while too) fails
only on 0x10FFFF (and presumably higher codepoints that end in FFFF,
but we don't support those yet IIRC).

Anyway, this patch fixes it:

Inline Patch

--- utf8.c.old  Wed Jan 30 11:40:39 2002
+++ utf8.c      Wed Jan 30 11:54:28 2002
@@ -69,7 +69,7 @@
                    !(flags & UNICODE_ALLOW_FFFF))) &&
                  /* UNICODE_ALLOW_SUPER includes
                   * FFFEs and FFFFs beyond 0x10FFFF. */
-                 ((uv <= PERL_UNICODE_MAX) ||
+                 ((uv < PERL_UNICODE_MAX) ||
                   !(flags & UNICODE_ALLOW_SUPER))
                  )
              Perl_warner(aTHX_ WARN_UTF8,

All tests pass. I didn't know the right place to add a new test. But I added something to t/lib/warnings/utf8\, though it's a crappy test and Schwern will probably shoot me:

Inline Patch

--- t/lib/warnings/utf8.old     Wed Jan 30 12:06:53 2002
+++ t/lib/warnings/utf8 Wed Jan 30 11:44:56 2002
@@ -37,10 +37,12 @@
 my $surr = chr(0xD800);
 my $fff3 = chr(0xFFFE);
 my $ffff = chr(0xFFFF);
+my $top = chr(0x10FFFF); # shouldn't warn regardless
 no warnings 'utf8';
 $surr = chr(0xD800);
 $fffe = chr(0xFFFE);
 $ffff = chr(0xFFFF);
 EXPECT
 UTF-16 surrogate 0xd800 at - line 2.
 Unicode character 0xfffe is illegal at - line 3.

That's all for now. XML::SAX::PurePerl sweetly passes all tests here,
and now does the right thing wrt unicode on 5.00503, 5.6.1 and 5.7.2.

Site configuration information for perl v5.7.2:

Configured by matt at Wed Jan 30 10:25:52 GMT 2002.

Summary of my perl5 (revision 5.0 version 7 subversion 2 patch 14470)
configuration:
Platform:
osname=linux, osvers=2.2.14-5.0, archname=i686-linux
uname='linux matt_dev 2.2.14-5.0 #1 tue mar 7 21:07:39 est 2000 i686
unknown '
config_args='-ds -e -Dusedevel'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=define
Compiler:
cc='cc', ccflags
='-fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/inc
lude/gdbm',
optimize='-O2',
cppflags='-fno-strict-aliasing -I/usr/include/gdbm'
ccversion='', gccversion='egcs-2.91.66 19990314/Linux (egcs-1.1.2
release)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt -lutil
perllibs=-lnsl -ldl -lm -lc -lposix -lcrypt -lutil
libc=/lib/libc-2.1.3.so, so=so, useshrplib=false, libperl=libperl.a
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
DEVEL14470

@INC for perl v5.7.2:
/usr/local/lib/perl5/5.7.2/i686-linux
/usr/local/lib/perl5/5.7.2
/usr/local/lib/perl5/site_perl/5.7.2/i686-linux
/usr/local/lib/perl5/site_perl/5.7.2
/usr/local/lib/perl5/site_perl
.

Environment for perl v5.7.2:
HOME=/home/matt
LANG=en_US
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/home/matt/bin:/usr/local/
bin
PERL_BADLANG (unset)
SHELL=/bin/bash

Matt.
--
<:->get a SMart net</:->

p5pRT · 2002-01-30T04:46:09Z

From [Unknown Contact. See original ticket]

From: Matt Sergeant [mailto:msergeant@startechgroup.co.uk]

--- utf8.c.old Wed Jan 30 11:40:39 2002
+++ utf8.c Wed Jan 30 11:54:28 2002
@@ -69,7 +69,7 @@
!(flags & UNICODE_ALLOW_FFFF))) &&
/* UNICODE_ALLOW_SUPER includes
* FFFEs and FFFFs beyond 0x10FFFF. */
- ((uv <= PERL_UNICODE_MAX) ||
+ ((uv < PERL_UNICODE_MAX) ||
!(flags & UNICODE_ALLOW_SUPER))
)
Perl_warner(aTHX_ WARN_UTF8,

Darn, this only seems to fix it for chr(), pack("U") still exhibits the
warning. Anyone know why?

Matt.

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

p5pRT · 2002-01-30T07:01:39Z

From @jhi

On Wed, Jan 30, 2002 at 12:15:59PM -0000, Matt Sergeant wrote:

This is a bug report for perl from matt@matt_dev.star.net.uk,
generated with the help of perlbug 1.33 running under perl v5.7.2.

-----------------------------------------------------------------
[Please enter your report here]

In testing XML::SAX::PurePerl on bleadperl, the unicode character entity
reference 􏿿 causes a warning to be issued:

Unicode character 0x10ffff is illegal at blib/lib/XML/SAX/PurePerl.pm line
353.

Uhhh, the warning (recently added) is given because the character
0x10FFFF *IS* illegal (it is classified as a non-character, it's a
valid code point, but not a character). If user code is trying to
generate it, something is usually wrong (if intentional, one can turn
the warning off by "no warnings 'utf8';"). Therefore I do not
consider this to be bug needing to be fixed.

What I notice, though, is that the current code does not warn for
characters beyond 0x10FFFF, which is definitely a bug.

This can be replicated by the following code:

$ perl5.7.2 -le 'print chr 0x10FFFF'
Unicode character 0x10ffff is illegal at -e line 1.

Interestingly the character seems to be converted correctly, despite
the strange warning.

The warning occurs regardless of if I use pack("U") or chr.

The problem is in this block of code in utf8.c:
else if $
         \(\(uv >= 0xFDD0 && uv \<= 0xFDEF &&
           \!\(flags & UNICODE\_ALLOW\_FDD0$\)
          ||
         $\(uv & 0xFFFF$ == 0xFFFE &&
           \!$flags & UNICODE\_ALLOW\_FFFE$\)
          ||
         $\(uv & 0xFFFF$ == 0xFFFF &&
          \!$flags & UNICODE\_ALLOW\_FFFF$\)\) &&
         /\* UNICODE\_ALLOW\_SUPER includes
          \* FFFEs and FFFFs beyond 0x10FFFF\. \*/
         $\(uv \<= PERL\_UNICODE\_MAX$ ||
           \!$flags & UNICODE\_ALLOW\_SUPER$\)
         \)
     Perl\_warner$aTHX\_ WARN\_UTF8\,
              "Unicode character 0x%04"UVxf" is illegal"\, uv$;
Which obviously (yeah ok so I had to stare for a while too) fails
only on 0x10FFFF (and presumably higher codepoints that end in FFFF,
but we don't support those yet IIRC).

Anyway, this patch fixes it:

--- utf8.c.old Wed Jan 30 11:40:39 2002
+++ utf8.c Wed Jan 30 11:54:28 2002
@@ -69,7 +69,7 @@
!(flags & UNICODE_ALLOW_FFFF))) &&
/* UNICODE_ALLOW_SUPER includes
* FFFEs and FFFFs beyond 0x10FFFF. */
- ((uv <= PERL_UNICODE_MAX) ||
+ ((uv < PERL_UNICODE_MAX) ||
!(flags & UNICODE_ALLOW_SUPER))
)
Perl_warner(aTHX_ WARN_UTF8,

All tests pass. I didn't know the right place to add a new test. But
I added something to t/lib/warnings/utf8, though it's a crappy test and
Schwern will probably shoot me:

--- t/lib/warnings/utf8.old Wed Jan 30 12:06:53 2002
+++ t/lib/warnings/utf8 Wed Jan 30 11:44:56 2002
@@ -37,10 +37,12 @@
my $surr = chr(0xD800);
my $fff3 = chr(0xFFFE);
my $ffff = chr(0xFFFF);
+my $top = chr(0x10FFFF); # shouldn't warn regardless
no warnings 'utf8';
$surr = chr(0xD800);
$fffe = chr(0xFFFE);
$ffff = chr(0xFFFF);
EXPECT
UTF-16 surrogate 0xd800 at - line 2.
Unicode character 0xfffe is illegal at - line 3.

That's all for now. XML::SAX::PurePerl sweetly passes all tests here,
and now does the right thing wrt unicode on 5.00503, 5.6.1 and 5.7.2.

---
Site configuration information for perl v5.7.2:

Configured by matt at Wed Jan 30 10:25:52 GMT 2002.

Summary of my perl5 (revision 5.0 version 7 subversion 2 patch 14470)
configuration:
Platform:
osname=linux, osvers=2.2.14-5.0, archname=i686-linux
uname='linux matt_dev 2.2.14-5.0 #1 tue mar 7 21:07:39 est 2000 i686
unknown '
config_args='-ds -e -Dusedevel'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=define
Compiler:
cc='cc', ccflags
='-fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/inc
lude/gdbm',
optimize='-O2',
cppflags='-fno-strict-aliasing -I/usr/include/gdbm'
ccversion='', gccversion='egcs-2.91.66 19990314/Linux (egcs-1.1.2
release)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt -lutil
perllibs=-lnsl -ldl -lm -lc -lposix -lcrypt -lutil
libc=/lib/libc-2.1.3.so, so=so, useshrplib=false, libperl=libperl.a
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
DEVEL14470

---
@INC for perl v5.7.2:
/usr/local/lib/perl5/5.7.2/i686-linux
/usr/local/lib/perl5/5.7.2
/usr/local/lib/perl5/site_perl/5.7.2/i686-linux
/usr/local/lib/perl5/site_perl/5.7.2
/usr/local/lib/perl5/site_perl
.

---
Environment for perl v5.7.2:
HOME=/home/matt
LANG=en_US
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/home/matt/bin:/usr/local/
bin
PERL_BADLANG (unset)
SHELL=/bin/bash

Matt.
--
<:->get a SMart net</:->

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-01-30T07:13:51Z

From [Unknown Contact. See original ticket]

On Wed, 30 Jan 2002, Jarkko Hietaniemi wrote:

On Wed, Jan 30, 2002 at 12:15:59PM -0000, Matt Sergeant wrote:

This is a bug report for perl from matt@matt_dev.star.net.uk,
generated with the help of perlbug 1.33 running under perl v5.7.2.

-----------------------------------------------------------------
[Please enter your report here]

In testing XML::SAX::PurePerl on bleadperl, the unicode character entity
reference 􏿿 causes a warning to be issued:

Unicode character 0x10ffff is illegal at blib/lib/XML/SAX/PurePerl.pm line
353.

Uhhh, the warning (recently added) is given because the character
0x10FFFF *IS* illegal (it is classified as a non-character, it's a
valid code point, but not a character). If user code is trying to
generate it, something is usually wrong (if intentional, one can turn
the warning off by "no warnings 'utf8';"). Therefore I do not
consider this to be bug needing to be fixed.

OK, well I'm not really sure how to handle this then... How can you turn
off 'utf8' warnings yet remain compatible with 5.005, and not use eval
EXPR for the entire chr() section of code?

BTW: 0x10FFFF is a valid XML character
http://www.w3.org/TR/REC-xml#NT-Char

--

<:->Get a smart net</:->

p5pRT · 2002-01-30T07:21:12Z

From @jhi

OK, well I'm not really sure how to handle this then... How can you turn
off 'utf8' warnings yet remain compatible with 5.005, and not use eval

I'm curious: how do you do Unicode in 5.005...?

EXPR for the entire chr() section of code?

eval for the 'no warnings'?

BTW: 0x10FFFF is a valid XML character
http://www.w3.org/TR/REC-xml#NT-Char

Sure. It's a valid code point-- but not a character. Maybe in XML
there is no difference between (allowed) code points and (allocated)
characters?

Unicode 3.1:

http://www.unicode.org/unicode/reports/tr27/index.html

and the end of the section VI, "Code Charts" (about 80% down
the document).

--

<:->Get a smart net</:->

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-01-30T07:27:21Z

From [Unknown Contact. See original ticket]

On Wed, 30 Jan 2002, Jarkko Hietaniemi wrote:

OK, well I'm not really sure how to handle this then... How can you turn
off 'utf8' warnings yet remain compatible with 5.005, and not use eval

I'm curious: how do you do Unicode in 5.005...?

Punt and return bytes (in order to do the XML character tests we have
different regexps for 5.005 and 5.6+ that test multiple bytes on 5.5, and
the parser's "next" routine on 5.005 will read more bytes when it sees a
surrogate). But the same code needs to work on all platforms.

I guess I could do some use/require magic though. I'll try that next.

EXPR for the entire chr() section of code?

eval for the 'no warnings'?

That'll eval in the eval's scope, and not propogate the no warnings
outside of that scope. Unless I've got eval's scoping rules screwed up.

BTW: 0x10FFFF is a valid XML character
http://www.w3.org/TR/REC-xml#NT-Char

Sure. It's a valid code point-- but not a character. Maybe in XML
there is no difference between (allowed) code points and (allocated)
characters?

I don't know, and don't care that much. I just want it to work :-)

--

<:->Get a smart net</:->

p5pRT · 2002-01-30T07:55:57Z

From @jhi

What I notice, though, is that the current code does not warn for
characters beyond 0x10FFFF, which is definitely a bug.

Ahh, it's all coming back now... warning about such characters
causes pain in the complementing tr///... have to look at this later.

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-01-30T08:10:33Z

From @gbarr

On Wed, Jan 30, 2002 at 05:01:01PM +0200, Jarkko Hietaniemi wrote:

Unicode character 0x10ffff is illegal at blib/lib/XML/SAX/PurePerl.pm line
353.

Uhhh, the warning (recently added) is given because the character
0x10FFFF *IS* illegal (it is classified as a non-character, it's a
valid code point, but not a character). If user code is trying to
generate it, something is usually wrong (if intentional, one can turn
the warning off by "no warnings 'utf8';"). Therefore I do not
consider this to be bug needing to be fixed.

Should that not be "no warnings 'unicode';" ? As you have said it
is a valid code point, so its valid in utf8, its just not a valid
unicode character.

Graham.

p5pRT · 2002-01-30T08:12:07Z

From @jhi

On Wed, Jan 30, 2002 at 04:08:22PM +0000, Graham Barr wrote:

On Wed, Jan 30, 2002 at 05:01:01PM +0200, Jarkko Hietaniemi wrote:

Unicode character 0x10ffff is illegal at blib/lib/XML/SAX/PurePerl.pm line
353.

Uhhh, the warning (recently added) is given because the character
0x10FFFF *IS* illegal (it is classified as a non-character, it's a
valid code point, but not a character). If user code is trying to
generate it, something is usually wrong (if intentional, one can turn
the warning off by "no warnings 'utf8';"). Therefore I do not
consider this to be bug needing to be fixed.

Should that not be "no warnings 'unicode';" ? As you have said it

I guess so. But for backward compatibility also the old subpragma
should stay available, ugh....

is a valid code point, so its valid in utf8, its just not a valid
unicode character.

Graham.

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-01-30T10:49:27Z

From @TimToady

Jarkko Hietaniemi writes:
: > What I notice, though, is that the current code does not warn for
: > characters beyond 0x10FFFF, which is definitely a bug.
:
: Ahh, it's all coming back now... warning about such characters
: causes pain in the complementing tr///... have to look at this later.

I think the general policy of Perl should be that it is allowed to
think about bad thoughts, because that is the only way to understand
what's bad about the bad thoughts Perl receives on input. If there is
to be any self-censorship, it should be on the output, I believe.
That's why they're called "disciplines", after all. :-) So it's fine if
the default output discipline enforces that the internal representation
is transformed to well-formed UTF-8. It's even okay if the default
input discipline enforces well-formedness, as long as there's a way
to get at the raw badness.

But within Perl, character strings are simply sequences of integers.
The internal representation must be optimized for this concept, not for
any particular Unicode representation, whether UTF-8 or UTF-16 or
UTF-32. Any of these could be used as underlying representations, but
the abstraction of sequences of integers must be there explicitly in
the internal high-level string API. To oversimplify, the high-level
API must not have any parameters whose type contains the string "UTF".

In the absence of other type information, these integers are assumed
to be Unicode code points. Additional strictures are possible and even
useful, but should not be the default (except for certain operations that
are explicitly designed for Unicode.)

For various reasons, some of which relate to the sequence-of-integer
abstraction, and some of which relate to "infinite" strings and arrays,
I think Perl 6 strings are likely to be represented by a list of
chunks, where each chunk is a sequence of integers of the same size or
representation, but different chunks can have different integer sizes
or representations. The abstract string interface must hide this from
any module that wishes to work at the abstract string level. In
particular, it must hide this from the regex engine, which works on
pure sequences in the abstract.

Note that I did not use the phrase "pure sequences of integers" in the
last sentence. The regex engine must not care if it is matching
characters from a string of known length, or tokens objects from an
array that is being grown arbitrarily on demand. Matching on UTF-32
is not good enough.

This is just a heads up for some of the stuff in Apocalypse 5.
Backtracking behavior will not necessarily be limited to regexes in
Perl 6, and if so, we have to consider very carefully how regex
backtracking, continuations, and temp variable unifications all work
together. (This is part of the reason I pushed earlier for the regex
opcodes to be meshed with the normal opcodes.)

I seriously intend that it be trivial to write a Perl parser (or any
other parser) in Perl, and that changing a grammar rule be as simple as
swapping in a different qr// (or a sub equivalent to a qr//). More
generally, I want logic programming to be one of the paradigms that
Perl supports. And as usual, I want to support it without forcing it
on people who aren't interested.

Sorry I can't be more clear yet. Story of my life. That's the basic
problem with the bear-of-very-little-brain approach. So please "bear"
with me.

[I've cross-posted because of the wide interest, but I don't want to
start a general frenzy cross-posted to all the lists. Please answer
specific points in separate messages, and please direct each followup
to the appropriate list. Thanks.]

Larry

p5pRT · 2002-01-30T11:02:35Z

From @jhi

On Wed, Jan 30, 2002 at 10:47:36AM -0800, Larry Wall wrote:

Jarkko Hietaniemi writes:
: > What I notice, though, is that the current code does not warn for
: > characters beyond 0x10FFFF, which is definitely a bug.
:
: Ahh, it's all coming back now... warning about such characters
: causes pain in the complementing tr///... have to look at this later.

I think the general policy of Perl should be that it is allowed to
think about bad thoughts, because that is the only way to understand
what's bad about the bad thoughts Perl receives on input. If there is
to be any self-censorship, it should be on the output, I believe.
That's why they're called "disciplines", after all. :-) So it's fine if
the default output discipline enforces that the internal representation
is transformed to well-formed UTF-8. It's even okay if the default
input discipline enforces well-formedness, as long as there's a way
to get at the raw badness.

But within Perl, character strings are simply sequences of integers.
The internal representation must be optimized for this concept, not for
any particular Unicode representation, whether UTF-8 or UTF-16 or
UTF-32. Any of these could be used as underlying representations, but
the abstraction of sequences of integers must be there explicitly in
the internal high-level string API. To oversimplify, the high-level
API must not have any parameters whose type contains the string "UTF".

In the absence of other type information, these integers are assumed
to be Unicode code points. Additional strictures are possible and even
useful, but should not be the default (except for certain operations that
are explicitly designed for Unicode.)

Coming back to the original issue :-) how should I read the above as
regards to chr(0x200000): warning by default or not? chr(0xffff)
warning or not? chr(0xD800) (half of a surrogate) warning or not?
This issue is indepent of representation. (IIRC, Ken Lunde thought
warning on half-surrogates would be a good thing, since if one is
generating Unicode, generating just a half of a surrogate is
meaningless, kind a like generating only "@" for chr(ord("A"))).

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-01-30T11:34:26Z

From @TimToady

Jarkko Hietaniemi writes:
: Coming back to the original issue :-) how should I read the above as
: regards to chr(0x200000): warning by default or not? chr(0xffff)
: warning or not? chr(0xD800) (half of a surrogate) warning or not?
: This issue is indepent of representation. (IIRC, Ken Lunde thought
: warning on half-surrogates would be a good thing, since if one is
: generating Unicode, generating just a half of a surrogate is
: meaningless, kind a like generating only "@" for chr(ord("A"))).

I'd prefer to see warnings/coercions on by default for input and
output, and off by default for generic Perl code (but easy to turn
on). To me, chr() is a generic method of turning an integer into a
character, and doesn't have any Unicode limitations, whatever they are
today. What if I'm trying to construct illegal Unicode for testing
purposes? What if I'm composing a string to pass to a constructor
that will eventually tag the string as some other representation?
I just have the gut feeling that the programmer often knows better
than the language designer, and should be allowed to choose their own
pain threshold.

On the other hand, output to a different process that is expecting
Unicode is more constrained--by international treaty, as it were. But
passport control should be at the border, not at roadblocks at every
intersection. If people want to check their own passports more often,
that's their own lookout.

There are exceptions, of course. I just don't think chr() has to
be one of them.

Larry

p5pRT · 2002-01-30T11:45:13Z

From @jhi

On Wed, Jan 30, 2002 at 11:29:50AM -0800, Larry Wall wrote:

Jarkko Hietaniemi writes:
: Coming back to the original issue :-) how should I read the above as
: regards to chr(0x200000): warning by default or not? chr(0xffff)
: warning or not? chr(0xD800) (half of a surrogate) warning or not?
: This issue is indepent of representation. (IIRC, Ken Lunde thought
: warning on half-surrogates would be a good thing, since if one is
: generating Unicode, generating just a half of a surrogate is
: meaningless, kind a like generating only "@" for chr(ord("A"))).

I'd prefer to see warnings/coercions on by default for input and
output, and off by default for generic Perl code (but easy to turn
on). To me, chr() is a generic method of turning an integer into a

Okay. I'll dilute the warning to an optional one. It's the same
warning for chr(), pack(), and \x{...}, so all of them will work
the same.

character, and doesn't have any Unicode limitations, whatever they are
today. What if I'm trying to construct illegal Unicode for testing
purposes? What if I'm composing a string to pass to a constructor
that will eventually tag the string as some other representation?
I just have the gut feeling that the programmer often knows better
than the language designer, and should be allowed to choose their own
pain threshold.

On the other hand, output to a different process that is expecting
Unicode is more constrained--by international treaty, as it were. But
passport control should be at the border, not at roadblocks at every
intersection. If people want to check their own passports more often,

But I like so much populating every bridge with at least one troll...

that's their own lookout.

There are exceptions, of course. I just don't think chr() has to
be one of them.

Larry

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-01-30T23:50:36Z

From [Unknown Contact. See original ticket]

Larry Wall wrote:

Jarkko Hietaniemi writes:
: Coming back to the original issue :-) how should I read the above as
: regards to chr(0x200000): warning by default or not? chr(0xffff)
: warning or not? chr(0xD800) (half of a surrogate) warning or not?
: This issue is indepent of representation. (IIRC, Ken Lunde thought
: warning on half-surrogates would be a good thing, since if one is
: generating Unicode, generating just a half of a surrogate is
: meaningless, kind a like generating only "@" for chr(ord("A"))).

I'd prefer to see warnings/coercions on by default for input and
output, and off by default for generic Perl code (but easy to turn
on). To me, chr() is a generic method of turning an integer into a
character, and doesn't have any Unicode limitations, whatever they are
today. What if I'm trying to construct illegal Unicode for testing
purposes?

If you're constructing illegal unicode because you *want* something
known to be illegal unicode, you get what you deserve ;)

What if I'm composing a string to pass to a constructor
that will eventually tag the string as some other representation?

And possibly more likely, What if I want to insert placeholders of some
sort in a string, which will be later substituted with something else,
and want to avoid using as a placeholder something which might validly
occur in the text?

Eg, suppose I had [bad] code like:
push @matches, $1 while s/( $ [^()]* $ )/chr @matches/xe;
s/([\x00-\x1F])/$matches[$1]/ge for @matches;
print $_, "\n" for @matches;

This code "works" if and only if there's no characters in the original
data which are the chr() of a number less than 32, and if there's less
than 32 pairs of [possibly nested] parentheses with data in them.

A "good" version of this might be:
push @matches, $1 while s/( $ [^()]* $ )/chr(0x20_0000+@matches)/xe;
my $x = sprintf q[[%c-%c]], 0x20_0000, 0x20_0000+$#matches;
s/($x)/$matches[ord($1)-0x20_0000]/ge for @matches, $_;
print $_, "\n" for @matches;

This code works as long as there's no characters in the original which
are the chr() of a number greater than or equal to 0x20_0000, which we
can probably be fairly certain of, no matter what kind of text it is.

I can also think of using chr representations of integers to do some
kind of funky GRT type thing, but I can't think of any useful example.

I just have the gut feeling that the programmer often knows better
than the language designer, and should be allowed to choose their own
pain threshold.

Yes, definitely.

On the other hand, output to a different process that is expecting
Unicode is more constrained--by international treaty, as it were. But
passport control should be at the border, not at roadblocks at every
intersection. If people want to check their own passports more often,
that's their own lookout.

There are exceptions, of course. I just don't think chr() has to
be one of them.

Hmm... would "\x{hex number}" and "\0oct number" be exceptions?

--
There's a wild Fandango loose in the theater!

p5pRT · 2002-01-31T08:01:24Z

From @jhi

And possibly more likely, What if I want to insert placeholders of some
sort in a string, which will be later substituted with something else,
and want to avoid using as a placeholder something which might validly
occur in the text?

Eg, suppose I had [bad] code like:
push @matches, $1 while s/( $ [^()]* $ )/chr @matches/xe;
s/([\x00-\x1F])/$matches[$1]/ge for @matches;
print $_, "\n" for @matches;

This code "works" if and only if there's no characters in the original
data which are the chr() of a number less than 32, and if there's less
than 32 pairs of [possibly nested] parentheses with data in them.

A "good" version of this might be:
push @matches, $1 while s/( $ [^()]* $ )/chr(0x20_0000+@matches)/xe;
my $x = sprintf q[[%c-%c]], 0x20_0000, 0x20_0000+$#matches;
s/($x)/$matches[ord($1)-0x20_0000]/ge for @matches, $_;
print $_, "\n" for @matches;

This code works as long as there's no characters in the original which
are the chr() of a number greater than or equal to 0x20_0000, which we
can probably be fairly certain of, no matter what kind of text it is.

So you basically want to bring back the evil "the eighth bit is a magical
flag we can use for whatever we need it for of because no one will ever
need more than 127 characters" habit, now ported into Unicode? :-)

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-02-01T03:19:48Z

From [Unknown Contact. See original ticket]

Jarkko Hietaniemi <jhi@iki.fi> writes:

So you basically want to bring back the evil "the eighth bit is a magical
flag we can use for whatever we need it for of because no one will ever
need more than 127 characters" habit, now ported into Unicode? :-)

Maybe we could make our 35'th bit (or which ever is our grossest code
violation) a marker for this hack. Though of course the UTF-EBCDIC
don't have that luxury as they only have 2**31 code points to play with.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

p5pRT · 2002-02-02T12:06:18Z

From [Unknown Contact. See original ticket]

On Thu, 31 Jan 2002 18:00:41 +0200, jhi@iki.fi (Jarkko Hietaniemi)
wrote:

This code works as long as there's no characters in the original which
are the chr() of a number greater than or equal to 0x20_0000, which we
can probably be fairly certain of, no matter what kind of text it is.

So you basically want to bring back the evil "the eighth bit is a magical
flag we can use for whatever we need it for of because no one will ever
need more than 127 characters" habit, now ported into Unicode? :-)

Weeell, I suppose one could say that it's justified since Unicode AFAIK
will never, ever allocate a character beyond U-10FFFF because of the way
they set themselves up.

However, that's what people thought when they used ASCII back in 7-bit
days....

Cheers,
Philip

p5pRT · 2002-02-02T12:06:18Z

From [Unknown Contact. See original ticket]

On Wed, 30 Jan 2002 12:15:59 -0000, msergeant@startechgroup.co.uk (Matt
Sergeant) wrote:

Anyway, this patch fixes it:

--- utf8.c.old Wed Jan 30 11:40:39 2002
+++ utf8.c Wed Jan 30 11:54:28 2002
@@ -69,7 +69,7 @@
!(flags & UNICODE_ALLOW_FFFF))) &&
/* UNICODE_ALLOW_SUPER includes
* FFFEs and FFFFs beyond 0x10FFFF. */
- ((uv <= PERL_UNICODE_MAX) ||
+ ((uv < PERL_UNICODE_MAX) ||
!(flags & UNICODE_ALLOW_SUPER))
)
Perl_warner(aTHX_ WARN_UTF8,

It's not the right patch. 0x7FFFF is equally as illegal a Unicode
character as 0xFFFF or 0x10FFFF. (I believe this is new as of Unicode
3.2?)

Cheers,
Philip

p5pRT · 2002-02-02T12:06:19Z

From [Unknown Contact. See original ticket]

On Thu, 31 Jan 2002 02:58:27 -0500, goldbb2@earthlink.net (Benjamin
Goldberg) wrote:

And possibly more likely, What if I want to insert placeholders of some
sort in a string, which will be later substituted with something else,
and want to avoid using as a placeholder something which might validly
occur in the text?

You know, I believe this is the sort of thing 0xFFFF was deliberately
left unassigned for (and later turned into an illegal character) -- so
that it can be used as a magic flag. (The fact that it is equal to -1 in
16-bit signed integers is probably a plus, too.)

It's like C's practice of using \0 as an end-of-string marker; C
considers \0 to be "not a character", and Unicode says that U+FFFF is
guaranteed not to be a character. So you can use it for such
fenceposting or placeholding or whatever.

Cheers,
Philip

p5pRT · 2002-02-02T12:06:20Z

From [Unknown Contact. See original ticket]

On Wed, 30 Jan 2002 17:01:01 +0200, jhi@iki.fi (Jarkko Hietaniemi)
wrote:

What I notice, though, is that the current code does not warn for
characters beyond 0x10FFFF, which is definitely a bug.

Do you mean characters such as 0x23FFFF, or any old character after
0x10FFFF regardless of the bottom sixteen bits?

Cheers,
Philip

p5pRT · 2002-02-02T16:07:07Z

From @jhi

On Sat, Feb 02, 2002 at 09:07:17PM +0100, Philip Newton wrote:

On Wed, 30 Jan 2002 17:01:01 +0200, jhi@iki.fi (Jarkko Hietaniemi)
wrote:

What I notice, though, is that the current code does not warn for
characters beyond 0x10FFFF, which is definitely a bug.

Do you mean characters such as 0x23FFFF, or any old character after
0x10FFFF regardless of the bottom sixteen bits?

I meant any past 0x10FFFF, such as 0x200000. But anyway, now it's as
Larry wanted, that is:

- on I/O warnings on by default
- otherwise (such as chr) warnings off by default

(and the any past 0x10FFFFF warning isn't yet on, since it causes
trouble in unexpected places, like tr///c and the ~ bitop...)

Cheers,
Philip

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-02-02T16:08:29Z

From @jhi

On Sat, Feb 02, 2002 at 09:07:19PM +0100, Philip Newton wrote:

On Thu, 31 Jan 2002 18:00:41 +0200, jhi@iki.fi (Jarkko Hietaniemi)
wrote:

This code works as long as there's no characters in the original which
are the chr() of a number greater than or equal to 0x20_0000, which we
can probably be fairly certain of, no matter what kind of text it is.

So you basically want to bring back the evil "the eighth bit is a magical
flag we can use for whatever we need it for of because no one will ever
need more than 127 characters" habit, now ported into Unicode? :-)

Weeell, I suppose one could say that it's justified since Unicode AFAIK
will never, ever allocate a character beyond U-10FFFF because of the way
they set themselves up.

However, that's what people thought when they used ASCII back in 7-bit
days....

7 bits, 4kB, 64kB, 360 kB, 640kB, 32 bits, 2400 bps, ...

Cheers,
Philip

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT · 2002-02-02T19:43:40Z

From [Unknown Contact. See original ticket]

Philip Newton wrote:

On Thu, 31 Jan 2002 02:58:27 -0500, goldbb2@earthlink.net (Benjamin
Goldberg) wrote:

And possibly more likely, What if I want to insert placeholders of
some sort in a string, which will be later substituted with
something else, and want to avoid using as a placeholder something
which might validly occur in the text?

You know, I believe this is the sort of thing 0xFFFF was deliberately
left unassigned for (and later turned into an illegal character) -- so
that it can be used as a magic flag. (The fact that it is equal to -1
in 16-bit signed integers is probably a plus, too.)

It's like C's practice of using \0 as an end-of-string marker; C
considers \0 to be "not a character", and Unicode says that U+FFFF is
guaranteed not to be a character. So you can use it for such
fenceposting or placeholding or whatever.

But that only allows one to put in one *single* kind of flag, rather
than being able to encode numbers in general.

Using it would mean changing my code to something like:

push @matches, $1
while s/( $ [^()]* $ )/"\0x{FFFF}".@matches."\0x{FFFF}")/xe;
s/\0x{FFFF}(\d+)\0x{FFFF}/$matches[$1]/g for @matches, $_;
print $_, "\n" for @matches;

Sure, this would still work... at least with this particular example
where the delimiters are ( and ). But if the delimiters were allowed to
be digits, then my encoded number U+FFFF . number . U+FFFF could
concievably get broken in the middle of the number.

I want something where *every* encoded number is a single atom, and
can't be confused with input data. chr(0x20_0000+number) has that
property, unless space aliens land on earth.

--
A child of 5 could understand this! Fetch me a child of 5

p5pRT · 2002-02-02T20:41:30Z

From @jhi

chr(0x20_0000+number) has that property, unless space aliens land on
earth.

These are the kinds of statements that make the snowballs in hell look
expectantly at the thermometer. Or, in this case, the aliens look at
their calendar whether they can fit in an invasion.

--
A child of 5 could understand this! Fetch me a child of 5

--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

p5pRT closed this as completed Nov 28, 2003

p5pRT added affects-5.7 distro-Linux labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode broken for 0x10FFFF #4931

Unicode broken for 0x10FFFF #4931

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

p5pRT commented Jan 31, 2002

p5pRT commented Feb 1, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

p5pRT commented Feb 2, 2002

Unicode broken for 0x10FFFF #4931

Unicode broken for 0x10FFFF #4931

Comments

p5pRT commented Jan 30, 2002

p5pRT commented Jan 30, 2002

From msergeant@startechgroup.co.uk

p5pRT commented Jan 30, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Jan 30, 2002

From @jhi

p5pRT commented Jan 30, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Jan 30, 2002

From @jhi

p5pRT commented Jan 30, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Jan 30, 2002

From @jhi

p5pRT commented Jan 30, 2002

From @gbarr

p5pRT commented Jan 30, 2002

From @jhi

p5pRT commented Jan 30, 2002

From @TimToady

p5pRT commented Jan 30, 2002

From @jhi

p5pRT commented Jan 30, 2002

From @TimToady

p5pRT commented Jan 30, 2002

From @jhi

p5pRT commented Jan 30, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Jan 31, 2002

From @jhi

p5pRT commented Feb 1, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Feb 2, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Feb 2, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Feb 2, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Feb 2, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Feb 2, 2002

From @jhi

p5pRT commented Feb 2, 2002

From @jhi

p5pRT commented Feb 2, 2002

From [Unknown Contact. See original ticket]

p5pRT commented Feb 2, 2002

From @jhi