New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode broken for 0x10FFFF #4931
Comments
From msergeant@startechgroup.co.ukThis is a bug report for perl from matt@matt_dev.star.net.uk, [Please enter your report here] In testing XML::SAX::PurePerl on bleadperl, the unicode character entity Unicode character 0x10ffff is illegal at blib/lib/XML/SAX/PurePerl.pm line This can be replicated by the following code: $ perl5.7.2 -le 'print chr 0x10FFFF' Interestingly the character seems to be converted correctly, despite The warning occurs regardless of if I use pack("U") or chr. The problem is in this block of code in utf8.c: else if ( Which obviously (yeah ok so I had to stare for a while too) fails Anyway, this patch fixes it: Inline Patch--- utf8.c.old Wed Jan 30 11:40:39 2002
+++ utf8.c Wed Jan 30 11:54:28 2002
@@ -69,7 +69,7 @@
!(flags & UNICODE_ALLOW_FFFF))) &&
/* UNICODE_ALLOW_SUPER includes
* FFFEs and FFFFs beyond 0x10FFFF. */
- ((uv <= PERL_UNICODE_MAX) ||
+ ((uv < PERL_UNICODE_MAX) ||
!(flags & UNICODE_ALLOW_SUPER))
)
Perl_warner(aTHX_ WARN_UTF8,
Inline Patch--- t/lib/warnings/utf8.old Wed Jan 30 12:06:53 2002
+++ t/lib/warnings/utf8 Wed Jan 30 11:44:56 2002
@@ -37,10 +37,12 @@
my $surr = chr(0xD800);
my $fff3 = chr(0xFFFE);
my $ffff = chr(0xFFFF);
+my $top = chr(0x10FFFF); # shouldn't warn regardless
no warnings 'utf8';
$surr = chr(0xD800);
$fffe = chr(0xFFFE);
$ffff = chr(0xFFFF);
EXPECT
UTF-16 surrogate 0xd800 at - line 2.
Unicode character 0xfffe is illegal at - line 3. That's all for now. XML::SAX::PurePerl sweetly passes all tests here, Site configuration information for perl v5.7.2: Configured by matt at Wed Jan 30 10:25:52 GMT 2002. Summary of my perl5 (revision 5.0 version 7 subversion 2 patch 14470) Locally applied patches: @INC for perl v5.7.2: Environment for perl v5.7.2: PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/home/matt/bin:/usr/local/ Matt. |
From [Unknown Contact. See original ticket]
Darn, this only seems to fix it for chr(), pack("U") still exhibits the Matt. ________________________________________________________________________ |
From @jhiOn Wed, Jan 30, 2002 at 12:15:59PM -0000, Matt Sergeant wrote:
Uhhh, the warning (recently added) is given because the character What I notice, though, is that the current code does not warn for
-- |
From [Unknown Contact. See original ticket]On Wed, 30 Jan 2002, Jarkko Hietaniemi wrote:
OK, well I'm not really sure how to handle this then... How can you turn BTW: 0x10FFFF is a valid XML character -- |
From @jhi
I'm curious: how do you do Unicode in 5.005...?
eval for the 'no warnings'?
Sure. It's a valid code point-- but not a character. Maybe in XML Unicode 3.1: http://www.unicode.org/unicode/reports/tr27/index.html and the end of the section VI, "Code Charts" (about 80% down
-- |
From [Unknown Contact. See original ticket]On Wed, 30 Jan 2002, Jarkko Hietaniemi wrote:
Punt and return bytes (in order to do the XML character tests we have I guess I could do some use/require magic though. I'll try that next.
That'll eval in the eval's scope, and not propogate the no warnings
I don't know, and don't care that much. I just want it to work :-) -- |
From @jhi
Ahh, it's all coming back now... warning about such characters -- |
From @gbarrOn Wed, Jan 30, 2002 at 05:01:01PM +0200, Jarkko Hietaniemi wrote:
Should that not be "no warnings 'unicode';" ? As you have said it Graham. |
From @jhiOn Wed, Jan 30, 2002 at 04:08:22PM +0000, Graham Barr wrote:
I guess so. But for backward compatibility also the old subpragma
-- |
From @TimToadyJarkko Hietaniemi writes: I think the general policy of Perl should be that it is allowed to But within Perl, character strings are simply sequences of integers. In the absence of other type information, these integers are assumed For various reasons, some of which relate to the sequence-of-integer Note that I did not use the phrase "pure sequences of integers" in the This is just a heads up for some of the stuff in Apocalypse 5. I seriously intend that it be trivial to write a Perl parser (or any Sorry I can't be more clear yet. Story of my life. That's the basic [I've cross-posted because of the wide interest, but I don't want to Larry |
From @jhiOn Wed, Jan 30, 2002 at 10:47:36AM -0800, Larry Wall wrote:
Coming back to the original issue :-) how should I read the above as -- |
From @TimToadyJarkko Hietaniemi writes: I'd prefer to see warnings/coercions on by default for input and On the other hand, output to a different process that is expecting There are exceptions, of course. I just don't think chr() has to Larry |
From @jhiOn Wed, Jan 30, 2002 at 11:29:50AM -0800, Larry Wall wrote:
Okay. I'll dilute the warning to an optional one. It's the same
But I like so much populating every bridge with at least one troll...
-- |
From [Unknown Contact. See original ticket]Larry Wall wrote:
If you're constructing illegal unicode because you *want* something
And possibly more likely, What if I want to insert placeholders of some Eg, suppose I had [bad] code like: This code "works" if and only if there's no characters in the original A "good" version of this might be: This code works as long as there's no characters in the original which I can also think of using chr representations of integers to do some
Yes, definitely.
Hmm... would "\x{hex number}" and "\0oct number" be exceptions? -- |
From @jhi
So you basically want to bring back the evil "the eighth bit is a magical -- |
From [Unknown Contact. See original ticket]Jarkko Hietaniemi <jhi@iki.fi> writes:
Maybe we could make our 35'th bit (or which ever is our grossest code -- |
From [Unknown Contact. See original ticket]On Thu, 31 Jan 2002 18:00:41 +0200, jhi@iki.fi (Jarkko Hietaniemi)
Weeell, I suppose one could say that it's justified since Unicode AFAIK However, that's what people thought when they used ASCII back in 7-bit Cheers, |
From [Unknown Contact. See original ticket]On Wed, 30 Jan 2002 12:15:59 -0000, msergeant@startechgroup.co.uk (Matt
It's not the right patch. 0x7FFFF is equally as illegal a Unicode Cheers, |
From [Unknown Contact. See original ticket]On Thu, 31 Jan 2002 02:58:27 -0500, goldbb2@earthlink.net (Benjamin
You know, I believe this is the sort of thing 0xFFFF was deliberately It's like C's practice of using \0 as an end-of-string marker; C Cheers, |
From [Unknown Contact. See original ticket]On Wed, 30 Jan 2002 17:01:01 +0200, jhi@iki.fi (Jarkko Hietaniemi)
Do you mean characters such as 0x23FFFF, or any old character after Cheers, |
From @jhiOn Sat, Feb 02, 2002 at 09:07:17PM +0100, Philip Newton wrote:
I meant any past 0x10FFFF, such as 0x200000. But anyway, now it's as - on I/O warnings on by default (and the any past 0x10FFFFF warning isn't yet on, since it causes
-- |
From @jhiOn Sat, Feb 02, 2002 at 09:07:19PM +0100, Philip Newton wrote:
7 bits, 4kB, 64kB, 360 kB, 640kB, 32 bits, 2400 bps, ...
-- |
From [Unknown Contact. See original ticket]Philip Newton wrote:
But that only allows one to put in one *single* kind of flag, rather Using it would mean changing my code to something like: push @matches, $1 Sure, this would still work... at least with this particular example I want something where *every* encoded number is a single atom, and -- |
From @jhi
These are the kinds of statements that make the snowballs in hell look
-- |
Migrated from rt.perl.org#8375 (status was 'resolved')
Searchable as RT8375$
The text was updated successfully, but these errors were encountered: