New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent handling of characters with value > 0x7FFF_FFFF and other issues #9260
Comments
From chris.hall@highwayman.comCreated by chris.hall@highwayman.comAmongst the issues: * Character values > 0x7FFF_FFFF are not consistently handled. IMO: the handling is so broken that it would be much better * chr and pack respond differently to large and out of range * pack can generate strings that unpack will not process. * warnings about 'illegal' non-characters are arguably spurious. Treating 0xFFFF_FFFF as a non-character is interesting. * IMO: chr(-1) complete nonsense == undef, not "a character I Perl strings containing characters >0x7FFF_FFFF use a non-standard Bits of Perl are happier with these non-standard sequences than Consider: 1: use strict ; which generates: A: Unicode character 0x7fffffff is illegal at tb.pl line 11. ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbd f: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18. ... repeated for 0x80, 0x80, 0x80, 0x80, 0x80 h: @w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0} ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbd n: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23. ... repeated for 0x80, 0x80, 0x80, 0x80, 0x80 p: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23. ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbf r: @w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0} NOTES: 1. chr(n) is happy with characters > 0x7FFF_FFFF BUT: note the runtime warning about 0x7FFF_FFFF itself -- output line a. Unicode defines characters U+xxFFFF as non-characters, for all xx from These characters are NOT illegal. Unicode states: "Noncharacter code points are reserved for internal use, such as Characters > 0x10_FFFF are not known to Unicode. IMO, chr(n) should not be issuing warnings about non-characters at all. IMO, to project non-characters beyond the Unicode range is doubly FURTHER: although characters > 0x10_FFFF are beyond Unicode, and 2. Similarly "\x{8000_0000} and "\x{7FFF_FFFF}" -- output line A. 3. HOWEVER: utf8::valid() considers a string containing characters IMO allowing for characters 0x7FFF_FFFF in the first place is a mistake. But having allowed them, why flag the string as invalid ? 4. However: length() is happy, and issues no warning. Either length() is accepting the non-standard encoding, or some other 5. Lines 12 & 13 generate warnings about malformed UTF-8, at compile time. However, the run-time copes with these super-large characters. 6. substr is happy with the super-large characters -- line 16. 7. split is happy with the super-large characters -- line 26. 8. ord is happy with the super-large characters -- line 26. 9. unpack 'U' throws up all over super-large characters ! See lines 18 & 23, and output d-h and l-r. unpack has no idea about the non-standard encoding of characters 10. pack 'U' complains about character values in much the same way as However, pack and chr are by no means consistent with each other, 11. pack 'U' is generating stuff that unpack 'U' cannot cope with ! See lines 21-24 and output k-r ___________________________________________________________ Looking further at chr and pack: 1: use strict ; On a 64-bit v5.8.8: A: UTF-16 surrogate 0xd800 at tb2.pl line 6. * chr(-1) generates a warning, not because it's complete rubbish, chr(-3) doesn't merit a warning. * note that surrogates and non-characters are OK as far as utf8::valid * pack is masking stuff to 32 bit unsigned !! * both chr and pack are throwing warnings about surrogates On a 32-bit v5.10.0: A: Integer overflow in hexadecimal number at tb2.pl line 10. * chr is mapping -ve values to U+FFFD -- without warning. This is as per documentation. However, character 0xFFFF_FFFF, merits a warning, but does NOT IMO: this is a dog's dinner. I think: - non-characters and surrogates should not trouble chr - values that are invalid should generate undef, not U+FFFD a) cannot distinguish chr(0xFFFD) and chr(-10) b) U+FFFD is a replacement for a character that we don't [-1 is a banana. U+FFFD is an orange, which we may - limiting characters to 0x7FFF_FFFF is no great loss, and * pack 'U' is NOT mapping -ve values to U+FFFD !! Perl Info
|
From jgmyers@proofpoint.comThis is similar to bug #43294. I agree that allowing characters above the Unicode maximum of U+10FFFF Allowing surrogates and the non-character U+FFFE in UTF-8 is a security |
The RT System itself - Status changed from 'new' to 'open' |
From chris.hall@highwayman.comOn Thu Mar 20 15:25:57 2008, jgmyers@proofpoint.com wrote:
Oh dear. I was actually trying to argue for decoupling general I think Perl's character general character handling has been mixed up IMO Perl should, internally, handle characters with values I would dispense with all the broken and incomplete handling Separately there is clearly the need to filter strict UTF-8 for Seems to me that the current code falls between two stools and is not
I don't think it's chr's job to police anything. The current * non-characters are OK for internal use, but not for external * use of strings for things other than Unicode [I note that printf '%vX' is suggested for IPv6. This implies * processing UTF-16 as strings of characters 0..0xFFFF. and trying to do two things at once -- e.g. allowing chr() to generate Applications that need strict UTF-8 (and possibly subsets thereof) need But I don't think the needs of strict UTF-8 should get in the way of -- |
From jgmyers@proofpoint.comOn Thu Mar 20 17:56:04 2008, chris_hall wrote:
I disagree--Perl should adopt and conform to the Unicode standard. Particularly heinous is the concept of calling something "utf8" that (Actually, Unicode does not define the UTF-8 byte sequence for U+FFFE as
By allowing values that are not permitted by Unicode, you are laying a
The requirements with respect to ill-formed sequences, including The requirements with respect to noncharacters are admittedly complex
I would agree with this, though I disagree about the need for
I disagree. It is chr's job to police chr('orange'). Similarly, it
Which then begs the question of what is "internal use" versus "external
To use strings for things other than Unicode, one should use byte
UTF-16 is a character encoding scheme and should be processed as a byte |
From chris.hall@highwayman.comOn Fri, 21 Mar 2008 you wrote
The standard already says that non-characters should not be exchanged I don't quite understand why you'd want to apply a UTF-16 decoder after
No, I'm suggesting removing all the clutter from simple character Applications that don't trust their input have their own issues, which I
It says they are ill-formed. It doesn't mandate what your application A quick and dirty application might just throw rubbish away, and might Another application might convert rubbish to U+FFFD and later whinge Yet another application might which to give more specific diagnostics, Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood, though Similarly, the redundant longer forms, which UTF-8 says are ill-formed,
Except that non-characters are entirely legal, and may be essential to Then there's what to do with (a) unassigned characters, (b) private use ...
Indeed, and that may vary from application to application. So, not only is it (a) more general and (b) conceptually simpler to
I don't see why handling an IPv6 address as a short sequence of 16 bit In the old 8-bit character world what one did with characters was not
Not really. UTF-16 is defined in terms of 16 bit values. I can use
Well, this is it in a nut-shell. I don't think that Perl characters (that is, things that are components [OK, the size here is arbitrary... 31-bits fits with the well [Even if Perl characters were exactly Unicode characters, there would On top of a generic string data structure there should clearly be Chris |
From jgmyers@proofpoint.comChris Hall via RT wrote:
Unicode 5.0 conformance requirement C10 does mandate a restriction on
Unicode permits such behavior.
Unicode permits either behavior.
Again, Unicode conformance requirement C10 prohibits applications from
Software is sometimes constructed by connecting together modules that The potential attack does require a UTF-16 decoder that is sloppier than
Ill-formed sequences are invalid for everybody. By pushing the Even you seem to have been unaware of the seriously adverse security
Please provide an example of a reasonable application to which a
There are no such problems with any of these categories.
No, it's the converse. When you fail to provide a consistent definition
In neither case are they characters.
The old 8-bit character world is hardly a model of reasonableness. One The ability to store arbitrary 16 bit quantities in UTF-8 strings is
Perl strings are not word (16 bit) sequences. UTF-16 was a bad idea to begin with. Let it die a natural death, just
This is indeed it in a nut-shell. Perl has a choice: On one hand, it |
From perl@nevcal.comOn approximately 3/25/2008 3:27 PM, came the following characters from
Perl seems to have already made that choice... and chose TMTOWTDI. The language implements an extension of UTF-8 encoding rules for 70** The language has certain string operations* that implement certain Module Encode implements (as best as Dan and kibitzers can) UTF-8 So people that want to use utf8 strings as containers for 16-bit It appears that reported bugs get fixed, as time permits. It appears * This list is fairly well known, including "\l\L\u\U" uc ucfirst lc ** maybe it is 72? Larger than 64, apparently, and such values higher --
|
From chris.hall@highwayman.comOn Tue, 25 Mar 2008 you wrote
As reported: the 7 and 13 byte extended sequences are not properly IMO things are so broken (not even utf8::valid() likes the 7- and - stopping at 31 bit integers is at least consistent with well-known - 32 bit integers could be supported in 6 byte sequences (by treating ....the extent of brokenness recalls the guiding mantra: KISS.
The separation between the content of strings and Unicode is unclear. The name utf8 doesn't help ! A good example of this is chr(n) which: - issues warnings if 'n' is a Unicode surrogate or non-character. These warnings are a nuisance for people using strings as containers Those wanting help with strict Unicode aren't materially helped by - accepts characters beyond the Unicode range without warning. So isn't consistent in its "Unicode support". - generates a chr(0xFFFD) in response to chr(-1). Which makes no sense where strings are used as containers for n-bit
There are plenty of bugs to go round :-} I hope that has not been lost in the discussion. I'm not sure that the Encode Module is the right place for all support It seems to me that Encode is to do with mapping between Perl characters With Unicode there are additional, specific options required either to * the non-characters -- all of them. * the Replacement Character -- perhaps should not send these, or * private use characters -- which may or may not be suitable for * perhaps more general support for sub-sets of Unicode. * dealing with canonical equivalences. Now I suppose that a lot can be done by regular expressions and other
It's 72: 13 byte sequence, starting 0xFF followed by 12 bytes carrying 6 FWIW, on 64-bit integer machine: $v = 0xFFFF_FFFF_FFFF_FFFD ; work just fine. Though Perl whimpers: Hexadecimal number > 0xffffffff non-portable at .... (compile time) While: $v = 0xFFFF_FFFF_FFFF_FFFF ; whimpers: Hexadecimal number > 0xffffffff non-portable at .... (compile time) where the second whinge is baroque. Chris |
From chris.hall@highwayman.comOn Tue, 25 Mar 2008 John Gardiner Myers wrote
Sure. But the real point is that this doesn't specify how the error And, as previously discussed, the issue of ill-formed UTF-8 is only part
Sure. But the point is that there isn't a single correct approach, it
OK. I was going by what Unicode 5.0 says: "The definition of UTF-8 in Annex D of ISO/IEC 10646:2003 also allows
And any given application may wish to do one or the other.
It takes a narrow view of this. Obviously it is good to encourage ....
The existing handling is in a mess. I suggest that this is partly IMO the solution is (a) to simplify the base string and character data
As above. I grant that handling the redundant longer forms is not a big
Yes, it doesn't help clarify things.
Except that it would no longer be Unicode conformant. If you want to argue that non-characters are a Bad Thing, that's a Using private use characters instead simply moves the problem. If I use
Well... if you're troubled by the exchange of 66 non-character values An application that was *really* worried about what it was being sent
I agree that without a consistent definition you get a mess.
Looking at the "Character Encoding Model", where I said "characters" a
Granted that character encodings in the 8-bit world were tricky. But chr() didn't get upset about, for example, DEL (0x7F) or DLE (0x10) .....
Leaving to one side any questions about ill-formed sequences. What * non-characters -- allow, filter out, replace, ... ? * private-use characters -- allow, filter out, replace, ... ? * unassigned characters -- allow, filter out, replace, ... ? * canonical equivalences -- allow, filter out, replace, ... ? The standard acknowledges a security issue here, but punts it: "However, another level of alternate representation has raised * requirements to handle only sub-sets of characters. * other things, perhaps ? Even surrogates are potentially tricky... ... in UTF-8 surrogate values are explicitly ill-formed. ... in UTF-16 they should travel in pairs, but I guess decoders need to ... but it appears that some code will combine surrogate code points ... so banning these values from Perl strings is problematic. With ill-formed sequences the question is how to deal with the error The point here is that the requirements are not simple and not There is, absolutely, a crying need for clear and effective support for
This is a false alternative. Supporting generic "character" and string primitives does not preclude At present Perl is achieving neither. Chris PS: big-endian integers are sinful. |
From @khwilliamsonAfter much further discussion and gnashing of teeth, this has been Hexadecimal number > 0xffffffff non-portable at 51936.pl line 21. The decision was made to allow any unsigned values be stored as strings When doing an operation that requires Unicode semantics on an Unicode doesn't actually forbid the use of isolated surrogates in The portion of the original ticket involving chr(-1) has not been In any event, I believe much of the inglorious handling of this whole --Karl Williamson |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#51936 (status was 'resolved')
Searchable as RT51936$
The text was updated successfully, but these errors were encountered: