New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259
Comments
From chris.hall@highwayman.comCreated by chris.hall@highwayman.comEncode::encode('UTF-8', $foo) and Encode::decode('UTF-8', $bar) detect the There are 65 other Unicode non-characters: U+FFFE which one would expect to be treated the same as U+FFFF. They aren't. They are accepted as normal characters. This appears to be a bug. It's the same under Perl 5.10.0. (Alternatively, one could argue that detecting the 0xFFFF non-character Perl Info
|
From jgmyers@proofpoint.comThis is related/duplicate to bugs 38722 and 43294. 43294 has a proposed |
The RT System itself - Status changed from 'new' to 'open' |
From chris.hall@highwayman.comOn Thu Mar 20 13:53:50 2008, jgmyers@proofpoint.com wrote:
Related, except for the confusion between strict UTF-8 and more general My understanding is that the utf8::valid() and utf8::decode() functions I agree there's a place for functions that are strict UTF-8. I don't The bug I was reporting is, however, in the UTF-8 (strict) handling in |
From jgmyers@proofpoint.comMy primary use of utf8::valid is to determine when it is necessary to if (defined($out) && !utf8::valid($out)) { This requires utf8::valid to do a strict check (as it does with my patch |
From chris.hall@highwayman.comOn Fri, 21 Mar 2008 you wrote
Well, yes, for what you want that is what would be required. The documentation says: $flag = utf8::valid(STRING) [INTERNAL] Test whether STRING is in a consistent state regarding which is invoking 'UTF-8' in caps and stuff, which one understands from So either the documentation is phouquée or the code is. What you want is entirely reasonable. I don't know what the performance issues are with Encode/Decode, but I More generally I can see a rôle for a 'quick' scanner that might 1. broken sequences (probably including sequences starting 0xFE & FF) 2. surrogates 3. redundant sequences 4. values > 0x10FFFF 5. non-characters 6. replacement characters 7. private use characters 8. unassigned characters that is: a scan function that takes a second argument to indicate what Chris |
From @khwilliamsonAll 66 characters are now known to both Encode and Decode |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#51918 (status was 'resolved')
Searchable as RT51918$
The text was updated successfully, but these errors were encountered: