Skip Menu |
Report information
Id: 51918
Status: resolved
Priority: 0/
Queue: perl5

Owner: khw <khw [at] cpan.org>
Requestors: chris_hall <chris.hall [at] highwayman.com>
Cc: jgmyers <jgmyers [at] proofpoint.com>
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: Unicode
Perl Version: (no value)
Fixed In: (no value)



Subject: UTF-8 (strict) Encode and Decode detect only 1/66 non-characters
Date: Thu, 20 Mar 2008 09:39:54 +0000
To: perlbug [...] perl.org
From: Chris Hall <chris.hall [...] highwayman.com>
Download (untitled) / with headers
text/plain 5.9k
This is a bug report for perl from chris.hall@highwayman.com, generated with the help of perlbug 1.35 running under perl v5.8.8. ----------------------------------------------------------------- [Please enter your report here] Encode::encode('UTF-8', $foo) and Encode::decode('UTF-8', $bar) detect the Unicode 'non-character' U+FFFF and treat it as an error. There are 65 other Unicode non-characters: U+FFFE U+01FFFE, U+02FFFE, U+03FFFE, ... U+10FFFE U+01FFFF, U+02FFFF, U+03FFFF, ... U+10FFFF U+FDD0..U+FDEF which one would expect to be treated the same as U+FFFF. They aren't. They are accepted as normal characters. This appears to be a bug. It's the same under Perl 5.10.0. (Alternatively, one could argue that detecting the 0xFFFF non-character is less than useful -- this is a perfectly good character, and has uses internally. Perhaps Encode should have an option to allow non-characters ? Whichever way you cut it, all non-characters should be handled the same way.) [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=library severity=low --- This perlbug was built using Perl v5.8.8 in the Red Hat build system. It is being executed now by Perl v5.8.8 - Mon Nov 26 14:25:50 EST 2007. Site configuration information for perl v5.8.8: Configured by Red Hat, Inc. at Mon Nov 26 14:25:50 EST 2007. Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.20-1.3001.fc6xen, archname=x86_64-linux-thread-multi uname='linux xenbuilder4.fedora.phx.redhat.com 2.6.20-1.3001.fc6xen #1 smp thu aug 9 16:18:42 edt 2007 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -Dversion=5.8.8 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr - Dprefix=/usr -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dprivlib=/usr/lib/perl5/5.8.8 -Dsitelib=/usr/lib/perl5/site_perl/5.8.8 -Dvendorlib=/us r/lib/perl5/vendor_perl/5.8.8 -Darchlib=/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi -Dsitearch=/usr/lib64/perl5/site_perl/5.8.8/x86_64-linu x-thread-multi -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi -Darchname=x86_64-linux -Dvendorprefix=/usr - Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow - Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto - Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto - Dinc_version_list=5.8.7 5.8.6 5.8.5 -Dscriptdir=/usr/bin' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include - D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm', optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='4.1.2 20070925 (Red Hat 4.1.2-33)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags ='' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.7' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 - m64 -mtune=generic' Locally applied patches: --- @INC for perl v5.8.8: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl/5.8.7 /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 . --- Environment for perl v5.8.8: HOME=/home/GMCH LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin PERL_BADLANG (unset) SHELL=/bin/bash -- Chris Hall highwayman.com
Download signature.asc
application/pgp-signature 470b

Message body not shown because it is not plain text.

This is related/duplicate to bugs 38722 and 43294. 43294 has a proposed fix.
Download (untitled) / with headers
text/plain 809b
On Thu Mar 20 13:53:50 2008, jgmyers@proofpoint.com wrote: Show quoted text
> This is related/duplicate to bugs 38722 and 43294. 43294 has a
proposed Show quoted text
> fix.
Related, except for the confusion between strict UTF-8 and more general string handling. My understanding is that the utf8::valid() and utf8::decode() functions are related to Perl's internal character handling -- which happens to be based on utf8. All I expect utf8::valid() to tell me is that Perl is happy with a character string (which I may have finangled from somewhere -- for example by fiddling with the utf8 status of the string). I agree there's a place for functions that are strict UTF-8. I don't think that everything should be like that, though. The bug I was reporting is, however, in the UTF-8 (strict) handling in Encode. -- Chris Hall
Download (untitled) / with headers
text/plain 442b
My primary use of utf8::valid is to determine when it is necessary to take the perfomance hit of firing up the Encode machinery to clean a string obtained from an unreliable source: if (defined($out) && !utf8::valid($out)) { utf8::encode($out); # turn off utf-8 flag $out = Encode::decode('utf-8', $out); # replace invalid chars with U+FFFD } This requires utf8::valid to do a strict check (as it does with my patch for bug 43294).
Subject: Re: [perl #51918] UTF-8 (strict) Encode and Decode detect only 1/66 non-characters
Date: Sun, 23 Mar 2008 18:11:58 +0000
To: perlbug-followup [...] perl.org
From: Chris Hall <chris.hall [...] highwayman.com>
Download (untitled) / with headers
text/plain 1.8k
On Fri, 21 Mar 2008 you wrote Show quoted text
>My primary use of utf8::valid is to determine when it is necessary to >take the perfomance hit of firing up the Encode machinery to clean a >string obtained from an unreliable source: > >if (defined($out) && !utf8::valid($out)) { > utf8::encode($out); # turn off utf-8 flag > $out = Encode::decode('utf-8', $out); # replace invalid chars with >U+FFFD >} > >This requires utf8::valid to do a strict check (as it does with my patch >for bug 43294).
Well, yes, for what you want that is what would be required. The documentation says: $flag = utf8::valid(STRING) [INTERNAL] Test whether STRING is in a consistent state regarding UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag on or if string is held as bytes (both these states are 'consistent'). Main reason for this routine is to allow Perl's testsuite to check that operations have left strings in a consistent state. which is invoking 'UTF-8' in caps and stuff, which one understands from the Encode documentation to mean 'strict' UTF-8. So either the documentation is phouquée or the code is. What you want is entirely reasonable. I don't know what the performance issues are with Encode/Decode, but I can see it is tempting to exploit the fact that Perl is using something like UTF-8. More generally I can see a rôle for a 'quick' scanner that might identify strings that contain any or all of: 1. broken sequences (probably including sequences starting 0xFE & FF) 2. surrogates 3. redundant sequences 4. values > 0x10FFFF 5. non-characters 6. replacement characters 7. private use characters 8. unassigned characters that is: a scan function that takes a second argument to indicate what the application considered 'invalid'. Some applications might like to filter for character blocks that were not locally supported. Chris -- Chris Hall highwayman.com
Download signature.asc
application/pgp-signature 470b

Message body not shown because it is not plain text.

RT-Send-CC: perl5-porters [...] perl.org
All 66 characters are now known to both Encode and Decode --Karl Williamson


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org