UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

p5pRT · 2008-03-20T09:41:39Z

Migrated from rt.perl.org#51918 (status was 'resolved')

Searchable as RT51918$

p5pRT · 2008-03-20T09:41:41Z

From chris.hall@highwayman.com

Created by chris.hall@highwayman.com

Encode::encode('UTF-8', $foo) and Encode::decode('UTF-8', $bar) detect the
Unicode 'non-character' U+FFFF and treat it as an error.

There are 65 other Unicode non-characters:

U+FFFE
U+01FFFE, U+02FFFE, U+03FFFE, ... U+10FFFE
U+01FFFF, U+02FFFF, U+03FFFF, ... U+10FFFF
U+FDD0..U+FDEF

which one would expect to be treated the same as U+FFFF.

They aren't. They are accepted as normal characters.

This appears to be a bug.

It's the same under Perl 5.10.0.

(Alternatively, one could argue that detecting the 0xFFFF non-character
is less than useful -- this is a perfectly good character, and has uses
internally. Perhaps Encode should have an option to allow
non-characters ? Whichever way you cut it, all non-characters should be
handled the same way.)

Perl Info


Flags:
     category=library
     severity=low

This perlbug was built using Perl v5.8.8 in the Red Hat build system.
It is being executed now by Perl v5.8.8 - Mon Nov 26 14:25:50 EST 2007.

Site configuration information for perl v5.8.8:

Configured by Red Hat, Inc. at Mon Nov 26 14:25:50 EST 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
   Platform:
     osname=linux, osvers=2.6.20-1.3001.fc6xen, archname=x86_64-linux-thread-multi
     uname='linux xenbuilder4.fedora.phx.redhat.com 2.6.20-1.3001.fc6xen #1 smp thu aug 9 16:18:42 edt 2007 x86_64 x86_64 x86_64 gnulinux '
     config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -
mtune=generic -Dversion=5.8.8 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -
Dprefix=/usr -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dprivlib=/usr/lib/perl5/5.8.8 -Dsitelib=/usr/lib/perl5/site_perl/5.8.8 -Dvendorlib=/us
r/lib/perl5/vendor_perl/5.8.8 -Darchlib=/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi -Dsitearch=/usr/lib64/perl5/site_perl/5.8.8/x86_64-linu
x-thread-multi -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi -Darchname=x86_64-linux -Dvendorprefix=/usr -
Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -
Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -
Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -
Dinc_version_list=5.8.7 5.8.6 5.8.5 -Dscriptdir=/usr/bin'
     hint=recommended, useposix=true, d_sigaction=define
     usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
     use64bitint=define use64bitall=define uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -
D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
     optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/usr/include/gdbm'
     ccversion='', gccversion='4.1.2 20070925 (Red Hat 4.1.2-33)', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='gcc', ldflags =''
     libpth=/usr/local/lib64 /lib64 /usr/lib64
     libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
     perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     libc=, so=so, useshrplib=true, libperl=libperl.so
     gnulibc_version='2.7'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -
m64 -mtune=generic'

Locally applied patches:



@INC for perl v5.8.8:
     /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
     /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi
     /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi
     /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi
     /usr/lib/perl5/site_perl/5.8.8
     /usr/lib/perl5/site_perl/5.8.7
     /usr/lib/perl5/site_perl/5.8.6
     /usr/lib/perl5/site_perl/5.8.5
     /usr/lib/perl5/site_perl
     /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
     /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi
     /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi
     /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi
     /usr/lib/perl5/vendor_perl/5.8.8
     /usr/lib/perl5/vendor_perl/5.8.7
     /usr/lib/perl5/vendor_perl/5.8.6
     /usr/lib/perl5/vendor_perl/5.8.5
     /usr/lib/perl5/vendor_perl
     /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi
     /usr/lib/perl5/5.8.8
     .


Environment for perl v5.8.8:
     HOME=/home/GMCH
     LANG=en_GB.UTF-8
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
     PERL_BADLANG (unset)
     SHELL=/bin/bash

-- 
Chris Hall               highwayman.com

p5pRT · 2008-03-20T20:53:50Z

From jgmyers@proofpoint.com

This is related/duplicate to bugs 38722 and 43294. 43294 has a proposed
fix.

p5pRT · 2008-03-20T20:53:50Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2008-03-21T01:11:32Z

From chris.hall@highwayman.com

On Thu Mar 20 13:53:50 2008, jgmyers@proofpoint.com wrote:

This is related/duplicate to bugs 38722 and 43294. 43294 has a
proposed
fix.

Related, except for the confusion between strict UTF-8 and more general
string handling.

My understanding is that the utf8::valid() and utf8::decode() functions
are related to Perl's internal character handling -- which happens to
be based on utf8. All I expect utf8::valid() to tell me is that Perl
is happy with a character string (which I may have finangled from
somewhere -- for example by fiddling with the utf8 status of the
string).

I agree there's a place for functions that are strict UTF-8. I don't
think that everything should be like that, though.

The bug I was reporting is, however, in the UTF-8 (strict) handling in
Encode.
--
Chris Hall

p5pRT · 2008-03-21T19:32:36Z

From jgmyers@proofpoint.com

My primary use of utf8::valid is to determine when it is necessary to
take the perfomance hit of firing up the Encode machinery to clean a
string obtained from an unreliable source:

if (defined($out) && !utf8::valid($out)) {
utf8::encode($out); # turn off utf-8 flag
$out = Encode::decode('utf-8', $out); # replace invalid chars with
U+FFFD
}

This requires utf8::valid to do a strict check (as it does with my patch
for bug 43294).

p5pRT · 2008-03-23T18:37:12Z

From chris.hall@highwayman.com

On Fri, 21 Mar 2008 you wrote

My primary use of utf8::valid is to determine when it is necessary to
take the perfomance hit of firing up the Encode machinery to clean a
string obtained from an unreliable source:

if (defined($out) && !utf8::valid($out)) {
utf8::encode($out); # turn off utf-8 flag
$out = Encode::decode('utf-8', $out); # replace invalid chars with
U+FFFD
}

This requires utf8::valid to do a strict check (as it does with my patch
for bug 43294).

Well, yes, for what you want that is what would be required.

The documentation says:

$flag = utf8::valid(STRING)

[INTERNAL] Test whether STRING is in a consistent state regarding
UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag on
or if string is held as bytes (both these states are 'consistent').
Main reason for this routine is to allow Perl's testsuite to check
that operations have left strings in a consistent state.

which is invoking 'UTF-8' in caps and stuff, which one understands from
the Encode documentation to mean 'strict' UTF-8.

So either the documentation is phouquée or the code is.

What you want is entirely reasonable.

I don't know what the performance issues are with Encode/Decode, but I
can see it is tempting to exploit the fact that Perl is using something
like UTF-8.

More generally I can see a rôle for a 'quick' scanner that might
identify strings that contain any or all of:

1. broken sequences (probably including sequences starting 0xFE & FF)

2. surrogates

3. redundant sequences

4. values > 0x10FFFF

5. non-characters

6. replacement characters

7. private use characters

8. unassigned characters

that is: a scan function that takes a second argument to indicate what
the application considered 'invalid'. Some applications might like to
filter for character blocks that were not locally supported.

Chris
--
Chris Hall highwayman.com

p5pRT · 2011-01-10T02:33:44Z

From @khwilliamson

All 66 characters are now known to both Encode and Decode
--Karl Williamson

p5pRT · 2011-01-10T02:33:45Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Jan 10, 2011

p5pRT added Severity Low type-Unicode labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

p5pRT commented Mar 20, 2008

p5pRT commented Mar 20, 2008

p5pRT commented Mar 20, 2008

p5pRT commented Mar 20, 2008

p5pRT commented Mar 21, 2008

p5pRT commented Mar 21, 2008

p5pRT commented Mar 23, 2008

p5pRT commented Jan 10, 2011

p5pRT commented Jan 10, 2011

UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

UTF-8 (strict) Encode and Decode detect only 1/66 non-characters #9259

Comments

p5pRT commented Mar 20, 2008

p5pRT commented Mar 20, 2008

From chris.hall@highwayman.com

Created by chris.hall@highwayman.com

p5pRT commented Mar 20, 2008

From jgmyers@proofpoint.com

p5pRT commented Mar 20, 2008

p5pRT commented Mar 21, 2008

From chris.hall@highwayman.com

p5pRT commented Mar 21, 2008

From jgmyers@proofpoint.com

p5pRT commented Mar 23, 2008

From chris.hall@highwayman.com

p5pRT commented Jan 10, 2011

From @khwilliamson

p5pRT commented Jan 10, 2011