perl cannot read UTF-16 files with illegal unicode #10934

p5pRT · 2010-12-30T01:36:50Z

Migrated from rt.perl.org#81454 (status was 'rejected')

Searchable as RT81454$

p5pRT · 2010-12-30T01:36:51Z

From andrew@pimlott.net

Created by andrew@pimlott.net

Create UTF-8 and UTF-16LE files containing the character U+FDD0. (For
UTF-8, this is the bytes ef b7 93; for UTF-16LE, it is d3 fd.) With the
UTF-8 file as STDIN, run

binmode(STDIN, ':encoding(UTF-8)');
while (<STDIN>) { }

The program runs without complaint. With the UTF-16LE file as STDIN, run

binmode(STDIN, ':encoding(UTF-16LE)');
while (<STDIN>) { }

The program dies with

UTF-16LE:Unicode character fdd3 is illegal at ./bin/grep_high line 2.

This is a fatal error and I find no way to turn it off except perhaps to
call Encode::decode by hand. I have run across files like this in the real
world, and it would be nice to read them with the standard filehandle
mechanism. Also, the difference between UTF-8 and UTF-16 behavior seems
unjustified.

I suggest that this diagnostic be a warning, just like the "is illegal for
interchange" messages emitted in other contexts, and be disabled by "no
warnings 'utf8'". Also, this form of the diagnostic is not documented in
perldiag, even though it practically comes from the perl core.

Andrew

Perl Info


Flags:
    category=library
    severity=low
    module=Encode

Site configuration information for perl 5.10.1:

Configured by Debian Project at Tue Nov  2 09:44:07 UTC 2010.

Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=2.6.32.25-dsa-ia32, archname=i486-linux-gnu-thread-multi
    uname='linux murphy 2.6.32.25-dsa-ia32 #1 smp fri oct 29 10:49:58 cest 2010 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.5', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.11.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
    gnulibc_version='2.11.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Locally applied patches:
    DEBPKG:debian/arm_thread_stress_timeout - http://bugs.debian.org/501970 Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts
    DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable.
    DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
    DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
    DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
    DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
    DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
    DEBPKG:debian/extutils_hacks - Various debian-specific ExtUtils changes
    DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
    DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
    DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
    DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
    DEBPKG:debian/m68k_thread_stress - http://bugs.debian.org/495826 Disable some threads tests on m68k for now due to missing TLS.
    DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
    DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
    DEBPKG:debian/perl_synopsis - http://bugs.debian.org/278323 Rearrange perl.pod
    DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
    DEBPKG:debian/use_gdbm - Explicitly link against -lgdbm_compat in ODBM_File/NDBM_File. 
    DEBPKG:fixes/assorted_docs - http://bugs.debian.org/443733 [384f06a] Math::BigInt::CalcEmu documentation grammar fix
    DEBPKG:fixes/net_smtp_docs - http://bugs.debian.org/100195 [rt.cpan.org #36038] Document the Net::SMTP 'Port' option
    DEBPKG:fixes/processPL - http://bugs.debian.org/357264 [rt.cpan.org #17224] Always use PERLRUNINST when building perl modules.
    DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
    DEBPKG:fixes/pod2man-index-backslash - http://bugs.debian.org/521256 Escape backslashes in .IX entries
    DEBPKG:debian/disable-zlib-bundling - Disable zlib bundling in Compress::Raw::Zlib
    DEBPKG:fixes/kfreebsd_cppsymbols - http://bugs.debian.org/533098 [3b910a0] Add gcc predefined macros to $Config{cppsymbols} on GNU/kFreeBSD.
    DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default.
    DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl.
    DEBPKG:fixes/kfreebsd-filecopy-pipes - http://bugs.debian.org/537555 [16f708c] Fix File::Copy::copy with pipes on GNU/kFreeBSD
    DEBPKG:fixes/anon-tmpfile-dir - http://bugs.debian.org/528544 [perl #66452] Honor TMPDIR when open()ing an anonymous temporary file
    DEBPKG:fixes/abstract-sockets - http://bugs.debian.org/329291 [89904c0] Add support for Abstract namespace sockets.
    DEBPKG:fixes/hurd_cppsymbols - http://bugs.debian.org/544307 [eeb92b7] Add gcc predefined macros to $Config{cppsymbols} on GNU/Hurd.
    DEBPKG:fixes/autodie-flock - http://bugs.debian.org/543731 Allow for flock returning EAGAIN instead of EWOULDBLOCK on linux/parisc
    DEBPKG:fixes/archive-tar-instance-error - http://bugs.debian.org/539355 [rt.cpan.org #48879] Separate Archive::Tar instance error strings from each other
    DEBPKG:fixes/positive-gpos - http://bugs.debian.org/545234 [perl #69056] [c584a96] Fix \\G crash on first match
    DEBPKG:debian/devel-ppport-ia64-optim - http://bugs.debian.org/548943 Work around an ICE on ia64
    DEBPKG:fixes/trie-logic-match - http://bugs.debian.org/552291 [perl #69973] [0abd0d7] Fix a DoS in Unicode processing [CVE-2009-3626]
    DEBPKG:fixes/hppa-thread-eagain - http://bugs.debian.org/554218 make the threads-shared test suite more robust, fixing failures on hppa
    DEBPKG:fixes/crash-on-undefined-destroy - http://bugs.debian.org/564074 [perl #71952] [1f15e67] Fix a NULL pointer dereference when looking for a DESTROY method
    DEBPKG:fixes/tainted-errno - http://bugs.debian.org/574129 [perl #61976] [be1cf43] fix an errno stringification bug in taint mode
    DEBPKG:fixes/safe-upgrade - http://bugs.debian.org/582978 Upgrade Safe.pm to 2.25, fixing CVE-2010-1974
    DEBPKG:fixes/tell-crash - http://bugs.debian.org/578577 [f4817f3] Fix a tell() crash on bad arguments.
    DEBPKG:fixes/format-write-crash - http://bugs.debian.org/579537 [perl #22977] [421f30e] Fix a crash in format/write
    DEBPKG:fixes/arm-alignment - http://bugs.debian.org/289884 [f1c7503] Prevent gcc from optimizing the alignment test away on armel
    DEBPKG:fixes/fcgi-test - Fix a failure in CGI/t/fast.t when FCGI is installed
    DEBPKG:fixes/hurd-ccflags - http://bugs.debian.org/587901 Make hints/gnu.sh append to $ccflags rather than overriding them
    DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
    DEBPKG:fixes/lc-numeric-docs - http://bugs.debian.org/379329 [perl #78452] [903eb63] LC_NUMERIC documentation fixes
    DEBPKG:fixes/lc-numeric-sprintf - http://bugs.debian.org/601549 [perl #78632] [b3fd614] Fix sprintf not to ignore LC_NUMERIC with constants
    DEBPKG:fixes/concat-stack-corruption - http://bugs.debian.org/596105 [perl #78674] [e3393f5] Fix stack pointer corruption in pp_concat() with 'use encoding'
    DEBPKG:patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.10.1-16 in patchlevel.h


@INC for perl 5.10.1:
    /home/andrew/lib/perl
    /home/andrew/local/lib/perl
    /etc/perl
    /usr/local/lib/perl/5.10.1
    /usr/local/share/perl/5.10.1
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.1:
    HOME=/home/andrew
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/andrew/bin:/usr/sbin:/sbin:/home/andrew/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
    PERL5LIB=/home/andrew/lib/perl:/home/andrew/local/lib/perl
    PERLDB_OPTS=HistFile=/home/andrew/.perldb.history HistSize=1000
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

p5pRT · 2010-12-30T01:57:49Z

From andrew@pimlott.net

I said U+FDD0 but the example bytes were for U+FDD3. Everything else
holds.

Andrew

p5pRT · 2011-01-13T21:58:06Z

From @khwilliamson

I'm sorry, but this ticket should instead be against the CPAN module
Encode, and not Perl itself, and what you are requesting is actually an
enhancement; it isn't actually a bug. The url to enter Encode tickets is:
https://rt.cpan.org/Public/Bug/Report.html?Queue=Encode

Note that U+FDD3 and the other non-character code points are not
strictly "illegal", but they are "illegal for open interchange", and so
actually Encode is working as required by the Unicode standard, since it
could be getting this from open interchange. What you are really asking
for is a new input method that does not do strict checking for these
code points, perhaps with "lax" in its name. It is important to do the
strict checking by default, as one of the non-character code points is
U+FFFE, which could be used in UTF-16 as the first character in an
attack to cause the platform to think that the byte order is swapped
from what it actually is.

--Karl Williamson

p5pRT · 2011-01-13T21:58:07Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2011-01-13T21:58:07Z

@khwilliamson - Status changed from 'open' to 'rejected'

p5pRT · 2011-01-15T09:16:48Z

From andrew@pimlott.net

I will refile the issue with Encode. However, specifying the encoding
of filehandles is effectively a perl core feature. So I think it's
important to have intuitive and consistent behavior, assuming the user
doesn't know anything about Encode behind the curtains. I would
suggest:

- that all encodings treat these characters in the same way as the perl
core
- that the default behavior be a warning (with the same text as the core
warning--currently it is a little different)
- that the warning be controlled by 'use warnings'

The last requires that the warning flags be passed on to Encode. This
is more work but is more consistent to the programmer than adding a
"lax" encoding name. Also, when he looks up the error in perldiag(1),
that's what he is told to do.

Interesting point about the security implication. But U+FEFF could as
well be used maliciously, but it is accepted. And an attack might be
routed through UTF-8 or another encoding, so by the same reasoning it
should be an error there. (It might even be routed through a perl
program that constructs strings with chr(), so chr(0xFDD0) should be an
error.)

And it's a little strong to say that Encode is "working as required",
because Unicode says "Applications are free to use any of these
noncharacter code points internally", and Encode doesn't know anything
about the context of the program. (I probably biased this bug report
against myself by calling these characters "illegal", following perl's
terminology, instead of "noncharacters" ;-) )

Andrew

p5pRT · 2011-01-22T21:28:41Z

From @khwilliamson

On 01/14/2011 03:15 PM, Andrew Pimlott wrote:

I will refile the issue with Encode. However, specifying the encoding
of filehandles is effectively a perl core feature. So I think it's
important to have intuitive and consistent behavior, assuming the user
doesn't know anything about Encode behind the curtains.

I'm sorry that you're being exposed to Perl's internal organizational
structure. Our excuse would be that it's essentially an all volunteer
organization, with no funding to do things "right." (But I've noticed
that this is true as well for companies that would have the resources to
avoid this, should they choose.) In any event, Encode is a CPAN module
that exists and is maintained independently of the Perl core, and works
on any number of Perl versions. The code that causes the behavior you
don't like is in Encode. The functionality that Encode provides has
been deemed so useful that it's now shipped with the core; and, as is
often the case, the fact that it's independent only shows up as an issue
when there is a flaw.

As an aside, UTF-16 parallels UTF-8, in that if you use Encode with that
spelling of the encoding, you get the same strict behavior you do with
UTF-16. It's just that there are alternative spellings for 8 bit
encoding that give laxer behavior. I am of the opinion that those
spellings should be different than they are, and involve something like
"lax" to indicate to the user that they are getting that.

Security is serious business. I'm unhappy that Perl, including the
modules it ships with, has too many instances of not treating it that
way. I view it as unconscionable.

I would

suggest:

- that all encodings treat these characters in the same way as the perl
core
- that the default behavior be a warning (with the same text as the core
warning--currently it is a little different)
- that the warning be controlled by 'use warnings'

The last requires that the warning flags be passed on to Encode. This
is more work but is more consistent to the programmer than adding a
"lax" encoding name. Also, when he looks up the error in perldiag(1),
that's what he is told to do.

The default behavior can't be just a warning when a server is facing the
wide-world of hackers. The current behavior of failing when warnings
are enabled has been considered a bug, and is now changed in 5.13.9.

Interesting point about the security implication. But U+FEFF could as
well be used maliciously, but it is accepted. And an attack might be
routed through UTF-8 or another encoding, so by the same reasoning it
should be an error there. (It might even be routed through a perl
program that constructs strings with chr(), so chr(0xFDD0) should be an
error.)

I'm not sure if you were being ironic here, as it is self-contradictory;
so I have to assume you were being straight. You said in an earlier
post that the non-characters are legal internally. So chr() has to
accept them. I don't see how U+FEFF or encoding them in UTF-8 could be
security implications.

And it's a little strong to say that Encode is "working as required",
because Unicode says "Applications are free to use any of these
noncharacter code points internally", and Encode doesn't know anything
about the context of the program. (I probably biased this bug report
against myself by calling these characters "illegal", following perl's
terminology, instead of "noncharacters" ;-) )

Since Encode doesn't know anything about the context, it has to assume
the worst case. To do otherwise is to leave users unknowingly open to
attack.

There was no biasing involved. I've been trying to root out all
instances of calling these "illegal" in the 5.13.X series.

p5pRT · 2011-01-30T19:05:35Z

From andrew@pimlott.net

Excerpts from Karl Williamson's message of Sat Jan 22 13:16:45 -0800 2011:

I'm sorry that you're being exposed to Perl's internal organizational
structure.

No prob about that. I've filed the bug with Encode:
https://rt.cpan.org/Public/Bug/Display.html?id=64788

I'm just suggesting that some coordination with the core perl maintaiers
is warranted, since Encode is so closely integrated.

As an aside, UTF-16 parallels UTF-8, in that if you use Encode with that
spelling of the encoding, you get the same strict behavior you do with
UTF-16.

I don't think that's accurate--see the original example I posted:

binmode(STDIN, ':encoding(UTF-8)');
while (<STDIN>) { }

Input is EF B7 93, which decodes to U+FDD3, a noncharacter. There is no
diagnostic. (Unless this is changed in a recent dev version.)

The default behavior can't be just a warning when a server is facing the
wide-world of hackers.

That's a fair point, but consider the other side: I write code that is
correct according to Unicode and my application's semantics. One day,
my application fails because Encode surprisingly considers some valid
input illegal. It's a judgement call, and don't think the best default
policy is obvious. A warning is a reasonable compromise, IMO.

Even better would be to document the situation clearly:

As a security precaution, the following encodings consider Unicode
"noncharacters" to be malformed. If you want to decode Unicode
noncharacters, ...

Interesting point about the security implication. But U+FEFF could as
well be used maliciously, but it is accepted. And an attack might be
routed through UTF-8 or another encoding, so by the same reasoning it
should be an error there. (It might even be routed through a perl
program that constructs strings with chr(), so chr(0xFDD0) should be an
error.)

I'm not sure if you were being ironic here, as it is self-contradictory;
so I have to assume you were being straight. You said in an earlier
post that the non-characters are legal internally. So chr() has to
accept them.

I'd put it differently: perl and Encode have no idea what data is
"internal". Input being read from a filehandle might well be internal,
eg. from a file the application itself produced. An argument to chr()
might well be external, eg a number supplied by a potentially malicious
agent. Saying Encode::decode() should be more strict than chr() is
merely a rough heuristic.

By the way, to be clear, chr(0xFDD0) does throw a warning, so someone
already decided that this should be checked here. I say, the best thing
for the programmer is to be consistent: Have one way to say whether
noncharacters should be errors, warings, or non-issues, and honor it
everywhere. The "no warnings 'utf8'" pragma would be the natural way to
do this.

I don't see how U+FEFF or encoding them in UTF-8 could be
security implications.

Oops, I got that one wrong. But according to my understanding, U+FFFE
is the only character with this security implication. So this rationale
should not be used to brand other noncharacters "illegal".

Since Encode doesn't know anything about the context, it has to assume
the worst case. To do otherwise is to leave users unknowingly open to
attack.

I appreciate your caution. But it's a judgement call as to how paranoid
you should be, when you may cause unexpected errors in valid programs.
"has to assume the worst case" is an extreme philosophy.

There was no biasing involved. I've been trying to root out all
instances of calling these "illegal" in the 5.13.X series.

Great--clear diagnostics and documentation will make these issues less
trouble for everyone.

Andrew

p5pRT · 2011-01-31T06:34:54Z

From @ap

* Andrew Pimlott <andrew@pimlott.net> [2011-01-30 20:10]:

That's a fair point, but consider the other side: I write
code that is correct according to Unicode and my application's
semantics. One day, my application fails because Encode
surprisingly considers some valid input illegal. It's
a judgement call, and don't think the best default policy
is obvious. A warning is a reasonable compromise, IMO.

I don’t think compromises are desirable here. And programs that
do something useful invariably shuffle data through both internal
and external interfaces, so it must be possible to affect each
interface individually, rather than via some action-a-distance
quasi-/global setting like toggling warnings that effects them
all. It should be possible to specifically request strict or lax
treatment of input per interface (eg. by parametrising I/O layers
differently).

And that is already the plan for the future. (The :utf8 layer
will be safe by default and there’ll be ways to request decoding
that differ.) Other inconsistencies with insufficient checking
(or non-checking!) of input such as with :encoding(UTF-8) should
get cleared up over time. That Perl isn’t very clean in this area
is known. Karl Williamson has been putting tremendous energy into
scrubbing it all down for the last year or so.

For things like `chr` the situation is more difficult since perl
can’t tell which operation belongs to what interface. The warning
we have is probably the only feasible solution.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

p5pRT closed this as completed Jan 13, 2011

p5pRT added Severity Low distro-Linux sendToCPAN type-library labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perl cannot read UTF-16 files with illegal unicode #10934

perl cannot read UTF-16 files with illegal unicode #10934

p5pRT commented Dec 30, 2010

p5pRT commented Dec 30, 2010

p5pRT commented Dec 30, 2010

p5pRT commented Jan 13, 2011

p5pRT commented Jan 13, 2011

p5pRT commented Jan 13, 2011

p5pRT commented Jan 15, 2011

p5pRT commented Jan 22, 2011

p5pRT commented Jan 30, 2011

p5pRT commented Jan 31, 2011

perl cannot read UTF-16 files with illegal unicode #10934

perl cannot read UTF-16 files with illegal unicode #10934

Comments

p5pRT commented Dec 30, 2010

p5pRT commented Dec 30, 2010

From andrew@pimlott.net

Created by andrew@pimlott.net

p5pRT commented Dec 30, 2010

From andrew@pimlott.net

p5pRT commented Jan 13, 2011

From @khwilliamson

p5pRT commented Jan 13, 2011

p5pRT commented Jan 13, 2011

p5pRT commented Jan 15, 2011

From andrew@pimlott.net

p5pRT commented Jan 22, 2011

From @khwilliamson

p5pRT commented Jan 30, 2011

From andrew@pimlott.net

p5pRT commented Jan 31, 2011

From @ap