Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 filters do not handle all surrogates gracefully #10118

Closed
p5pRT opened this issue Feb 2, 2010 · 6 comments
Closed

UTF-16 filters do not handle all surrogates gracefully #10118

p5pRT opened this issue Feb 2, 2010 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 2, 2010

Migrated from rt.perl.org#72414 (status was 'resolved')

Searchable as RT72414$

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2010

From @nwc10

Created by @nwc10

Consider a script written in UTF-16BE, with a character whose surrogate pair
contains the octect 10​:

$ ./perl -Ilib -MEncode -we 'print "\xFE\xFF", encode("UTF16-BE", "warn qq[Hello world]; # \x{12800}")' >script.pl

$ ./perl script.pl Malformed UTF-16 surrogate.

But that isn't true​:

$ iconv -f UTF-16BE -t UTF-8 <script.pl | ./perl
Hello world at - line 1.

The problem is that utf16_textfilter() is reading "line" by "line", assuming
an encoding where an octet of 10 is end of line, and making no effort to
get all 4 octets of a surrogate pair before calling utf16_to_utf8()

The latter (also) doesn't check for end of buffer when reading the second
half of a surrogate pair.

UTF-16LE will suffer the same bugs, once the reading-off-by-one bug is fixed
in utf16rev_textfilter()

Nicholas Clark

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.11.0.42:

Configured by nick at Fri Oct 16 20:36:59 BST 2009.

Summary of my perl5 (revision 5 version 11 subversion 0) configuration:
  Commit id: 20d0b1e9c410d995ea730a00781152c652d4b672
  Platform:
    osname=linux, osvers=2.6.18-xenu, archname=x86_64-linux-thread-multi
    uname='linux zazen 2.6.18-xenu #1 smp thu oct 4 12:23:41 bst 2007 x86_64 gnulinux '
    config_args='-Dusedevel=y -Dcc=ccache g++ -Dld=g++ -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-Os -Dusethreads -Duse64bitall -Uusemymalloc -Duseperlio -Dprefix=~/Sandpit/snap5.9.x-v5.11.0-218-g20d0b1e -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap5.9.x-v5.11.0-218-g20d0b1e -Dinstallman1dir=none -Dinstallman3dir=none -Uuserelocatableinc -de'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='ccache g++', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-Os',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='g++', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc.so.6, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -Os -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.11.0.42:
    lib
    /home/nick/Sandpit/snap5.9.x-v5.11.0-218-g20d0b1e/lib/perl5/site_perl/5.11.0/x86_64-linux-thread-multi
    /home/nick/Sandpit/snap5.9.x-v5.11.0-218-g20d0b1e/lib/perl5/site_perl/5.11.0
    /home/nick/Sandpit/snap5.9.x-v5.11.0-218-g20d0b1e/lib/perl5/5.11.0/x86_64-linux-thread-multi
    /home/nick/Sandpit/snap5.9.x-v5.11.0-218-g20d0b1e/lib/perl5/5.11.0
    .


Environment for perl 5.11.0.42:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2017

From @jkeenan

On Tue, 02 Feb 2010 08​:26​:24 GMT, nicholas wrote​:

This is a bug report for perl from nick@​ccl4.org,
generated with the help of perlbug 1.39 running under perl 5.11.0.42.

-----------------------------------------------------------------
[Please describe your issue here]

Consider a script written in UTF-16BE, with a character whose
surrogate pair
contains the octect 10​:

$ ./perl -Ilib -MEncode -we 'print "\xFE\xFF", encode("UTF16-BE",
"warn qq[Hello world]; # \x{12800}")' >script.pl

$ ./perl script.pl Malformed UTF-16 surrogate.

But that isn't true​:

$ iconv -f UTF-16BE -t UTF-8 <script.pl | ./perl
Hello world at - line 1.

The problem is that utf16_textfilter() is reading "line" by "line",
assuming
an encoding where an octet of 10 is end of line, and making no effort
to
get all 4 octets of a surrogate pair before calling utf16_to_utf8()

The latter (also) doesn't check for end of buffer when reading the
second
half of a surrogate pair.

UTF-16LE will suffer the same bugs, once the reading-off-by-one bug is
fixed
in utf16rev_textfilter()

Nicholas Clark

This problem appears to have been corrected somewhere between 5.10.1 and 5.12.5.

#####
$ perlbrew use perl-5.10.1
$ perl -v | head -2 | tail -1
This is perl, v5.10.1 (*) built for x86_64-linux
$ perl 72414-script.pl
Malformed UTF-16 surrogate.

$ perlbrew use perl-5.12.5
$ perl -v | head -2 | tail -1
This is perl 5, version 12, subversion 5 (v5.12.5) built for x86_64-linux
$ perl 72414-script.pl
Hello world at 72414-script.pl line 1.
#####

However, I haven't been able to figure out how to use Porting/bisect.pl to determine the commit at which the program first completed successfully. Suggestions?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2017

From @hvds

On Sun, 26 Feb 2017 19​:10​:09 -0800, jkeenan wrote​:

However, I haven't been able to figure out how to use
Porting/bisect.pl to determine the commit at which the program first
completed successfully. Suggestions?

Verify that the testcase exits non-zero on failure and zero on success​:

% perl-5.10 ~/72414-script.pl
Malformed UTF-16 surrogate.
% echo $?
9
% perl-blead ~/72414-script.pl
Hello world at /home/hv/72414-script.pl line 1.
% echo $?
0
%

Check the docs for example of "when was this fixed"​:

% perldoc Porting/bisect-runner.pl | grep -A1 'stop being an error'
  # When did this stop being an error?
  .../Porting/bisect.pl --expect-fail -e '1 // 2'
%

Bisect​:

% Porting/bisect.pl --expect-fail -- ./perl -Ilib ~/72414-script.pl
[...]
ba77e4c is the first bad commit
commit ba77e4c
Author​: Nicholas Clark <nick@​ccl4.org>
Date​: Thu Oct 22 19​:39​:30 2009 +0100

  S_utf16_textfilter() needs to avoid splitting UTF-16 surrogate pairs.
 
  Easier said than done.

:040000 040000 00e64049450c3e91b8d09afa4b676520cc75836e f73afa6dfba581efaa53915a40b8c611e07cf23f M t
:100644 100644 f795707e0d90fbc38ebad23b3b8944647530c5e0 f105505ea49664c0a0d00a89ecff57ccb32ee284 M toke.c
bisect run success
That took 1277 seconds.
%

The bisector could helpfully s/bad commit/good commit/ under expect-fail.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2017

From @jkeenan

On Mon, 27 Feb 2017 12​:12​:18 GMT, hv wrote​:

On Sun, 26 Feb 2017 19​:10​:09 -0800, jkeenan wrote​:

However, I haven't been able to figure out how to use
Porting/bisect.pl to determine the commit at which the program first
completed successfully. Suggestions?

Verify that the testcase exits non-zero on failure and zero on
success​:

% perl-5.10 ~/72414-script.pl
Malformed UTF-16 surrogate.
% echo $?
9
% perl-blead ~/72414-script.pl
Hello world at /home/hv/72414-script.pl line 1.
% echo $?
0
%

Check the docs for example of "when was this fixed"​:

% perldoc Porting/bisect-runner.pl | grep -A1 'stop being an error'
# When did this stop being an error?
.../Porting/bisect.pl --expect-fail -e '1 // 2'
%

Bisect​:

% Porting/bisect.pl --expect-fail -- ./perl -Ilib ~/72414-script.pl
[...]
ba77e4c is the first bad commit
commit ba77e4c
Author​: Nicholas Clark <nick@​ccl4.org>
Date​: Thu Oct 22 19​:39​:30 2009 +0100

S_utf16_textfilter() needs to avoid splitting UTF-16 surrogate pairs.

Easier said than done.

:040000 040000 00e64049450c3e91b8d09afa4b676520cc75836e
f73afa6dfba581efaa53915a40b8c611e07cf23f M t
:100644 100644 f795707e0d90fbc38ebad23b3b8944647530c5e0
f105505ea49664c0a0d00a89ecff57ccb32ee284 M toke.c
bisect run success
That took 1277 seconds.
%

The bisector could helpfully s/bad commit/good commit/ under expect-
fail.

Hugo

Bisection confirmed​:

#####
# bad
$ git show | head -1
commit b3766b1
$ ./perl -Ilib /home/jkeenan/learn/perl/p5p/72414-script.pl
Malformed UTF-16 surrogate.

# good
$ git show | head -1
commit ba77e4c
$./perl -Ilib /home/jkeenan/learn/perl/p5p/72414-script.pl
Hello world at /home/jkeenan/learn/perl/p5p/72414-script.pl line 1.
#####

Hugo, Tux, alh +++ for assistance in bisection.

Marking ticket Resolved.

Thank you very much.
--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2017

@jkeenan - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant