Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$PerlIO::encoding::fallback corrupts UTF-8 output #10131

Open
p5pRT opened this issue Feb 4, 2010 · 4 comments
Open

$PerlIO::encoding::fallback corrupts UTF-8 output #10131

p5pRT opened this issue Feb 4, 2010 · 4 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 4, 2010

Migrated from rt.perl.org#72534 (status was 'open')

Searchable as RT72534$

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2010

From loic.etienne@tech.swisssign.com

To​: perlbug@​perl.org
Subject​: $PerlIO​::encoding​::fallback corrupts UTF-8 output
Reply-To​: loic.etienne@​tech.swisssign.com
Message-Id​: <5.10.0_4674_1265291836@​dev0003.int.swisssign.net>

This is a bug report for perl from loic.etienne@​tech.swisssign.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.


Setting
  $PerlIO​::encoding​::fallback = 0x0400;
before
  binmode(STDOUT, '​:encoding(UTF-8)');
may corrupt the UTF-8 output of print STDOUT
when a UTF-8 multi-byte character lays over two output buffers.
Each part of a split multi-byte character is outputted as XML entities,
although the byte sequence itself is a correct UTF-8 byte sequence.

Example​: &#xc3;&#x84; instead of the corresponding bytes.

IMHO, the encoding fallback should apply only for input, and not for output,
since perl itself generates the bytes to be outputted.
A corrupted UTF-8 sequence can only occur if perl's internal string handling
is buggy (very unlikely).

Code to reproduce the bug (assuming that the output buffer size is 1024)​:

use strict;
use warnings;

use PerlIO​::encoding;

#
# 00C4 Ä LATIN CAPITAL LETTER A WITH DIAERESIS
# 2-bytes UTF-8 sequence 0xC3 0x84
#
my $two_bytes_in_utf8 = chr(0xC4);

#
# The following $string is constructed in such a way that
# the last UTF-8 character of $string overlaps the output buffer boundary​:
# ... <0xC3 0x84> 0xC3 |buffer boundary| 0x84 <0xC3 0x84> ...
# <utf8-char> oops oops <utf8-char>
#
# Note that $string itself is internally represented in ISO-8859-1
# but converted to UTF-8 by the output layer :encoding(UTF-8)
#
# The output buffer is assumed to consist of 1024 bytes, thus 'x 512'.
# Use a value higher than 'x 512' on systems with bigger output buffer size.
#
my $string = 'a' . $two_bytes_in_utf8 x 512;

$PerlIO​::encoding​::fallback = 0x0400; # xml entities
binmode(STDOUT, '​:encoding(UTF-8)');

# wrong output (&#xc3;&#x84; at the end)
print STDOUT "$string\n";

# correct output
syswrite(STDOUT, "$string\n");

# Remark that if
# binmode(STDOUT, '​:encoding(UTF-8)');
# occured before
# $PerlIO​::encoding​::fallback = 0x0400;
# then the output of both print and syswrite would be correct.



Flags​:
  category=library
  severity=medium


Site configuration information for perl 5.10.0​:

Configured by Debian Project at Thu Oct 1 22​:36​:47 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration​:
  Platform​:
  osname=linux, osvers=2.6.24-23-server,
archname=x86_64-linux-gnu-thread-multi
  uname='linux crested 2.6.24-23-server #1 smp wed apr 1 22​:14​:30 utc 2009
x86_64 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5
-Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0
-Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm
-DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0
-Dd_dosuid -des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe
-I/usr/local/include'
  ccversion='', gccversion='4.4.1', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=/lib/libc-2.10.1.so, so=so, useshrplib=true,
libperl=libperl.so.5.10.0
  gnulibc_version='2.10.1'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches​:


@​INC for perl 5.10.0​:
  /etc/perl
  /usr/local/lib/perl/5.10.0
  /usr/local/share/perl/5.10.0
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.10
  /usr/share/perl/5.10
  /usr/local/lib/site_perl
  .


Environment for perl 5.10.0​:
  HOME=/home/etiennel
  LANG=en_US.UTF-8
  LANGUAGE=en_US​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/etiennel/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/bin​:/usr/games​:/home/etiennel/bin/
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 26, 2013

From @jkeenan

On Thu Feb 04 06​:26​:52 2010, loic.etienne@​tech.swisssign.com wrote​:

To​: perlbug@​perl.org
Subject​: $PerlIO​::encoding​::fallback corrupts UTF-8 output
Reply-To​: loic.etienne@​tech.swisssign.com
Message-Id​: <5.10.0_4674_1265291836@​dev0003.int.swisssign.net>

This is a bug report for perl from loic.etienne@​tech.swisssign.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
Setting
$PerlIO​::encoding​::fallback = 0x0400;
before
binmode(STDOUT, '​:encoding(UTF-8)');
may corrupt the UTF-8 output of print STDOUT
when a UTF-8 multi-byte character lays over two output buffers.
Each part of a split multi-byte character is outputted as XML
entities,
although the byte sequence itself is a correct UTF-8 byte sequence.

Example​: &#xc3;&#x84; instead of the corresponding bytes.

IMHO, the encoding fallback should apply only for input, and not for
output,
since perl itself generates the bytes to be outputted.
A corrupted UTF-8 sequence can only occur if perl's internal string
handling
is buggy (very unlikely).

Code to reproduce the bug (assuming that the output buffer size is
1024)​:

use strict;
use warnings;

use PerlIO​::encoding;

#
# 00C4 � LATIN CAPITAL LETTER A WITH DIAERESIS
# 2-bytes UTF-8 sequence 0xC3 0x84
#
my $two_bytes_in_utf8 = chr(0xC4);

#
# The following $string is constructed in such a way that
# the last UTF-8 character of $string overlaps the output buffer
boundary​:
# ... <0xC3 0x84> 0xC3 |buffer boundary| 0x84 <0xC3 0x84> ...
# <utf8-char> oops oops <utf8-char>
#
# Note that $string itself is internally represented in ISO-8859-1
# but converted to UTF-8 by the output layer :encoding(UTF-8)
#
# The output buffer is assumed to consist of 1024 bytes, thus 'x 512'.
# Use a value higher than 'x 512' on systems with bigger output buffer
size.
#
my $string = 'a' . $two_bytes_in_utf8 x 512;

$PerlIO​::encoding​::fallback = 0x0400; # xml entities
binmode(STDOUT, '​:encoding(UTF-8)');

# wrong output (&#xc3;&#x84; at the end)
print STDOUT "$string\n";

# correct output
syswrite(STDOUT, "$string\n");

# Remark that if
# binmode(STDOUT, '​:encoding(UTF-8)');
# occured before
# $PerlIO​::encoding​::fallback = 0x0400;
# then the output of both print and syswrite would be correct.

This problem persists in Perl 5.18.0. Can someone familiar with PerlIO,
etc. take a crack at this?

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented May 26, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 27, 2013

From @Leont

On Thu Feb 04 06​:26​:52 2010, loic.etienne@​tech.swisssign.com wrote​:

Setting
$PerlIO​::encoding​::fallback = 0x0400;
before
binmode(STDOUT, '​:encoding(UTF-8)');
may corrupt the UTF-8 output of print STDOUT
when a UTF-8 multi-byte character lays over two output buffers.
Each part of a split multi-byte character is outputted as XML
entities,
although the byte sequence itself is a correct UTF-8 byte sequence.

Example​: &#xc3;&#x84; instead of the corresponding bytes.

IMHO, the encoding fallback should apply only for input, and not for
output,
since perl itself generates the bytes to be outputted.
A corrupted UTF-8 sequence can only occur if perl's internal string
handling
is buggy (very unlikely).

Code to reproduce the bug (assuming that the output buffer size is
1024)​:

use strict;
use warnings;

use PerlIO​::encoding;

#
# 00C4 � LATIN CAPITAL LETTER A WITH DIAERESIS
# 2-bytes UTF-8 sequence 0xC3 0x84
#
my $two_bytes_in_utf8 = chr(0xC4);

#
# The following $string is constructed in such a way that
# the last UTF-8 character of $string overlaps the output buffer
boundary​:
# ... <0xC3 0x84> 0xC3 |buffer boundary| 0x84 <0xC3 0x84> ...
# <utf8-char> oops oops <utf8-char>
#
# Note that $string itself is internally represented in ISO-8859-1
# but converted to UTF-8 by the output layer :encoding(UTF-8)
#
# The output buffer is assumed to consist of 1024 bytes, thus 'x 512'.
# Use a value higher than 'x 512' on systems with bigger output buffer
size.
#
my $string = 'a' . $two_bytes_in_utf8 x 512;

$PerlIO​::encoding​::fallback = 0x0400; # xml entities
binmode(STDOUT, '​:encoding(UTF-8)');

# wrong output (&#xc3;&#x84; at the end)
print STDOUT "$string\n";

# correct output
syswrite(STDOUT, "$string\n");

# Remark that if
# binmode(STDOUT, '​:encoding(UTF-8)');
# occured before
# $PerlIO​::encoding​::fallback = 0x0400;
# then the output of both print and syswrite would be correct.

Between PerlIO​::encoding and Encode, there's not nearly enough
communication about partial characters. The default fallback value
appears to work, anything else is tricky.

Leon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants