Skip Menu |
Report information
Id: 37170
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: pflanze <christian [at] pflanze.mine.nu>
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: low
Type: core
Perl Version: 5.8.7
Fixed In: (no value)



Subject: Taint mode still breaks utf8 handling
Date: 15 Sep 2005 09:52:36 -0000
To: perlbug [...] perl.org
From: chris [...] elvis-jaeger.mine.nu
Download (untitled) / with headers
text/plain 7.1k
This is a bug report for perl from christian.jaeger@ethlife.ethz.ch, generated with the help of perlbug 1.35 running under perl v5.8.7. ----------------------------------------------------------------- [Please enter your report here] I'm in the process of "porting" a perl web app (fastcgi, running with -T flag) from perl 5.005_03 to current releases. I first had problems with 5.8.4: when I read in a block of data using read, about like this: use Encode; open F,"some/file/containing_utf8_text" or die $!; my $buf; read F,$buf,10,1000 or die $!; my $str= Encode::decode_utf8($buf); gave a $str which still had the utf8 byte sequences as characters (and print "utf8?: ", Encode::is_utf8($str) ? "yes" : "no", "\n"; gave "no" iirc) (I'm actually using my own wrappers around open and read, so I didn't test the exact code as above). I did narrow those down to the usage of the -T flag. I found that one of either of the following would make the decoding work correctly: - switching off tainting mode - detainting $buf before decoding it, like: $buf=~ /(.*)/s or die; my $str= Encode::decode_utf8($1); - upgrading to perl 5.8.7 (5.8.7-3 from Debian testing) "Fine, it has been fixed" I thought. But now I realized that something else still doesn't work under taint mode. Sorry that I'm a bit vague below, I'm under pressure to finish the project; please contact me if you need more information. For now I'm simply turning of taint mode. (What I'm doing is, I write a list of strings to one file, first writing the lengths of each, so that I know how to split the file contents into the strings agan when reading back in: my $d= [ list of strings or string refs ]; my $f= ... filehandle to new output file, blessed to a class which has an xprint method. my @is_utf8; for(@$d) { my $rft; my $is_utf8; # if (defined($rft=Scalar::Util::reftype($_)) and $rft eq "SCALAR") { $is_utf8= Encode::is_utf8($$_); Eile->log("reference ".($is_utf8 ? "is" : "is not")." utf8"); Encode::_utf8_off($$_) if $is_utf8; $f->xprint(pack('l',length($$_)), ($is_utf8 ? "1" : "0") ); } else { $is_utf8= Encode::is_utf8($_); Eile->log("string ".($is_utf8 ? "is" : "is not")." utf8"); Encode::_utf8_off($_) if $is_utf8; $f->xprint(pack('l',length($_)), ($is_utf8 ? "1" : "0") ); } push @is_utf8,$is_utf8; } $f->xprint(pack('l',-1),"|");# "|" is choosen arbitrarily, it's not used anywhere. for(@$d) { my $is_utf8= shift @is_utf8; my $rft; if (defined($rft=Scalar::Util::reftype($_)) and $rft eq "SCALAR") { $f->xprint($$_); Encode::_utf8_on($$_) if $is_utf8; } else { $f->xprint($_); Encode::_utf8_on($_) if $is_utf8; } } ) The problem is that sometimes Encode::is_utf8 reports false on a string, even when I know it must contain unicode characters: - the file being written to disk *does* contain utf8 sequences. - the flag being written to disk is false. (Encode::is_utf8 gave false) - the length being written into the header is too short (which means that the length builtin reported the length in unicode code points, not bytes -- how can this be if Encode::is_utf8 is false?). As I said, again switching off taint mode seems to make it work fine. (The strings being written above were coming from LWP (from HTTP get requests) -- maybe they were tainted for this reason.) Thanks for your works, Christian. [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl v5.8.7: Configured by Debian Project at Thu Jun 9 00:28:22 EST 2005. Summary of my perl5 (revision 5 version 8 subversion 7) configuration: Platform: osname=linux, osvers=2.4.27-ti1211, archname=i386-linux-thread-multi uname='linux kosh 2.4.27-ti1211 #1 sun sep 19 18:17:45 est 2004 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.7 -Dsitearch=/usr/local/lib/perl/5.8.7 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.7 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='3.3.6 (Debian 1:3.3.6-6)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so.5.8.7 gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: --- @INC for perl v5.8.7: /etc/perl /usr/local/lib/perl/5.8.7 /usr/local/share/perl/5.8.7 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 /usr/local/lib/perl/5.8.3 /usr/local/share/perl/5.8.3 . --- Environment for perl v5.8.7: HOME=/home/chris LANG=de_CH LANGUAGE (unset) LC_CTYPE=de_CH LC_NUMERIC=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/Gambit-C/bin:/opt/j2sdk_nb/j2sdk1.4.2/bin/:/home/chris/local/bin:/home/chris/bin:/root/local/bin:/root/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/bin/X11:/usr/local/sbin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin PERL_BADLANG (unset) SHELL=/bin/bash
CC: Perl 5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #37170] Taint mode still breaks utf8 handling
Date: Thu, 15 Sep 2005 21:29:01 +0900
To: Jaeger Christian <christian.jaeger [...] ethlife.ethz.ch>
From: Dan Kogai <dankogai [...] dan.co.jp>
Download (untitled) / with headers
text/plain 3.2k
Christian and Porters, Thanks for your report. On Sep 15, 2005, at 18:53 , Christian Jaeger (via RT) wrote: Show quoted text
> - the file being written to disk *does* contain utf8 sequences. > - the flag being written to disk is false. (Encode::is_utf8 gave > false) > - the length being written into the header is too short (which > means that the length builtin reported the length in unicode code > points, not bytes -- how can this be if Encode::is_utf8 is false?).
I could not duplicate the symptom on perl 5.8.7 but on 5.8.6 I did. # use strict; use Encode; my $fn = 'test.txt'; sub readwrite{ my $str = shift; open my $fh, ">:utf8", $fn or die "$fn : $!"; print $fh $str; close $fh; open my $fh, "<:raw", $fn or die "$fn : $!"; read $fh, my $buf, -s $fn; close $fh; unlink $fn; return $buf; } sub checkstr{ my $str = shift; print "Encode::is_utf8(\$str) = ", Encode::is_utf8($str), "\n"; print "utf8::is_utf8(\$str) = ", utf8::is_utf8($str), "\n"; } my $ascii = join '', map { chr $_ } 0x20..0x7e; # only ascii my $utf8 = join '', map { chr $_ } 0x2020..0x207e; # now Unicode; checkstr(decode_utf8(readwrite $ascii)); checkstr(decode_utf8(readwrite $utf8)); __END__ you run the code as follows (on my Mac OS X v10.4.2); Show quoted text
> % /usr/bin/perl utf8flag.pl > Perl Version is 5.008006, Encode Version is 2.08 > Encode::is_utf8($str) = > utf8::is_utf8($str) = > Encode::is_utf8($str) = 1 > utf8::is_utf8($str) = 1 > % /usr/bin/perl -T utf8flag.pl > Perl Version is 5.008006, Encode Version is 2.08 > Encode::is_utf8($str) = > utf8::is_utf8($str) = > Encode::is_utf8($str) = > utf8::is_utf8($str) = 1 > % perl utf8flag.pl > Perl Version is 5.008007, Encode Version is 2.10 > Encode::is_utf8($str) = 1 > utf8::is_utf8($str) = 1 > Encode::is_utf8($str) = 1 > utf8::is_utf8($str) = 1 > % perl -T utf8flag.pl > Perl Version is 5.008007, Encode Version is 2.10 > Encode::is_utf8($str) = 1 > utf8::is_utf8($str) = 1 > Encode::is_utf8($str) = 1 > utf8::is_utf8($str) = 1
As you see, on 5.8.6 utf8::is_utf8() works fine while Encode::is_utf8 () does not. Also note on 5.8.7 the flag is set UNCONDITIONALLY, whether the string contains U+100 and above or not. /* universal.c */ XS(XS_utf8_is_utf8) { dXSARGS; if (items != 1) Perl_croak(aTHX_ "Usage: utf8::is_utf8(sv)"); { SV * sv = ST(0); { if (SvUTF8(sv)) XSRETURN_YES; else XSRETURN_NO; } } XSRETURN_EMPTY; } /* end of code */ /* ext/Encode/Encode.xs */ bool is_utf8(sv, check = 0) SV * sv int check CODE: { if (SvGMAGICAL(sv)) /* it could be $1, for example */ sv = newSVsv(sv); /* GMAGIG will be done */ if (SvPOK(sv)) { RETVAL = SvUTF8(sv) ? TRUE : FALSE; if (RETVAL && check && !is_utf8_string((U8*)SvPVX(sv), SvCUR(sv))) RETVAL = FALSE; } else { RETVAL = FALSE; } if (sv != ST(0)) SvREFCNT_dec(sv); /* it was a temp copy */ } OUTPUT: RETVAL /* end of code */ Though not harmful, the behavior of 5.8.7 is not as documented as in Encode. Should I fix the pod accordingly or did it just reveal undocumented bug? Dan the Encode Maintainer
Subject: Re: [perl #37170] Taint mode still breaks utf8 handling
Date: Thu, 15 Sep 2005 14:34:00 +0200
To: perlbug-followup [...] perl.org
From: Christian Jaeger <christian.jaeger [...] ethlife.ethz.ch>
Download (untitled) / with headers
text/plain 1.7k
Hello Thanks for your reply. At 5:29 Uhr -0700 15.09.2005, Dan Kogai via RT wrote: Show quoted text
>I could not duplicate the symptom on perl 5.8.7 but on 5.8.6 I did. >... >you run the code as follows (on my Mac OS X v10.4.2);
With my perl 5.8.7 I'm getting: chris@elvis-5 chris > perl ./bugreport-test1 Encode::is_utf8($str) = 1 utf8::is_utf8($str) = 1 Encode::is_utf8($str) = 1 utf8::is_utf8($str) = 1 chris@elvis-5 chris > perl -T ./bugreport-test1 Encode::is_utf8($str) = 1 utf8::is_utf8($str) = 1 Encode::is_utf8($str) = 1 utf8::is_utf8($str) = 1 (thus the same as you with that version) Show quoted text
>As you see, on 5.8.6 utf8::is_utf8() works fine while Encode::is_utf8 >() does not.
Interesting, I will try my app with -T again with utf8::is_utf8. Show quoted text
> Also note on 5.8.7 the flag is set UNCONDITIONALLY, >whether the string contains U+100 and above or not.
yes, but that's fine for me. Your test case can't explain the second mentioned problem I'm seeing -- I somehow had a case where, before writing to the file, I had a string (originating from LWP) which gave false from Encode::is_utf8 but still gave a shorter length() (thus I would have guessed indicating utf8 flag is on) than the byte length in the file it is then written to. One thing to note: I'm not opening files with ">:utf8" or "<:raw", but: sysopen($fh,$path, O_EXCL|O_CREAT|O_RDWR, $mode) for writing, and open $fh,"<",$path for reading. That's the reason why I'm toggling off the utf8 flag of strings which have it manually (as shown in the code I pasted in my bug report) for the duration of the write, and using decode_utf8 later. I think I can't use ">:utf8", because not all strings I write have the utf8 flag on (some of the strings are binary data). Christian.
Download (untitled) / with headers
text/plain 148b
This looks like the same bug as reported in: #32687: Encode::is_utf8 on tainted UTF8 string returns false ...still unresolved in 5.8.8. Mark


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org