Skip Menu |
Report information
Id: 24846
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: jesse [at] bestpractical.com
Cc:
AdminCc:

Operating System: All
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version:
  • 5.8.0
  • 5.8.1
  • 5.8.2
Fixed In: (no value)



Subject: [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Date: Wed, 7 Jan 2004 23:13:03 -0500
To: perlbug [...] bugs.perl.org
From: Jesse Vincent <jesse [...] bestpractical.com>
Download (untitled) / with headers
text/plain 7.6k
Yeah, I'm still not quite sure I believe it myself, but IO::Scalar exercises join with UTF8 and non-UTF8 data causing RT to end up with corrupted attachments fairly often. After patching IO::Scalar to work around this by emulating join using concatenation, the issue disappears. ---- From: Nicholas Adrian Vinen <hb@pandora.x256.com> Hello, I am a consultant for a company which uses RT for their internal support. They asked me to fix a problem they were having where attaching binary files to a ticket caused the file to become corrupt sometimes. They tracked it down to the case where the mod_perl session which serves the request to add the attachment to the ticket has previously been used to perform some ticket-related operation. I finally tracked down this problem to a bug in perl. Here is a detailed description of the problem: When you attach a file to a ticket using RT it saves the file you attach into a file into /tmp. It then adds a MIME::Body::File record to the MIME::Entity which represents the ticket. Later, it calls make_singlepart() on the MIME::Entity, which converts the entity into a string. During this process, it calls as_string() on the MIME::Body::File. This causes the file to be read in and printed into a string using the IO::Scalar object. IO::Scalar's print() function calls the function join() on the data as it is read in, before that data is appended onto the destination string. The problem occurs inside join(). join() recycles string objects into which it does the joining, which it later returns. It never touches the UTF8 flag on these strings. So, on the initial run, it has no strings to recycle (or few), and when they are created they are set to ASCII. So all the results of join() are ASCII, which is what MIME and RT wants, as ASCII is also what is used for processing binary data. The problem is, on the second and subsequent executions of RT within the perl system, the recycled strings often have the UTF8 flag set. So, join ('', $string), where $string is ASCII, will often return a UTF8 string. When this UTF8 string is later converted into ASCII it is modified, and so the binary data is corrupted. The solution is to apply the following patch to perl (tested with perl 5.8.2), which sets the UTF8 flag on the returned string to something sensible. diff -u perl-5.8.2/doop.c perl-5.8.2-patched/doop.c --- perl-5.8.2/doop.c 2003-09-30 10:09:51.000000000 -0700 +++ perl-5.8.2-patched/doop.c 2004-01-05 23:23:13.000000000 -0800 @@ -647,6 +647,9 @@ register STRLEN len; STRLEN delimlen; STRLEN tmplen; + int utf8; + + utf8 = (SvUTF8(del)!=0); (void) SvPV(del, delimlen); /* stringify and get the delimlen */ /* SvCUR assumes it's SvPOK() and woe betide you if it's not. */ @@ -674,22 +677,37 @@ SvTAINTED_off(sv); if (items-- > 0) { - if (*mark) + if (*mark) { + utf8 += (SvUTF8(*mark)!=0); sv_catsv(sv, *mark); + } mark++; } if (delimlen) { for (; items > 0; items--,mark++) { sv_catsv(sv,del); + utf8 += (SvUTF8(*mark)!=0); sv_catsv(sv,*mark); } } else { - for (; items > 0; items--,mark++) + for (; items > 0; items--,mark++) { + utf8 += (SvUTF8(*mark)!=0); sv_catsv(sv,*mark); + } } SvSETMAGIC(sv); + if( utf8 ) + { + if( utf8 != sp-oldmark+1 && ckWARN_d(WARN_UTF8) ) + { + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Joining UTF8 and ASCII strings"); + } + SvUTF8_on(sv); + } else { + SvUTF8_off(sv); + } } void There may be other perl functions with similar problems; this is beyond the scope of my job, however I hope that the maintainers of perl will be proactive in attempting to find and fix any similar problems, as the way they have added UTF8 support to perl doesn't make it obvious when such bugs exist. I'd say that any built-in function that returns a string should be checked for (a) setting the UTF8 flag at all and (b) whether the value it sets it to is sensible. Also I think warnings when mixed types of strings are passed into functions are sensible as this can be dangerous, and as we don't know what character set the ASCII strings are in, the routines themselves can't really handle this case properly if any extended characters are present. I hope this helps. Nicholas On Tue, Jan 06, 2004 at 01:46:22PM -0500, Jesse Vincent wrote: Show quoted text
> Hey Nicholas, > > Thanks very much for the patch. I've forwarded it on to some > perl 5 porters who pushed back a bit. They've asked for a clear > statement of _exactly_ what's being recycled, along with a simple > testcase for reproducing the bug. I'd be happy to contribute myself, > but I'm really not a C person :/
I wish I knew! I had a really hard time reading the perl code. Here is what I could tell: somehow calling join() in perl causes Perl_do_join to be called with 3 main arguments. One is the delimiter, on is the array of strings to be joined, and one is a string into which to put the result. I don't know where exactly it gets the string which is the destination - the code calls a macro which I chased several levels deep. This is the function which does it, as far as I can tell: PP(pp_join) { dSP; dMARK; dTARGET; MARK++; do_join(TARG, *MARK, MARK, SP); SP = MARK; SETs(TARG); RETURN; } PP() looks like this: #define PP OP * Perl_##s(pTHX) pTHX looks like this: #define pTHX register struct perl_thread *thr PERL_UNUSED_DECL TARG is the string which is being joined into. dTARGET is a macro which looks like this: #define dTARGET SV * GETTARGET GETTARGET looks like this: #define GETTARGET targ = PAD_SV(PL_op->op_targ) PAD_SV looks like this: #define PAD_SV(po) (PL_curpad[po]) PL_curpad looks like this: #define PL_curpad (*Perl_Tcurpad_ptr(aTHX)) aTHX is defined like this: #define aTHX thr I think you can see why I wasn't very specific :( it's a mess... and here is where I lose the trail because I can't find Perl_Tcurpad_ptr defined anywere. However, here is what I can tell. The target comes out of the some 'curpad' member of 'thr' which is the current perl thread context. I *think* curpad is like a stack and the returned value from join goes onto the end of the stack. I think it is these values which are being recycled. Certainly the target has to come from somewhere, and it doesn't look like it is being allocated, it looks to me like it's being taken from an array. op_targ seems to be which element of the array that the result should go into but I'm not sure what defines this. I'd have to dig a lot more to find out and I'm already quite lost. The simple test case is this: install RT3 on a server with a single process with MaxRequestsPerChildren>1, using mod_perl, and attach a binary file to a ticket twice in a row :) That's the simplest test case I have. I've tried to reproduce it in small scripts but I can't, and now that I've fixed it on the servers that I have access to, I don't really want to break it again just to write a test case. I think, however, the fact that the join function never sets or unsets the utf8 flag on its target string means that it can't be operating 100% correctly. Anyway, this is as much information as I can give you as I need to get on with the next project now. Nicholas P.S. I tried to cross-post the email I sent onto the rt-devel and rt-users mailing lists but was rejected because I'm not subscribed to them. Perhaps you can do that for me? -- http://www.bestpractical.com/rt -- Trouble Ticketing. Free.
CC: bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Date: Fri, 09 Jan 2004 14:22:03 +0000
To: perl5-porters [...] perl.org
From: Nick Ing-Simmons <nick.ing-simmons [...] elixent.com>
Download (untitled) / with headers
text/plain 553b
Jesse Vincent <perl5-porters@perl.org> writes: Show quoted text
># New Ticket Created by Jesse Vincent ># Please include the string: [perl #24846] ># in the subject line of all future correspondence about this issue. ># <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=24846 > > > > >Yeah, I'm still not quite sure I believe it myself, but IO::Scalar >exercises join with UTF8 and non-UTF8 data causing RT to end up with >corrupted attachments fairly often.
IO::Scalar is (or should be) largely redundant in perl5.8.* as you can open(my $fh,"+<",\$scalar);
CC: perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Date: Fri, 9 Jan 2004 15:18:13 +0000
To: Nick Ing-Simmons <nick.ing-simmons [...] elixent.com>
From: Tim Bunce <Tim.Bunce [...] pobox.com>
Download (untitled) / with headers
text/plain 714b
On Fri, Jan 09, 2004 at 02:22:03PM +0000, Nick Ing-Simmons wrote: Show quoted text
> Jesse Vincent <perl5-porters@perl.org> writes:
> ># New Ticket Created by Jesse Vincent > ># Please include the string: [perl #24846] > ># in the subject line of all future correspondence about this issue. > ># <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=24846 >
>
> >Yeah, I'm still not quite sure I believe it myself, but IO::Scalar > >exercises join with UTF8 and non-UTF8 data causing RT to end up with > >corrupted attachments fairly often.
> > IO::Scalar is (or should be) largely redundant in perl5.8.* > as you can > > open(my $fh,"+<",\$scalar);
IO::Scalar maybe largely redundant in perl5.8.*, but join() isn't. Tim.
CC: bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Date: 11 Jan 2004 22:37:42 +0100
To: perl5-porters [...] perl.org
From: Slaven Rezic <slaven [...] rezic.de>
Download (untitled) / with headers
text/plain 996b
Jesse Vincent (via RT) <perlbug-followup@perl.org> writes: Show quoted text
> # New Ticket Created by Jesse Vincent > # Please include the string: [perl #24846] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=24846 > > > > > Yeah, I'm still not quite sure I believe it myself, but IO::Scalar > exercises join with UTF8 and non-UTF8 data causing RT to end up with > corrupted attachments fairly often. After patching IO::Scalar to work > around this by emulating join using concatenation, the issue disappears. > >
Here's a test case: use strict; use Encode qw(is_utf8); use Test::More qw(no_plan); my $ascii = "abc\304"; my $utf8 = "abc\x{0100}"; for ($utf8, $ascii) { my $res = join("", $_); is(is_utf8($res), $_ eq $utf8); } __END__ Regards, Slaven -- Slaven Rezic - slaven@rezic.de tksm - Perl/Tk program for searching and replacing in multiple files http://ptktools.sourceforge.net/#tksm
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Date: Mon, 12 Jan 2004 11:19:37 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 2.6k
Show quoted text
> The problem occurs inside join(). join() recycles string objects > into which it does the joining, which it later returns. It never > touches the UTF8 flag on these strings. So, on the initial run, it has > no strings to recycle (or few), and when they are created they are set > to ASCII. So all the results of join() are ASCII, which is what MIME and > RT wants, as ASCII is also what is used for processing binary data. The > problem is, on the second and subsequent executions of RT within the perl > system, the recycled strings often have the UTF8 flag set. So, join ('', > $string), where $string is ASCII, will often return a UTF8 string. When > this UTF8 string is later converted into ASCII it is modified, and so > the binary data is corrupted. > > The solution is to apply the following patch to perl (tested with > perl 5.8.2), which sets the UTF8 flag on the returned string to > something sensible.
This is parhaps due to SvPOK_only_UTF8() in sv_setpv() which leaves UTF8 flag as it was. I disagree warning when UTF8 and ASCII are mixed. I think it would upset encoding.pm which allows byte strings as in arbitrary encoding other than the system-native encoding (ASCII/Latin1 or EBCDIC). ### \A patch against perl-5.8.3 RC1 diff -urN perl~/doop.c perl/doop.c --- perl~/doop.c Fri Dec 19 05:47:58 2003 +++ perl/doop.c Mon Jan 12 10:08:10 2004 @@ -668,6 +668,10 @@ } sv_setpv(sv, ""); + /* sv_setpv retains old UTF8ness [perl #24846] */ + if (SvUTF8(sv)) + SvUTF8_off(sv); + if (PL_tainting && SvMAGICAL(sv)) SvTAINTED_off(sv); diff -urN perl~/t/op/join.t perl/t/op/join.t --- perl~/t/op/join.t Sat Dec 30 16:16:18 2000 +++ perl/t/op/join.t Mon Jan 12 10:34:22 2004 @@ -1,6 +1,6 @@ #!./perl -print "1..14\n"; +print "1..18\n"; @x = (1, 2, 3); if (join(':',@x) eq '1:2:3') {print "ok 1\n";} else {print "not ok 1\n";} @@ -65,3 +65,29 @@ print "ok 14\n"; } +{ # [perl #24846] $jb2 should be in bytes, not in utf8. + my $b = "abc\304"; + my $u = "abc\x{0100}"; + + sub join_into_my_variable { + my $r = join("", @_); + return $r; + } + + my $jb1 = join_into_my_variable("", $b); + my $ju1 = join_into_my_variable("", $u); + my $jb2 = join_into_my_variable("", $b); + my $ju2 = join_into_my_variable("", $u); + + print "not " unless unpack('H*', $jb1) eq unpack('H*', $b); + print "ok 15\n"; + + print "not " unless unpack('H*', $ju1) eq unpack('H*', $u); + print "ok 16\n"; + + print "not " unless unpack('H*', $jb2) eq unpack('H*', $b); + print "ok 17\n"; + + print "not " unless unpack('H*', $ju2) eq unpack('H*', $u); + print "ok 18\n"; +} ### \z patch Regards SADAHIRO Tomoyuki
Subject: Re: [perl #24846] [PATCH] Apparent utf8 bug in join() in 5.8.[012]
Date: Mon, 12 Jan 2004 11:27:53 +0100
To: perl5-porters [...] perl.org
From: Rafael Garcia-Suarez <rgarciasuarez [...] free.fr>
Download (untitled) / with headers
text/plain 451b
SADAHIRO Tomoyuki wrote: Show quoted text
> I disagree warning when UTF8 and ASCII are mixed.
So do I. Show quoted text
> I think it would upset encoding.pm > which allows byte strings as in arbitrary encoding > other than the system-native encoding (ASCII/Latin1 or EBCDIC). > > ### \A patch against perl-5.8.3 RC1 > diff -urN perl~/doop.c perl/doop.c > --- perl~/doop.c Fri Dec 19 05:47:58 2003 > +++ perl/doop.c Mon Jan 12 10:08:10 2004
Thanks, applied to bleadperl as #22117.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org