Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PATCH] Apparent utf8 bug in join() in 5.8.[012] #7023

Closed
p5pRT opened this issue Jan 8, 2004 · 8 comments
Closed

[PATCH] Apparent utf8 bug in join() in 5.8.[012] #7023

p5pRT opened this issue Jan 8, 2004 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 8, 2004

Migrated from rt.perl.org#24846 (status was 'resolved')

Searchable as RT24846$

@p5pRT
Copy link
Author

p5pRT commented Jan 8, 2004

From @obra

Yeah, I'm still not quite sure I believe it myself, but IO​::Scalar
exercises join with UTF8 and non-UTF8 data causing RT to end up with
corrupted attachments fairly often. After patching IO​::Scalar to work
around this by emulating join using concatenation, the issue disappears.


From​: Nicholas Adrian Vinen <hb@​pandora.x256.com>

  Hello,
  I am a consultant for a company which uses RT for their internal
  support. They asked me to fix a problem they were having where
attaching binary files to a ticket caused the file to become corrupt
sometimes. They tracked it down to the case where the mod_perl
session which serves the request to add the attachment to the ticket
has previously been used to perform some ticket-related operation. I
finally tracked down this problem to a bug in perl. Here is a detailed
description of the problem​:

  When you attach a file to a ticket using RT it saves the file you
  attach into a file into /tmp. It then adds a MIME​::Body​::File
record to the MIME​::Entity which represents the ticket. Later,
it calls make_singlepart() on the MIME​::Entity, which converts the
entity into a string. During this process, it calls as_string() on the
MIME​::Body​::File. This causes the file to be read in and printed into a
string using the IO​::Scalar object. IO​::Scalar's print() function calls
the function join() on the data as it is read in, before that data is
appended onto the destination string.

  The problem occurs inside join(). join() recycles string objects
  into which it does the joining, which it later returns. It never
touches the UTF8 flag on these strings. So, on the initial run, it has
no strings to recycle (or few), and when they are created they are set
to ASCII. So all the results of join() are ASCII, which is what MIME and
RT wants, as ASCII is also what is used for processing binary data. The
problem is, on the second and subsequent executions of RT within the perl
system, the recycled strings often have the UTF8 flag set. So, join ('',
$string), where $string is ASCII, will often return a UTF8 string. When
this UTF8 string is later converted into ASCII it is modified, and so
the binary data is corrupted.

  The solution is to apply the following patch to perl (tested with
  perl 5.8.2), which sets the UTF8 flag on the returned string to
something sensible.

Inline Patch
diff -u perl-5.8.2/doop.c perl-5.8.2-patched/doop.c
--- perl-5.8.2/doop.c   2003-09-30 10:09:51.000000000 -0700
+++ perl-5.8.2-patched/doop.c   2004-01-05 23:23:13.000000000 -0800
@@ -647,6 +647,9 @@
     register STRLEN len;
     STRLEN delimlen;
     STRLEN tmplen;
+    int utf8;
+
+    utf8 = (SvUTF8(del)!=0);
 
     (void) SvPV(del, delimlen); /* stringify and get the delimlen */
     /* SvCUR assumes it's SvPOK() and woe betide you if it's not. */
@@ -674,22 +677,37 @@
        SvTAINTED_off(sv);
 
     if (items-- > 0) {
-       if (*mark)
+       if (*mark) {
+           utf8 += (SvUTF8(*mark)!=0);
            sv_catsv(sv, *mark);
+       }
        mark++;
     }
 
     if (delimlen) {
        for (; items > 0; items--,mark++) {
            sv_catsv(sv,del);
+           utf8 += (SvUTF8(*mark)!=0);
            sv_catsv(sv,*mark);
        }
     }
     else {
-       for (; items > 0; items--,mark++)
+       for (; items > 0; items--,mark++) {
+           utf8 += (SvUTF8(*mark)!=0);
            sv_catsv(sv,*mark);
+       }
     }
     SvSETMAGIC(sv);
+    if( utf8 )
+    {
+        if( utf8 != sp-oldmark+1 && ckWARN_d(WARN_UTF8) )
+       {
+           Perl_warner(aTHX_ packWARN(WARN_UTF8), "Joining UTF8 and ASCII strings");
+       }
+        SvUTF8_on(sv);
+    } else {
+        SvUTF8_off(sv);
+    }
 }
 
 void

      There may be other perl functions with similar problems; this is
      beyond the scope of my job, however I hope that the maintainers of
perl will be proactive in attempting to find and fix any similar problems, as the way they have added UTF8 support to perl doesn't make it obvious when such bugs exist\. I'd say that any built\-in function that returns a string should be checked for \(a\) setting the UTF8 flag at all and \(b\) whether the value it sets it to is sensible\. Also I think warnings when mixed types of strings are passed into functions are sensible as this can be dangerous\, and as we don't know what character set the ASCII strings are in\, the routines themselves can't really handle this case properly if any extended characters are present\.

  I hope this helps.

  Nicholas

On Tue, Jan 06, 2004 at 01​:46​:22PM -0500, Jesse Vincent wrote​:

Hey Nicholas,

Thanks very much for the patch\. I've forwarded it on to some

perl 5 porters who pushed back a bit. They've asked for a clear
statement of _exactly_ what's being recycled, along with a simple
testcase for reproducing the bug. I'd be happy to contribute myself,
but I'm really not a C person :/

  I wish I knew! I had a really hard time reading the perl code. Here
is what I could tell​: somehow calling join() in perl causes Perl_do_join
to be called with 3 main arguments. One is the delimiter, on is the
array of strings to be joined, and one is a string into which to put the
result. I don't know where exactly it gets the string which is the
destination - the code calls a macro which I chased several levels deep.
This is the function which does it, as far as I can tell​:

PP(pp_join)
{
  dSP; dMARK; dTARGET;
  MARK++;
  do_join(TARG, *MARK, MARK, SP);
  SP = MARK;
  SETs(TARG);
  RETURN;
}

  PP() looks like this​:

#define PP OP * Perl_##s(pTHX)

  pTHX looks like this​:

#define pTHX register struct perl_thread *thr PERL_UNUSED_DECL

  TARG is the string which is being joined into. dTARGET is a macro
which looks like this​:

#define dTARGET SV * GETTARGET

  GETTARGET looks like this​:

#define GETTARGET targ = PAD_SV(PL_op->op_targ)

  PAD_SV looks like this​:

#define PAD_SV(po) (PL_curpad[po])

  PL_curpad looks like this​:

#define PL_curpad (*Perl_Tcurpad_ptr(aTHX))

  aTHX is defined like this​:

#define aTHX thr

  I think you can see why I wasn't very specific :( it's a mess...
and here is where I lose the trail because I can't find
Perl_Tcurpad_ptr defined anywere. However, here is what I can tell. The
target comes out of the some 'curpad' member of 'thr' which is the
current perl thread context. I *think* curpad is like a stack and the
returned value from join goes onto the end of the stack. I think it is
these values which are being recycled. Certainly the target has to come
from somewhere, and it doesn't look like it is being allocated, it looks
to me like it's being taken from an array. op_targ seems to be which
element of the array that the result should go into but I'm not sure
what defines this. I'd have to dig a lot more to find out and I'm
already quite lost.

  The simple test case is this​: install RT3 on a server with a single
process with MaxRequestsPerChildren>1, using mod_perl, and attach a
binary file to a ticket twice in a row :) That's the simplest test
case I have. I've tried to reproduce it in small scripts but I can't,
and now that I've fixed it on the servers that I have access to, I don't
really want to break it again just to write a test case. I think,
however, the fact that the join function never sets or unsets the utf8
flag on its target string means that it can't be operating 100%
correctly. Anyway, this is as much information as I can give you as I
need to get on with the next project now.

  Nicholas

P.S. I tried to cross-post the email I sent onto the rt-devel and
rt-users mailing lists but was rejected because I'm not subscribed to
them. Perhaps you can do that for me?

--
http​://www.bestpractical.com/rt -- Trouble Ticketing. Free.

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2004

From nick.ing-simmons@elixent.com

Jesse Vincent <perl5-porters@​perl.org> writes​:

# New Ticket Created by Jesse Vincent
# Please include the string​: [perl #24846]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=24846 >

Yeah, I'm still not quite sure I believe it myself, but IO​::Scalar
exercises join with UTF8 and non-UTF8 data causing RT to end up with
corrupted attachments fairly often.

IO​::Scalar is (or should be) largely redundant in perl5.8.*
as you can

open(my $fh,"+<",\$scalar);

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2004

From @timbunce

On Fri, Jan 09, 2004 at 02​:22​:03PM +0000, Nick Ing-Simmons wrote​:

Jesse Vincent <perl5-porters@​perl.org> writes​:

# New Ticket Created by Jesse Vincent
# Please include the string​: [perl #24846]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=24846 >

Yeah, I'm still not quite sure I believe it myself, but IO​::Scalar
exercises join with UTF8 and non-UTF8 data causing RT to end up with
corrupted attachments fairly often.

IO​::Scalar is (or should be) largely redundant in perl5.8.*
as you can

open(my $fh,"+<",\$scalar);

IO​::Scalar maybe largely redundant in perl5.8.*, but join() isn't.

Tim.

@p5pRT
Copy link
Author

p5pRT commented Jan 11, 2004

From @eserte

Jesse Vincent (via RT) <perlbug-followup@​perl.org> writes​:

# New Ticket Created by Jesse Vincent
# Please include the string​: [perl #24846]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=24846 >

Yeah, I'm still not quite sure I believe it myself, but IO​::Scalar
exercises join with UTF8 and non-UTF8 data causing RT to end up with
corrupted attachments fairly often. After patching IO​::Scalar to work
around this by emulating join using concatenation, the issue disappears.

Here's a test case​:

use strict;
use Encode qw(is_utf8);
use Test​::More qw(no_plan);
my $ascii = "abc\304";
my $utf8 = "abc\x{0100}";
for ($utf8, $ascii) {
  my $res = join("", $_);
  is(is_utf8($res), $_ eq $utf8);
}
__END__

Regards,
  Slaven

--
Slaven Rezic - slaven@​rezic.de

  tksm - Perl/Tk program for searching and replacing in multiple files
  http​://ptktools.sourceforge.net/#tksm

@p5pRT
Copy link
Author

p5pRT commented Jan 12, 2004

From BQW10602@nifty.com

  The problem occurs inside join\(\)\. join\(\) recycles string objects
  into which it does the joining\, which it later returns\. It never

touches the UTF8 flag on these strings. So, on the initial run, it has
no strings to recycle (or few), and when they are created they are set
to ASCII. So all the results of join() are ASCII, which is what MIME and
RT wants, as ASCII is also what is used for processing binary data. The
problem is, on the second and subsequent executions of RT within the perl
system, the recycled strings often have the UTF8 flag set. So, join ('',
$string), where $string is ASCII, will often return a UTF8 string. When
this UTF8 string is later converted into ASCII it is modified, and so
the binary data is corrupted.

  The solution is to apply the following patch to perl \(tested with
  perl 5\.8\.2\)\, which sets the UTF8 flag on the returned string to

something sensible.

This is parhaps due to SvPOK_only_UTF8() in sv_setpv()
which leaves UTF8 flag as it was.

I disagree warning when UTF8 and ASCII are mixed.
I think it would upset encoding.pm
which allows byte strings as in arbitrary encoding
other than the system-native encoding (ASCII/Latin1 or EBCDIC).

### \A patch against perl-5.8.3 RC1

Inline Patch
diff -urN perl~/doop.c perl/doop.c
--- perl~/doop.c	Fri Dec 19 05:47:58 2003
+++ perl/doop.c	Mon Jan 12 10:08:10 2004
@@ -668,6 +668,10 @@
     }
 
     sv_setpv(sv, "");
+    /* sv_setpv retains old UTF8ness [perl #24846] */
+    if (SvUTF8(sv))
+	SvUTF8_off(sv);
+
     if (PL_tainting && SvMAGICAL(sv))
 	SvTAINTED_off(sv);
 
diff -urN perl~/t/op/join.t perl/t/op/join.t
--- perl~/t/op/join.t	Sat Dec 30 16:16:18 2000
+++ perl/t/op/join.t	Mon Jan 12 10:34:22 2004
@@ -1,6 +1,6 @@
 #!./perl
 
-print "1..14\n";
+print "1..18\n";
 
 @x = (1, 2, 3);
 if (join(':',@x) eq '1:2:3') {print "ok 1\n";} else {print "not ok 1\n";}
@@ -65,3 +65,29 @@
   print "ok 14\n";
 }
 
+{ # [perl #24846] $jb2 should be in bytes, not in utf8.
+  my $b = "abc\304";
+  my $u = "abc\x{0100}";
+
+  sub join_into_my_variable {
+    my $r = join("", @_);
+    return $r;
+  }
+
+  my $jb1 = join_into_my_variable("", $b);
+  my $ju1 = join_into_my_variable("", $u);
+  my $jb2 = join_into_my_variable("", $b);
+  my $ju2 = join_into_my_variable("", $u);
+
+  print "not " unless unpack('H*', $jb1) eq unpack('H*', $b);
+  print "ok 15\n";
+
+  print "not " unless unpack('H*', $ju1) eq unpack('H*', $u);
+  print "ok 16\n";
+
+  print "not " unless unpack('H*', $jb2) eq unpack('H*', $b);
+  print "ok 17\n";
+
+  print "not " unless unpack('H*', $ju2) eq unpack('H*', $u);
+  print "ok 18\n";
+}
### \z patch

Regards
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Jan 12, 2004

From @rgs

SADAHIRO Tomoyuki wrote​:

I disagree warning when UTF8 and ASCII are mixed.

So do I.

I think it would upset encoding.pm
which allows byte strings as in arbitrary encoding
other than the system-native encoding (ASCII/Latin1 or EBCDIC).

### \A patch against perl-5.8.3 RC1
diff -urN perl~/doop.c perl/doop.c
--- perl~/doop.c Fri Dec 19 05​:47​:58 2003
+++ perl/doop.c Mon Jan 12 10​:08​:10 2004

Thanks, applied to bleadperl as #22117.

@p5pRT
Copy link
Author

p5pRT commented Jan 12, 2004

@rgs - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant