Skip Menu |
Report information
Id: 122148
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: mmartinec <Mark.Martinec [at] ijs.si>
Cc:
AdminCc:

Operating System: freebsd
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version: 5.20.0
Fixed In:
  • 5.21.2
  • 5.22.0



To: perlbug [...] perl.org
Subject: "Malformed UTF-8 character (unexpected end of string)" on a tainted string in 5.20
Date: Sat, 21 Jun 2014 03:53:05 +0200
From: Mark Martinec <Mark.Martinec [...] ijs.si>
Download (untitled) / with headers
text/plain 4.2k
This is a bug report for perl from Mark.Martinec@ijs.si, generated with the help of perlbug 1.40 running under perl 5.20.0. ----------------------------------------------------------------- [Please describe your issue here] Under perl 5.20.0 the following program fails (or warns) on: Malformed UTF-8 character (unexpected end of string) in substitution iterator at ./test.pl line 16. if the character string is tainted. Leaving out the -T option or not having the string tainted avoids the problem. #!/usr/bin/perl -T use strict; use warnings; use warnings FATAL => 'utf8'; use Encode; my $taint = substr($ENV{PATH}, 0,0); # tainted empty string # just a convenient way to represent some real (spam) subject text: my $s = '=?gb2312?B?zMbYkNSKIEx5ZGlhIL7o2ZvWrsir0MLToof4ob5Ma'. 'XBzeaG/w9fJq76nyq+2zMi5IHwg11TUgeL5VmljdG9yaWEgJiDFy'. '9DAxN1Mb3JyZXR0YSC+6Nmbob5HcmVlbmJlbGyhv7Pk64rG9w==?='; my $chars = decode('MIME-Header', $s) . $taint; $chars =~ s{ ([\x00-\x1f\x7f\\]) }{ sprintf('\\u%.4X',ord($1)) }xgse; binmode(STDOUT,':utf8'); printf("%s\n", $chars); [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl 5.20.0: Configured by root at Mon Jun 16 15:09:13 UTC 2014. Summary of my perl5 (revision 5 version 20 subversion 0) configuration: Platform: osname=freebsd, osvers=10.0-release, archname=amd64-freebsd-thread-multi uname='freebsd 10amd64-ws-default-job-03 10.0-release freebsd 10.0-release amd64 ' config_args='-sde -Dprefix=/usr/local -Darchlib=/usr/local/lib/perl5/5.20/mach -Dprivlib=/usr/local/lib/perl5/5.20 -Dman3dir=/usr/local/lib/perl5/5.20/perl/man/man3 -Dman1dir=/usr/local/man/man1 -Dsitearch=/usr/local/lib/perl5/site_perl/5.20/mach -Dsitelib=/usr/local/lib/perl5/site_perl/5.20 -Dscriptdir=/usr/local/bin -Dsiteman3dir=/usr/local/lib/perl5/5.20/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Ui_malloc -Ui_iconv -Uinstallusrbinperl -Dcc=cc -Duseshrplib -Dinc_version_list=none -Dccflags=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.20/BSDPAN" -Doptimize=-g -DDEBUGGING -Ui_gdbm -Duse64bitint -Dusethreads=y -Dusemymalloc=n' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.20/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include', optimize='-g', cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.20/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.2.1 Compatible FreeBSD Clang 3.3 (tags/RELEASE_33/final 183502)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags ='-pthread -Wl,-E -fstack-protector -L/usr/local/lib' libpth=/usr/lib /usr/local/lib /usr/include/clang/3.3 /usr/lib libs=-lm -lcrypt -lutil perllibs=-lm -lcrypt -lutil libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' -Wl,-R/usr/local/lib/perl5/5.20/mach/CORE' cccdlflags='-DPIC -fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' --- @INC for perl 5.20.0: /usr/local/lib/perl5/5.20/BSDPAN /usr/local/lib/perl5/site_perl/5.20/mach /usr/local/lib/perl5/site_perl/5.20 /usr/local/lib/perl5/5.20/mach /usr/local/lib/perl5/5.20 . --- Environment for perl 5.20.0: HOME=/home/mark LANG (unset) LANGUAGE (unset) LC_ALL=en_US.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/usr/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/kde4/bin/:/usr/X11R6/bin PERL_BADLANG (unset) SHELL=/usr/local/bin/bash
From: Karl Williamson <public [...] khwilliamson.com>
Date: Sat, 21 Jun 2014 13:06:42 -0600
Subject: Re: [perl #122148] "Malformed UTF-8 character (unexpected end of string)" on a tainted string in 5.20
To: perl5-porters [...] perl.org, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 4.4k
On 06/20/2014 07:53 PM, Mark Martinec (via RT) wrote: Show quoted text
> # New Ticket Created by Mark Martinec > # Please include the string: [perl #122148] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=122148 > > > > > This is a bug report for perl from Mark.Martinec@ijs.si, > generated with the help of perlbug 1.40 running under perl 5.20.0. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > Under perl 5.20.0 the following program fails (or warns) on: > > Malformed UTF-8 character (unexpected end of string) > in substitution iterator at ./test.pl line 16. > > if the character string is tainted. Leaving out the -T > option or not having the string tainted avoids the problem. > > > > #!/usr/bin/perl -T > use strict; > use warnings; > use warnings FATAL => 'utf8'; > use Encode; > > my $taint = substr($ENV{PATH}, 0,0); # tainted empty string > > # just a convenient way to represent some real (spam) subject text: > my $s = > '=?gb2312?B?zMbYkNSKIEx5ZGlhIL7o2ZvWrsir0MLToof4ob5Ma'. > 'XBzeaG/w9fJq76nyq+2zMi5IHwg11TUgeL5VmljdG9yaWEgJiDFy'. > '9DAxN1Mb3JyZXR0YSC+6Nmbob5HcmVlbmJlbGyhv7Pk64rG9w==?='; > my $chars = decode('MIME-Header', $s) . $taint; > > $chars =~ s{ ([\x00-\x1f\x7f\\]) }{ sprintf('\\u%.4X',ord($1)) }xgse; > > binmode(STDOUT,':utf8'); > printf("%s\n", $chars); > > > [Please do not change anything below this line] > -----------------------------------------------------------------
I bisected this to: commit 25fdce4a165b6305e760d4c8d94404ce055657a0 Author: Father Chrysostomos <sprout@cpan.org> Date: Tue Jul 23 13:15:34 2013 -0700 Stop pos() from being confused by changing utf8ness The value of pos() is stored as a byte offset. If it is stored on a tied variable or a reference (or glob), then the stringification could change, resulting in pos() now pointing to a different character off- set or pointing to the middle of a character: $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, a; print pos $x' 2 $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, "\x{1000}"; print pos $x' Malformed UTF-8 character (unexpected end of string) in match position at -e line 1. 0 So pos() should be stored as a character offset. The regular expression engine expects byte offsets always, so allow it to store bytes when possible (a pure non-magical string) but use char- acters otherwise. This does result in more complexity than I should like, but the alter- native (always storing a character offset) would slow down regular expressions, which is a big no-no. :100644 100644 01a9e8b77bbb01605b78d988ccc0e83f6d826c74 0126bb59097724b42e9ef260093bf2e2719896a8 M dump.c :100644 100644 f92ba8e39aeee756ec30b559cf3d5985f7de936f 73951d931cdcce8f569b6a6f12a5f9ee3bc8dd3c M embed.fnc :100644 100644 4c62a834a9faf05791a734339918e5edc8ee7ee7 49700ca3524675c935ef8c7b56c1cb32990a4fdc M embed.h :040000 040000 7911f5b1bc1d3695b1418177e4b09bbdd322a910 d91b549da4bb2b85b35e033a23f7805cb1f6913a M ext :100644 100644 48cc187e4a52b563c5fa2c089e2fd6fe1e19ae43 b33cd3fd6ff4dc3476ea5f69ae7c0b0ffca347e6 M inline.h :100644 100644 f18a98a33a8191c25142b327bc65a701b1c03a3e c2d2186e96fc2121dc2a7d10288e1ecf20acb996 M mg.c :100644 100644 de673d424377cc07dca6e520cb476586c66e7732 29e339f82c9ac3b163dc2a6bfedb0ac378b7eb5c M mg.h :100644 100644 46294b3d5c6ef3748c670b0370b9c62c3e484e74 032b93920147733fb81a479a0c80c36e8590950f M pp.c :100644 100644 6cb26bd02c24a44d1716c5a47466a9ce94ce9f54 b71648c4981b827537e5198f0bf39ed83fcb63ed M pp_ctl.c :100644 100644 6068d2197391090354548e193460f41095861bd5 afecce8e111ddaad992ac23bfd1e148d059fbcd4 M pp_hot.c :100644 100644 5c06505b6237240a434b8ae8e1151c9015f6fdee 7326fb805ed6f254d55bf5b83156c0957c00cd07 M proto.h :100644 100644 d207d0d951de3dddc27b0ebc61b1e0fe7b48c4b6 44690b3280401a18ddaa30195def419a57848d14 M regexec.c :100644 100644 9c8fd30225c3f17bcc1899a7839eb99cc82e3638 fd6425f4033a262891c135afcff21156b4cac815 M regexp.h :100644 100644 4e4a917dca198a065e05ee988c5edd8af1c9e784 3945ab991a372c112db49a2e22f7b58d8f90aaab M sv.c :100644 100644 6d8a40e8f6e874c521611087f1bbee36e2b74c68 2f0eabc74aedc5bcbfed0cd0e25b89a2733bacbe M sv.h :040000 040000 b6e9c8126409d1fbe6f717d603fcf77986b58871 f3876f58ada74dc0e900b818d98e4314341ec79c M t bisect run success That took 705 seconds
Date: Tue, 24 Jun 2014 11:53:02 +0100
From: Dave Mitchell <davem [...] iabyn.com>
To: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #122148] "Malformed UTF-8 character (unexpected end of string)" on a tainted string in 5.20
CC: perl5-porters [...] perl.org, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 781b
On Sat, Jun 21, 2014 at 01:06:42PM -0600, Karl Williamson wrote: Show quoted text
> On 06/20/2014 07:53 PM, Mark Martinec (via RT) wrote:
> >Under perl 5.20.0 the following program fails (or warns) on: > > > > Malformed UTF-8 character (unexpected end of string) > > in substitution iterator at ./test.pl line 16.
I can reduce the demo code to the following: $ p -Twe '$_ = "XXXX\x{1000}aaaaaaaaaaaaaaaaaXX" . $^X; s/X/"xxxxxx"/ge' Malformed UTF-8 character (unexpected end of string) in substitution iterator at -e line 1. $ I haven't looked into it any further yet. -- Wesley Crusher gets beaten up by his classmates for being a smarmy git, and consequently has a go at making some friends of his own age for a change. -- Things That Never Happen in "Star Trek" #18
CC: perl5-porters [...] perl.org, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Subject: Re: [perl #122148] "Malformed UTF-8 character (unexpected end of string)" on a tainted string in 5.20
To: Karl Williamson <public [...] khwilliamson.com>
Date: Wed, 2 Jul 2014 17:43:12 +0100
From: Dave Mitchell <davem [...] iabyn.com>
Download (untitled) / with headers
text/plain 2.2k
On Tue, Jun 24, 2014 at 11:53:02AM +0100, Dave Mitchell wrote: Show quoted text
> On Sat, Jun 21, 2014 at 01:06:42PM -0600, Karl Williamson wrote:
> > On 06/20/2014 07:53 PM, Mark Martinec (via RT) wrote:
> > >Under perl 5.20.0 the following program fails (or warns) on: > > > > > > Malformed UTF-8 character (unexpected end of string) > > > in substitution iterator at ./test.pl line 16.
> > I can reduce the demo code to the following: > > $ p -Twe '$_ = "XXXX\x{1000}aaaaaaaaaaaaaaaaaXX" . $^X; s/X/"xxxxxx"/ge' > Malformed UTF-8 character (unexpected end of string) in substitution iterator at -e line 1. > $ > > I haven't looked into it any further yet.
Now fixed with the following. A good candidate for 5.20.1 commit cda67c9995c6d90b71a0939aaae084e1869b8248 Author: David Mitchell <davem@iabyn.com> AuthorDate: Wed Jul 2 17:13:45 2014 +0100 Commit: David Mitchell <davem@iabyn.com> CommitDate: Wed Jul 2 17:22:52 2014 +0100 s///e on tainted utf8 strings got pos() messed up RT #122148: In 5.20, commit 25fdce4a165 changed the way pos() was stored in magic attached to SVs from being a byte offset to a char offset, *except* that, for efficiency, strings being used for pattern matching were kept as byte offsets (with a flag indicating thus), *except* where the SV already had magic attached (such as taint, as in the bug report and in this commit's test), in which case it kept it as chars. The code that updated pos() after an iteration of s///e was faulty: the string buffer it used for converting byte legnths to char lengths (via utf8_length()) was the wrong buffer: rather than using the src string being matched against, it was using the destination string being built up via iterations of s///. Once double-byte utf8 chars were involved, all the pos() calculations went wrong, and utf8 warnings started mysteriously appearing. -- No man treats a motor car as foolishly as he treats another human being. When the car will not go, he does not attribute its annoying behaviour to sin, he does not say, You are a wicked motorcar, and I shall not give you any more petrol until you go. He attempts to find out what is wrong and set it right. -- Bertrand Russell, Has Religion Made Useful Contributions to Civilization?
Subject: Your ticket against Perl 5 has been resolved
Download (untitled) / with headers
text/plain 222b
Thanks for submitting this ticket The issue should be resolved with the release today of Perl v5.22. If you find that the problem persists, feel free to reopen this ticket -- Karl Williamson for the Perl 5 porters team


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org