Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Malformed UTF-8 character (unexpected end of string)" on a tainted string in 5.20 #13948

Closed
p5pRT opened this issue Jun 21, 2014 · 8 comments
Closed

Comments

@p5pRT
Copy link

p5pRT commented Jun 21, 2014

Migrated from rt.perl.org#122148 (status was 'resolved')

Searchable as RT122148$

@p5pRT
Copy link
Author

p5pRT commented Jun 21, 2014

From Mark.Martinec@ijs.si

Created by Mark.Martinec@ijs.si

Under perl 5.20.0 the following program fails (or warns) on​:

  Malformed UTF-8 character (unexpected end of string)
  in substitution iterator at ./test.pl line 16.

if the character string is tainted. Leaving out the -T
option or not having the string tainted avoids the problem.

#!/usr/bin/perl -T
use strict;
use warnings;
use warnings FATAL => 'utf8';
use Encode;

my $taint = substr($ENV{PATH}, 0,0); # tainted empty string

# just a convenient way to represent some real (spam) subject text​:
my $s =
  '=?gb2312?B?zMbYkNSKIEx5ZGlhIL7o2ZvWrsir0MLToof4ob5Ma'.
  'XBzeaG/w9fJq76nyq+2zMi5IHwg11TUgeL5VmljdG9yaWEgJiDFy'.
  '9DAxN1Mb3JyZXR0YSC+6Nmbob5HcmVlbmJlbGyhv7Pk64rG9w==?=';
my $chars = decode('MIME-Header', $s) . $taint;

$chars =~ s{ ([\x00-\x1f\x7f\\]) }{ sprintf('\\u%.4X',ord($1)) }xgse;

binmode(STDOUT,'​:utf8');
printf("%s\n", $chars);

Perl Info

Flags:
     category=core
     severity=medium

Site configuration information for perl 5.20.0:

Configured by root at Mon Jun 16 15:09:13 UTC 2014.

Summary of my perl5 (revision 5 version 20 subversion 0) configuration:

   Platform:
     osname=freebsd, osvers=10.0-release, 
archname=amd64-freebsd-thread-multi
     uname='freebsd 10amd64-ws-default-job-03 10.0-release freebsd 
10.0-release amd64 '
     config_args='-sde -Dprefix=/usr/local 
-Darchlib=/usr/local/lib/perl5/5.20/mach 
-Dprivlib=/usr/local/lib/perl5/5.20 
-Dman3dir=/usr/local/lib/perl5/5.20/perl/man/man3 
-Dman1dir=/usr/local/man/man1 
-Dsitearch=/usr/local/lib/perl5/site_perl/5.20/mach 
-Dsitelib=/usr/local/lib/perl5/site_perl/5.20 -Dscriptdir=/usr/local/bin 
-Dsiteman3dir=/usr/local/lib/perl5/5.20/man/man3 
-Dsiteman1dir=/usr/local/man/man1 -Ui_malloc -Ui_iconv 
-Uinstallusrbinperl -Dcc=cc -Duseshrplib -Dinc_version_list=none 
-Dccflags=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.20/BSDPAN" -Doptimize=-g 
-DDEBUGGING -Ui_gdbm -Duse64bitint -Dusethreads=y -Dusemymalloc=n'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=define, usemultiplicity=define
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.20/BSDPAN" 
-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing 
-pipe -fstack-protector -I/usr/local/include',
     optimize='-g',
     cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.20/BSDPAN" 
-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing 
-pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.2.1 Compatible FreeBSD Clang 3.3 
(tags/RELEASE_33/final 183502)', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags ='-pthread -Wl,-E  -fstack-protector 
-L/usr/local/lib'
     libpth=/usr/lib /usr/local/lib /usr/include/clang/3.3 /usr/lib
     libs=-lm -lcrypt -lutil
     perllibs=-lm -lcrypt -lutil
     libc=, so=so, useshrplib=true, libperl=libperl.so
     gnulibc_version=''
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='  
-Wl,-R/usr/local/lib/perl5/5.20/mach/CORE'
     cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib 
-fstack-protector'



@INC for perl 5.20.0:
     /usr/local/lib/perl5/5.20/BSDPAN
     /usr/local/lib/perl5/site_perl/5.20/mach
     /usr/local/lib/perl5/site_perl/5.20
     /usr/local/lib/perl5/5.20/mach
     /usr/local/lib/perl5/5.20
     .


Environment for perl 5.20.0:
     HOME=/home/mark
     LANG (unset)
     LANGUAGE (unset)
     LC_ALL=en_US.UTF-8
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     
PATH=/usr/local/bin:/usr/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/kde4/bin/:/usr/X11R6/bin
     PERL_BADLANG (unset)
     SHELL=/usr/local/bin/bash


@p5pRT
Copy link
Author

p5pRT commented Jun 21, 2014

From @khwilliamson

On 06/20/2014 07​:53 PM, Mark Martinec (via RT) wrote​:

# New Ticket Created by Mark Martinec
# Please include the string​: [perl #122148]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122148 >

This is a bug report for perl from Mark.Martinec@​ijs.si,
generated with the help of perlbug 1.40 running under perl 5.20.0.

-----------------------------------------------------------------
[Please describe your issue here]

Under perl 5.20.0 the following program fails (or warns) on​:

Malformed UTF\-8 character \(unexpected end of string\)
  in substitution iterator at \./test\.pl line 16\.

if the character string is tainted. Leaving out the -T
option or not having the string tainted avoids the problem.

#!/usr/bin/perl -T
use strict;
use warnings;
use warnings FATAL => 'utf8';
use Encode;

my $taint = substr($ENV{PATH}, 0,0); # tainted empty string

# just a convenient way to represent some real (spam) subject text​:
my $s =
'=?gb2312?B?zMbYkNSKIEx5ZGlhIL7o2ZvWrsir0MLToof4ob5Ma'.
'XBzeaG/w9fJq76nyq+2zMi5IHwg11TUgeL5VmljdG9yaWEgJiDFy'.
'9DAxN1Mb3JyZXR0YSC+6Nmbob5HcmVlbmJlbGyhv7Pk64rG9w==?=';
my $chars = decode('MIME-Header', $s) . $taint;

$chars =~ s{ ([\x00-\x1f\x7f\\]) }{ sprintf('\\u%.4X',ord($1)) }xgse;

binmode(STDOUT,'​:utf8');
printf("%s\n", $chars);

[Please do not change anything below this line]
-----------------------------------------------------------------

I bisected this to​:
commit 25fdce4
Author​: Father Chrysostomos <sprout@​cpan.org>
Date​: Tue Jul 23 13​:15​:34 2013 -0700

  Stop pos() from being confused by changing utf8ness

  The value of pos() is stored as a byte offset. If it is stored on a
  tied variable or a reference (or glob), then the stringification could
  change, resulting in pos() now pointing to a different character off-
  set or pointing to the middle of a character​:

  $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, a;
print pos $x'
  2
  $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x,
"\x{1000}"; print pos $x'
  Malformed UTF-8 character (unexpected end of string) in match
position at -e line 1.
  0

  So pos() should be stored as a character offset.

  The regular expression engine expects byte offsets always, so allow it
  to store bytes when possible (a pure non-magical string) but use char-
  acters otherwise.

  This does result in more complexity than I should like, but the alter-
  native (always storing a character offset) would slow down regular
  expressions, which is a big no-no.

:100644 100644 01a9e8b77bbb01605b78d988ccc0e83f6d826c74
0126bb59097724b42e9ef260093bf2e2719896a8 M dump.c
:100644 100644 f92ba8e39aeee756ec30b559cf3d5985f7de936f
73951d931cdcce8f569b6a6f12a5f9ee3bc8dd3c M embed.fnc
:100644 100644 4c62a834a9faf05791a734339918e5edc8ee7ee7
49700ca3524675c935ef8c7b56c1cb32990a4fdc M embed.h
:040000 040000 7911f5b1bc1d3695b1418177e4b09bbdd322a910
d91b549da4bb2b85b35e033a23f7805cb1f6913a M ext
:100644 100644 48cc187e4a52b563c5fa2c089e2fd6fe1e19ae43
b33cd3fd6ff4dc3476ea5f69ae7c0b0ffca347e6 M inline.h
:100644 100644 f18a98a33a8191c25142b327bc65a701b1c03a3e
c2d2186e96fc2121dc2a7d10288e1ecf20acb996 M mg.c
:100644 100644 de673d424377cc07dca6e520cb476586c66e7732
29e339f82c9ac3b163dc2a6bfedb0ac378b7eb5c M mg.h
:100644 100644 46294b3d5c6ef3748c670b0370b9c62c3e484e74
032b93920147733fb81a479a0c80c36e8590950f M pp.c
:100644 100644 6cb26bd02c24a44d1716c5a47466a9ce94ce9f54
b71648c4981b827537e5198f0bf39ed83fcb63ed M pp_ctl.c
:100644 100644 6068d2197391090354548e193460f41095861bd5
afecce8e111ddaad992ac23bfd1e148d059fbcd4 M pp_hot.c
:100644 100644 5c06505b6237240a434b8ae8e1151c9015f6fdee
7326fb805ed6f254d55bf5b83156c0957c00cd07 M proto.h
:100644 100644 d207d0d951de3dddc27b0ebc61b1e0fe7b48c4b6
44690b3280401a18ddaa30195def419a57848d14 M regexec.c
:100644 100644 9c8fd30225c3f17bcc1899a7839eb99cc82e3638
fd6425f4033a262891c135afcff21156b4cac815 M regexp.h
:100644 100644 4e4a917dca198a065e05ee988c5edd8af1c9e784
3945ab991a372c112db49a2e22f7b58d8f90aaab M sv.c
:100644 100644 6d8a40e8f6e874c521611087f1bbee36e2b74c68
2f0eabc74aedc5bcbfed0cd0e25b89a2733bacbe M sv.h
:040000 040000 b6e9c8126409d1fbe6f717d603fcf77986b58871
f3876f58ada74dc0e900b818d98e4314341ec79c M t
bisect run success
That took 705 seconds

@p5pRT
Copy link
Author

p5pRT commented Jun 21, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2014

From @iabyn

On Sat, Jun 21, 2014 at 01​:06​:42PM -0600, Karl Williamson wrote​:

On 06/20/2014 07​:53 PM, Mark Martinec (via RT) wrote​:

Under perl 5.20.0 the following program fails (or warns) on​:

Malformed UTF-8 character (unexpected end of string)
in substitution iterator at ./test.pl line 16.

I can reduce the demo code to the following​:

  $ p -Twe '$_ = "XXXX\x{1000}aaaaaaaaaaaaaaaaaXX" . $^X; s/X/"xxxxxx"/ge'
  Malformed UTF-8 character (unexpected end of string) in substitution iterator at -e line 1.
  $

I haven't looked into it any further yet.

--
Wesley Crusher gets beaten up by his classmates for being a smarmy git,
and consequently has a go at making some friends of his own age for a
change.
  -- Things That Never Happen in "Star Trek" #18

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2014

From @iabyn

On Tue, Jun 24, 2014 at 11​:53​:02AM +0100, Dave Mitchell wrote​:

On Sat, Jun 21, 2014 at 01​:06​:42PM -0600, Karl Williamson wrote​:

On 06/20/2014 07​:53 PM, Mark Martinec (via RT) wrote​:

Under perl 5.20.0 the following program fails (or warns) on​:

Malformed UTF-8 character (unexpected end of string)
in substitution iterator at ./test.pl line 16.

I can reduce the demo code to the following​:

$ p \-Twe '$\_ = "XXXX\\x\{1000\}aaaaaaaaaaaaaaaaaXX" \. $^X; s/X/"xxxxxx"/ge'
Malformed UTF\-8 character \(unexpected end of string\) in substitution iterator at \-e line 1\.
$

I haven't looked into it any further yet.

Now fixed with the following. A good candidate for 5.20.1

commit cda67c9
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Wed Jul 2 17​:13​:45 2014 +0100
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Wed Jul 2 17​:22​:52 2014 +0100

  s///e on tainted utf8 strings got pos() messed up
 
  RT #122148​: In 5.20, commit 25fdce4 changed the way pos() was stored
  in magic attached to SVs from being a byte offset to a char offset,
  *except* that, for efficiency, strings being used for pattern matching
  were kept as byte offsets (with a flag indicating thus), *except* where
  the SV already had magic attached (such as taint, as in the bug report and
  in this commit's test), in which case it kept it as chars.
 
  The code that updated pos() after an iteration of s///e was faulty​: the
  string buffer it used for converting byte legnths to char lengths (via
  utf8_length()) was the wrong buffer​: rather than using the src string
  being matched against, it was using the destination string being built up
  via iterations of s///. Once double-byte utf8 chars were involved, all the
  pos() calculations went wrong, and utf8 warnings started mysteriously
  appearing.

--
No man treats a motor car as foolishly as he treats another human being.
When the car will not go, he does not attribute its annoying behaviour to
sin, he does not say, You are a wicked motorcar, and I shall not give you
any more petrol until you go. He attempts to find out what is wrong and
set it right.
  -- Bertrand Russell,
  Has Religion Made Useful Contributions to Civilization?

@p5pRT
Copy link
Author

p5pRT commented Jul 3, 2014

@iabyn - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Jun 2, 2015

From @khwilliamson

Thanks for submitting this ticket

The issue should be resolved with the release today of Perl v5.22. If you find that the problem persists, feel free to reopen this ticket

--
Karl Williamson for the Perl 5 porters team

@p5pRT p5pRT closed this as completed Jun 2, 2015
@p5pRT
Copy link
Author

p5pRT commented Jun 2, 2015

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant