chop fails on decoded string with trailing nul #8021

p5pRT · 2005-07-16T03:50:05Z

Migrated from rt.perl.org#36569 (status was 'resolved')

Searchable as RT36569$

p5pRT · 2005-07-16T03:50:06Z

From jonathan-hankins@mindspring.com

This is a bug report for perl from jonathan-hankins@mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Basically, if I take a string with a trailing nul, encode it (to any
encoding, even "ascii"), decode it, then chop it, chop returns undef
and the string still has the trailing nul. If the string instead has
a trailing newline (for example), the chop works correctly.

Am I missing something?

Here is sample output from my test code below:

--
@asc (en/de-coded) before chop
$VAR1 = "hello, world!\n";
$VAR2 = "goodbye, cruel world!\0";

@asc2 (untouched) before chop
$VAR1 = "hello, world!\n";
$VAR2 = "goodbye, cruel world!\0";

@asc (en/de-coded) after chop
$VAR1 = "hello, world!";
$VAR2 = "goodbye, cruel world!\0";

@asc2 (untouched) after chop
$VAR1 = "hello, world!";
$VAR2 = "goodbye, cruel world!";
--

And here is my code:

--
#!/usr/bin/perl -w

use strict;
use Encode;
use Data::Dumper;

$Data::Dumper::Useqq = 1;

my @asc = ("hello, world!\n", "goodbye, cruel world!\0");
my @asc2 = @asc; # copy of untouched strings

my @utf = (encode('UTF-16LE', $asc[0]),
encode('UTF-16LE', $asc[1]));

@asc = (decode('UTF-16LE', $utf[0]),
decode('UTF-16LE', $utf[1]));

print "\n\n";
print "\@asc (en/de-coded) before chop\n", Dumper(@asc), "\n";
print "\@asc2 (untouched) before chop\n", Dumper(@asc2), "\n";
chop @asc;
chop @asc2;
print "\@asc (en/de-coded) after chop\n", Dumper(@asc), "\n";
print "\@asc2 (untouched) after chop\n", Dumper(@asc2), "\n";
print "\n\n";
--

--

Jonathan Hankins Homewood City Schools

jhankins@homewood.k12.al.us

I ran into this, and wondered if it is a bug.

Looks like a bug to me. At first glance, I'd describe it as a case where
chop is incapable of removing a null byte from the end of a utf8 string.

Here is another demonstration:

#!/usr/bin/perl

use Encode;

$_ = "foo\0";
while ( /\x00$/ ) {
printf "chopping from %d bytes\n", length();
chop;
sleep 1;
}
printf "okay: %d bytes left\n", length();

$_ = decode( 'ascii', "foo\0" );
while ( /\x00$/ ) {
printf "chopping from %d utf8 chars\n", length();
chop;
sleep 1;
}
printf "okay: %d chars left\n", length();

__END__

For me (macosx 10.3.9/darwin 7.9/perl 5.8.1, and freebsd 5.4/perl 5.8.6),
the second while loop never finishes -- chop never removes the final null.

David Graff Linguistic Data Consortium
graff@ldc.upenn.edu 3600 Market St., Suite 810
voice: (215) 898-0887 University of Pennsylvania
fax: (215) 573-2175 Philadelphia, PA 19104
http://www.ldc.upenn.edu

--=-=-=

Flags:
category=core
severity=low

Site configuration information for perl v5.8.4:

Configured by Debian Project at Tue Mar 8 20:31:23 EST 2005.

Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
Platform:
osname=linux, osvers=2.4.27-ti1211, archname=i386-linux-thread-multi
uname='linux kosh 2.4.27-ti1211 #1 sun sep 19 18:17:45 est 2004 i686 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.4 -Dsitearch=/usr/local/lib/perl/5.8.4 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.4 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -I/usr/local/include'
ccversion='', gccversion='3.3.5 (Debian 1:3.3.5-9)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so.5.8.4
gnulibc_version='2.3.2'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:

@INC for perl v5.8.4:
/etc/perl
/usr/local/lib/perl/5.8.4
/usr/local/share/perl/5.8.4
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.8
/usr/share/perl/5.8
/usr/local/lib/site_perl
.

Environment for perl v5.8.4:
HOME=/home/jhankins
LANG=en_US
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/jhankins/bin:/usr/bin/mh
PERL_BADLANG (unset)
SHELL=/usr/bin/zsh

p5pRT · 2005-07-16T13:05:04Z

From BQW10602@nifty.com

This is a bug report for perl from jonathan-hankins@mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Thanks for the report.

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

SADAHIRO Tomoyuki

Inline Patch

diff -ur perl~/doop.c perl/doop.c
--- perl~/doop.c	Mon Jul 11 04:49:52 2005
+++ perl/doop.c	Sat Jul 16 21:53:44 2005
@@ -977,7 +977,7 @@
 	    s = send - 1;
 	    while (s > start && UTF8_IS_CONTINUATION(*s))
 		s--;
-	    if (utf8_to_uvchr((U8*)s, 0)) {
+	    if (is_utf8_string((U8*)s, send - s)) {
 		sv_setpvn(astr, s, send - s);
 		*s = '\0';
 		SvCUR_set(sv, s - start);
diff -ur perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t	Fri Jan 23 23:19:45 2004
+++ perl/t/op/chop.t	Sat Jul 16 20:59:16 2005
@@ -6,7 +6,7 @@
     require './test.pl';
 }
 
-plan tests => 133;
+plan tests => 137;
 
 $_ = 'abc';
 $c = do foo();
@@ -221,4 +221,14 @@
     $a = "A$/";
     $b = chomp $a;
     is ($b, 2);
+}
+
+{
+    # [perl #36569] chop fails on decoded string with trailing nul
+    my $asc = "perl\0";
+    my $utf = "perl".pack('U',0); # marked as utf8
+    is(chop($asc), "\0", "chopping ascii NUL");
+    is(chop($utf), "\0", "chopping utf8 NUL");
+    is($asc, "perl", "chopped ascii NUL");
+    is($utf, "perl", "chopped utf8 NUL");
 }
END OF PATCH

p5pRT · 2005-07-16T13:05:06Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2005-07-16T13:48:34Z

From @Tux

On Sat, 16 Jul 2005 22:05:13 +0900, SADAHIRO Tomoyuki <bqw10602@nifty.com>
wrote:

This is a bug report for perl from jonathan-hankins@mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Thanks for the report.

Thanks for the fast patch. Applied as change #25158

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

Seems reasonable, though just cutting off one byte of the string would maybe
more of an expected behaviour. Maybe

SADAHIRO Tomoyuki

diff -ur perl~/doop.c perl/doop.c
--- perl~/doop.c Mon Jul 11 04:49:52 2005
+++ perl/doop.c Sat Jul 16 21:53:44 2005
@@ -977,7 +977,7 @@
s = send - 1;
while (s > start && UTF8_IS_CONTINUATION(*s))
s--;
- if (utf8_to_uvchr((U8*)s, 0)) {
+ if (is_utf8_string((U8*)s, send - s)) {
sv_setpvn(astr, s, send - s);
*s = '\0';
SvCUR_set(sv, s - start);
diff -ur perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t Fri Jan 23 23:19:45 2004
+++ perl/t/op/chop.t Sat Jul 16 20:59:16 2005
@@ -6,7 +6,7 @@
require './test.pl';
}

-plan tests => 133;
+plan tests => 137;

$_ = 'abc';
$c = do foo();
@@ -221,4 +221,14 @@
$a = "A$/";
$b = chomp $a;
is ($b, 2);
+}
+
+{
+ # [perl #36569] chop fails on decoded string with trailing nul
+ my $asc = "perl\0";
+ my $utf = "perl".pack('U',0); # marked as utf8
+ is(chop($asc), "\0", "chopping ascii NUL");
+ is(chop($utf), "\0", "chopping utf8 NUL");
+ is($asc, "perl", "chopped ascii NUL");
+ is($utf, "perl", "chopped utf8 NUL");
}
END OF PATCH

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using Perl 5.6.2, 5.8.0, 5.8.5, & 5.9.2 on HP-UX 10.20, 11.00 & 11.11,
AIX 4.3 & 5.2, SuSE 9.2 & 9.3, and Cygwin. http://www.cmve.net/~merijn
Smoking perl: http://www.test-smoke.org, perl QA: http://qa.perl.org
reports to: smokers-reports@perl.org, perl-qa@perl.org

p5pRT · 2005-07-19T03:33:54Z

From @ysth

On Sat, Jul 16, 2005 at 03:48:10PM +0200, H.Merijn Brand wrote:

On Sat, 16 Jul 2005 22:05:13 +0900, SADAHIRO Tomoyuki <bqw10602@nifty.com>
wrote:

This is a bug report for perl from jonathan-hankins@mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Thanks for the report.

Thanks for the fast patch. Applied as change #25158

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

Seems reasonable, though just cutting off one byte of the string would maybe
more of an expected behaviour. Maybe

Was there more to that sentence?

I'd vote for removing and returning a malformed char, from the last
non continuation byte on (or just the unexpected continuation bytes,
if the problem was too many of them).

That way, the data error is propagated onto the return value (as IMO
it should be), and a full-buffer problem will result in at most one
bad char. In fact, I could see being able to rely on this being
advantageous to buffering code (both XS and perl):

fill buffer with bytes
chop char and set aside
process buffer
move choped char to start of buffer
repeat

p5pRT · 2005-07-19T06:13:54Z

From @Tux

On Mon, 18 Jul 2005 20:33:28 -0700, Yitzchak Scott-Thoennes
<sthoenna@efn.org> wrote:

On Sat, Jul 16, 2005 at 03:48:10PM +0200, H.Merijn Brand wrote:

On Sat, 16 Jul 2005 22:05:13 +0900, SADAHIRO Tomoyuki <bqw10602@nifty.com>

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

Seems reasonable, though just cutting off one byte of the string would
maybe more of an expected behaviour. Maybe

Was there more to that sentence?

No, I stopped after maybe. Because the more I thought about it, the less
certain I was about *any* opinion I might have. I decided to leave that to
the utf8 experts

I'd vote for removing and returning a malformed char, from the last
non continuation byte on (or just the unexpected continuation bytes,
if the problem was too many of them).

That way, the data error is propagated onto the return value (as IMO
it should be), and a full-buffer problem will result in at most one
bad char. In fact, I could see being able to rely on this being
advantageous to buffering code (both XS and perl):

fill buffer with bytes
chop char and set aside
process buffer
move choped char to start of buffer
repeat

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using Perl 5.6.2, 5.8.0, 5.8.5, & 5.9.2 on HP-UX 10.20, 11.00 & 11.11,
AIX 4.3 & 5.2, SuSE 9.2 & 9.3, and Cygwin. http://www.cmve.net/~merijn
Smoking perl: http://www.test-smoke.org, perl QA: http://qa.perl.org
reports to: smokers-reports@perl.org, perl-qa@perl.org

p5pRT · 2008-05-24T14:03:56Z

p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'

p5pRT closed this as completed May 24, 2008

p5pRT added the Severity Low label Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chop fails on decoded string with trailing nul #8021

chop fails on decoded string with trailing nul #8021

p5pRT commented Jul 16, 2005

p5pRT commented Jul 16, 2005

p5pRT commented Jul 16, 2005

p5pRT commented Jul 16, 2005

p5pRT commented Jul 16, 2005

p5pRT commented Jul 19, 2005

p5pRT commented Jul 19, 2005

p5pRT commented May 24, 2008

chop fails on decoded string with trailing nul #8021

chop fails on decoded string with trailing nul #8021

Comments

p5pRT commented Jul 16, 2005

p5pRT commented Jul 16, 2005

From jonathan-hankins@mindspring.com

p5pRT commented Jul 16, 2005

From BQW10602@nifty.com

p5pRT commented Jul 16, 2005

p5pRT commented Jul 16, 2005

From @Tux

p5pRT commented Jul 19, 2005

From @ysth

p5pRT commented Jul 19, 2005

From @Tux

p5pRT commented May 24, 2008