Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chop fails on decoded string with trailing nul #8021

Closed
p5pRT opened this issue Jul 16, 2005 · 7 comments
Closed

chop fails on decoded string with trailing nul #8021

p5pRT opened this issue Jul 16, 2005 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 16, 2005

Migrated from rt.perl.org#36569 (status was 'resolved')

Searchable as RT36569$

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2005

From jonathan-hankins@mindspring.com

This is a bug report for perl from jonathan-hankins@​mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Basically, if I take a string with a trailing nul, encode it (to any
encoding, even "ascii"), decode it, then chop it, chop returns undef
and the string still has the trailing nul. If the string instead has
a trailing newline (for example), the chop works correctly.

Am I missing something?

Here is sample output from my test code below​:

--
@​asc (en/de-coded) before chop
$VAR1 = "hello, world!\n";
$VAR2 = "goodbye, cruel world!\0";

@​asc2 (untouched) before chop
$VAR1 = "hello, world!\n";
$VAR2 = "goodbye, cruel world!\0";

@​asc (en/de-coded) after chop
$VAR1 = "hello, world!";
$VAR2 = "goodbye, cruel world!\0";

@​asc2 (untouched) after chop
$VAR1 = "hello, world!";
$VAR2 = "goodbye, cruel world!";
--

And here is my code​:

--
#!/usr/bin/perl -w

use strict;
use Encode;
use Data​::Dumper;

$Data​::Dumper​::Useqq = 1;

my @​asc = ("hello, world!\n", "goodbye, cruel world!\0");
my @​asc2 = @​asc; # copy of untouched strings

my @​utf = (encode('UTF-16LE', $asc[0]),
  encode('UTF-16LE', $asc[1]));

@​asc = (decode('UTF-16LE', $utf[0]),
  decode('UTF-16LE', $utf[1]));

print "\n\n";
print "\@​asc (en/de-coded) before chop\n", Dumper(@​asc), "\n";
print "\@​asc2 (untouched) before chop\n", Dumper(@​asc2), "\n";
chop @​asc;
chop @​asc2;
print "\@​asc (en/de-coded) after chop\n", Dumper(@​asc), "\n";
print "\@​asc2 (untouched) after chop\n", Dumper(@​asc2), "\n";
print "\n\n";
--

--


Jonathan Hankins Homewood City Schools

jhankins@​homewood.k12.al.us


I ran into this, and wondered if it is a bug.

Looks like a bug to me. At first glance, I'd describe it as a case where
chop is incapable of removing a null byte from the end of a utf8 string.

Here is another demonstration​:

#!/usr/bin/perl

use Encode;

$_ = "foo\0";
while ( /\x00$/ ) {
  printf "chopping from %d bytes\n", length();
  chop;
  sleep 1;
}
printf "okay​: %d bytes left\n", length();

$_ = decode( 'ascii', "foo\0" );
while ( /\x00$/ ) {
  printf "chopping from %d utf8 chars\n", length();
  chop;
  sleep 1;
}
printf "okay​: %d chars left\n", length();

__END__

For me (macosx 10.3.9/darwin 7.9/perl 5.8.1, and freebsd 5.4/perl 5.8.6),
the second while loop never finishes -- chop never removes the final null.


David Graff Linguistic Data Consortium
graff@​ldc.upenn.edu 3600 Market St., Suite 810
voice​: (215) 898-0887 University of Pennsylvania
fax​: (215) 573-2175 Philadelphia, PA 19104
  http​://www.ldc.upenn.edu

--=-=-=


Flags​:
  category=core
  severity=low


Site configuration information for perl v5.8.4​:

Configured by Debian Project at Tue Mar 8 20​:31​:23 EST 2005.

Summary of my perl5 (revision 5 version 8 subversion 4) configuration​:
  Platform​:
  osname=linux, osvers=2.4.27-ti1211, archname=i386-linux-thread-multi
  uname='linux kosh 2.4.27-ti1211 #1 sun sep 19 18​:17​:45 est 2004 i686 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.4 -Dsitearch=/usr/local/lib/perl/5.8.4 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.4 -Dd_dosuid -des'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -I/usr/local/include'
  ccversion='', gccversion='3.3.5 (Debian 1​:3.3.5-9)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so.5.8.4
  gnulibc_version='2.3.2'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:
 


@​INC for perl v5.8.4​:
  /etc/perl
  /usr/local/lib/perl/5.8.4
  /usr/local/share/perl/5.8.4
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.8
  /usr/share/perl/5.8
  /usr/local/lib/site_perl
  .


Environment for perl v5.8.4​:
  HOME=/home/jhankins
  LANG=en_US
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/local/bin​:/bin​:/usr/bin​:/usr/X11R6/bin​:/home/jhankins/bin​:/usr/bin/mh
  PERL_BADLANG (unset)
  SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2005

From BQW10602@nifty.com

This is a bug report for perl from jonathan-hankins@​mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Thanks for the report.

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

SADAHIRO Tomoyuki

Inline Patch
diff -ur perl~/doop.c perl/doop.c
--- perl~/doop.c	Mon Jul 11 04:49:52 2005
+++ perl/doop.c	Sat Jul 16 21:53:44 2005
@@ -977,7 +977,7 @@
 	    s = send - 1;
 	    while (s > start && UTF8_IS_CONTINUATION(*s))
 		s--;
-	    if (utf8_to_uvchr((U8*)s, 0)) {
+	    if (is_utf8_string((U8*)s, send - s)) {
 		sv_setpvn(astr, s, send - s);
 		*s = '\0';
 		SvCUR_set(sv, s - start);
diff -ur perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t	Fri Jan 23 23:19:45 2004
+++ perl/t/op/chop.t	Sat Jul 16 20:59:16 2005
@@ -6,7 +6,7 @@
     require './test.pl';
 }
 
-plan tests => 133;
+plan tests => 137;
 
 $_ = 'abc';
 $c = do foo();
@@ -221,4 +221,14 @@
     $a = "A$/";
     $b = chomp $a;
     is ($b, 2);
+}
+
+{
+    # [perl #36569] chop fails on decoded string with trailing nul
+    my $asc = "perl\0";
+    my $utf = "perl".pack('U',0); # marked as utf8
+    is(chop($asc), "\0", "chopping ascii NUL");
+    is(chop($utf), "\0", "chopping utf8 NUL");
+    is($asc, "perl", "chopped ascii NUL");
+    is($utf, "perl", "chopped utf8 NUL");
 }
END OF PATCH

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2005

From @Tux

On Sat, 16 Jul 2005 22​:05​:13 +0900, SADAHIRO Tomoyuki <bqw10602@​nifty.com>
wrote​:

This is a bug report for perl from jonathan-hankins@​mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Thanks for the report.

Thanks for the fast patch. Applied as change #25158

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

Seems reasonable, though just cutting off one byte of the string would maybe
more of an expected behaviour. Maybe

SADAHIRO Tomoyuki

diff -ur perl~/doop.c perl/doop.c
--- perl~/doop.c Mon Jul 11 04​:49​:52 2005
+++ perl/doop.c Sat Jul 16 21​:53​:44 2005
@​@​ -977,7 +977,7 @​@​
s = send - 1;
while (s > start && UTF8_IS_CONTINUATION(*s))
s--;
- if (utf8_to_uvchr((U8*)s, 0)) {
+ if (is_utf8_string((U8*)s, send - s)) {
sv_setpvn(astr, s, send - s);
*s = '\0';
SvCUR_set(sv, s - start);
diff -ur perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t Fri Jan 23 23​:19​:45 2004
+++ perl/t/op/chop.t Sat Jul 16 20​:59​:16 2005
@​@​ -6,7 +6,7 @​@​
require './test.pl';
}

-plan tests => 133;
+plan tests => 137;

$_ = 'abc';
$c = do foo();
@​@​ -221,4 +221,14 @​@​
$a = "A$/";
$b = chomp $a;
is ($b, 2);
+}
+
+{
+ # [perl #36569] chop fails on decoded string with trailing nul
+ my $asc = "perl\0";
+ my $utf = "perl".pack('U',0); # marked as utf8
+ is(chop($asc), "\0", "chopping ascii NUL");
+ is(chop($utf), "\0", "chopping utf8 NUL");
+ is($asc, "perl", "chopped ascii NUL");
+ is($utf, "perl", "chopped utf8 NUL");
}
END OF PATCH

--
H.Merijn Brand Amsterdam Perl Mongers (http​://amsterdam.pm.org/)
using Perl 5.6.2, 5.8.0, 5.8.5, & 5.9.2 on HP-UX 10.20, 11.00 & 11.11,
AIX 4.3 & 5.2, SuSE 9.2 & 9.3, and Cygwin. http​://www.cmve.net/~merijn
Smoking perl​: http​://www.test-smoke.org, perl QA​: http​://qa.perl.org
reports to​: smokers-reports@​perl.org, perl-qa@​perl.org

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2005

From @ysth

On Sat, Jul 16, 2005 at 03​:48​:10PM +0200, H.Merijn Brand wrote​:

On Sat, 16 Jul 2005 22​:05​:13 +0900, SADAHIRO Tomoyuki <bqw10602@​nifty.com>
wrote​:

This is a bug report for perl from jonathan-hankins@​mindspring.com,
generated with the help of perlbug 1.35 running under perl v5.8.4.

I ran into this, and wondered if it is a bug.

I have tested on perl 5.8.4 with Encode.pm version 1.99_01 (from
Debian package) and 2.10 (from CPAN).

Thanks for the report.

Thanks for the fast patch. Applied as change #25158

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

Seems reasonable, though just cutting off one byte of the string would maybe
more of an expected behaviour. Maybe

Was there more to that sentence?

I'd vote for removing and returning a malformed char, from the last
non continuation byte on (or just the unexpected continuation bytes,
if the problem was too many of them).

That way, the data error is propagated onto the return value (as IMO
it should be), and a full-buffer problem will result in at most one
bad char. In fact, I could see being able to rely on this being
advantageous to buffering code (both XS and perl)​:

  fill buffer with bytes
  chop char and set aside
  process buffer
  move choped char to start of buffer
  repeat

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2005

From @Tux

On Mon, 18 Jul 2005 20​:33​:28 -0700, Yitzchak Scott-Thoennes
<sthoenna@​efn.org> wrote​:

On Sat, Jul 16, 2005 at 03​:48​:10PM +0200, H.Merijn Brand wrote​:

On Sat, 16 Jul 2005 22​:05​:13 +0900, SADAHIRO Tomoyuki <bqw10602@​nifty.com>

utf8_to_uvchr((U8*)s, 0) used in do_chop() returns 0,
not only if the octet sequence from *s is malformed,
but also if *s == '\0'. The return value 0 should be
for U+0000 (NUL) rather than malformedness. Oops :-<

P.S. by the way, when the string in utf8 ends with malformed
octet(s), how should chop() do?
It has returned undef without modification of the string.

Seems reasonable, though just cutting off one byte of the string would
maybe more of an expected behaviour. Maybe

Was there more to that sentence?

No, I stopped after maybe. Because the more I thought about it, the less
certain I was about *any* opinion I might have. I decided to leave that to
the utf8 experts

I'd vote for removing and returning a malformed char, from the last
non continuation byte on (or just the unexpected continuation bytes,
if the problem was too many of them).

That way, the data error is propagated onto the return value (as IMO
it should be), and a full-buffer problem will result in at most one
bad char. In fact, I could see being able to rely on this being
advantageous to buffering code (both XS and perl)​:

fill buffer with bytes
chop char and set aside
process buffer
move choped char to start of buffer
repeat

--
H.Merijn Brand Amsterdam Perl Mongers (http​://amsterdam.pm.org/)
using Perl 5.6.2, 5.8.0, 5.8.5, & 5.9.2 on HP-UX 10.20, 11.00 & 11.11,
AIX 4.3 & 5.2, SuSE 9.2 & 9.3, and Cygwin. http​://www.cmve.net/~merijn
Smoking perl​: http​://www.test-smoke.org, perl QA​: http​://qa.perl.org
reports to​: smokers-reports@​perl.org, perl-qa@​perl.org

@p5pRT
Copy link
Author

p5pRT commented May 24, 2008

p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant