chomp ignores utf8 #7033

p5pRT · 2004-01-12T21:10:19Z

Migrated from rt.perl.org#24888 (status was 'resolved')

Searchable as RT24888$

p5pRT · 2004-01-12T21:10:20Z

From @nwc10

Created by @nwc10

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

With the following patch to t/op/chop.t there are many test failures.
I'm not sure of the most efficient way to patch Perl_do_chomp to cure them.
I guess use the existing byte comparison code if utf8 flags are the same
on both the target and $/, and do conversion otherwise, but I'm not going to
look further until after 5.8.3 is released.

ok 52 - start=78 end=78
ok 53 - start=78 end=163
not ok 54 - start=78 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
# got 'NÂ'
# expected 'NÂ£'
ok 55 - start=78 end=163 ($/ as bytes)
ok 56 - start=78 end=164
not ok 57 - start=78 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
# got 'N'
# expected 'NÂ¤'
not ok 58 - start=78 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got 'N'
# expected 'N¤'
ok 59 - start=78 end=1296
not ok 60 - start=78 end=1296 (end as bytes)
# Failed at t/op/chop.t line 203
# got 'N'
# expected 'NÔ
not ok 61 - start=78 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got 'N'
Wide character in print at ./test.pl line 38.
# expected 'NÔ
ok 62 - start=163 end=78
ok 63 - start=163 end=163
not ok 64 - start=163 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
# got '£Â'
# expected '£Â£'
ok 65 - start=163 end=163 ($/ as bytes)
ok 66 - start=163 end=164
not ok 67 - start=163 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
# got '£'
# expected '£Â¤'
not ok 68 - start=163 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '£'
# expected '£¤'
ok 69 - start=163 end=1296
not ok 70 - start=163 end=1296 (end as bytes)
# Failed at t/op/chop.t line 203
# got '£'
# expected '£Ô
not ok 71 - start=163 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '£'
Wide character in print at ./test.pl line 38.
# expected 'Â£Ô
ok 72 - start=164 end=78
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 73 - start=164 end=163
# Failed at t/op/chop.t line 193
Wide character in print at ./test.pl line 38.
# got 'Â¤Â'
# expected '¤'
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 74 - start=164 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
# got 'Â¤Ã�Â'
# expected '¤Â£'
not ok 75 - start=164 end=163 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '¤'
# expected '¤£'
ok 76 - start=164 end=164
not ok 77 - start=164 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
# got '¤Â'
# expected '¤Â¤'
not ok 78 - start=164 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '¤'
# expected '¤¤'
ok 79 - start=164 end=1296
ok 80 - start=164 end=1296 (end as bytes)
not ok 81 - start=164 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '¤'
Wide character in print at ./test.pl line 38.
# expected 'Â¤Ô
ok 82 - start=1296 end=78
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 83 - start=1296 end=163
# Failed at t/op/chop.t line 193
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 84 - start=1296 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
not ok 85 - start=1296 end=163 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
ok 86 - start=1296 end=164
not ok 87 - start=1296 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
not ok 88 - start=1296 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
ok 89 - start=1296 end=1296
ok 90 - start=1296 end=1296 (end as bytes)
not ok 91 - start=1296 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô

This is not a new utf8 bug.

Inline Patch

--- t/op/chop.t.orig	Mon Nov  4 06:34:41 2002
+++ t/op/chop.t	Mon Jan 12 20:56:02 2004
@@ -6,7 +6,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 51;
+plan tests => 91;
 
 $_ = 'abc';
 $c = do foo();
@@ -183,3 +183,29 @@ ok($@ =~ /Can\'t modify.*chop.*in.*assig
 eval 'chomp($x, $y) = (1, 2);';
 ok($@ =~ /Can\'t modify.*chom?p.*in.*assignment/);
 
+my @chars = ("N", "\xa3", substr ("\xa4\x{100}", 0, 1), chr 1296);
+foreach my $start (@chars) {
+  foreach my $end (@chars) {
+    local $/ = $end;
+    my $message = "start=" . ord ($start) . " end=" . ord $end;
+    my $string = $start . $end;
+    chomp $string;
+    is ($string, $start, $message);
+
+    my $end_utf8 = $end;
+    utf8::encode ($end_utf8);
+    next if $end_utf8 eq $end;
+
+    # $end ne $end_utf8, so these should not chomp.
+    $string = $start . $end_utf8;
+    my $chomped = $string;
+    chomp $chomped;
+    is ($chomped, $string, "$message (end as bytes)");
+
+    $/ = $end_utf8;
+    $string = $start . $end;
+    $chomped = $string;
+    chomp $chomped;
+    is ($chomped, $string, "$message (\$/ as bytes)");
+  }
+}

Perl Info


Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.3:

Configured by nick at Fri Jan  9 10:31:25 GMT 2004.

Summary of my perl5 (revision 5.0 version 8 subversion 3) configuration:
  Platform:
    osname=linux, osvers=2.4.19-rmk4, archname=armv4l-linux
    uname='linux bagpuss.unfortu.net 2.4.19-rmk4 #3 fri oct 25 21:57:55 bst 2002 armv4l unknown '
    config_args='-Dusedevel=y -Dcc=ccache gcc-3.0 -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-O1 -Dusethreads=n -Dprefix=/usr/local/perl5.8.3/ -Dinstallman1dir=none -Dinstallman3dir=none -de'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='ccache gcc-3.0', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O1',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='3.0.4', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    MAINT22085


@INC for perl v5.8.3:
    lib
    /usr/local/perl5.8.3/lib/5.8.3/armv4l-linux
    /usr/local/perl5.8.3/lib/5.8.3
    /usr/local/perl5.8.3/lib/site_perl/5.8.3/armv4l-linux
    /usr/local/perl5.8.3/lib/site_perl/5.8.3
    /usr/local/perl5.8.3/lib/site_perl
    .


Environment for perl v5.8.3:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_GB.ISO-8859-1
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

p5pRT · 2004-01-13T06:17:35Z

From @ysth

On Mon, Jan 12, 2004 at 09:10:21PM -0000, Nicholas Clark <perlbug-followup@perl.org> wrote:

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

If you are auditing the code for UTF8 issues, you might take a look
for cases where SvUTF8/DO_UTF8 preceeds SvPV (or whatever else calls
sv_2pv_flags), since this will fail for overloaded stringify that
returns UTF8 (since the UTF8 flag isn't set until the stringify.) I
noticed one in do_vop, but haven't fixed it yet.

p5pRT · 2004-01-13T06:17:37Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2004-01-15T00:26:08Z

From @nwc10

Moral - don't use a character for a test case which happens to be a
substring of its UTF8 representation, unless you specifically need this
effect. (ie my testcase was slightly wrong)

On Mon, Jan 12, 2004 at 09:10:21PM -0000, Nicholas Clark wrote:

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

Change 22155 fixes this.

I presume that chomp really also ought to pay attention to the encoding
pragma in the mixed bytes/utf8 case? [ie more work still needed]

Nicholas Clark

==== //depot/perl/doop.c#140 (text) ====

@@ -1008,6 +1008,7 @@
STRLEN len;
STRLEN n_a;
char *s;
+ char *temp_buffer = NULL;

if (RsSNARF(PL_rs))
return 0;
@@ -1059,6 +1060,27 @@
else {
STRLEN rslen;
char *rsptr = SvPV(PL_rs, rslen);
+ if (SvUTF8(PL_rs) != SvUTF8(sv)) {
+ /* Assumption is that rs is shorter than the scalar. */
+ if (SvUTF8(PL_rs)) {
+ /* RS is utf8, scalar is 8 bit. */
+ bool is_utf8 = TRUE;
+ temp_buffer = (char*)bytes_from_utf8((U8*)rsptr,
+ &rslen, &is_utf8);
+ if (is_utf8) {
+ /* Cannot downgrade, therefore cannot possibly match
+ */
+ assert (temp_buffer == rsptr);
+ temp_buffer = NULL;
+ goto nope;
+ }
+ rsptr = temp_buffer;
+ } else {
+ /* RS is 8 bit, scalar is utf8. */
+ temp_buffer = (char*)bytes_to_utf8((U8*)rsptr, &rslen);
+ rsptr = temp_buffer;
+ }
+ }
if (rslen == 1) {
if (*s != *rsptr)
goto nope;
@@ -1081,6 +1103,7 @@
SvSETMAGIC(sv);
}
nope:
+ Safefree(temp_buffer);
return count;
}

p5pRT · 2004-01-15T00:59:53Z

From mjtg@cam.ac.uk

Nicholas Clark <nick@ccl4.org> wrote

+ /* Assumption is that rs is shorter than the scalar. */

That comment looks scary. Where is the assumption made? What happens
if it is false? Do you actually mean

/* if rs is longer than the scalar, these conversions are a waste
of time, but the case is rare enough that we don't care */

?

Secondly:

While investigating where the assumption might be made, I peered inside
bytes_from_utf8() / bytes_to_utf8(). I note that the converted string
is placed in a buffer allocated by Newz(), but can't see anywhere that
the buffer is freed. Why isn't this a horrendous memory leak
(for existing uses, as well as the new ones)?

Please tell me I'm missing something obvious.

Mike Guy

p5pRT · 2004-01-15T09:31:27Z

From nick.ing-simmons@elixent.com

Mike Guy <mjtg@cam.ac.uk> writes:

Secondly:

While investigating where the assumption might be made, I peered inside
bytes_from_utf8() / bytes_to_utf8(). I note that the converted string
is placed in a buffer allocated by Newz(), but can't see anywhere that
the buffer is freed. Why isn't this a horrendous memory leak
(for existing uses, as well as the new ones)?

Please tell me I'm missing something obvious.

You mean this bit:

nope:
+ Safefree(temp_buffer);
return count;
}

p5pRT · 2004-01-15T19:14:25Z

From BQW10602@nifty.com

On Thu, 15 Jan 2004 00:25:00 +0000
Nicholas Clark <nick@ccl4.org> wrote:

Moral - don't use a character for a test case which happens to be a
substring of its UTF8 representation, unless you specifically need this
effect. (ie my testcase was slightly wrong)

On Mon, Jan 12, 2004 at 09:10:21PM -0000, Nicholas Clark wrote:

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

Change 22155 fixes this.

I presume that chomp really also ought to pay attention to the encoding
pragma in the mixed bytes/utf8 case? [ie more work still needed]

Hello.

(1) chomp() returns number of *characters* removed.
So, should <count += rslen;> (number of bytes) not be good?

(2) For some multibyte or stateful encoding, and in the case
that either string or $/ is in bytes, recoding to utf8 is required.
(It must be inefficient for single byte encodings...)

(but AFAIK, encoding.pm unicodifies strings in many cases,
like literals and inputs. So strings with UTF8 off might be
rare under encoding pragma.)

(3) For many CJK encodings, comparison in bytes has a problem
which is not problematic for single byte encodings nor Unicode
encodings (UTF-X).

Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000),
while "\x40" is '@' (as like ASCII).
Then, when saying <$/ = "\x40"; $a = "\x81\x40"; chomp($a);>,
$a should not be chomped ($a eq "\x81" is very bad.)

Encodings which has such a problem include
big5, euc-jp, GBK, iso-2022-jp, johab, shift-jis, UHC.

### $ patch

Inline Patch

diff -urN perl~/doop.c perl/doop.c
--- perl~/doop.c	Thu Jan 15 23:58:30 2004
+++ perl/doop.c	Fri Jan 16 03:19:14 2004
@@ -1007,6 +1007,7 @@
     STRLEN n_a;
     char *s;
     char *temp_buffer = NULL;
+    SV* svrecode = Nullsv;
 
     if (RsSNARF(PL_rs))
 	return 0;
@@ -1042,6 +1043,18 @@
         if (SvREADONLY(sv))
             Perl_croak(aTHX_ PL_no_modify);
     }
+
+    if (PL_encoding) {
+	if (!SvUTF8(sv)) {
+	/* XXX, here sv is utf8-ized as a side-effect!
+	   If encoding.pm is used properly, almost string-generating
+	   operations, including literal strings, chr(), input data, etc.
+	   should have been utf8-ized already, right?
+	*/
+	    sv_recode_to_utf8(sv, PL_encoding);
+	}
+    }
+
     s = SvPV(sv, len);
     if (s && len) {
 	s += --len;
@@ -1056,8 +1069,13 @@
 	    }
 	}
 	else {
-	    STRLEN rslen;
+	    STRLEN rslen, rs_charlen;
 	    char *rsptr = SvPV(PL_rs, rslen);
+
+	    rs_charlen = SvUTF8(PL_rs)
+		? sv_len_utf8(PL_rs)
+		: rslen;
+
 	    if (SvUTF8(PL_rs) != SvUTF8(sv)) {
 		/* Assumption is that rs is shorter than the scalar.  */
 		if (SvUTF8(PL_rs)) {
@@ -1073,7 +1091,16 @@
 			goto nope;
 		    }
 		    rsptr = temp_buffer;
-		} else {
+		}
+		else if (PL_encoding) {
+		    /* RS is 8 bit, encoding.pm is used.
+		     * Do not recode PL_rs as a side-effect. */
+		   svrecode = newSVpvn(rsptr, rslen);
+		   sv_recode_to_utf8(svrecode, PL_encoding);
+		   rsptr = SvPV(svrecode, rslen);
+		   rs_charlen = sv_len_utf8(svrecode);
+		}
+		else {
 		    /* RS is 8 bit, scalar is utf8.  */
 		    temp_buffer = (char*)bytes_to_utf8((U8*)rsptr, &rslen);
 		    rsptr = temp_buffer;
@@ -1091,7 +1118,7 @@
 		s -= rslen - 1;
 		if (memNE(s, rsptr, rslen))
 		    goto nope;
-		count += rslen;
+		count += rs_charlen;
 	    }
 	}
 	s = SvPV_force(sv, n_a);
@@ -1101,6 +1128,10 @@
 	SvSETMAGIC(sv);
     }
   nope:
+
+    if (svrecode)
+	 SvREFCNT_dec(svrecode);
+
     Safefree(temp_buffer);
     return count;
 }
diff -urN perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t	Fri Jan 16 03:08:14 2004
+++ perl/t/op/chop.t	Fri Jan 16 02:50:22 2004
@@ -6,7 +6,7 @@
     require './test.pl';
 }
 
-plan tests => 91;
+plan tests => 93;
 
 $_ = 'abc';
 $c = do foo();
@@ -208,4 +208,17 @@
     chomp $chomped;
     is ($chomped, $string, "$message (\$/ as bytes)");
   }
+}
+
+{
+    # returns length in characters, but not in bytes.
+    $/ = "\x{100}";
+    $a = "A$/";
+    $b = chomp $a;
+    is ($b, 1);
+
+    $/ = "\x{100}\x{101}";
+    $a = "A$/";
+    $b = chomp $a;
+    is ($b, 2);
 }

### ^ patch

I wrote tests for chomp bytes in many CJK encoding as a new file.
I'm not sure where this test should be placed (say, perl/t/uni/ ?).

### ^ new test
BEGIN {
if ($ENV{'PERL_CORE'}){
chdir 't';
unshift @INC, '../lib';
}
require Config; import Config;
if ($Config{'extensions'} !~ /\bEncode\b/) {
print "1..0 # Skip: Encode was not built\n";
exit 0;
}
if (ord("A") == 193) {
print "1..0 # Skip: EBCDIC\n";
exit 0;
}
unless (PerlIO::Layer->find('perlio')){
print "1..0 # Skip: PerlIO required\n";
exit 0;
}
eval 'use Encode';
if ($@ =~ /dynamic loading not available/) {
print "1..0 # Skip: no dynamic loading, no Encode\n";
exit 0;
}
}

use strict;
use Test::More tests => (4 * 4 * 4) * (3); # (@char ** 3) * (keys %mbchars)

# %mbchars = (encoding => { bytes => utf8, ... }, ...);
# * pack('C*') is expected to return bytes even if ${^ENCODING} is true.
our %mbchars = (
'big-5' => {
pack('C*', 0x40) => pack('U*', 0x40), # COMMERCIAL AT
pack('C*', 0xA4, 0x40) => "\x{4E00}", # CJK-4E00
},
'euc-jp' => {
pack('C*', 0xB0, 0xA1) => "\x{4E9C}", # CJK-4E9C
pack('C*', 0x8F, 0xB0, 0xA1) => "\x{4E02}", # CJK-4E02
},
'shift-jis' => {
pack('C*', 0xA9) => "\x{FF69}", # halfwidth katakana small U
pack('C*', 0x82, 0xA9) => "\x{304B}", # hiragana KA
},
);

for my $enc (sort keys %mbchars) {
local ${^ENCODING} = find_encoding($enc);
my @char = (sort(keys %{ $mbchars{$enc} }),
sort(values %{ $mbchars{$enc} }));

for my $rs (@char) {
local $/ = $rs;
for my $start (@char) {
for my $end (@char) {
my $string = $start.$end;
my $expect = $end eq $rs ? $start : $string;
chomp $string;
is($string, $expect);
}
}
}
}
### ^ new test

regards
SADAHIRO Tomoyuki

p5pRT · 2004-01-23T14:07:22Z

From @nwc10

On Fri, Jan 16, 2004 at 04:13:00AM +0900, SADAHIRO Tomoyuki wrote:

(1) chomp() returns number of *characters* removed.
So, should <count += rslen;> (number of bytes) not be good?

Yes, bug. Well spotted

(2) For some multibyte or stateful encoding, and in the case
that either string or $/ is in bytes, recoding to utf8 is required.
(It must be inefficient for single byte encodings...)

(but AFAIK, encoding.pm unicodifies strings in many cases,
like literals and inputs. So strings with UTF8 off might be
rare under encoding pragma.)

I've no idea, but for the moment I'm happy to assume that they are rare,
and concentrate on correctness.

(3) For many CJK encodings, comparison in bytes has a problem
which is not problematic for single byte encodings nor Unicode
encodings (UTF-X).

Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000),
while "\x40" is '@' (as like ASCII).
Then, when saying <$/ = "\x40"; $a = "\x81\x40"; chomp($a);>,
$a should not be chomped ($a eq "\x81" is very bad.)

Encodings which has such a problem include
big5, euc-jp, GBK, iso-2022-jp, johab, shift-jis, UHC.

This is what your new test tests?

I wrote tests for chomp bytes in many CJK encoding as a new file.
I'm not sure where this test should be placed (say, perl/t/uni/ ?).

I'm not sure either, but for now it's t/uni/chomp.t

"Thanks, applied"

Thanks for sorting out all the loose ends I left.

Nicholas Clark

p5pRT · 2004-02-02T03:35:30Z

From @nwc10

On Fri, Jan 16, 2004 at 04:13:00AM +0900, SADAHIRO Tomoyuki wrote:

(1) chomp() returns number of *characters* removed.
So, should <count += rslen;> (number of bytes) not be good?

Yes, bug. Well spotted

(2) For some multibyte or stateful encoding, and in the case
that either string or $/ is in bytes, recoding to utf8 is required.
(It must be inefficient for single byte encodings...)

(but AFAIK, encoding.pm unicodifies strings in many cases,
like literals and inputs. So strings with UTF8 off might be
rare under encoding pragma.)

I've no idea, but for the moment I'm happy to assume that they are rare,
and concentrate on correctness.

(3) For many CJK encodings, comparison in bytes has a problem
which is not problematic for single byte encodings nor Unicode
encodings (UTF-X).

Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000),
while "\x40" is '@' (as like ASCII).
Then, when saying <$/ = "\x40"; $a = "\x81\x40"; chomp($a);>,
$a should not be chomped ($a eq "\x81" is very bad.)

Encodings which has such a problem include
big5, euc-jp, GBK, iso-2022-jp, johab, shift-jis, UHC.

This is what your new test tests?

I wrote tests for chomp bytes in many CJK encoding as a new file.
I'm not sure where this test should be placed (say, perl/t/uni/ ?).

I'm not sure either, but for now it's t/uni/chomp.t

"Thanks, applied"

Thanks for sorting out all the loose ends I left.

Nicholas Clark

p5pRT · 2006-03-30T17:12:07Z

From @nwc10

Was fixed by change 22155.

p5pRT · 2006-03-30T17:12:08Z

@nwc10 - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Mar 30, 2006

p5pRT added Severity Medium distro-Linux type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chomp ignores utf8 #7033

chomp ignores utf8 #7033

p5pRT commented Jan 12, 2004

p5pRT commented Jan 12, 2004

p5pRT commented Jan 13, 2004

p5pRT commented Jan 13, 2004

p5pRT commented Jan 15, 2004

p5pRT commented Jan 15, 2004

p5pRT commented Jan 15, 2004

p5pRT commented Jan 15, 2004

p5pRT commented Jan 23, 2004

p5pRT commented Feb 2, 2004

p5pRT commented Mar 30, 2006

p5pRT commented Mar 30, 2006

chomp ignores utf8 #7033

chomp ignores utf8 #7033

Comments

p5pRT commented Jan 12, 2004

p5pRT commented Jan 12, 2004

From @nwc10

Created by @nwc10

p5pRT commented Jan 13, 2004

From @ysth

p5pRT commented Jan 13, 2004

p5pRT commented Jan 15, 2004

From @nwc10

p5pRT commented Jan 15, 2004

From mjtg@cam.ac.uk

p5pRT commented Jan 15, 2004

From nick.ing-simmons@elixent.com

p5pRT commented Jan 15, 2004

From BQW10602@nifty.com

p5pRT commented Jan 23, 2004

From @nwc10

p5pRT commented Feb 2, 2004

From @nwc10

p5pRT commented Mar 30, 2006

From @nwc10

p5pRT commented Mar 30, 2006