Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chomp ignores utf8 #7033

Closed
p5pRT opened this issue Jan 12, 2004 · 11 comments
Closed

chomp ignores utf8 #7033

p5pRT opened this issue Jan 12, 2004 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 12, 2004

Migrated from rt.perl.org#24888 (status was 'resolved')

Searchable as RT24888$

@p5pRT
Copy link
Author

p5pRT commented Jan 12, 2004

From @nwc10

Created by @nwc10

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

With the following patch to t/op/chop.t there are many test failures.
I'm not sure of the most efficient way to patch Perl_do_chomp to cure them.
I guess use the existing byte comparison code if utf8 flags are the same
on both the target and $/, and do conversion otherwise, but I'm not going to
look further until after 5.8.3 is released.

ok 52 - start=78 end=78
ok 53 - start=78 end=163
not ok 54 - start=78 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
# got 'NÂ'
# expected 'N£'
ok 55 - start=78 end=163 ($/ as bytes)
ok 56 - start=78 end=164
not ok 57 - start=78 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
# got 'N'
# expected 'N¤'
not ok 58 - start=78 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got 'N'
# expected 'N¤'
ok 59 - start=78 end=1296
not ok 60 - start=78 end=1296 (end as bytes)
# Failed at t/op/chop.t line 203
# got 'N'
# expected 'NÔ
not ok 61 - start=78 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got 'N'
Wide character in print at ./test.pl line 38.
# expected 'NÔ
ok 62 - start=163 end=78
ok 63 - start=163 end=163
not ok 64 - start=163 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
# got '£Â'
# expected '£Â£'
ok 65 - start=163 end=163 ($/ as bytes)
ok 66 - start=163 end=164
not ok 67 - start=163 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
# got '£'
# expected '£Â¤'
not ok 68 - start=163 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '£'
# expected '£¤'
ok 69 - start=163 end=1296
not ok 70 - start=163 end=1296 (end as bytes)
# Failed at t/op/chop.t line 203
# got '£'
# expected '£Ô
not ok 71 - start=163 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '£'
Wide character in print at ./test.pl line 38.
# expected '£Ô
ok 72 - start=164 end=78
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 73 - start=164 end=163
# Failed at t/op/chop.t line 193
Wide character in print at ./test.pl line 38.
# got '¤Â'
# expected '¤'
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 74 - start=164 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
# got '¤Ã�Â'
# expected '¤Â£'
not ok 75 - start=164 end=163 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '¤'
# expected '¤£'
ok 76 - start=164 end=164
not ok 77 - start=164 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
# got '¤Â'
# expected '¤Â¤'
not ok 78 - start=164 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '¤'
# expected '¤¤'
ok 79 - start=164 end=1296
ok 80 - start=164 end=1296 (end as bytes)
not ok 81 - start=164 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
# got '¤'
Wide character in print at ./test.pl line 38.
# expected '¤Ô
ok 82 - start=1296 end=78
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 83 - start=1296 end=163
# Failed at t/op/chop.t line 193
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94.
Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95.
not ok 84 - start=1296 end=163 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
not ok 85 - start=1296 end=163 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
ok 86 - start=1296 end=164
not ok 87 - start=1296 end=164 (end as bytes)
# Failed at t/op/chop.t line 203
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
not ok 88 - start=1296 end=164 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô
ok 89 - start=1296 end=1296
ok 90 - start=1296 end=1296 (end as bytes)
not ok 91 - start=1296 end=1296 ($/ as bytes)
# Failed at t/op/chop.t line 209
Wide character in print at ./test.pl line 38.
# got 'Ô
Wide character in print at ./test.pl line 38.
# expected 'Ô

This is not a new utf8 bug.

Inline Patch
--- t/op/chop.t.orig	Mon Nov  4 06:34:41 2002
+++ t/op/chop.t	Mon Jan 12 20:56:02 2004
@@ -6,7 +6,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 51;
+plan tests => 91;
 
 $_ = 'abc';
 $c = do foo();
@@ -183,3 +183,29 @@ ok($@ =~ /Can\'t modify.*chop.*in.*assig
 eval 'chomp($x, $y) = (1, 2);';
 ok($@ =~ /Can\'t modify.*chom?p.*in.*assignment/);
 
+my @chars = ("N", "\xa3", substr ("\xa4\x{100}", 0, 1), chr 1296);
+foreach my $start (@chars) {
+  foreach my $end (@chars) {
+    local $/ = $end;
+    my $message = "start=" . ord ($start) . " end=" . ord $end;
+    my $string = $start . $end;
+    chomp $string;
+    is ($string, $start, $message);
+
+    my $end_utf8 = $end;
+    utf8::encode ($end_utf8);
+    next if $end_utf8 eq $end;
+
+    # $end ne $end_utf8, so these should not chomp.
+    $string = $start . $end_utf8;
+    my $chomped = $string;
+    chomp $chomped;
+    is ($chomped, $string, "$message (end as bytes)");
+
+    $/ = $end_utf8;
+    $string = $start . $end;
+    $chomped = $string;
+    chomp $chomped;
+    is ($chomped, $string, "$message (\$/ as bytes)");
+  }
+}
Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.3:

Configured by nick at Fri Jan  9 10:31:25 GMT 2004.

Summary of my perl5 (revision 5.0 version 8 subversion 3) configuration:
  Platform:
    osname=linux, osvers=2.4.19-rmk4, archname=armv4l-linux
    uname='linux bagpuss.unfortu.net 2.4.19-rmk4 #3 fri oct 25 21:57:55 bst 2002 armv4l unknown '
    config_args='-Dusedevel=y -Dcc=ccache gcc-3.0 -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-O1 -Dusethreads=n -Dprefix=/usr/local/perl5.8.3/ -Dinstallman1dir=none -Dinstallman3dir=none -de'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='ccache gcc-3.0', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O1',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='3.0.4', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    MAINT22085


@INC for perl v5.8.3:
    lib
    /usr/local/perl5.8.3/lib/5.8.3/armv4l-linux
    /usr/local/perl5.8.3/lib/5.8.3
    /usr/local/perl5.8.3/lib/site_perl/5.8.3/armv4l-linux
    /usr/local/perl5.8.3/lib/site_perl/5.8.3
    /usr/local/perl5.8.3/lib/site_perl
    .


Environment for perl v5.8.3:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_GB.ISO-8859-1
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 13, 2004

From @ysth

On Mon, Jan 12, 2004 at 09​:10​:21PM -0000, Nicholas Clark <perlbug-followup@​perl.org> wrote​:

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

If you are auditing the code for UTF8 issues, you might take a look
for cases where SvUTF8/DO_UTF8 preceeds SvPV (or whatever else calls
sv_2pv_flags), since this will fail for overloaded stringify that
returns UTF8 (since the UTF8 flag isn't set until the stringify.) I
noticed one in do_vop, but haven't fixed it yet.

@p5pRT
Copy link
Author

p5pRT commented Jan 13, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 15, 2004

From @nwc10

Moral - don't use a character for a test case which happens to be a
substring of its UTF8 representation, unless you specifically need this
effect. (ie my testcase was slightly wrong)

On Mon, Jan 12, 2004 at 09​:10​:21PM -0000, Nicholas Clark wrote​:

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

Change 22155 fixes this.

I presume that chomp really also ought to pay attention to the encoding
pragma in the mixed bytes/utf8 case? [ie more work still needed]

Nicholas Clark

==== //depot/perl/doop.c#140 (text) ====

@​@​ -1008,6 +1008,7 @​@​
  STRLEN len;
  STRLEN n_a;
  char *s;
+ char *temp_buffer = NULL;

  if (RsSNARF(PL_rs))
  return 0;
@​@​ -1059,6 +1060,27 @​@​
  else {
  STRLEN rslen;
  char *rsptr = SvPV(PL_rs, rslen);
+ if (SvUTF8(PL_rs) != SvUTF8(sv)) {
+ /* Assumption is that rs is shorter than the scalar. */
+ if (SvUTF8(PL_rs)) {
+ /* RS is utf8, scalar is 8 bit. */
+ bool is_utf8 = TRUE;
+ temp_buffer = (char*)bytes_from_utf8((U8*)rsptr,
+ &rslen, &is_utf8);
+ if (is_utf8) {
+ /* Cannot downgrade, therefore cannot possibly match
+ */
+ assert (temp_buffer == rsptr);
+ temp_buffer = NULL;
+ goto nope;
+ }
+ rsptr = temp_buffer;
+ } else {
+ /* RS is 8 bit, scalar is utf8. */
+ temp_buffer = (char*)bytes_to_utf8((U8*)rsptr, &rslen);
+ rsptr = temp_buffer;
+ }
+ }
  if (rslen == 1) {
  if (*s != *rsptr)
  goto nope;
@​@​ -1081,6 +1103,7 @​@​
  SvSETMAGIC(sv);
  }
  nope​:
+ Safefree(temp_buffer);
  return count;
}

@p5pRT
Copy link
Author

p5pRT commented Jan 15, 2004

From mjtg@cam.ac.uk

Nicholas Clark <nick@​ccl4.org> wrote

+ /* Assumption is that rs is shorter than the scalar. */

That comment looks scary. Where is the assumption made? What happens
if it is false? Do you actually mean

  /* if rs is longer than the scalar, these conversions are a waste
  of time, but the case is rare enough that we don't care */

?

Secondly​:

While investigating where the assumption might be made, I peered inside
bytes_from_utf8() / bytes_to_utf8(). I note that the converted string
is placed in a buffer allocated by Newz(), but can't see anywhere that
the buffer is freed. Why isn't this a horrendous memory leak
(for existing uses, as well as the new ones)?

Please tell me I'm missing something obvious.

Mike Guy

@p5pRT
Copy link
Author

p5pRT commented Jan 15, 2004

From nick.ing-simmons@elixent.com

Mike Guy <mjtg@​cam.ac.uk> writes​:

Secondly​:

While investigating where the assumption might be made, I peered inside
bytes_from_utf8() / bytes_to_utf8(). I note that the converted string
is placed in a buffer allocated by Newz(), but can't see anywhere that
the buffer is freed. Why isn't this a horrendous memory leak
(for existing uses, as well as the new ones)?

Please tell me I'm missing something obvious.

You mean this bit​:

nope​:
+ Safefree(temp_buffer);
return count;
}

@p5pRT
Copy link
Author

p5pRT commented Jan 15, 2004

From BQW10602@nifty.com

On Thu, 15 Jan 2004 00​:25​:00 +0000
Nicholas Clark <nick@​ccl4.org> wrote​:

Moral - don't use a character for a test case which happens to be a
substring of its UTF8 representation, unless you specifically need this
effect. (ie my testcase was slightly wrong)

On Mon, Jan 12, 2004 at 09​:10​:21PM -0000, Nicholas Clark wrote​:

While working my way down doop.c, I discovered that chomp completely ignores
utf8 flags in both the chomped string and $/

Change 22155 fixes this.

I presume that chomp really also ought to pay attention to the encoding
pragma in the mixed bytes/utf8 case? [ie more work still needed]

Hello.

(1) chomp() returns number of *characters* removed.
So, should <count += rslen;> (number of bytes) not be good?

(2) For some multibyte or stateful encoding, and in the case
that either string or $/ is in bytes, recoding to utf8 is required.
  (It must be inefficient for single byte encodings...)

(but AFAIK, encoding.pm unicodifies strings in many cases,
like literals and inputs. So strings with UTF8 off might be
rare under encoding pragma.)

(3) For many CJK encodings, comparison in bytes has a problem
which is not problematic for single byte encodings nor Unicode
encodings (UTF-X).

Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000),
while "\x40" is '@​' (as like ASCII).
Then, when saying <$/ = "\x40"; $a = "\x81\x40"; chomp($a);>,
$a should not be chomped ($a eq "\x81" is very bad.)

Encodings which has such a problem include
  big5, euc-jp, GBK, iso-2022-jp, johab, shift-jis, UHC.

### $ patch

Inline Patch
diff -urN perl~/doop.c perl/doop.c
--- perl~/doop.c	Thu Jan 15 23:58:30 2004
+++ perl/doop.c	Fri Jan 16 03:19:14 2004
@@ -1007,6 +1007,7 @@
     STRLEN n_a;
     char *s;
     char *temp_buffer = NULL;
+    SV* svrecode = Nullsv;
 
     if (RsSNARF(PL_rs))
 	return 0;
@@ -1042,6 +1043,18 @@
         if (SvREADONLY(sv))
             Perl_croak(aTHX_ PL_no_modify);
     }
+
+    if (PL_encoding) {
+	if (!SvUTF8(sv)) {
+	/* XXX, here sv is utf8-ized as a side-effect!
+	   If encoding.pm is used properly, almost string-generating
+	   operations, including literal strings, chr(), input data, etc.
+	   should have been utf8-ized already, right?
+	*/
+	    sv_recode_to_utf8(sv, PL_encoding);
+	}
+    }
+
     s = SvPV(sv, len);
     if (s && len) {
 	s += --len;
@@ -1056,8 +1069,13 @@
 	    }
 	}
 	else {
-	    STRLEN rslen;
+	    STRLEN rslen, rs_charlen;
 	    char *rsptr = SvPV(PL_rs, rslen);
+
+	    rs_charlen = SvUTF8(PL_rs)
+		? sv_len_utf8(PL_rs)
+		: rslen;
+
 	    if (SvUTF8(PL_rs) != SvUTF8(sv)) {
 		/* Assumption is that rs is shorter than the scalar.  */
 		if (SvUTF8(PL_rs)) {
@@ -1073,7 +1091,16 @@
 			goto nope;
 		    }
 		    rsptr = temp_buffer;
-		} else {
+		}
+		else if (PL_encoding) {
+		    /* RS is 8 bit, encoding.pm is used.
+		     * Do not recode PL_rs as a side-effect. */
+		   svrecode = newSVpvn(rsptr, rslen);
+		   sv_recode_to_utf8(svrecode, PL_encoding);
+		   rsptr = SvPV(svrecode, rslen);
+		   rs_charlen = sv_len_utf8(svrecode);
+		}
+		else {
 		    /* RS is 8 bit, scalar is utf8.  */
 		    temp_buffer = (char*)bytes_to_utf8((U8*)rsptr, &rslen);
 		    rsptr = temp_buffer;
@@ -1091,7 +1118,7 @@
 		s -= rslen - 1;
 		if (memNE(s, rsptr, rslen))
 		    goto nope;
-		count += rslen;
+		count += rs_charlen;
 	    }
 	}
 	s = SvPV_force(sv, n_a);
@@ -1101,6 +1128,10 @@
 	SvSETMAGIC(sv);
     }
   nope:
+
+    if (svrecode)
+	 SvREFCNT_dec(svrecode);
+
     Safefree(temp_buffer);
     return count;
 }
diff -urN perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t	Fri Jan 16 03:08:14 2004
+++ perl/t/op/chop.t	Fri Jan 16 02:50:22 2004
@@ -6,7 +6,7 @@
     require './test.pl';
 }
 
-plan tests => 91;
+plan tests => 93;
 
 $_ = 'abc';
 $c = do foo();
@@ -208,4 +208,17 @@
     chomp $chomped;
     is ($chomped, $string, "$message (\$/ as bytes)");
   }
+}
+
+{
+    # returns length in characters, but not in bytes.
+    $/ = "\x{100}";
+    $a = "A$/";
+    $b = chomp $a;
+    is ($b, 1);
+
+    $/ = "\x{100}\x{101}";
+    $a = "A$/";
+    $b = chomp $a;
+    is ($b, 2);
 }
### ^ patch

I wrote tests for chomp bytes in many CJK encoding as a new file.
I'm not sure where this test should be placed (say, perl/t/uni/ ?).

### ^ new test
BEGIN {
  if ($ENV{'PERL_CORE'}){
  chdir 't';
  unshift @​INC, '../lib';
  }
  require Config; import Config;
  if ($Config{'extensions'} !~ /\bEncode\b/) {
  print "1..0 # Skip​: Encode was not built\n";
  exit 0;
  }
  if (ord("A") == 193) {
  print "1..0 # Skip​: EBCDIC\n";
  exit 0;
  }
  unless (PerlIO​::Layer->find('perlio')){
  print "1..0 # Skip​: PerlIO required\n";
  exit 0;
  }
  eval 'use Encode';
  if ($@​ =~ /dynamic loading not available/) {
  print "1..0 # Skip​: no dynamic loading, no Encode\n";
  exit 0;
  }
}

use strict;
use Test​::More tests => (4 * 4 * 4) * (3); # (@​char ** 3) * (keys %mbchars)

# %mbchars = (encoding => { bytes => utf8, ... }, ...);
# * pack('C*') is expected to return bytes even if ${^ENCODING} is true.
our %mbchars = (
  'big-5' => {
  pack('C*', 0x40) => pack('U*', 0x40), # COMMERCIAL AT
  pack('C*', 0xA4, 0x40) => "\x{4E00}", # CJK-4E00
  },
  'euc-jp' => {
  pack('C*', 0xB0, 0xA1) => "\x{4E9C}", # CJK-4E9C
  pack('C*', 0x8F, 0xB0, 0xA1) => "\x{4E02}", # CJK-4E02
  },
  'shift-jis' => {
  pack('C*', 0xA9) => "\x{FF69}", # halfwidth katakana small U
  pack('C*', 0x82, 0xA9) => "\x{304B}", # hiragana KA
  },
);

for my $enc (sort keys %mbchars) {
  local ${^ENCODING} = find_encoding($enc);
  my @​char = (sort(keys %{ $mbchars{$enc} }),
  sort(values %{ $mbchars{$enc} }));

  for my $rs (@​char) {
  local $/ = $rs;
  for my $start (@​char) {
  for my $end (@​char) {
  my $string = $start.$end;
  my $expect = $end eq $rs ? $start : $string;
  chomp $string;
  is($string, $expect);
  }
  }
  }
}
### ^ new test

regards
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Jan 23, 2004

From @nwc10

On Fri, Jan 16, 2004 at 04​:13​:00AM +0900, SADAHIRO Tomoyuki wrote​:

(1) chomp() returns number of *characters* removed.
So, should <count += rslen;> (number of bytes) not be good?

Yes, bug. Well spotted

(2) For some multibyte or stateful encoding, and in the case
that either string or $/ is in bytes, recoding to utf8 is required.
(It must be inefficient for single byte encodings...)

(but AFAIK, encoding.pm unicodifies strings in many cases,
like literals and inputs. So strings with UTF8 off might be
rare under encoding pragma.)

I've no idea, but for the moment I'm happy to assume that they are rare,
and concentrate on correctness.

(3) For many CJK encodings, comparison in bytes has a problem
which is not problematic for single byte encodings nor Unicode
encodings (UTF-X).

Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000),
while "\x40" is '@​' (as like ASCII).
Then, when saying <$/ = "\x40"; $a = "\x81\x40"; chomp($a);>,
$a should not be chomped ($a eq "\x81" is very bad.)

Encodings which has such a problem include
big5, euc-jp, GBK, iso-2022-jp, johab, shift-jis, UHC.

This is what your new test tests?

I wrote tests for chomp bytes in many CJK encoding as a new file.
I'm not sure where this test should be placed (say, perl/t/uni/ ?).

I'm not sure either, but for now it's t/uni/chomp.t

"Thanks, applied"

Thanks for sorting out all the loose ends I left.

Nicholas Clark

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2004

From @nwc10

On Fri, Jan 16, 2004 at 04​:13​:00AM +0900, SADAHIRO Tomoyuki wrote​:

(1) chomp() returns number of *characters* removed.
So, should <count += rslen;> (number of bytes) not be good?

Yes, bug. Well spotted

(2) For some multibyte or stateful encoding, and in the case
that either string or $/ is in bytes, recoding to utf8 is required.
(It must be inefficient for single byte encodings...)

(but AFAIK, encoding.pm unicodifies strings in many cases,
like literals and inputs. So strings with UTF8 off might be
rare under encoding pragma.)

I've no idea, but for the moment I'm happy to assume that they are rare,
and concentrate on correctness.

(3) For many CJK encodings, comparison in bytes has a problem
which is not problematic for single byte encodings nor Unicode
encodings (UTF-X).

Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000),
while "\x40" is '@​' (as like ASCII).
Then, when saying <$/ = "\x40"; $a = "\x81\x40"; chomp($a);>,
$a should not be chomped ($a eq "\x81" is very bad.)

Encodings which has such a problem include
big5, euc-jp, GBK, iso-2022-jp, johab, shift-jis, UHC.

This is what your new test tests?

I wrote tests for chomp bytes in many CJK encoding as a new file.
I'm not sure where this test should be placed (say, perl/t/uni/ ?).

I'm not sure either, but for now it's t/uni/chomp.t

"Thanks, applied"

Thanks for sorting out all the loose ends I left.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 30, 2006

From @nwc10

Was fixed by change 22155.

@p5pRT
Copy link
Author

p5pRT commented Mar 30, 2006

@nwc10 - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant