New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chomp ignores utf8 #7033
Comments
From @nwc10Created by @nwc10While working my way down doop.c, I discovered that chomp completely ignores With the following patch to t/op/chop.t there are many test failures. ok 52 - start=78 end=78 This is not a new utf8 bug. Inline Patch--- t/op/chop.t.orig Mon Nov 4 06:34:41 2002
+++ t/op/chop.t Mon Jan 12 20:56:02 2004
@@ -6,7 +6,7 @@ BEGIN {
require './test.pl';
}
-plan tests => 51;
+plan tests => 91;
$_ = 'abc';
$c = do foo();
@@ -183,3 +183,29 @@ ok($@ =~ /Can\'t modify.*chop.*in.*assig
eval 'chomp($x, $y) = (1, 2);';
ok($@ =~ /Can\'t modify.*chom?p.*in.*assignment/);
+my @chars = ("N", "\xa3", substr ("\xa4\x{100}", 0, 1), chr 1296);
+foreach my $start (@chars) {
+ foreach my $end (@chars) {
+ local $/ = $end;
+ my $message = "start=" . ord ($start) . " end=" . ord $end;
+ my $string = $start . $end;
+ chomp $string;
+ is ($string, $start, $message);
+
+ my $end_utf8 = $end;
+ utf8::encode ($end_utf8);
+ next if $end_utf8 eq $end;
+
+ # $end ne $end_utf8, so these should not chomp.
+ $string = $start . $end_utf8;
+ my $chomped = $string;
+ chomp $chomped;
+ is ($chomped, $string, "$message (end as bytes)");
+
+ $/ = $end_utf8;
+ $string = $start . $end;
+ $chomped = $string;
+ chomp $chomped;
+ is ($chomped, $string, "$message (\$/ as bytes)");
+ }
+} Perl Info
|
From @ysthOn Mon, Jan 12, 2004 at 09:10:21PM -0000, Nicholas Clark <perlbug-followup@perl.org> wrote:
If you are auditing the code for UTF8 issues, you might take a look |
The RT System itself - Status changed from 'new' to 'open' |
From @nwc10Moral - don't use a character for a test case which happens to be a On Mon, Jan 12, 2004 at 09:10:21PM -0000, Nicholas Clark wrote:
Change 22155 fixes this. I presume that chomp really also ought to pay attention to the encoding Nicholas Clark ==== //depot/perl/doop.c#140 (text) ==== @@ -1008,6 +1008,7 @@ if (RsSNARF(PL_rs)) |
From mjtg@cam.ac.ukNicholas Clark <nick@ccl4.org> wrote
That comment looks scary. Where is the assumption made? What happens /* if rs is longer than the scalar, these conversions are a waste ? Secondly: While investigating where the assumption might be made, I peered inside Please tell me I'm missing something obvious. Mike Guy |
From nick.ing-simmons@elixent.comMike Guy <mjtg@cam.ac.uk> writes:
You mean this bit:
|
From BQW10602@nifty.comOn Thu, 15 Jan 2004 00:25:00 +0000
Hello. (1) chomp() returns number of *characters* removed. (2) For some multibyte or stateful encoding, and in the case (but AFAIK, encoding.pm unicodifies strings in many cases, (3) For many CJK encodings, comparison in bytes has a problem Say, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000), Encodings which has such a problem include ### $ patch Inline Patchdiff -urN perl~/doop.c perl/doop.c
--- perl~/doop.c Thu Jan 15 23:58:30 2004
+++ perl/doop.c Fri Jan 16 03:19:14 2004
@@ -1007,6 +1007,7 @@
STRLEN n_a;
char *s;
char *temp_buffer = NULL;
+ SV* svrecode = Nullsv;
if (RsSNARF(PL_rs))
return 0;
@@ -1042,6 +1043,18 @@
if (SvREADONLY(sv))
Perl_croak(aTHX_ PL_no_modify);
}
+
+ if (PL_encoding) {
+ if (!SvUTF8(sv)) {
+ /* XXX, here sv is utf8-ized as a side-effect!
+ If encoding.pm is used properly, almost string-generating
+ operations, including literal strings, chr(), input data, etc.
+ should have been utf8-ized already, right?
+ */
+ sv_recode_to_utf8(sv, PL_encoding);
+ }
+ }
+
s = SvPV(sv, len);
if (s && len) {
s += --len;
@@ -1056,8 +1069,13 @@
}
}
else {
- STRLEN rslen;
+ STRLEN rslen, rs_charlen;
char *rsptr = SvPV(PL_rs, rslen);
+
+ rs_charlen = SvUTF8(PL_rs)
+ ? sv_len_utf8(PL_rs)
+ : rslen;
+
if (SvUTF8(PL_rs) != SvUTF8(sv)) {
/* Assumption is that rs is shorter than the scalar. */
if (SvUTF8(PL_rs)) {
@@ -1073,7 +1091,16 @@
goto nope;
}
rsptr = temp_buffer;
- } else {
+ }
+ else if (PL_encoding) {
+ /* RS is 8 bit, encoding.pm is used.
+ * Do not recode PL_rs as a side-effect. */
+ svrecode = newSVpvn(rsptr, rslen);
+ sv_recode_to_utf8(svrecode, PL_encoding);
+ rsptr = SvPV(svrecode, rslen);
+ rs_charlen = sv_len_utf8(svrecode);
+ }
+ else {
/* RS is 8 bit, scalar is utf8. */
temp_buffer = (char*)bytes_to_utf8((U8*)rsptr, &rslen);
rsptr = temp_buffer;
@@ -1091,7 +1118,7 @@
s -= rslen - 1;
if (memNE(s, rsptr, rslen))
goto nope;
- count += rslen;
+ count += rs_charlen;
}
}
s = SvPV_force(sv, n_a);
@@ -1101,6 +1128,10 @@
SvSETMAGIC(sv);
}
nope:
+
+ if (svrecode)
+ SvREFCNT_dec(svrecode);
+
Safefree(temp_buffer);
return count;
}
diff -urN perl~/t/op/chop.t perl/t/op/chop.t
--- perl~/t/op/chop.t Fri Jan 16 03:08:14 2004
+++ perl/t/op/chop.t Fri Jan 16 02:50:22 2004
@@ -6,7 +6,7 @@
require './test.pl';
}
-plan tests => 91;
+plan tests => 93;
$_ = 'abc';
$c = do foo();
@@ -208,4 +208,17 @@
chomp $chomped;
is ($chomped, $string, "$message (\$/ as bytes)");
}
+}
+
+{
+ # returns length in characters, but not in bytes.
+ $/ = "\x{100}";
+ $a = "A$/";
+ $b = chomp $a;
+ is ($b, 1);
+
+ $/ = "\x{100}\x{101}";
+ $a = "A$/";
+ $b = chomp $a;
+ is ($b, 2);
}
I wrote tests for chomp bytes in many CJK encoding as a new file. ### ^ new test use strict; # %mbchars = (encoding => { bytes => utf8, ... }, ...); for my $enc (sort keys %mbchars) { for my $rs (@char) { regards |
From @nwc10On Fri, Jan 16, 2004 at 04:13:00AM +0900, SADAHIRO Tomoyuki wrote:
Yes, bug. Well spotted
I've no idea, but for the moment I'm happy to assume that they are rare,
This is what your new test tests?
I'm not sure either, but for now it's t/uni/chomp.t "Thanks, applied" Thanks for sorting out all the loose ends I left. Nicholas Clark |
1 similar comment
From @nwc10On Fri, Jan 16, 2004 at 04:13:00AM +0900, SADAHIRO Tomoyuki wrote:
Yes, bug. Well spotted
I've no idea, but for the moment I'm happy to assume that they are rare,
This is what your new test tests?
I'm not sure either, but for now it's t/uni/chomp.t "Thanks, applied" Thanks for sorting out all the loose ends I left. Nicholas Clark |
From @nwc10Was fixed by change 22155. |
@nwc10 - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#24888 (status was 'resolved')
Searchable as RT24888$
The text was updated successfully, but these errors were encountered: