New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in lc and uc (interaction between UTF-8, substr, and lc/uc) #8343
Comments
From perl@benizi.comCreated by perl@benizi.comProblem with lc/uc interacting with substr and _utf8_on. Second substr(lc($var),0) on the same _utf8_on'ed $var is the wrong perl bug.pl [test-string] For each string in the split: Output should be: Actual output is: # sample program demonstrating problem # expected results # actual results # golfed test case (should produce 'abc', not 'ab') Additional oddness/data: Affected functions: only lc/uc. (not ucfirst/lcfirst). Only in substr(lc(),0) Perl Info
|
From perl@benizi.comStill in 5.9.3 for i686-linux. (Tested that before I submitted, but |
perl@benizi.com - Status changed from 'new' to 'open' |
From @andkLooks like a hairy troll has jhidden for quite a while:) ----Program---- ----Output of .../pHIziQK/perl-5.8.0@18529/bin/perl---- ----EOF ($?='0')---- ----EOF ($?='0')---- Change 18530 by hv@hv-crypt.org on 2003/01/21 01:37:03 integrate (by hand) #18353 and #18359 from maint-5.8: OK, maybe it helps to binary search along the maint-5.8 stretch... ----Program---- ----Output of .../pAyq3oR/perl-5.8.0@18352/bin/perl---- ----EOF ($?='0')---- ----EOF ($?='0')---- Change 18353 by jhi@lyta on 2002/12/26 02:07:06 Introduce a cache for UTF-8 data: length and byte<->char mapping -- |
From @nwc10On Fri, Feb 24, 2006 at 04:46:09AM +0100, Andreas J. Koenig wrote:
Thanks. This confirms my suspicion that it was the UTF-8 caching code As part of my TPF grant I'm going to look at all this, so if no-one else As a work around, I think that re-assigning the value to itself before the lc Nicholas Clark |
From BQW10602@nifty.comOn Fri, 24 Feb 2006 10:13:15 +0000, Nicholas Clark <nick@ccl4.org> wrote
Should the magic on TARG be reset? (Or don't use TARG?) ucfirst() also has this bug, when ulen != tculen (see pp_ucfirst). for (split /:/, shift||"a:ßbc") { cf. The result on Perl 5.8.0 postincrement $_++ is also buggy. #!perl cf. The result on Perl 5.8.0 In contrast, preincrement ++$_ is good (pp_preinc doesn't use TARG). #!perl Regards |
From @jhiNicholas Clark wrote:
Not to put too much pressure on Sadahiro-san but he has traditionally
|
From BQW10602@nifty.comOn Fri, 24 Feb 2006 19:15:53 +0200, Jarkko Hietaniemi <jhietaniemi@gmail.com> wrote
However this bug against *other magics* seemed to exist even in SvSETMAGIC(sv) does not affect TARG if sv != TARG. Regards, The changes of pp.c are for pp_ucfirst, pp_uc, and pp_lc. Inline Patchdiff -ur perl-current@27323/pp.c perl/pp.c
--- perl-current@27323/pp.c Sat Feb 25 09:41:08 2006
+++ perl/pp.c Sat Feb 25 16:59:13 2006
@@ -3350,7 +3350,8 @@
if (slen > ulen)
sv_catpvn(TARG, (char*)(s + ulen), slen - ulen);
SvUTF8_on(TARG);
- SETs(TARG);
+ sv = TARG;
+ SETs(sv);
}
else {
s = (U8*)SvPV_force_nomg(sv, slen);
@@ -3402,7 +3403,8 @@
if (!len) {
SvUTF8_off(TARG); /* decontaminate */
sv_setpvn(TARG, "", 0);
- SETs(TARG);
+ sv = TARG;
+ SETs(sv);
}
else {
STRLEN min = len + 1;
@@ -3435,7 +3437,8 @@
*d = '\0';
SvUTF8_on(TARG);
SvCUR_set(TARG, d - (U8*)SvPVX_const(TARG));
- SETs(TARG);
+ sv = TARG;
+ SETs(sv);
}
}
else {
@@ -3487,7 +3490,8 @@
if (!len) {
SvUTF8_off(TARG); /* decontaminate */
sv_setpvn(TARG, "", 0);
- SETs(TARG);
+ sv = TARG;
+ SETs(sv);
}
else {
STRLEN min = len + 1;
@@ -3540,7 +3544,8 @@
*d = '\0';
SvUTF8_on(TARG);
SvCUR_set(TARG, d - (U8*)SvPVX_const(TARG));
- SETs(TARG);
+ sv = TARG;
+ SETs(sv);
}
}
else {
diff -ur perl-current@27323/t/op/lc.t perl/t/op/lc.t
--- perl-current@27323/t/op/lc.t Tue Nov 08 00:50:29 2005
+++ perl/t/op/lc.t Sat Feb 25 18:08:54 2006
@@ -6,7 +6,7 @@
require './test.pl';
}
-plan tests => 59;
+plan tests => 77;
$a = "HELLO.* world";
$b = "hello.* WORLD";
@@ -163,3 +163,39 @@
is($a, v10, "[perl #18857]");
}
}
+
+
+# [perl #38619] Bug in lc and uc (interaction between UTF-8, substr, and lc/uc)
+
+for ("a\x{100}", "xyz\x{100}") {
+ is(substr(uc($_), 0), uc($_), "[perl #38619] uc");
+}
+for ("A\x{100}", "XYZ\x{100}") {
+ is(substr(lc($_), 0), lc($_), "[perl #38619] lc");
+}
+for ("a\x{100}", "ßyz\x{100}") { # ß to Ss (different length)
+ is(substr(ucfirst($_), 0), ucfirst($_), "[perl #38619] ucfirst");
+}
+
+# Related to [perl #38619]
+# the original report concerns PERL_MAGIC_utf8.
+# these cases concern PERL_MAGIC_regex_global.
+
+for (map { $_ } "a\x{100}", "abc\x{100}", "\x{100}") {
+ chop; # get ("a", "abc", "") in utf8
+ my $return = uc($_) =~ /\G(.?)/g;
+ my $result = $return ? $1 : "not";
+ my $expect = (uc($_) =~ /(.?)/g)[0];
+ is($return, 1, "[perl #38619]");
+ is($result, $expect, "[perl #38619]");
+}
+
+for (map { $_ } "A\x{100}", "ABC\x{100}", "\x{100}") {
+ chop; # get ("A", "ABC", "") in utf8
+ my $return = lc($_) =~ /\G(.?)/g;
+ my $result = $return ? $1 : "not";
+ my $expect = (lc($_) =~ /(.?)/g)[0];
+ is($return, 1, "[perl #38619]");
+ is($result, $expect, "[perl #38619]");
+}
+
##### END OF PATCH |
From @nwc10On Sat, Feb 25, 2006 at 06:16:45PM +0900, SADAHIRO Tomoyuki wrote:
:-)
D'oh! Thanks, applied (change 27329) Nicholas Clark |
@iabyn - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#38619 (status was 'resolved')
Searchable as RT38619$
The text was updated successfully, but these errors were encountered: