Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in lc and uc (interaction between UTF-8, substr, and lc/uc) #8343

Closed
p5pRT opened this issue Feb 23, 2006 · 10 comments
Closed

Bug in lc and uc (interaction between UTF-8, substr, and lc/uc) #8343

p5pRT opened this issue Feb 23, 2006 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 23, 2006

Migrated from rt.perl.org#38619 (status was 'resolved')

Searchable as RT38619$

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2006

From perl@benizi.com

Created by perl@benizi.com

Problem with lc/uc interacting with substr and _utf8_on.

Second substr(lc($var),0) on the same _utf8_on'ed $var is the wrong
length, and, in preliminary results, seems to be limited to the same length as
the first substr(lc($var), 0). Adding further iterations leads to further
weirdness. Test program below can be called as​:

perl bug.pl [test-string]
Test string will be split on /​:/, defaults to 'a​:bc'.

For each string in the split​:
  _utf8_on, and print string <TAB> substr(lc(string), 0)

Output should be​:
  string1 <TAB> string1
  string2 <TAB> string2
  ...

Actual output is​:
  string1 <TAB> string1
  string2 <TAB> string3
  ...
(where string3 is the first length(string1) characters of string2)

# sample program demonstrating problem
$ cat bug.pl
#!/usr/bin/perl -l
use strict;
use warnings;
use Encode qw/_utf8_on/;
for (split /​:/, shift||'a​:bc') {
  _utf8_on($_);
  print "$_\t", substr(lc($_), 0);
}

# expected results
$ cat expected_output
a a
bc bc

# actual results
$ perl bug.pl
a a
bc b

# golfed test case (should produce 'abc', not 'ab')
$ perl -MEncode=_utf8_on -e '_utf8_on($_),print substr lc,0 for qw&lt;a bc&gt;,$/'
ab

Additional oddness/data​:
Affected versions​: >=5.8.1
Confirmed unaffected​: linux-i686 5.8.0, solaris 5.8.0

Affected functions​: only lc/uc. (not ucfirst/lcfirst). Only in substr(lc(),0)
order. (i.e. lc(substr($_, 0)) is not affected.)

Perl Info

Flags:
     category=core
     severity=low

Site configuration information for perl v5.8.7:

Configured by Gentoo at Sat Feb  4 23:34:18 EST 2006.

Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
   Platform:
     osname=linux, osvers=2.6.11-gentoo-r6, archname=i686-linux
     uname='linux elation 2.6.11-gentoo-r6 #4 thu may 12 16:36:25 edt 2005 i686 intel(r) pentium(r) 4 cpu 3.00ghz genuineintel gnulinux '
     config_args='-des -Darchname=i686-linux -Dcccdlflags=-fPIC -Dccdlflags=-rdynamic -Dcc=i686-pc-linux-gnu-gcc -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dlocincpth=  -Doptimize=-O2 -march=pentium4 -fomit-frame-pointer -Duselargefiles -Dd_semctl_semun -Dscriptdir=/usr/bin -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dinstallman1dir=/usr/share/man/man1 -Dinstallman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dinc_version_list=5.8.0 5.8.0/i686-linux 5.8.2 5.8.2/i686-linux 5.8.4 5.8.4/i686-linux 5.8.5 5.8.5/i686-linux 5.8.6 5.8.6/i686-linux  -Dcf_by=Gentoo -Ud_csh -Di_ndbm -Di_gdbm -Di_db'
     hint=recommended, useposix=true, d_sigaction=define
     usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
     use64bitint=undef use64bitall=undef uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='i686-pc-linux-gnu-gcc', ccflags ='-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2 -march=pentium4 -fomit-frame-pointer',
     cppflags='-fno-strict-aliasing -pipe'
     ccversion='', gccversion='3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0, pie-8.7.8)', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=4, prototype=define
   Linker and Libraries:
     ld='i686-pc-linux-gnu-gcc', ldflags =' -L/usr/local/lib'
     libpth=/usr/local/lib /lib /usr/lib
     libs=-lpthread -lnsl -lndbm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
     perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     libc=/lib/libc-2.3.5.so, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.3.5'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
     cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:



@INC for perl v5.8.7:
     /etc/perl
     /usr/lib/perl5/site_perl/5.8.7/i686-linux
     /usr/lib/perl5/site_perl/5.8.7
     /usr/lib/perl5/site_perl/5.8.5
     /usr/lib/perl5/site_perl/5.8.5/i686-linux
     /usr/lib/perl5/site_perl/5.8.6
     /usr/lib/perl5/site_perl/5.8.6/i686-linux
     /usr/lib/perl5/site_perl
     /usr/lib/perl5/vendor_perl/5.8.7/i686-linux
     /usr/lib/perl5/vendor_perl/5.8.7
     /usr/lib/perl5/vendor_perl/5.8.5
     /usr/lib/perl5/vendor_perl/5.8.5/i686-linux
     /usr/lib/perl5/vendor_perl/5.8.6
     /usr/lib/perl5/vendor_perl/5.8.6/i686-linux
     /usr/lib/perl5/vendor_perl
     /usr/lib/perl5/5.8.7/i686-linux
     /usr/lib/perl5/5.8.7
     /usr/local/lib/site_perl
     .


Environment for perl v5.8.7:
     HOME=/home/bhaskell
     LANG (unset)
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/home/bhaskell/bin:/home/bhaskell/wn/bin:/usr/kde/3.4/bin:/bin:/usr/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/3.4.4:/opt/ati/bin:/opt/ghc/bin:/opt/blackdown-jdk-1.4.2.02/bin:/opt/blackdown-jdk-1.4.2.02/jre/bin:/usr/qt/3/bin:/usr/kde/3.4/bin:/usr/kde/3.3/bin:/usr/games/bin:/var/qmail/bin:/usr/cogsci/bin:/people/bhaskell/bin
     PERL_BADLANG (unset)
     SHELL=/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2006

From perl@benizi.com

Still in 5.9.3 for i686-linux. (Tested that before I submitted, but
forgot to mention it).

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2006

perl@benizi.com - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2006

From @andk

Looks like a hairy troll has jhidden for quite a while​:)

----Program----
use strict;
use warnings;
use Encode qw/_utf8_on/;
for (split /​:/, 'a​:bc') {
  _utf8_on($_);
  my $p = join "", "$_ ", substr(lc($_), 0);
  print $p =~ /^(a a|bc bc)$/ ? "ok # $p\n" : "not ok # $p\n";
}

----Output of .../pHIziQK/perl-5.8.0@​18529/bin/perl----
ok # a a
ok # bc bc

----EOF ($?='0')----
----Output of .../pCRFA94/perl-5.8.0@​18530/bin/perl----
ok # a a
not ok # bc b

----EOF ($?='0')----

Change 18530 by hv@​hv-crypt.org on 2003/01/21 01​:37​:03

  integrate (by hand) #18353 and #18359 from maint-5.8​:

OK, maybe it helps to binary search along the maint-5.8 stretch...

----Program----
use strict;
use warnings;
use Encode qw/_utf8_on/;
for (split /​:/, 'a​:bc') {
  _utf8_on($_);
  my $p = join "", "$_ ", substr(lc($_), 0);
  print $p =~ /^(a a|bc bc)$/ ? "ok # $p\n" : "not ok # $p\n";
}

----Output of .../pAyq3oR/perl-5.8.0@​18352/bin/perl----
ok # a a
ok # bc bc

----EOF ($?='0')----
----Output of .../pZpX8E8/perl-5.8.0@​18353/bin/perl----
ok # a a
not ok # bc b

----EOF ($?='0')----

Change 18353 by jhi@​lyta on 2002/12/26 02​:07​:06

  Introduce a cache for UTF-8 data​: length and byte<->char mapping
  are stored in a new type of magic. Speeds up length(), substr(),
  index(), rindex(), pos(), and some parts of s///.
 
  The speedup varies a lot (on the usual suspects​: what is the
  access pattern of the data, compiler, CPU), but should be at
  least one order of magnitude, and getting to the same magnitude
  as byte string speeds, and in some cases (length on unchanged data)
  even reaching the byte string speed. On the other hand, in some
  cases (index) the byte speed is still faster by a factor of five
  or so, but the bottleneck there does not seem to be any more
  the byte<->char mapping (instead, the fbm_instr() speed).
 
  There is one cache slot for the speed, and only two for the
  byte<->char mapping (the first one for the start->offset,
  and the second for the offset->offset+length, when talking
  in substr() terms).
 
  Code this hairy is bound to have hairy trolls hiding under it.

--
andreas

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2006

From @nwc10

On Fri, Feb 24, 2006 at 04​:46​:09AM +0100, Andreas J. Koenig wrote​:

Looks like a hairy troll has jhidden for quite a while​:)

Change 18353 by jhi@​lyta on 2002/12/26 02​:07​:06

Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\(\)\, substr\(\)\,
index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.

Thanks. This confirms my suspicion that it was the UTF-8 caching code
introduced with 5.8.1

As part of my TPF grant I'm going to look at all this, so if no-one else
beats me to finding the specific bug in the existing code, it will be
resolved in the next 3 months.

As a work around, I think that re-assigning the value to itself before the lc
or uc will clear the cache, and lc and uc will then give the correct answer.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2006

From BQW10602@nifty.com

On Fri, 24 Feb 2006 10​:13​:15 +0000, Nicholas Clark <nick@​ccl4.org> wrote

On Fri, Feb 24, 2006 at 04​:46​:09AM +0100, Andreas J. Koenig wrote​:

Looks like a hairy troll has jhidden for quite a while​:)

Change 18353 by jhi@​lyta on 2002/12/26 02​:07​:06

Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\(\)\, substr\(\)\,
index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.

Thanks. This confirms my suspicion that it was the UTF-8 caching code
introduced with 5.8.1

As part of my TPF grant I'm going to look at all this, so if no-one else
beats me to finding the specific bug in the existing code, it will be
resolved in the next 3 months.

As a work around, I think that re-assigning the value to itself before the lc
or uc will clear the cache, and lc and uc will then give the correct answer.

Should the magic on TARG be reset? (Or don't use TARG?)

ucfirst() also has this bug, when ulen != tculen (see pp_ucfirst).

for (split /​:/, shift||"a​:ßbc") {
  utf8​::upgrade($_);
  print "$_\t", substr(ucfirst($_), 0), "\t$_\n";
}
__END__
a A a
ßbc S ßbc

cf. The result on Perl 5.8.0
a A a
ßbc Ssbc ßbc

postincrement $_++ is also buggy.

#!perl
use strict;
use warnings;
use Encode qw/_utf8_on/;
for (split /​:/, shift||'a​:bc') {
  _utf8_on($_);
  print "$_\t", substr($_++, 0), "\t$_\n";
}
__END__
a a b
bc b bd

cf. The result on Perl 5.8.0
a a b
bc bc bd

In contrast, preincrement ++$_ is good (pp_preinc doesn't use TARG).

#!perl
use strict;
use warnings;
use Encode qw/_utf8_on/;
for (split /​:/, shift||'a​:bc') {
  _utf8_on($_);
  print "$_\t", substr(++$_, 0), "\t$_\n";
}
__END__
a b b
bc bd bd

Regards
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2006

From @jhi

Nicholas Clark wrote​:

On Fri, Feb 24, 2006 at 04​:46​:09AM +0100, Andreas J. Koenig wrote​:

Looks like a hairy troll has jhidden for quite a while​:)

Change 18353 by jhi@​lyta on 2002/12/26 02​:07​:06

Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\(\)\, substr\(\)\,
index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.

Thanks. This confirms my suspicion that it was the UTF-8 caching code
introduced with 5.8.1

As part of my TPF grant I'm going to look at all this, so if no-one else
beats me to finding the specific bug in the existing code, it will be
resolved in the next 3 months.

Not to put too much pressure on Sadahiro-san but he has traditionally
been able to fix all UTF-8 bugs extremely fast... Our brains must be
similarly wired. Or, rather, his brain is wired the right way to
*untangle* the spaghetti I left behind.

As a work around, I think that re-assigning the value to itself before the lc
or uc will clear the cache, and lc and uc will then give the correct answer.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2006

From BQW10602@nifty.com

On Fri, 24 Feb 2006 19​:15​:53 +0200, Jarkko Hietaniemi <jhietaniemi@​gmail.com> wrote

Nicholas Clark wrote​:

On Fri, Feb 24, 2006 at 04​:46​:09AM +0100, Andreas J. Koenig wrote​:

Looks like a hairy troll has jhidden for quite a while​:)

Change 18353 by jhi@​lyta on 2002/12/26 02​:07​:06

Introduce a cache for UTF\-8 data&#8203;: length and byte\<\->char mapping
are stored in a new type of magic\.  Speeds up length\(\)\, substr\(\)\,
index\(\)\, rindex\(\)\, pos\(\)\, and some parts of s///\.

Thanks. This confirms my suspicion that it was the UTF-8 caching code
introduced with 5.8.1

As part of my TPF grant I'm going to look at all this, so if no-one else
beats me to finding the specific bug in the existing code, it will be
resolved in the next 3 months.

Not to put too much pressure on Sadahiro-san but he has traditionally
been able to fix all UTF-8 bugs extremely fast... Our brains must be
similarly wired. Or, rather, his brain is wired the right way to
*untangle* the spaghetti I left behind.

However this bug against *other magics* seemed to exist even in
perl 5.6.1, as shown below for m//g in scalar context.

SvSETMAGIC(sv) does not affect TARG if sv != TARG.

Regards,
SADAHIRO Tomoyuki

The changes of pp.c are for pp_ucfirst, pp_uc, and pp_lc.

Inline Patch
diff -ur perl-current@27323/pp.c perl/pp.c
--- perl-current@27323/pp.c	Sat Feb 25 09:41:08 2006
+++ perl/pp.c	Sat Feb 25 16:59:13 2006
@@ -3350,7 +3350,8 @@
 	    if (slen > ulen)
 	        sv_catpvn(TARG, (char*)(s + ulen), slen - ulen);
 	    SvUTF8_on(TARG);
-	    SETs(TARG);
+	    sv = TARG;
+	    SETs(sv);
 	}
 	else {
 	    s = (U8*)SvPV_force_nomg(sv, slen);
@@ -3402,7 +3403,8 @@
 	if (!len) {
 	    SvUTF8_off(TARG);				/* decontaminate */
 	    sv_setpvn(TARG, "", 0);
-	    SETs(TARG);
+	    sv = TARG;
+	    SETs(sv);
 	}
 	else {
 	    STRLEN min = len + 1;
@@ -3435,7 +3437,8 @@
 	    *d = '\0';
 	    SvUTF8_on(TARG);
 	    SvCUR_set(TARG, d - (U8*)SvPVX_const(TARG));
-	    SETs(TARG);
+	    sv = TARG;
+	    SETs(sv);
 	}
     }
     else {
@@ -3487,7 +3490,8 @@
 	if (!len) {
 	    SvUTF8_off(TARG);				/* decontaminate */
 	    sv_setpvn(TARG, "", 0);
-	    SETs(TARG);
+	    sv = TARG;
+	    SETs(sv);
 	}
 	else {
 	    STRLEN min = len + 1;
@@ -3540,7 +3544,8 @@
 	    *d = '\0';
 	    SvUTF8_on(TARG);
 	    SvCUR_set(TARG, d - (U8*)SvPVX_const(TARG));
-	    SETs(TARG);
+	    sv = TARG;
+	    SETs(sv);
 	}
     }
     else {
diff -ur perl-current@27323/t/op/lc.t perl/t/op/lc.t
--- perl-current@27323/t/op/lc.t	Tue Nov 08 00:50:29 2005
+++ perl/t/op/lc.t	Sat Feb 25 18:08:54 2006
@@ -6,7 +6,7 @@
     require './test.pl';
 }
 
-plan tests => 59;
+plan tests => 77;
 
 $a = "HELLO.* world";
 $b = "hello.* WORLD";
@@ -163,3 +163,39 @@
 	is($a, v10, "[perl #18857]");
     } 
 }
+
+
+# [perl #38619] Bug in lc and uc (interaction between UTF-8, substr, and lc/uc)
+
+for ("a\x{100}", "xyz\x{100}") {
+    is(substr(uc($_), 0), uc($_), "[perl #38619] uc");
+}
+for ("A\x{100}", "XYZ\x{100}") {
+    is(substr(lc($_), 0), lc($_), "[perl #38619] lc");
+}
+for ("a\x{100}", "ßyz\x{100}") { # ß to Ss (different length)
+    is(substr(ucfirst($_), 0), ucfirst($_), "[perl #38619] ucfirst");
+}
+
+# Related to [perl #38619]
+# the original report concerns PERL_MAGIC_utf8.
+# these cases concern PERL_MAGIC_regex_global.
+
+for (map { $_ } "a\x{100}", "abc\x{100}", "\x{100}") {
+    chop; # get ("a", "abc", "") in utf8
+    my $return =  uc($_) =~ /\G(.?)/g;
+    my $result = $return ? $1 : "not";
+    my $expect = (uc($_) =~ /(.?)/g)[0];
+    is($return, 1,       "[perl #38619]");
+    is($result, $expect, "[perl #38619]");
+}
+
+for (map { $_ } "A\x{100}", "ABC\x{100}", "\x{100}") {
+    chop; # get ("A", "ABC", "") in utf8
+    my $return =  lc($_) =~ /\G(.?)/g;
+    my $result = $return ? $1 : "not";
+    my $expect = (lc($_) =~ /(.?)/g)[0];
+    is($return, 1,       "[perl #38619]");
+    is($result, $expect, "[perl #38619]");
+}
+
##### END OF PATCH

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2006

From @nwc10

On Sat, Feb 25, 2006 at 06​:16​:45PM +0900, SADAHIRO Tomoyuki wrote​:

Not to put too much pressure on Sadahiro-san but he has traditionally
been able to fix all UTF-8 bugs extremely fast... Our brains must be

:-)

However this bug against *other magics* seemed to exist even in
perl 5.6.1, as shown below for m//g in scalar context.

SvSETMAGIC(sv) does not affect TARG if sv != TARG.

D'oh!

Thanks, applied (change 27329)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 21, 2006

@iabyn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant