approx. 10 times faster utf8 string operations #7021

p5pRT · 2004-01-06T18:54:00Z

Migrated from rt.perl.org#24826 (status was 'resolved')

Searchable as RT24826$

p5pRT · 2004-01-06T18:54:29Z

From roal@anet.at

This is a bug report for perl from roal@anet.at,
generated with the help of perlbug 1.34 running under perl v5.8.2.

The perlunicode pod says

In Perl 5.8.0 the slowness was often quite spectacular;
in Perl 5.8.1 a caching scheme was introduced which will hopefully make the
slowness somewhat less spectacular, at least for some operations. In general,
operations with UTF-8 encoded strings are still slower.

Regular Expression have always been what Perl is so famous for, and are certainly
one reason for Perl's name, being a Practical Extraction and Reporting Language.
But, unfortunately, there is yet no more efficiency on strings if they are
flagged as UTF-8.

I have investigated on this and found a solution to make regular expression operations
including case-insensivity, lower- and uppercasing on UTF-8 encoded strings
to about 10 times faster as before.

Below, there is the test script I used to measure the effectiveness, on a simple pure
ASCII string. I got the following results:

On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow!
Although, the performance became a little better with the patched files
(524 s -> 474 s on a Windows machine).

On Perl 5.8.2, the performance is much better, but still very poor per default!
Fortunately, with the patched files the performance really speeds up!!!
(77 s -> 9 s on the same Windows machine and 775 s -> 66 s on a slower BSD/OS system).

Save the code given below as "utf8.pl" and run it by executing

perl utf8.pl

to get test results as shown below, or, with another multiplier value used to create the test string,
for example 10000 (which is equal to 1e4):

perl utf8.pl 1e4

My results have been:

with default Perl 5.8.2:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16:50:55 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16:50:55 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:50:59 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan 6 16:51:55 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:52:16 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 77 seconds

Switching to utf8 semantics required the following additional files to load:
unicore/Canonical.pl
unicore/Exact.pl
unicore/To/Fold.pl
unicore/To/Lower.pl
unicore/To/Upper.pl
unicore/lib/Word.pl
utf8.pm
utf8_heavy.pl
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06:33:46 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06:33:47 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:34:16 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 30 seconds

String is now treated as utf8
Tue Jan 6 06:45:21 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:47:11 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 775 seconds

with Perl 5.8.2, after the patch:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16:54:24 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16:54:24 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:54:28 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan 6 16:54:30 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:54:37 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 9 seconds
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06:52:35 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06:52:36 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:53:05 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 30 seconds

String is now treated as utf8
Tue Jan 6 06:53:14 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:54:11 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 66 seconds

with default Perl 5.8.0:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 16:56:56 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16:56:56 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:57:00 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan 6 17:05:24 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17:05:44 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 524 seconds

with Perl 5.8.0, after the patch:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 17:07:14 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 17:07:14 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17:07:19 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 5 seconds

String is now treated as utf8
Tue Jan 6 17:15:05 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17:15:13 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 474 seconds

The Solution for the patch:

Entirely remove the '%utf8::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl',
where FUNCTION stands for 'Fold', 'Lower' and 'Upper'. Therefore,

from 'unicore/To/Fold.pl' -> remove '%utf8::ToSpecFold = (...)'
from 'unicore/To/Lower.pl' -> remove '%utf8::ToSpecLower = (...)'
from 'unicore/To/Upper.pl' -> remove '%utf8::ToSpecUpper = (...)'

Even when only the Variable name '%utf8::ToSpecFUNCTION' is used in Perl code anywhere, some black magic
turns on and causes the horrible performance slow down on utf8-strings! %utf8::ToSpecFUNCTION needs even not
to be defined to "enable" that "black magic"!

This is just a workaround, but a very effective one. I guess that the real solution for that problem
is somewhere within 'utf8.c', which contains

The "special" is a string like "utf8::ToSpecLower", which means the
hash %utf8::ToSpecLower. The access to the hash is through
Perl_to_utf8_case().

Further investigation into that C source may find the reason of this.

best,
rob.

=cut

######## start 'utf8.pl' test script ########
# Robert Allerstorfer 2004 01 06
# applies to Perl 5.8.2
#
$^W = 1;
use strict;
use 5.008;
require Encode;
# Encode is required because even in Perl 5.8.2 there is nothing like utf8::_utf8_on($octets)

my $multiply = @ARGV ? shift(@ARGV) + 0 : 0;
$multiply ||= 5e4;

printf "UTF-8 Performance Test on $^O with Perl %vd\n", $^V;
my $string .= join ("", ('A'..'Z', 'a'..'z')) x $multiply;
($_) = &now;
print "$_: Test String created: pure ASCII 'A-Za-z' x $multiply (";
print int((length($string) / 1024**2 * 10) + .5) / 10, " MB)\n\n";

utf8::encode($string);
(undef, my $t0) = &now;
&search($string);
my @inckeys0 = keys %INC;
(undef, my $t1) = &now;
print "Required time: ", $t1 - $t0, " seconds\n\n";

Encode::_utf8_on($string);
&search($string);
(undef, my $t2) = &now;
print "Required time: ", $t2 - $t1, " seconds\n\n";

print "Switching to utf8 semantics required the following additional files to load:\n\t";
my %seen;
@seen{@inckeys0} = ();
my @newinckeys;
foreach (keys %INC) {
push @newinckeys, $_
unless exists $seen{$_}
;
}
print "$_\n\t" foreach (sort @newinckeys);
print "\n";
exit 0;

sub now {
my $t = time;
return scalar localtime($t), $t;
}

sub search {
my $string = shift;
print "String is now treated as ";
my $utf8_flag = $] < 5.008001 ? Encode::is_utf8($string) : utf8::is_utf8($string);
print $utf8_flag ? "utf8" : "bytes", "\n";

my $term = "abc";
my $matches = $string =~ s/($term)/$1/gi;
($_) = &now;
print "$_: $matches case-insensitive occurencies of '$term' found in String\n";

my ($lc, $uc) = (0, 0);
while ($string =~ /(\w)/g) {
if ($1 eq lc $1) {
$lc ++;
} elsif ($1 eq uc $1) {
$uc ++;
}
}
($_) = &now;
print "$_: $lc lowercase and $uc uppercase characters found in String\n";
}

__END__
#
######## end of script ########

Flags:
category=core
severity=high

Site configuration information for perl v5.8.2:

Configured by roal at Fri Dec 19 04:41:37 EST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration:
Platform:
osname=bsdos, osvers=4.2, archname=i386-bsdos
uname='bsdos bsdos.anet.at 4.2 bsdi bsdos 4.2 kernel #0: wed oct 25 17:38:20 mdt 2000 polk@hephaestus.bsdi.com:mntproto4.2-i386usrsrcsyscompilegeneric i386 '
config_args='-es -Duseshrplib -Adefine:libperl=p2x582.so -Dccdlflags=-Wl,-rpath,. -Ud_procselfexe -Uinstallusrbinperl -Dinstallprefix=~~/perl -Dprefix=~~/perl -Dcf_email=info@anet.at'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
optimize='-O2',
cppflags='-fno-strict-aliasing -I/usr/local/include'
ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='ld', ldflags =' -L/usr/X11/lib -L/usr/local/lib'
libpth=~/usr/lib /usr/lib /usr/local/lib /usr/shlib /shlib /lib /usr/X11/lib
libs=-lutil -lbind -ldl -lm -lc
perllibs=-lutil -lbind -ldl -lm -lc
libc=/shlib/libc.so, so=so, useshrplib=true, libperl=p2x582.so
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-rpath,. -Wl,-rpath,/usr/home/roal/perl/lib/5.8.2/i386-bsdos/CORE'
cccdlflags='-fPIC', lddlflags='-shared -x -L/usr/X11/lib -L/usr/local/lib'

Locally applied patches:
ACTIVEPERL_LOCAL_PATCHES_ENTRY
21846 Configure gets d_u32align wrong
21739 [perl #24493] install.html not working
21737 Ooops. left an XXX comment in, and worse still it's a // comment
21735 utf8 keys now work for tied hashes
21734 Accessing unicode keys in tie hashes via hv_exists was broken
21733 ext/threads/t/problem.t
21732 Config::myconfig() fails under ithreads
21728 Update perlhist with 5.6.2
21723 Include 'SCCS' in the list of dir names ignored by installperl
21718 Empty subroutine as object method segfaults in 5.8.2 (sometimes)
21714 Fix bug #24380: assigning list with duplicated keys to a hash
21706 [perl #24460] [DOC PATCH] the begincheck program
21693 must copy changes from win32/makeifle.mk to wince/makefile.ce
21691 Update the list of pumpkings in perlhist.pod
21687 [PATCH 5.6.2-RC1 pod/perlhist.pod] Updated
21677 OS/2 docu
21676 Bug #24407: key for shared hash got stringified into wrong pool
21673 Be sure to use -fPIC not -fpic on Linux/SPARC
21672 extending the hash attack test
21671 Benchmark.pm cmpthese segfault
21662 'make minitest' fails for op/cproto and op/pat
21586 Comment that this 'optimisation' is actually a necessary fixup
21548 Sync with Pod::Perldoc 3.12
21540 Fix backward-compatibility issues in if.pm

@INC for perl v5.8.2:
/usr/home/roal/perl/lib/5.8.2/i386-bsdos
/usr/home/roal/perl/lib/5.8.2
/usr/home/roal/perl/lib/site_perl/5.8.2/i386-bsdos
/usr/home/roal/perl/lib/site_perl/5.8.2
/usr/home/roal/perl/lib/site_perl
.

Environment for perl v5.8.2:
HOME=/usr/home/roal
LANG (unset)
LANGUAGE (unset)
LC_CTYPE=ISO8859-1
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/home/roal/bin:/bin:/usr/bin:/usr/X11/bin:/usr/contrib/bin:/usr/contrib/mh/bin:/usr/games:/usr/local/bin
PERL_BADLANG (unset)
SHELL=/bin/bash

p5pRT · 2004-01-06T19:24:24Z

From roal@anet.at

The test script to download as an attachment

p5pRT · 2004-01-06T19:24:24Z

From roal@anet.at

utf8.pl

p5pRT · 2004-01-06T19:24:27Z

roal@anet.at - Status changed from 'new' to 'open'

p5pRT · 2004-01-06T21:13:18Z

roal@anet.at - Status changed from 'open' to 'new'

p5pRT · 2004-01-24T23:44:29Z

From perlbug-followup@perl.org

This is a bug report for perl from roal@anet.at,
generated with the help of perlbug 1.34 running under perl v5.8.2.

The perlunicode pod says

In Perl 5.8.0 the slowness was often quite spectacular;
in Perl 5.8.1 a caching scheme was introduced which will hopefully make the
slowness somewhat less spectacular, at least for some operations. In general,
operations with UTF-8 encoded strings are still slower.

Regular Expression have always been what Perl is so famous for, and are certainly
one reason for Perl's name, being a Practical Extraction and Reporting Language.
But, unfortunately, there is yet no more efficiency on strings if they are
flagged as UTF-8.

I have investigated on this and found a solution to make regular expression operations
including case-insensivity, lower- and uppercasing on UTF-8 encoded strings
to about 10 times faster as before.

Below, there is the test script I used to measure the effectiveness, on a simple pure
ASCII string. I got the following results:

On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow!
Although, the performance became a little better with the patched files
(524 s -> 474 s on a Windows machine).

On Perl 5.8.2, the performance is much better, but still very poor per default!
Fortunately, with the patched files the performance really speeds up!!!
(77 s -> 9 s on the same Windows machine and 775 s -> 66 s on a slower BSD/OS system).

Save the code given below as "utf8.pl" and run it by executing

perl utf8.pl

to get test results as shown below, or, with another multiplier value used to create the test string,
for example 10000 (which is equal to 1e4):

perl utf8.pl 1e4

My results have been:

with default Perl 5.8.2:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16:50:55 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16:50:55 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:50:59 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan 6 16:51:55 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:52:16 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 77 seconds

Switching to utf8 semantics required the following additional files to load:
unicore/Canonical.pl
unicore/Exact.pl
unicore/To/Fold.pl
unicore/To/Lower.pl
unicore/To/Upper.pl
unicore/lib/Word.pl
utf8.pm
utf8_heavy.pl
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06:33:46 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06:33:47 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:34:16 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 30 seconds

String is now treated as utf8
Tue Jan 6 06:45:21 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:47:11 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 775 seconds

with Perl 5.8.2, after the patch:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16:54:24 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16:54:24 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:54:28 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan 6 16:54:30 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:54:37 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 9 seconds
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06:52:35 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06:52:36 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:53:05 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 30 seconds

String is now treated as utf8
Tue Jan 6 06:53:14 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06:54:11 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 66 seconds

with default Perl 5.8.0:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 16:56:56 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16:56:56 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16:57:00 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 4 seconds

String is now treated as utf8
Tue Jan 6 17:05:24 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17:05:44 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 524 seconds

with Perl 5.8.0, after the patch:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 17:07:14 2004: Test String created: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 17:07:14 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17:07:19 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 5 seconds

String is now treated as utf8
Tue Jan 6 17:15:05 2004: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17:15:13 2004: 1300000 lowercase and 1300000 uppercase characters found in String
Required time: 474 seconds

The Solution for the patch:

Entirely remove the '%utf8::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl',
where FUNCTION stands for 'Fold', 'Lower' and 'Upper'. Therefore,

from 'unicore/To/Fold.pl' -> remove '%utf8::ToSpecFold = (...)'
from 'unicore/To/Lower.pl' -> remove '%utf8::ToSpecLower = (...)'
from 'unicore/To/Upper.pl' -> remove '%utf8::ToSpecUpper = (...)'

Even when only the Variable name '%utf8::ToSpecFUNCTION' is used in Perl code anywhere, some black magic
turns on and causes the horrible performance slow down on utf8-strings! %utf8::ToSpecFUNCTION needs even not
to be defined to "enable" that "black magic"!

This is just a workaround, but a very effective one. I guess that the real solution for that problem
is somewhere within 'utf8.c', which contains

The "special" is a string like "utf8::ToSpecLower", which means the
hash %utf8::ToSpecLower. The access to the hash is through
Perl_to_utf8_case().

Further investigation into that C source may find the reason of this.

best,
rob.

=cut

######## start 'utf8.pl' test script ########
# Robert Allerstorfer 2004 01 06
# applies to Perl 5.8.2
#
$^W = 1;
use strict;
use 5.008;
require Encode;
# Encode is required because even in Perl 5.8.2 there is nothing like utf8::_utf8_on($octets)

my $multiply = @ARGV ? shift(@ARGV) + 0 : 0;
$multiply ||= 5e4;

printf "UTF-8 Performance Test on $^O with Perl %vd\n", $^V;
my $string .= join ("", ('A'..'Z', 'a'..'z')) x $multiply;
($_) = &now;
print "$_: Test String created: pure ASCII 'A-Za-z' x $multiply (";
print int((length($string) / 1024**2 * 10) + .5) / 10, " MB)\n\n";

utf8::encode($string);
(undef, my $t0) = &now;
&search($string);
my @inckeys0 = keys %INC;
(undef, my $t1) = &now;
print "Required time: ", $t1 - $t0, " seconds\n\n";

Encode::_utf8_on($string);
&search($string);
(undef, my $t2) = &now;
print "Required time: ", $t2 - $t1, " seconds\n\n";

print "Switching to utf8 semantics required the following additional files to load:\n\t";
my %seen;
@seen{@inckeys0} = ();
my @newinckeys;
foreach (keys %INC) {
push @newinckeys, $_
unless exists $seen{$_}
;
}
print "$_\n\t" foreach (sort @newinckeys);
print "\n";
exit 0;

sub now {
my $t = time;
return scalar localtime($t), $t;
}

sub search {
my $string = shift;
print "String is now treated as ";
my $utf8_flag = $] < 5.008001 ? Encode::is_utf8($string) : utf8::is_utf8($string);
print $utf8_flag ? "utf8" : "bytes", "\n";

my $term = "abc";
my $matches = $string =~ s/($term)/$1/gi;
($_) = &now;
print "$_: $matches case-insensitive occurencies of '$term' found in String\n";

my ($lc, $uc) = (0, 0);
while ($string =~ /(\w)/g) {
if ($1 eq lc $1) {
$lc ++;
} elsif ($1 eq uc $1) {
$uc ++;
}
}
($_) = &now;
print "$_: $lc lowercase and $uc uppercase characters found in String\n";
}

__END__
#
######## end of script ########

Flags:
category=core
severity=high

Site configuration information for perl v5.8.2:

Configured by roal at Fri Dec 19 04:41:37 EST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration:
Platform:
osname=bsdos, osvers=4.2, archname=i386-bsdos
uname='bsdos bsdos.anet.at 4.2 bsdi bsdos 4.2 kernel #0: wed oct 25 17:38:20 mdt 2000 polk@hephaestus.bsdi.com:mntproto4.2-i386usrsrcsyscompilegeneric i386 '
config_args='-es -Duseshrplib -Adefine:libperl=p2x582.so -Dccdlflags=-Wl,-rpath,. -Ud_procselfexe -Uinstallusrbinperl -Dinstallprefix=~~/perl -Dprefix=~~/perl -Dcf_email=info@anet.at'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
optimize='-O2',
cppflags='-fno-strict-aliasing -I/usr/local/include'
ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='ld', ldflags =' -L/usr/X11/lib -L/usr/local/lib'
libpth=~/usr/lib /usr/lib /usr/local/lib /usr/shlib /shlib /lib /usr/X11/lib
libs=-lutil -lbind -ldl -lm -lc
perllibs=-lutil -lbind -ldl -lm -lc
libc=/shlib/libc.so, so=so, useshrplib=true, libperl=p2x582.so
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-rpath,. -Wl,-rpath,/usr/home/roal/perl/lib/5.8.2/i386-bsdos/CORE'
cccdlflags='-fPIC', lddlflags='-shared -x -L/usr/X11/lib -L/usr/local/lib'

Locally applied patches:
ACTIVEPERL_LOCAL_PATCHES_ENTRY
21846 Configure gets d_u32align wrong
21739 [perl #24493] install.html not working
21737 Ooops. left an XXX comment in, and worse still it's a // comment
21735 utf8 keys now work for tied hashes
21734 Accessing unicode keys in tie hashes via hv_exists was broken
21733 ext/threads/t/problem.t
21732 Config::myconfig() fails under ithreads
21728 Update perlhist with 5.6.2
21723 Include 'SCCS' in the list of dir names ignored by installperl
21718 Empty subroutine as object method segfaults in 5.8.2 (sometimes)
21714 Fix bug #24380: assigning list with duplicated keys to a hash
21706 [perl #24460] [DOC PATCH] the begincheck program
21693 must copy changes from win32/makeifle.mk to wince/makefile.ce
21691 Update the list of pumpkings in perlhist.pod
21687 [PATCH 5.6.2-RC1 pod/perlhist.pod] Updated
21677 OS/2 docu
21676 Bug #24407: key for shared hash got stringified into wrong pool
21673 Be sure to use -fPIC not -fpic on Linux/SPARC
21672 extending the hash attack test
21671 Benchmark.pm cmpthese segfault
21662 'make minitest' fails for op/cproto and op/pat
21586 Comment that this 'optimisation' is actually a necessary fixup
21548 Sync with Pod::Perldoc 3.12
21540 Fix backward-compatibility issues in if.pm

@INC for perl v5.8.2:
/usr/home/roal/perl/lib/5.8.2/i386-bsdos
/usr/home/roal/perl/lib/5.8.2
/usr/home/roal/perl/lib/site_perl/5.8.2/i386-bsdos
/usr/home/roal/perl/lib/site_perl/5.8.2
/usr/home/roal/perl/lib/site_perl
.

Environment for perl v5.8.2:
HOME=/usr/home/roal
LANG (unset)
LANGUAGE (unset)
LC_CTYPE=ISO8859-1
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/home/roal/bin:/bin:/usr/bin:/usr/X11/bin:/usr/contrib/bin:/usr/contrib/mh/bin:/usr/games:/usr/local/bin
PERL_BADLANG (unset)
SHELL=/bin/bash

p5pRT · 2004-01-24T23:44:31Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2004-01-24T23:45:06Z

@rspier - Status changed from 'open' to 'new'

p5pRT · 2004-03-03T07:38:16Z

From @jhi

This was recently brought up in perl-unicode@perl.org:
http://bugs6.perl.org/rt3/Ticket/Display.html?id=24826
The proposed analysis that the funky %utf8::ToSpecFoo tables in
lib/unicore/To/Foo.pl
were part of the puzzle was right; the proposed cure of removing them
was not quite
(that would make parts of the test op/lc and op/pat, and almost all of
uni/*, to fail)

The problem was that in utf8.c:to_utf8_case() for each /i character
matching and for
each lc/uc/lcfirst/ucfirst a throw-away SV was created and a sprintf()
made into that SV--
even when the Unicode code point (the character ordinal in case
Unicodese is Greek to you)
had no chance of having any special casings [1]. Ouch.

Now the special casings are checked only if there is a chance they will
be needed
(either the code point is U+00DF or it is higher than 0xFF [2], and
even when they
are checked, no SV is created but instead the bytes of the UTF-8
encoding of the code
point are used directly as the hash key.

The speed improvements for /i and the lc/etc are significant, a factor
of about 5-10.
I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

[1] Take a look at lib/unicore/CaseFoldings.txt,
lib/unicore/SpecCase.txt, and
http://www.unicode.org/unicode/reports/tr18/, especially
http://www.unicode.org/unicode/reports/tr18/#Default_Loose_Matches
and realize that Unicodese "loose matches" is our /i.

[2] Glaring at the casing data it seems that 0x12F would work, too,
after 0xDF the next
specially casing code point is 0x130, if I didn't miss anything.

p5pRT · 2004-03-03T07:38:16Z

From @jhi

casing.pat.bz2

p5pRT · 2004-03-03T07:38:16Z

From @jhi

--
Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT · 2004-03-03T07:38:19Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2004-03-03T22:16:30Z

From @rgs

Jarkko Hietaniemi wrote:

Now the special casings are checked only if there is a chance they will
be needed
(either the code point is U+00DF or it is higher than 0xFF [2], and
even when they
are checked, no SV is created but instead the bytes of the UTF-8
encoding of the code
point are used directly as the hash key.

The speed improvements for /i and the lc/etc are significant, a factor
of about 5-10.
I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

Which is now done, thanks, to blead, as #22427.

p5pRT · 2004-03-07T20:24:53Z

From @nwc10

On Wed, Mar 03, 2004 at 09:37:21AM +0200, Jarkko Hietaniemi wrote:

I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

Er, yes, but thanks for digging into this. I fear that no-one currently
subscribed to perl5-porters actually understands the intricacies of the
implementation of perl's Unicode support.

Nicholas Clark

p5pRT · 2004-03-07T21:21:23Z

From @jhi

I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

Er, yes, but thanks for digging into this. I fear that no-one currently
subscribed to perl5-porters actually understands the intricacies of the
implementation of perl's Unicode support.

You speak as if I understood those :-)

Sadahiro Tomoyuki and Inaba Hiroto used to have a very good handle on
things
Unicode, and Dan Kogai of course can deal with anything Encode. Hey, I
sense
a certain trend there...

Nicholas Clark

--
Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT · 2004-03-08T10:19:20Z

From roal@anet.at

On Sun, 7 Mar 2004, 20:24 GMT+00 (21:24 local time) Nicholas Clark
wrote:

Er, yes, but thanks for digging into this. I fear that no-one currently
subscribed to perl5-porters actually understands the intricacies of the
implementation of perl's Unicode support.

Larry gave an interesting insight in handling Unicode in Perl 6 vs.
Perl 5, in that post:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&frame=right&th=732a1f27a1510614&seekm=20040303075022.GA8915%40wall.org#link12

which was actually the first response I received to the report of that
utf8 bug which Jarkko has fixed so quickly. Thanks Larry! :-) I will
be very happy to see that (hopefully) impressive speedup in Perl
5.8.4, since it didn't go into 5.8.3, unfortunately.

best,
rob.

p5pRT · 2004-03-24T23:34:10Z

@iabyn - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Mar 24, 2004

p5pRT added Severity High distro-All type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

approx. 10 times faster utf8 string operations #7021

approx. 10 times faster utf8 string operations #7021

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

p5pRT commented Jan 24, 2004

p5pRT commented Jan 24, 2004

p5pRT commented Jan 24, 2004

p5pRT commented Mar 3, 2004

p5pRT commented Mar 3, 2004

p5pRT commented Mar 3, 2004

p5pRT commented Mar 3, 2004

p5pRT commented Mar 3, 2004

p5pRT commented Mar 7, 2004

p5pRT commented Mar 7, 2004

p5pRT commented Mar 8, 2004

p5pRT commented Mar 24, 2004

approx. 10 times faster utf8 string operations #7021

approx. 10 times faster utf8 string operations #7021

Comments

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

From roal@anet.at

with default Perl 5.8.2​:

with Perl 5.8.2, after the patch​:

with default Perl 5.8.0​:

with Perl 5.8.0, after the patch​:

p5pRT commented Jan 6, 2004

From roal@anet.at

p5pRT commented Jan 6, 2004

From roal@anet.at

p5pRT commented Jan 6, 2004

p5pRT commented Jan 6, 2004

p5pRT commented Jan 24, 2004

From perlbug-followup@perl.org

with default Perl 5.8.2​:

with Perl 5.8.2, after the patch​:

with default Perl 5.8.0​:

with Perl 5.8.0, after the patch​:

p5pRT commented Jan 24, 2004

p5pRT commented Jan 24, 2004

p5pRT commented Mar 3, 2004

From @jhi

p5pRT commented Mar 3, 2004

From @jhi

p5pRT commented Mar 3, 2004

From @jhi

p5pRT commented Mar 3, 2004

p5pRT commented Mar 3, 2004

From @rgs

p5pRT commented Mar 7, 2004

From @nwc10

p5pRT commented Mar 7, 2004

From @jhi

p5pRT commented Mar 8, 2004

From roal@anet.at

p5pRT commented Mar 24, 2004

with default Perl 5.8.2:

with Perl 5.8.2, after the patch:

with default Perl 5.8.0:

with Perl 5.8.0, after the patch:

with default Perl 5.8.2:

with Perl 5.8.2, after the patch:

with default Perl 5.8.0:

with Perl 5.8.0, after the patch: