Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approx. 10 times faster utf8 string operations #7021

Closed
p5pRT opened this issue Jan 6, 2004 · 17 comments
Closed

approx. 10 times faster utf8 string operations #7021

p5pRT opened this issue Jan 6, 2004 · 17 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 6, 2004

Migrated from rt.perl.org#24826 (status was 'resolved')

Searchable as RT24826$

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2004

From roal@anet.at

This is a bug report for perl from roal@​anet.at,
generated with the help of perlbug 1.34 running under perl v5.8.2.

The perlunicode pod says

  In Perl 5.8.0 the slowness was often quite spectacular;
  in Perl 5.8.1 a caching scheme was introduced which will hopefully make the
  slowness somewhat less spectacular, at least for some operations. In general,
  operations with UTF-8 encoded strings are still slower.

Regular Expression have always been what Perl is so famous for, and are certainly
one reason for Perl's name, being a Practical Extraction and Reporting Language.
But, unfortunately, there is yet no more efficiency on strings if they are
flagged as UTF-8.

I have investigated on this and found a solution to make regular expression operations
including case-insensivity, lower- and uppercasing on UTF-8 encoded strings
to about 10 times faster as before.

Below, there is the test script I used to measure the effectiveness, on a simple pure
ASCII string. I got the following results​:

On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow!
Although, the performance became a little better with the patched files
(524 s -> 474 s on a Windows machine).

On Perl 5.8.2, the performance is much better, but still very poor per default!
Fortunately, with the patched files the performance really speeds up!!!
(77 s -> 9 s on the same Windows machine and 775 s -> 66 s on a slower BSD/OS system).

Save the code given below as "utf8.pl" and run it by executing

  perl utf8.pl

to get test results as shown below, or, with another multiplier value used to create the test string,
for example 10000 (which is equal to 1e4)​:

  perl utf8.pl 1e4

My results have been​:

with default Perl 5.8.2​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16​:50​:55 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16​:50​:55 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:50​:59 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 4 seconds

String is now treated as utf8
Tue Jan 6 16​:51​:55 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:52​:16 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 77 seconds

Switching to utf8 semantics required the following additional files to load​:
  unicore/Canonical.pl
  unicore/Exact.pl
  unicore/To/Fold.pl
  unicore/To/Lower.pl
  unicore/To/Upper.pl
  unicore/lib/Word.pl
  utf8.pm
  utf8_heavy.pl
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06​:33​:46 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06​:33​:47 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:34​:16 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 30 seconds

String is now treated as utf8
Tue Jan 6 06​:45​:21 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:47​:11 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 775 seconds

with Perl 5.8.2, after the patch​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16​:54​:24 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16​:54​:24 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:54​:28 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 4 seconds

String is now treated as utf8
Tue Jan 6 16​:54​:30 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:54​:37 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 9 seconds
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06​:52​:35 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06​:52​:36 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:53​:05 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 30 seconds

String is now treated as utf8
Tue Jan 6 06​:53​:14 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:54​:11 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 66 seconds

with default Perl 5.8.0​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 16​:56​:56 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16​:56​:56 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:57​:00 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 4 seconds

String is now treated as utf8
Tue Jan 6 17​:05​:24 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17​:05​:44 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 524 seconds

with Perl 5.8.0, after the patch​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 17​:07​:14 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 17​:07​:14 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17​:07​:19 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 5 seconds

String is now treated as utf8
Tue Jan 6 17​:15​:05 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17​:15​:13 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 474 seconds

The Solution for the patch​:


Entirely remove the '%utf8​::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl',
where FUNCTION stands for 'Fold', 'Lower' and 'Upper'. Therefore,

from 'unicore/To/Fold.pl' -> remove '%utf8​::ToSpecFold = (...)'
from 'unicore/To/Lower.pl' -> remove '%utf8​::ToSpecLower = (...)'
from 'unicore/To/Upper.pl' -> remove '%utf8​::ToSpecUpper = (...)'

Even when only the Variable name '%utf8​::ToSpecFUNCTION' is used in Perl code anywhere, some black magic
turns on and causes the horrible performance slow down on utf8-strings! %utf8​::ToSpecFUNCTION needs even not
to be defined to "enable" that "black magic"!

This is just a workaround, but a very effective one. I guess that the real solution for that problem
is somewhere within 'utf8.c', which contains

  The "special" is a string like "utf8​::ToSpecLower", which means the
  hash %utf8​::ToSpecLower. The access to the hash is through
  Perl_to_utf8_case().

Further investigation into that C source may find the reason of this.

best,
rob.

=cut

######## start 'utf8.pl' test script ########
# Robert Allerstorfer 2004 01 06
# applies to Perl 5.8.2
#
$^W = 1;
use strict;
use 5.008;
require Encode;
# Encode is required because even in Perl 5.8.2 there is nothing like utf8​::_utf8_on($octets)

my $multiply = @​ARGV ? shift(@​ARGV) + 0 : 0;
$multiply ||= 5e4;

printf "UTF-8 Performance Test on $^O with Perl %vd\n", $^V;
my $string .= join ("", ('A'..'Z', 'a'..'z')) x $multiply;
($_) = &now;
print "$_​: Test String created​: pure ASCII 'A-Za-z' x $multiply (";
print int((length($string) / 1024**2 * 10) + .5) / 10, " MB)\n\n";

utf8​::encode($string);
(undef, my $t0) = &now;
&search($string);
my @​inckeys0 = keys %INC;
(undef, my $t1) = &now;
print "Required time​: ", $t1 - $t0, " seconds\n\n";

Encode​::_utf8_on($string);
&search($string);
(undef, my $t2) = &now;
print "Required time​: ", $t2 - $t1, " seconds\n\n";

print "Switching to utf8 semantics required the following additional files to load​:\n\t";
my %seen;
@​seen{@​inckeys0} = ();
my @​newinckeys;
foreach (keys %INC) {
  push @​newinckeys, $_
  unless exists $seen{$_}
  ;
}
print "$_\n\t" foreach (sort @​newinckeys);
print "\n";
exit 0;

sub now {
  my $t = time;
  return scalar localtime($t), $t;
}

sub search {
  my $string = shift;
  print "String is now treated as ";
  my $utf8_flag = $] < 5.008001 ? Encode​::is_utf8($string) : utf8​::is_utf8($string);
  print $utf8_flag ? "utf8" : "bytes", "\n";

  my $term = "abc";
  my $matches = $string =~ s/($term)/$1/gi;
  ($_) = &now;
  print "$_​: $matches case-insensitive occurencies of '$term' found in String\n";

  my ($lc, $uc) = (0, 0);
  while ($string =~ /(\w)/g) {
  if ($1 eq lc $1) {
  $lc ++;
  } elsif ($1 eq uc $1) {
  $uc ++;
  }
  }
  ($_) = &now;
  print "$_​: $lc lowercase and $uc uppercase characters found in String\n";
}

__END__
#
######## end of script ########


Flags​:
  category=core
  severity=high


Site configuration information for perl v5.8.2​:

Configured by roal at Fri Dec 19 04​:41​:37 EST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration​:
  Platform​:
  osname=bsdos, osvers=4.2, archname=i386-bsdos
  uname='bsdos bsdos.anet.at 4.2 bsdi bsdos 4.2 kernel #0​: wed oct 25 17​:38​:20 mdt 2000 polk@​hephaestus.bsdi.com​:mntproto4.2-i386usrsrcsyscompilegeneric i386 '
  config_args='-es -Duseshrplib -Adefine​:libperl=p2x582.so -Dccdlflags=-Wl,-rpath,. -Ud_procselfexe -Uinstallusrbinperl -Dinstallprefix=/perl -Dprefix=/perl -Dcf_email=info@​anet.at'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -I/usr/local/include'
  ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='ld', ldflags =' -L/usr/X11/lib -L/usr/local/lib'
  libpth=~/usr/lib /usr/lib /usr/local/lib /usr/shlib /shlib /lib /usr/X11/lib
  libs=-lutil -lbind -ldl -lm -lc
  perllibs=-lutil -lbind -ldl -lm -lc
  libc=/shlib/libc.so, so=so, useshrplib=true, libperl=p2x582.so
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-rpath,. -Wl,-rpath,/usr/home/roal/perl/lib/5.8.2/i386-bsdos/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -x -L/usr/X11/lib -L/usr/local/lib'

Locally applied patches​:
  ACTIVEPERL_LOCAL_PATCHES_ENTRY
  21846 Configure gets d_u32align wrong
  21739 [perl #24493] install.html not working
  21737 Ooops. left an XXX comment in, and worse still it's a // comment
  21735 utf8 keys now work for tied hashes
  21734 Accessing unicode keys in tie hashes via hv_exists was broken
  21733 ext/threads/t/problem.t
  21732 Config​::myconfig() fails under ithreads
  21728 Update perlhist with 5.6.2
  21723 Include 'SCCS' in the list of dir names ignored by installperl
  21718 Empty subroutine as object method segfaults in 5.8.2 (sometimes)
  21714 Fix bug #24380​: assigning list with duplicated keys to a hash
  21706 [perl #24460] [DOC PATCH] the begincheck program
  21693 must copy changes from win32/makeifle.mk to wince/makefile.ce
  21691 Update the list of pumpkings in perlhist.pod
  21687 [PATCH 5.6.2-RC1 pod/perlhist.pod] Updated
  21677 OS/2 docu
  21676 Bug #24407​: key for shared hash got stringified into wrong pool
  21673 Be sure to use -fPIC not -fpic on Linux/SPARC
  21672 extending the hash attack test
  21671 Benchmark.pm cmpthese segfault
  21662 'make minitest' fails for op/cproto and op/pat
  21586 Comment that this 'optimisation' is actually a necessary fixup
  21548 Sync with Pod​::Perldoc 3.12
  21540 Fix backward-compatibility issues in if.pm


@​INC for perl v5.8.2​:
  /usr/home/roal/perl/lib/5.8.2/i386-bsdos
  /usr/home/roal/perl/lib/5.8.2
  /usr/home/roal/perl/lib/site_perl/5.8.2/i386-bsdos
  /usr/home/roal/perl/lib/site_perl/5.8.2
  /usr/home/roal/perl/lib/site_perl
  .


Environment for perl v5.8.2​:
  HOME=/usr/home/roal
  LANG (unset)
  LANGUAGE (unset)
  LC_CTYPE=ISO8859-1
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/home/roal/bin​:/bin​:/usr/bin​:/usr/X11/bin​:/usr/contrib/bin​:/usr/contrib/mh/bin​:/usr/games​:/usr/local/bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2004

From roal@anet.at

The test script to download as an attachment

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2004

From roal@anet.at

utf8.pl

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2004

roal@anet.at - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2004

roal@anet.at - Status changed from 'open' to 'new'

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2004

From perlbug-followup@perl.org

This is a bug report for perl from roal@​anet.at,
generated with the help of perlbug 1.34 running under perl v5.8.2.

The perlunicode pod says

  In Perl 5.8.0 the slowness was often quite spectacular;
  in Perl 5.8.1 a caching scheme was introduced which will hopefully make the
  slowness somewhat less spectacular, at least for some operations. In general,
  operations with UTF-8 encoded strings are still slower.

Regular Expression have always been what Perl is so famous for, and are certainly
one reason for Perl's name, being a Practical Extraction and Reporting Language.
But, unfortunately, there is yet no more efficiency on strings if they are
flagged as UTF-8.

I have investigated on this and found a solution to make regular expression operations
including case-insensivity, lower- and uppercasing on UTF-8 encoded strings
to about 10 times faster as before.

Below, there is the test script I used to measure the effectiveness, on a simple pure
ASCII string. I got the following results​:

On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow!
Although, the performance became a little better with the patched files
(524 s -> 474 s on a Windows machine).

On Perl 5.8.2, the performance is much better, but still very poor per default!
Fortunately, with the patched files the performance really speeds up!!!
(77 s -> 9 s on the same Windows machine and 775 s -> 66 s on a slower BSD/OS system).

Save the code given below as "utf8.pl" and run it by executing

  perl utf8.pl

to get test results as shown below, or, with another multiplier value used to create the test string,
for example 10000 (which is equal to 1e4)​:

  perl utf8.pl 1e4

My results have been​:

with default Perl 5.8.2​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16​:50​:55 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16​:50​:55 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:50​:59 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 4 seconds

String is now treated as utf8
Tue Jan 6 16​:51​:55 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:52​:16 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 77 seconds

Switching to utf8 semantics required the following additional files to load​:
  unicore/Canonical.pl
  unicore/Exact.pl
  unicore/To/Fold.pl
  unicore/To/Lower.pl
  unicore/To/Upper.pl
  unicore/lib/Word.pl
  utf8.pm
  utf8_heavy.pl
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06​:33​:46 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06​:33​:47 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:34​:16 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 30 seconds

String is now treated as utf8
Tue Jan 6 06​:45​:21 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:47​:11 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 775 seconds

with Perl 5.8.2, after the patch​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.2
Tue Jan 6 16​:54​:24 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16​:54​:24 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:54​:28 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 4 seconds

String is now treated as utf8
Tue Jan 6 16​:54​:30 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:54​:37 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 9 seconds
--

UTF-8 Performance Test on bsdos with Perl 5.8.2
Tue Jan 6 06​:52​:35 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 06​:52​:36 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:53​:05 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 30 seconds

String is now treated as utf8
Tue Jan 6 06​:53​:14 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 06​:54​:11 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 66 seconds

with default Perl 5.8.0​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 16​:56​:56 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 16​:56​:56 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 16​:57​:00 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 4 seconds

String is now treated as utf8
Tue Jan 6 17​:05​:24 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17​:05​:44 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 524 seconds

with Perl 5.8.0, after the patch​:

UTF-8 Performance Test on MSWin32 with Perl 5.8.0
Tue Jan 6 17​:07​:14 2004​: Test String created​: pure ASCII 'A-Za-z' x 50000 (2.5 MB)

String is now treated as bytes
Tue Jan 6 17​:07​:14 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17​:07​:19 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 5 seconds

String is now treated as utf8
Tue Jan 6 17​:15​:05 2004​: 100000 case-insensitive occurencies of 'abc' found in String
Tue Jan 6 17​:15​:13 2004​: 1300000 lowercase and 1300000 uppercase characters found in String
Required time​: 474 seconds

The Solution for the patch​:


Entirely remove the '%utf8​::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl',
where FUNCTION stands for 'Fold', 'Lower' and 'Upper'. Therefore,

from 'unicore/To/Fold.pl' -> remove '%utf8​::ToSpecFold = (...)'
from 'unicore/To/Lower.pl' -> remove '%utf8​::ToSpecLower = (...)'
from 'unicore/To/Upper.pl' -> remove '%utf8​::ToSpecUpper = (...)'

Even when only the Variable name '%utf8​::ToSpecFUNCTION' is used in Perl code anywhere, some black magic
turns on and causes the horrible performance slow down on utf8-strings! %utf8​::ToSpecFUNCTION needs even not
to be defined to "enable" that "black magic"!

This is just a workaround, but a very effective one. I guess that the real solution for that problem
is somewhere within 'utf8.c', which contains

  The "special" is a string like "utf8​::ToSpecLower", which means the
  hash %utf8​::ToSpecLower. The access to the hash is through
  Perl_to_utf8_case().

Further investigation into that C source may find the reason of this.

best,
rob.

=cut

######## start 'utf8.pl' test script ########
# Robert Allerstorfer 2004 01 06
# applies to Perl 5.8.2
#
$^W = 1;
use strict;
use 5.008;
require Encode;
# Encode is required because even in Perl 5.8.2 there is nothing like utf8​::_utf8_on($octets)

my $multiply = @​ARGV ? shift(@​ARGV) + 0 : 0;
$multiply ||= 5e4;

printf "UTF-8 Performance Test on $^O with Perl %vd\n", $^V;
my $string .= join ("", ('A'..'Z', 'a'..'z')) x $multiply;
($_) = &now;
print "$_​: Test String created​: pure ASCII 'A-Za-z' x $multiply (";
print int((length($string) / 1024**2 * 10) + .5) / 10, " MB)\n\n";

utf8​::encode($string);
(undef, my $t0) = &now;
&search($string);
my @​inckeys0 = keys %INC;
(undef, my $t1) = &now;
print "Required time​: ", $t1 - $t0, " seconds\n\n";

Encode​::_utf8_on($string);
&search($string);
(undef, my $t2) = &now;
print "Required time​: ", $t2 - $t1, " seconds\n\n";

print "Switching to utf8 semantics required the following additional files to load​:\n\t";
my %seen;
@​seen{@​inckeys0} = ();
my @​newinckeys;
foreach (keys %INC) {
  push @​newinckeys, $_
  unless exists $seen{$_}
  ;
}
print "$_\n\t" foreach (sort @​newinckeys);
print "\n";
exit 0;

sub now {
  my $t = time;
  return scalar localtime($t), $t;
}

sub search {
  my $string = shift;
  print "String is now treated as ";
  my $utf8_flag = $] < 5.008001 ? Encode​::is_utf8($string) : utf8​::is_utf8($string);
  print $utf8_flag ? "utf8" : "bytes", "\n";

  my $term = "abc";
  my $matches = $string =~ s/($term)/$1/gi;
  ($_) = &now;
  print "$_​: $matches case-insensitive occurencies of '$term' found in String\n";

  my ($lc, $uc) = (0, 0);
  while ($string =~ /(\w)/g) {
  if ($1 eq lc $1) {
  $lc ++;
  } elsif ($1 eq uc $1) {
  $uc ++;
  }
  }
  ($_) = &now;
  print "$_​: $lc lowercase and $uc uppercase characters found in String\n";
}

__END__
#
######## end of script ########


Flags​:
  category=core
  severity=high


Site configuration information for perl v5.8.2​:

Configured by roal at Fri Dec 19 04​:41​:37 EST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration​:
  Platform​:
  osname=bsdos, osvers=4.2, archname=i386-bsdos
  uname='bsdos bsdos.anet.at 4.2 bsdi bsdos 4.2 kernel #0​: wed oct 25 17​:38​:20 mdt 2000 polk@​hephaestus.bsdi.com​:mntproto4.2-i386usrsrcsyscompilegeneric i386 '
  config_args='-es -Duseshrplib -Adefine​:libperl=p2x582.so -Dccdlflags=-Wl,-rpath,. -Ud_procselfexe -Uinstallusrbinperl -Dinstallprefix=/perl -Dprefix=/perl -Dcf_email=info@​anet.at'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -I/usr/local/include'
  ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='ld', ldflags =' -L/usr/X11/lib -L/usr/local/lib'
  libpth=~/usr/lib /usr/lib /usr/local/lib /usr/shlib /shlib /lib /usr/X11/lib
  libs=-lutil -lbind -ldl -lm -lc
  perllibs=-lutil -lbind -ldl -lm -lc
  libc=/shlib/libc.so, so=so, useshrplib=true, libperl=p2x582.so
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-rpath,. -Wl,-rpath,/usr/home/roal/perl/lib/5.8.2/i386-bsdos/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -x -L/usr/X11/lib -L/usr/local/lib'

Locally applied patches​:
  ACTIVEPERL_LOCAL_PATCHES_ENTRY
  21846 Configure gets d_u32align wrong
  21739 [perl #24493] install.html not working
  21737 Ooops. left an XXX comment in, and worse still it's a // comment
  21735 utf8 keys now work for tied hashes
  21734 Accessing unicode keys in tie hashes via hv_exists was broken
  21733 ext/threads/t/problem.t
  21732 Config​::myconfig() fails under ithreads
  21728 Update perlhist with 5.6.2
  21723 Include 'SCCS' in the list of dir names ignored by installperl
  21718 Empty subroutine as object method segfaults in 5.8.2 (sometimes)
  21714 Fix bug #24380​: assigning list with duplicated keys to a hash
  21706 [perl #24460] [DOC PATCH] the begincheck program
  21693 must copy changes from win32/makeifle.mk to wince/makefile.ce
  21691 Update the list of pumpkings in perlhist.pod
  21687 [PATCH 5.6.2-RC1 pod/perlhist.pod] Updated
  21677 OS/2 docu
  21676 Bug #24407​: key for shared hash got stringified into wrong pool
  21673 Be sure to use -fPIC not -fpic on Linux/SPARC
  21672 extending the hash attack test
  21671 Benchmark.pm cmpthese segfault
  21662 'make minitest' fails for op/cproto and op/pat
  21586 Comment that this 'optimisation' is actually a necessary fixup
  21548 Sync with Pod​::Perldoc 3.12
  21540 Fix backward-compatibility issues in if.pm


@​INC for perl v5.8.2​:
  /usr/home/roal/perl/lib/5.8.2/i386-bsdos
  /usr/home/roal/perl/lib/5.8.2
  /usr/home/roal/perl/lib/site_perl/5.8.2/i386-bsdos
  /usr/home/roal/perl/lib/site_perl/5.8.2
  /usr/home/roal/perl/lib/site_perl
  .


Environment for perl v5.8.2​:
  HOME=/usr/home/roal
  LANG (unset)
  LANGUAGE (unset)
  LC_CTYPE=ISO8859-1
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/home/roal/bin​:/bin​:/usr/bin​:/usr/X11/bin​:/usr/contrib/bin​:/usr/contrib/mh/bin​:/usr/games​:/usr/local/bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2004

@rspier - Status changed from 'open' to 'new'

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2004

From @jhi

This was recently brought up in perl-unicode@​perl.org​:
http​://bugs6.perl.org/rt3/Ticket/Display.html?id=24826
The proposed analysis that the funky %utf8​::ToSpecFoo tables in
lib/unicore/To/Foo.pl
were part of the puzzle was right; the proposed cure of removing them
was not quite
(that would make parts of the test op/lc and op/pat, and almost all of
uni/*, to fail)

The problem was that in utf8.c​:to_utf8_case() for each /i character
matching and for
each lc/uc/lcfirst/ucfirst a throw-away SV was created and a sprintf()
made into that SV--
even when the Unicode code point (the character ordinal in case
Unicodese is Greek to you)
had no chance of having any special casings [1]. Ouch.

Now the special casings are checked only if there is a chance they will
be needed
(either the code point is U+00DF or it is higher than 0xFF [2], and
even when they
are checked, no SV is created but instead the bytes of the UTF-8
encoding of the code
point are used directly as the hash key.

The speed improvements for /i and the lc/etc are significant, a factor
of about 5-10.
I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

[1] Take a look at lib/unicore/CaseFoldings.txt,
lib/unicore/SpecCase.txt, and
  http​://www.unicode.org/unicode/reports/tr18/, especially
  http​://www.unicode.org/unicode/reports/tr18/#Default_Loose_Matches
  and realize that Unicodese "loose matches" is our /i.

[2] Glaring at the casing data it seems that 0x12F would work, too,
after 0xDF the next
  specially casing code point is 0x130, if I didn't miss anything.

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2004

From @jhi

casing.pat.bz2

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2004

From @jhi

--
Jarkko Hietaniemi <jhi@​iki.fi> http​://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2004

From @rgs

Jarkko Hietaniemi wrote​:

Now the special casings are checked only if there is a chance they will
be needed
(either the code point is U+00DF or it is higher than 0xFF [2], and
even when they
are checked, no SV is created but instead the bytes of the UTF-8
encoding of the code
point are used directly as the hash key.

The speed improvements for /i and the lc/etc are significant, a factor
of about 5-10.
I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

Which is now done, thanks, to blead, as #22427.

@p5pRT
Copy link
Author

p5pRT commented Mar 7, 2004

From @nwc10

On Wed, Mar 03, 2004 at 09​:37​:21AM +0200, Jarkko Hietaniemi wrote​:

I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

Er, yes, but thanks for digging into this. I fear that no-one currently
subscribed to perl5-porters actually understands the intricacies of the
implementation of perl's Unicode support.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 7, 2004

From @jhi

I will wear a dunce cap for the rest of today, while you may apply the
attached patch.

Er, yes, but thanks for digging into this. I fear that no-one currently
subscribed to perl5-porters actually understands the intricacies of the
implementation of perl's Unicode support.

You speak as if I understood those :-)

Sadahiro Tomoyuki and Inaba Hiroto used to have a very good handle on
things
Unicode, and Dan Kogai of course can deal with anything Encode. Hey, I
sense
a certain trend there...

Nicholas Clark

--
Jarkko Hietaniemi <jhi@​iki.fi> http​://www.iki.fi/jhi/ "There is this
special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2004

From roal@anet.at

On Sun, 7 Mar 2004, 20​:24 GMT+00 (21​:24 local time) Nicholas Clark
wrote​:

Er, yes, but thanks for digging into this. I fear that no-one currently
subscribed to perl5-porters actually understands the intricacies of the
implementation of perl's Unicode support.

Larry gave an interesting insight in handling Unicode in Perl 6 vs.
Perl 5, in that post​:

http​://groups.google.com/groups?hl=en&lr=&ie=UTF-8&frame=right&th=732a1f27a1510614&seekm=20040303075022.GA8915%40wall.org#link12

which was actually the first response I received to the report of that
utf8 bug which Jarkko has fixed so quickly. Thanks Larry! :-) I will
be very happy to see that (hopefully) impressive speedup in Perl
5.8.4, since it didn't go into 5.8.3, unfortunately.

best,
rob.

@p5pRT
Copy link
Author

p5pRT commented Mar 24, 2004

@iabyn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant