New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approx. 10 times faster utf8 string operations #7021
Comments
From roal@anet.atThis is a bug report for perl from roal@anet.at, The perlunicode pod says In Perl 5.8.0 the slowness was often quite spectacular; Regular Expression have always been what Perl is so famous for, and are certainly I have investigated on this and found a solution to make regular expression operations Below, there is the test script I used to measure the effectiveness, on a simple pure On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow! On Perl 5.8.2, the performance is much better, but still very poor per default! Save the code given below as "utf8.pl" and run it by executing perl utf8.pl to get test results as shown below, or, with another multiplier value used to create the test string, perl utf8.pl 1e4 My results have been: with default Perl 5.8.2:UTF-8 Performance Test on MSWin32 with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 Switching to utf8 semantics required the following additional files to load: UTF-8 Performance Test on bsdos with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 with Perl 5.8.2, after the patch:UTF-8 Performance Test on MSWin32 with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 UTF-8 Performance Test on bsdos with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 with default Perl 5.8.0:UTF-8 Performance Test on MSWin32 with Perl 5.8.0 String is now treated as bytes String is now treated as utf8 with Perl 5.8.0, after the patch:UTF-8 Performance Test on MSWin32 with Perl 5.8.0 String is now treated as bytes String is now treated as utf8 The Solution for the patch: Entirely remove the '%utf8::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl', from 'unicore/To/Fold.pl' -> remove '%utf8::ToSpecFold = (...)' Even when only the Variable name '%utf8::ToSpecFUNCTION' is used in Perl code anywhere, some black magic This is just a workaround, but a very effective one. I guess that the real solution for that problem The "special" is a string like "utf8::ToSpecLower", which means the Further investigation into that C source may find the reason of this. best, =cut ######## start 'utf8.pl' test script ######## my $multiply = @ARGV ? shift(@ARGV) + 0 : 0; printf "UTF-8 Performance Test on utf8::encode($string); Encode::_utf8_on($string); print "Switching to utf8 semantics required the following additional files to load:\n\t"; sub now { sub search { my $term = "abc"; my ($lc, $uc) = (0, 0); __END__ Flags: Site configuration information for perl v5.8.2: Configured by roal at Fri Dec 19 04:41:37 EST 2003. Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration: Locally applied patches: @INC for perl v5.8.2: Environment for perl v5.8.2: |
From roal@anet.atThe test script to download as an attachment |
From roal@anet.at |
roal@anet.at - Status changed from 'new' to 'open' |
roal@anet.at - Status changed from 'open' to 'new' |
From perlbug-followup@perl.orgThis is a bug report for perl from roal@anet.at, The perlunicode pod says In Perl 5.8.0 the slowness was often quite spectacular; Regular Expression have always been what Perl is so famous for, and are certainly I have investigated on this and found a solution to make regular expression operations Below, there is the test script I used to measure the effectiveness, on a simple pure On Perl 5.8.0, case-insensitive searches on utf8 strings are always extremly slow! On Perl 5.8.2, the performance is much better, but still very poor per default! Save the code given below as "utf8.pl" and run it by executing perl utf8.pl to get test results as shown below, or, with another multiplier value used to create the test string, perl utf8.pl 1e4 My results have been: with default Perl 5.8.2:UTF-8 Performance Test on MSWin32 with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 Switching to utf8 semantics required the following additional files to load: UTF-8 Performance Test on bsdos with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 with Perl 5.8.2, after the patch:UTF-8 Performance Test on MSWin32 with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 UTF-8 Performance Test on bsdos with Perl 5.8.2 String is now treated as bytes String is now treated as utf8 with default Perl 5.8.0:UTF-8 Performance Test on MSWin32 with Perl 5.8.0 String is now treated as bytes String is now treated as utf8 with Perl 5.8.0, after the patch:UTF-8 Performance Test on MSWin32 with Perl 5.8.0 String is now treated as bytes String is now treated as utf8 The Solution for the patch: Entirely remove the '%utf8::ToSpecFUNCTION = (...)' definition from 'unicore/To/FUNCTION.pl', from 'unicore/To/Fold.pl' -> remove '%utf8::ToSpecFold = (...)' Even when only the Variable name '%utf8::ToSpecFUNCTION' is used in Perl code anywhere, some black magic This is just a workaround, but a very effective one. I guess that the real solution for that problem The "special" is a string like "utf8::ToSpecLower", which means the Further investigation into that C source may find the reason of this. best, =cut ######## start 'utf8.pl' test script ######## my $multiply = @ARGV ? shift(@ARGV) + 0 : 0; printf "UTF-8 Performance Test on utf8::encode($string); Encode::_utf8_on($string); print "Switching to utf8 semantics required the following additional files to load:\n\t"; sub now { sub search { my $term = "abc"; my ($lc, $uc) = (0, 0); __END__ Flags: Site configuration information for perl v5.8.2: Configured by roal at Fri Dec 19 04:41:37 EST 2003. Summary of my perl5 (revision 5.0 version 8 subversion 2) configuration: Locally applied patches: @INC for perl v5.8.2: Environment for perl v5.8.2: |
The RT System itself - Status changed from 'new' to 'open' |
@rspier - Status changed from 'open' to 'new' |
From @jhiThis was recently brought up in perl-unicode@perl.org: The problem was that in utf8.c:to_utf8_case() for each /i character Now the special casings are checked only if there is a chance they will The speed improvements for /i and the lc/etc are significant, a factor [1] Take a look at lib/unicore/CaseFoldings.txt, [2] Glaring at the casing data it seems that 0x12F would work, too, |
From @jhi |
From @jhi-- |
The RT System itself - Status changed from 'new' to 'open' |
From @rgsJarkko Hietaniemi wrote:
Which is now done, thanks, to blead, as #22427. |
From @nwc10On Wed, Mar 03, 2004 at 09:37:21AM +0200, Jarkko Hietaniemi wrote:
Er, yes, but thanks for digging into this. I fear that no-one currently Nicholas Clark |
From @jhi
You speak as if I understood those :-) Sadahiro Tomoyuki and Inaba Hiroto used to have a very good handle on
-- |
From roal@anet.atOn Sun, 7 Mar 2004, 20:24 GMT+00 (21:24 local time) Nicholas Clark
Larry gave an interesting insight in handling Unicode in Perl 6 vs. http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&frame=right&th=732a1f27a1510614&seekm=20040303075022.GA8915%40wall.org#link12 which was actually the first response I received to the report of that best, |
@iabyn - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#24826 (status was 'resolved')
Searchable as RT24826$
The text was updated successfully, but these errors were encountered: