Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp very slow on UTF8 (over 100.000 times slower than without UTF8) #9774

Closed
p5pRT opened this issue Jun 22, 2009 · 14 comments
Closed

regexp very slow on UTF8 (over 100.000 times slower than without UTF8) #9774

p5pRT opened this issue Jun 22, 2009 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 22, 2009

Migrated from rt.perl.org#66852 (status was 'resolved')

Searchable as RT66852$

@p5pRT
Copy link
Author

p5pRT commented Jun 22, 2009

From Andreas.Hansson@axis.com

Created by andhan@axis.com

A regexp on >32 M chars takes 0 seconds without UTF8
A regexp on 32 K chars takes 10 seconds with UTF8.

Test with code below​:

#!/usr/bin/perl -w

my $load = "Abra ka dabra\n ----- /var/log/messages -----\n";
while (length($load) < 32000) {
  $load.=$load;
}

# TEST 1
my $starttime = time();
print "testing with utf8=OFF...\n";
&_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

# TEST 2
utf8​::upgrade($load);
$starttime = time();
print "testing with utf8=ON...\n";
&_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

sub _parse($)
{
  my ($history) = @​_;

  if ($history =~ / (?​:.*)------/s)
  {
  # ... Do something;
  }

}

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.0:

Configured by Debian Project at Thu Jan  1 12:43:38 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.26-1-686, archname=i486-linux-gnu-thread-multi
    uname='linux rebekka 2.6.26-1-686 #1 smp mon dec 15 18:15:07 utc 2008 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/andhan
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/andhan/bin.linux:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2009

From p5p@spam.wizbit.be

On Mon Jun 22 04​:51​:07 2009, Andreas.Hansson@​axis.com wrote​:

This is a bug report for perl from andhan@​axis.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]
A regexp on >32 M chars takes 0 seconds without UTF8
A regexp on 32 K chars takes 10 seconds with UTF8.

#!/usr/bin/perl -w

my $load = "Abra ka dabra\n ----- /var/log/messages -----\n";
while (length($load) < 32000) {
  $load.=$load;
}

# TEST 1
my $starttime = time();
print "testing with utf8=OFF...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

# TEST 2
utf8​::upgrade($load);
$starttime = time();
print "testing with utf8=ON...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

sub _parse {
  my ($history) = @​_;

  if ($history =~ / (?​:.*)------/s)
  {
  print "Match\n";
  }
}
__END__

Behaviour is the same on all versions of perl;

$ perl-5.10.0 rt-66852.pl
testing with utf8=OFF...
Time 0 seconds
testing with utf8=ON...
Time 7 seconds

Running with a shorter string and re 'debug'​:

#!/usr/bin/perl -w

use re 'debug';
my $load = "Abra ka dabra\n ----- /var/log/messages -----\n";

# TEST 1
my $starttime = time();
print "testing with utf8=OFF...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

# TEST 2
utf8​::upgrade($load);
$starttime = time();
print "testing with utf8=ON...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

sub _parse {
  my ($history) = @​_;

  if ($history =~ / (?​:.*)------/s)
  {
  print "Match\n";
  }
}
__END__

$ perl-5.10.0 rt-66852.pl

Compiling REx " (?​:.*)------"
Final program​:
  1​: EXACT < > (3)
  3​: STAR (5)
  4​: SANY (0)
  5​: EXACT <------> (8)
  8​: END (0)
anchored " " at 0 floating "------" at 1..2147483647 (checking
floating) minlen 7

testing with utf8=OFF...
Guessing start of match in sv for REx " (?​:.*)------" against "Abra ka
dabra%n ----- /var/log/messages -----%n"
Did not find floating substr "------"...
Match rejected by optimizer
Time 0 seconds

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
  4 <Abra> < ka dabra> | 1​:EXACT < >(3)
  5 <Abra > <ka dabra%n> | 3​:STAR(5)
  SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Best regards,

Bram

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From @davidnicol

On Tue, Jun 23, 2009 at 4​:02 PM, Bram via RT<perlbug-followup@​perl.org> wrote​:

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Bram

So you can optimize it yourself at the Perl level​:

$ diff -U5 66852_orig.pl 66852.pl

Inline Patch
--- 66852_orig.pl       2009-06-23 22:22:32.000000000 -0500
+++ 66852.pl    2009-06-23 22:23:22.000000000 -0500
@@ -25,11 +25,12 @@


 sub _parse {
        my ($history) = @_;

-       if ($history =~ / (?:.*)------/s)
+       if ($history =~ /------/
+           and $history =~ / (?:.*)------/s)
        {
                print "Match\n";
        }
 }
 __END__

gave

$ perl 66852.pl
testing with utf8=OFF...
Time 0 seconds
testing with utf8=ON...
Time 0 seconds

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From Andreas.Hansson@axis.com

Thank you for your investigation.

For information, I tested the same code with PHP5, and it seems that PHP's mb_ereg also lack performance...

Test-result​:
:~/test$ php test.php
testing with utf8=OFF...
Time 0.000917911529541 seconds
testing with utf8=ON...
Time 2.08961200714 seconds

Code​:
<?

$load = "Abra ka dabra\n ----- /var/log/messages -----\n";
while (strlen($load) < 32000) {
  $load.=$load;
}

// TEST 1
$starttime = microtime(true);
print "testing with utf8=OFF...\n";
_parse_iso8859($load);
print "Time ".(microtime(true) - $starttime)." seconds\n";

// TEST 2
$starttime = microtime(true);
print "testing with utf8=ON...\n";
_parse_utf8($load);
print "Time ".(microtime(true) - $starttime)." seconds\n";

function _parse_iso8859($history) {
  if (ereg(" (.*)------", $history, $regs))
  {
  print "Match\n";
  }
}

function _parse_utf8($history) {
  mb_regex_encoding("UTF-8");
  mb_internal_encoding("UTF-8");
  if (mb_ereg(" (.*)------", $history, $regs))
  {
  print "Match\n";
  }
}

?>

-----Original Message-----
From​: Bram via RT [mailto​:perlbug-followup@​perl.org]
Sent​: den 23 juni 2009 23​:03
To​: Andreas Hansson
Subject​: [perl #66852] regexp very slow on UTF8 (over 100.000 times slower than without UTF8)

On Mon Jun 22 04​:51​:07 2009, Andreas.Hansson@​axis.com wrote​:

This is a bug report for perl from andhan@​axis.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]
A regexp on >32 M chars takes 0 seconds without UTF8
A regexp on 32 K chars takes 10 seconds with UTF8.

#!/usr/bin/perl -w

my $load = "Abra ka dabra\n ----- /var/log/messages -----\n";
while (length($load) < 32000) {
  $load.=$load;
}

# TEST 1
my $starttime = time();
print "testing with utf8=OFF...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

# TEST 2
utf8​::upgrade($load);
$starttime = time();
print "testing with utf8=ON...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

sub _parse {
  my ($history) = @​_;

  if ($history =~ / (?​:.*)------/s)
  {
  print "Match\n";
  }
}
__END__

Behaviour is the same on all versions of perl;

$ perl-5.10.0 rt-66852.pl
testing with utf8=OFF...
Time 0 seconds
testing with utf8=ON...
Time 7 seconds

Running with a shorter string and re 'debug'​:

#!/usr/bin/perl -w

use re 'debug';
my $load = "Abra ka dabra\n ----- /var/log/messages -----\n";

# TEST 1
my $starttime = time();
print "testing with utf8=OFF...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

# TEST 2
utf8​::upgrade($load);
$starttime = time();
print "testing with utf8=ON...\n";
_parse($load);
print "Time ".(time() - $starttime)." seconds\n";

sub _parse {
  my ($history) = @​_;

  if ($history =~ / (?​:.*)------/s)
  {
  print "Match\n";
  }
}
__END__

$ perl-5.10.0 rt-66852.pl

Compiling REx " (?​:.*)------"
Final program​:
  1​: EXACT < > (3)
  3​: STAR (5)
  4​: SANY (0)
  5​: EXACT <------> (8)
  8​: END (0)
anchored " " at 0 floating "------" at 1..2147483647 (checking
floating) minlen 7

testing with utf8=OFF...
Guessing start of match in sv for REx " (?​:.*)------" against "Abra ka
dabra%n ----- /var/log/messages -----%n"
Did not find floating substr "------"...
Match rejected by optimizer
Time 0 seconds

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
  4 <Abra> < ka dabra> | 1​:EXACT < >(3)
  5 <Abra > <ka dabra%n> | 3​:STAR(5)
  SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Best regards,

Bram

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From @demerphq

2009/6/24 Andreas Hansson <Andreas.Hansson@​axis.com>​:

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
  4 <Abra> < ka dabra>      |  1​:EXACT < >(3)
  5 <Abra > <ka dabra%n>    |  3​:STAR(5)
                                 SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Just thinking aloud for sake of the ticket, but i think one might
argue that this is a bug.

Why should utf8 affect case-insensitive matching? All one has to do is
(up|down)grade the strings first and then use their literal bytes as
the string to search for and it would be fast.

So, er, why dont we do that i wonder...

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From perl@nevcal.com

On approximately 6/24/2009 9​:28 AM, came the following characters from
the keyboard of demerphq​:

2009/6/24 Andreas Hansson <Andreas.Hansson@​axis.com>​:

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
4 <Abra> < ka dabra> | 1​:EXACT < >(3)
5 <Abra > <ka dabra%n> | 3​:STAR(5)
SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Just thinking aloud for sake of the ticket, but i think one might
argue that this is a bug.

Why should utf8 affect case-insensitive matching? All one has to do is
(up|down)grade the strings first and then use their literal bytes as
the string to search for and it would be fast.

So, er, why dont we do that i wonder...

I've wondered that for a long time. The arguments I've heard include​:

1) Compatibility... upgrading the user's string might break something
else that expects it to stay non-utf8 ... and memory usage might go up
significantly.

2) Performance... updating the user's string to a temporary (to avoid
#1) is slow for large strings, and may not help the memory usage problem
(which may be coincident).

3) Not all strings are text; some may even be combinations of binary and
text; upgrading may destroy invariants for other parts of the program,
that don't properly deal with strings as strings, but rather expect them
to be octets (which should be considered a bug these days, but wasn't
always, and the compatibility police are concerned).

I may have missed some.

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From @demerphq

2009/6/24 Glenn Linderman <perl@​nevcal.com>​:

On approximately 6/24/2009 9​:28 AM, came the following characters from the
keyboard of demerphq​:

2009/6/24 Andreas Hansson <Andreas.Hansson@​axis.com>​:

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
 4 <Abra> < ka dabra>      |  1​:EXACT < >(3)
 5 <Abra > <ka dabra%n>    |  3​:STAR(5)
                                SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Just thinking aloud for sake of the ticket, but i think one might
argue that this is a bug.

Why should utf8 affect case-insensitive matching? All one has to do is
(up|down)grade the strings first and then use their literal bytes as
the string to search for and it would be fast.

So, er, why dont we do that i wonder...

I've wondered that for a long time.  The arguments I've heard include​:

1) Compatibility... upgrading the user's string might break something else
that expects it to stay non-utf8 ... and memory usage might go up
significantly.

2) Performance... updating the user's string to a temporary (to avoid #1) is
slow for large strings, and may not help the memory usage problem (which may
be coincident).

3) Not all strings are text; some may even be combinations of binary and
text; upgrading may destroy invariants for other parts of the program, that
don't properly deal with strings as strings, but rather expect them to be
octets (which should be considered a bug these days, but wasn't always, and
the compatibility police are concerned).

Actually, my point was slightly different here. The strings im talking
about are the strings stored by the regex engine. The ones used to
find the "must match" strings so we fail fast. These are generally
short snippets, and are not user exposed and the string being matched
against would not change.

So for instance the pattern in the two cases we are discussing here is
the same, it is only the utf8ness of the string being searched that
differs, and it actually turns out that the constant string we are
searching for will be represented the same as it is in utf8, so in
this case there is actually no reason to disable that optimisation.

Or am i missing a subtlety here that you covered?

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From @khwilliamson

demerphq wrote​:

2009/6/24 Glenn Linderman <perl@​nevcal.com>​:

On approximately 6/24/2009 9​:28 AM, came the following characters from the
keyboard of demerphq​:

2009/6/24 Andreas Hansson <Andreas.Hansson@​axis.com>​:

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
4 <Abra> < ka dabra> | 1​:EXACT < >(3)
5 <Abra > <ka dabra%n> | 3​:STAR(5)
SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Just thinking aloud for sake of the ticket, but i think one might
argue that this is a bug.

Why should utf8 affect case-insensitive matching? All one has to do is
(up|down)grade the strings first and then use their literal bytes as
the string to search for and it would be fast.

So, er, why dont we do that i wonder...

I've wondered that for a long time. The arguments I've heard include​:

1) Compatibility... upgrading the user's string might break something else
that expects it to stay non-utf8 ... and memory usage might go up
significantly.

2) Performance... updating the user's string to a temporary (to avoid #1) is
slow for large strings, and may not help the memory usage problem (which may
be coincident).

3) Not all strings are text; some may even be combinations of binary and
text; upgrading may destroy invariants for other parts of the program, that
don't properly deal with strings as strings, but rather expect them to be
octets (which should be considered a bug these days, but wasn't always, and
the compatibility police are concerned).

Actually, my point was slightly different here. The strings im talking
about are the strings stored by the regex engine. The ones used to
find the "must match" strings so we fail fast. These are generally
short snippets, and are not user exposed and the string being matched
against would not change.

So for instance the pattern in the two cases we are discussing here is
the same, it is only the utf8ness of the string being searched that
differs, and it actually turns out that the constant string we are
searching for will be represented the same as it is in utf8, so in
this case there is actually no reason to disable that optimisation.

Or am i missing a subtlety here that you covered?

cheers,
Yves

I don't know if this is relevant, but I recollect when looking at the
regexec code that there is a lot of inefficiency there when string get
converted to utf8 and then thrown away when backtracking.

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2009

From perl@nevcal.com

On approximately 6/24/2009 12​:28 PM, came the following characters from
the keyboard of demerphq​:

2009/6/24 Glenn Linderman <perl@​nevcal.com>​:

On approximately 6/24/2009 9​:28 AM, came the following characters from the
keyboard of demerphq​:

2009/6/24 Andreas Hansson <Andreas.Hansson@​axis.com>​:

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n ----- /var/log/
messages -----%n"
UTF-8 string...
4 <Abra> < ka dabra> | 1​:EXACT < >(3)
5 <Abra > <ka dabra%n> | 3​:STAR(5)
SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Just thinking aloud for sake of the ticket, but i think one might
argue that this is a bug.

Why should utf8 affect case-insensitive matching? All one has to do is
(up|down)grade the strings first and then use their literal bytes as
the string to search for and it would be fast.

So, er, why dont we do that i wonder...

I've wondered that for a long time. The arguments I've heard include​:

1) Compatibility... upgrading the user's string might break something else
that expects it to stay non-utf8 ... and memory usage might go up
significantly.

2) Performance... updating the user's string to a temporary (to avoid #1) is
slow for large strings, and may not help the memory usage problem (which may
be coincident).

3) Not all strings are text; some may even be combinations of binary and
text; upgrading may destroy invariants for other parts of the program, that
don't properly deal with strings as strings, but rather expect them to be
octets (which should be considered a bug these days, but wasn't always, and
the compatibility police are concerned).

Actually, my point was slightly different here. The strings im talking
about are the strings stored by the regex engine. The ones used to
find the "must match" strings so we fail fast. These are generally
short snippets, and are not user exposed and the string being matched
against would not change.

So for instance the pattern in the two cases we are discussing here is
the same, it is only the utf8ness of the string being searched that
differs, and it actually turns out that the constant string we are
searching for will be represented the same as it is in utf8, so in
this case there is actually no reason to disable that optimisation.

Or am i missing a subtlety here that you covered?

No, I'm overlooked the subtelty that you were discussing!

For internal strings stored and used by the regex engine in the compiled
form, it seems to me that there would be no user reason not to store
whatever is useful. I suppose it would be possible to create extremely
large match strings, but I'm not sure why one would do that... I've
never seen an application read several hundred KB of text, and use it as
part of a regex match string :)

So under the assumption that the strings are short, and are internal,
then it seems there are three cases​:

1) match against non-utf8 strings
2) match against utf8 strings
3) match against both
4) (for completeness only) match against neither

So then there are two sub-cases​:

a) the internal string has the same representation in utf8 and non-utf8
b) the internal string has different representation in utf8 and non-utf8

regex was initially designed for 1a only, I presume. When utf8 came
along, so did all the other cases.

Seems like for cases 1a, 2a, and 3a, the match should be done as octets.
  This is unconditional because it is subcase a.

To handle the b subcases is more complex. Either there has to be logic
to convert the internal string(s) to the same representation as the
match string as a preprocessing step, or both representations could be
saved, and the appropriate one chosen based on the representation of the
match string.

Of course, handling things like case-insensitivity couldn't be done that
way, but this issue doesn't involve that.

Should I guess that the current reason is that distance in characters
between the two exact matches cannot, in the utf8 match string case, be
easily computed from the length in bytes, for setting $+[0] and $-[0],
so the optimizer rejects the shortcut?

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

p5pRT commented Oct 20, 2014

From victor@vsespb.ru

I tested and seems issue if fixed in 5.20. Should it be backported to 5.18?

On Wed Jun 24 15​:17​:08 2009, perl@​nevcal.com wrote​:

On approximately 6/24/2009 12​:28 PM, came the following characters
from
the keyboard of demerphq​:

2009/6/24 Glenn Linderman <perl@​nevcal.com>​:

On approximately 6/24/2009 9​:28 AM, came the following characters
from the
keyboard of demerphq​:

2009/6/24 Andreas Hansson <Andreas.Hansson@​axis.com>​:

testing with utf8=ON...
Matching REx " (?​:.*)------" against "Abra ka dabra%n -----
/var/log/
messages -----%n"
UTF-8 string...
4 <Abra> < ka dabra> | 1​:EXACT < >(3)
5 <Abra > <ka dabra%n> | 3​:STAR(5)
SANY can match 40 times out of
2147483647...
[.....]
Match failed
Time 0 seconds

When the data is non UTF8 then the match is immediatly rejected by
the
optimizer.

When the data is in UTF8 then the match is not rejected by the
optimizer but actually tested by the engine.

Just thinking aloud for sake of the ticket, but i think one might
argue that this is a bug.

Why should utf8 affect case-insensitive matching? All one has to do
is
(up|down)grade the strings first and then use their literal bytes
as
the string to search for and it would be fast.

So, er, why dont we do that i wonder...

I've wondered that for a long time. The arguments I've heard
include​:

1) Compatibility... upgrading the user's string might break
something else
that expects it to stay non-utf8 ... and memory usage might go up
significantly.

2) Performance... updating the user's string to a temporary (to
avoid #1) is
slow for large strings, and may not help the memory usage problem
(which may
be coincident).

3) Not all strings are text; some may even be combinations of binary
and
text; upgrading may destroy invariants for other parts of the
program, that
don't properly deal with strings as strings, but rather expect them
to be
octets (which should be considered a bug these days, but wasn't
always, and
the compatibility police are concerned).

Actually, my point was slightly different here. The strings im
talking
about are the strings stored by the regex engine. The ones used to
find the "must match" strings so we fail fast. These are generally
short snippets, and are not user exposed and the string being matched
against would not change.

So for instance the pattern in the two cases we are discussing here
is
the same, it is only the utf8ness of the string being searched that
differs, and it actually turns out that the constant string we are
searching for will be represented the same as it is in utf8, so in
this case there is actually no reason to disable that optimisation.

Or am i missing a subtlety here that you covered?

No, I'm overlooked the subtelty that you were discussing!

For internal strings stored and used by the regex engine in the
compiled
form, it seems to me that there would be no user reason not to store
whatever is useful. I suppose it would be possible to create
extremely
large match strings, but I'm not sure why one would do that... I've
never seen an application read several hundred KB of text, and use it
as
part of a regex match string :)

So under the assumption that the strings are short, and are internal,
then it seems there are three cases​:

1) match against non-utf8 strings
2) match against utf8 strings
3) match against both
4) (for completeness only) match against neither

So then there are two sub-cases​:

a) the internal string has the same representation in utf8 and non-
utf8
b) the internal string has different representation in utf8 and non-
utf8

regex was initially designed for 1a only, I presume. When utf8 came
along, so did all the other cases.

Seems like for cases 1a, 2a, and 3a, the match should be done as
octets.
This is unconditional because it is subcase a.

To handle the b subcases is more complex. Either there has to be
logic
to convert the internal string(s) to the same representation as the
match string as a preprocessing step, or both representations could be
saved, and the appropriate one chosen based on the representation of
the
match string.

Of course, handling things like case-insensitivity couldn't be done
that
way, but this issue doesn't involve that.

Should I guess that the current reason is that distance in characters
between the two exact matches cannot, in the utf8 match string case,
be
easily computed from the length in bytes, for setting $+[0] and $-[0],
so the optimizer rejects the shortcut?

@p5pRT
Copy link
Author

p5pRT commented Oct 20, 2014

From @iabyn

On Mon, Oct 20, 2014 at 06​:21​:47AM -0700, Victor Efimov via RT wrote​:

I tested and seems issue if fixed in 5.20. Should it be backported to 5.18?

It's likely it was fixed by my heavy re-working of the re_intuit_start()
function; this consisted of in the region of 100 accumulating commits, and
wouldn't be suitable for back-porting

--
"Procrastination grows to fill the available time"
  -- Mitchell's corollary to Parkinson's Law

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2016

From @dcollinsn

On Mon Oct 20 08​:16​:08 2014, davem wrote​:

On Mon, Oct 20, 2014 at 06​:21​:47AM -0700, Victor Efimov via RT wrote​:

I tested and seems issue if fixed in 5.20. Should it be backported to
5.18?

It's likely it was fixed by my heavy re-working of the
re_intuit_start()
function; this consisted of in the region of 100 accumulating commits,
and
wouldn't be suitable for back-porting

Confirmed fixed, closing.

--
Respectfully,
Dan Collins

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2016

@dcollinsn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant