Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex performance: braces { } slower than star * #14855

Closed
p5pRT opened this issue Aug 14, 2015 · 18 comments
Closed

regex performance: braces { } slower than star * #14855

p5pRT opened this issue Aug 14, 2015 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 14, 2015

Migrated from rt.perl.org#125810 (status was 'resolved')

Searchable as RT125810$

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @mauke

Created by @mauke

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark qw(cmpthese);

my $target = 'abcdefg' x 1_000;

cmpthese -3, {
  star => sub { $target =~ /.*[xy]/s },
  brace => sub { $target =~ /.{0,}[xy]/s },
};

__END__

The only difference between the regexes is that one uses * notation to express
"0 or more" and the other uses {0,}. Semantically there's no difference, but
look at these numbers​:

| Rate brace star
| brace 1.80/s -- -100%
| star 6297/s 349374% --

If I'm reading this right, * runs 3500 times faster than { } for no good
reason.

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.22.0:

Configured by mauke at Sat Jul  4 14:56:57 CEST 2015.

Summary of my perl5 (revision 5 version 22 subversion 0) configuration:
   
  Platform:
    osname=linux, osvers=4.0.1-1-arch, archname=i686-linux
    uname='linux simplicio 4.0.1-1-arch #1 smp preempt wed apr 29 12:15:20 cest 2015 i686 gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='5.1.0', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12, longdblkind=3
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/i686-pc-linux-gnu/5.1.0/include-fixed /usr/lib /lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.21.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.21'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'



@INC for perl 5.22.0:
    /home/mauke/usr/lib/perl5/site_perl/5.22.0/i686-linux
    /home/mauke/usr/lib/perl5/site_perl/5.22.0
    /home/mauke/usr/lib/perl5/5.22.0/i686-linux
    /home/mauke/usr/lib/perl5/5.22.0
    .


Environment for perl 5.22.0:
    HOME=/home/mauke
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=POSIX
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/mauke/perl5/perlbrew/bin:/home/mauke/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
    PERLBREW_BASHRC_VERSION=0.69
    PERLBREW_HOME=/home/mauke/.perlbrew
    PERLBREW_ROOT=/home/mauke/perl5/perlbrew
    PERL_BADLANG (unset)
    PERL_UNICODE=SAL
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From Mark@Overmeer.net

* l.mai@​web.de (perlbug-followup@​perl.org) [150814 15​:35]​:

# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125810 >

cmpthese -3, {
star => sub { $target =~ /.*[xy]/s },
brace => sub { $target =~ /.{0,}[xy]/s },
};

The only difference between the regexes is that one uses * notation to express
"0 or more" and the other uses {0,}. Semantically there's no difference, but
look at these numbers​:

| Rate brace star
| brace 1.80/s -- -100%
| star 6297/s 349374% --

Interesting... the brace version shows quadratic behavior

'abcdefg' x 1000 (on my system)
  Rate brace star
  brace 3.11/s -- -100%
  star 10346/s 333055%

'abcdefg' x 500 (on my system)

  Rate brace star
  brace 12.4/s -- -100%
  star 20693/s 166535%

'abcdefg' x 250 (on my system)

  Rate brace star
  brace 49.5/s -- -100%
  star 41417/s 83531%

So, halving the size of the pattern, make .* run double as fast,
but .{0,} four times.
--
Regards,

  MarkOv


  Mark Overmeer MSc MARKOV Solutions
  Mark@​Overmeer.net solutions@​overmeer.net
http​://Mark.Overmeer.net http​://solutions.overmeer.net

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @Abigail

On Fri, Aug 14, 2015 at 05​:43​:00PM +0200, Mark Overmeer wrote​:

* l.mai@​web.de (perlbug-followup@​perl.org) [150814 15​:35]​:

# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125810 >

cmpthese -3, {
star => sub { $target =~ /.*[xy]/s },
brace => sub { $target =~ /.{0,}[xy]/s },
};

The only difference between the regexes is that one uses * notation to express
"0 or more" and the other uses {0,}. Semantically there's no difference, but
look at these numbers​:

| Rate brace star
| brace 1.80/s -- -100%
| star 6297/s 349374% --

Interesting... the brace version shows quadratic behavior

'abcdefg' x 1000 (on my system)
Rate brace star
brace 3.11/s -- -100%
star 10346/s 333055%

'abcdefg' x 500 (on my system)

         Rate   brace    star
brace  12\.4/s      \-\-   \-100%
star  20693/s 166535%     

'abcdefg' x 250 (on my system)

         Rate  brace   star
brace  49\.5/s     \-\-  \-100%
star  41417/s 83531%   

So, halving the size of the pattern, make .* run double as fast,
but .{0,} four times.

There's a shortcut in the case of /.*/.

In the /.*/ case, the regexp engine only tries to match the string
starting from the beginning, trying to find [xy] after consuming
"abcdefg" x 1000, then after ("abcdefg" x 999) . "abcdef", all
the way down to finding [xy] right after consuming "". Having found
nothing, it concludes there cannot be a match, and give up.

Not so in the /.{0,}/ case. Then it will restart trying to match,
but now starting from the first b. Hence the quadratic behaviour.

Also note, /.*/ is NOT the same as /.{0,}/; at least, not internally.
Using "use re 'debug'" shows that /.*/ becomes a "STAR" op, while
/.{0,}/ becomes "CURLY{0,32767}" (but it can still match more than
32767 times).

I do agree with the OP though that there should not be a user
visible difference.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @iabyn

On Fri, Aug 14, 2015 at 06​:36​:33PM +0200, Abigail wrote​:

I do agree with the OP though that there should not be a user
visible difference.

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.
Arguably it could be extended to spot {0,} too, but it this a likely
use case?

--
In economics, the exam questions are the same every year.
They just change the answers.

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @mauke

On Fri Aug 14 10​:26​:40 2015, davem wrote​:

On Fri, Aug 14, 2015 at 06​:36​:33PM +0200, Abigail wrote​:

I do agree with the OP though that there should not be a user
visible difference.

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.
Arguably it could be extended to spot {0,} too, but it this a likely
use case?

I've attached a patch attempt that fixes the reported issue. It doesn't seem to break any core tests.

However, I haven't actually looked at the bigger context of the code. Someone who's seen S_regpiece before should review this - I suspect the goto logic can be simplified.

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @mauke

0001-PoC-125810.patch
From 844df4809f00fd6c6d03369af59f3c4730ddd34d Mon Sep 17 00:00:00 2001
From: Lukas Mai <l.mai@web.de>
Date: Fri, 14 Aug 2015 21:21:38 +0200
Subject: [PATCH] PoC [#125810]

---
 regcomp.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index f08f08f..3b15a07 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -10893,6 +10893,20 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	  do_curly:
 	    if ((flags&SIMPLE)) {
+                if (min == 0 && max == REG_INFTY) {
+                    reginsert(pRExC_state, STAR, ret, depth+1);
+                    ret->flags = 0;
+                    MARK_NAUGHTY(4);
+                    RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
+                    goto nest_check;
+                }
+                if (min == 1 && max == REG_INFTY) {
+                    reginsert(pRExC_state, PLUS, ret, depth+1);
+                    ret->flags = 0;
+                    MARK_NAUGHTY(3);
+                    RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
+                    goto nest_check;
+                }
                 MARK_NAUGHTY_EXP(2, 2);
 		reginsert(pRExC_state, CURLY, ret, depth+1);
                 Set_Node_Offset(ret, parse_start+1); /* MJD */
@@ -10966,22 +10980,10 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
     *flagp = (op != '+') ? (WORST|SPSTART|HASWIDTH) : (WORST|HASWIDTH);
 
-    if (op == '*' && (flags&SIMPLE)) {
-	reginsert(pRExC_state, STAR, ret, depth+1);
-	ret->flags = 0;
-	MARK_NAUGHTY(4);
-        RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
-    }
-    else if (op == '*') {
+    if (op == '*') {
 	min = 0;
 	goto do_curly;
     }
-    else if (op == '+' && (flags&SIMPLE)) {
-	reginsert(pRExC_state, PLUS, ret, depth+1);
-	ret->flags = 0;
-	MARK_NAUGHTY(3);
-        RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
-    }
     else if (op == '+') {
 	min = 1;
 	goto do_curly;
-- 
2.5.0

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @Abigail

On Fri, Aug 14, 2015 at 06​:26​:05PM +0100, Dave Mitchell wrote​:

On Fri, Aug 14, 2015 at 06​:36​:33PM +0200, Abigail wrote​:

I do agree with the OP though that there should not be a user
visible difference.

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.

Why is /.*XXX/ a silly thing to do?

Arguably it could be extended to spot {0,} too, but it this a likely
use case?

I don't know, but it would be nice if * really was just a shortcut for {0,}.
If only to prevent "experienced" people snapping "don't write {0,}, you
noob, * is faster".

And although I don't know how Perl spots things and optimizes it, it seems
to me that recognizing {0,} isn't fundamentally harder than spotting *.

I had never realized before that {0,} could be slower than *. I've to
check Regexp​::Common it's not using it.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @jandubois

On Fri, Aug 14, 2015 at 1​:37 PM, Abigail <abigail@​abigail.be> wrote​:

On Fri, Aug 14, 2015 at 06​:26​:05PM +0100, Dave Mitchell wrote​:

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.

Why is /.*XXX/ a silly thing to do?

Because the .* doesn't serve any useful purpose. It's like writing
$a+1+0 instead of $a+1.

You rely on the computer to optimize it away, but there is no reason
to put it there in the first place.

How about /^.*XXX.*$/ for even more silliness. :)

Cheers,
-Jan

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @Abigail

On Fri, Aug 14, 2015 at 02​:13​:49PM -0700, Jan Dubois wrote​:

On Fri, Aug 14, 2015 at 1​:37 PM, Abigail <abigail@​abigail.be> wrote​:

On Fri, Aug 14, 2015 at 06​:26​:05PM +0100, Dave Mitchell wrote​:

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.

Why is /.*XXX/ a silly thing to do?

Because the .* doesn't serve any useful purpose. It's like writing
$a+1+0 instead of $a+1.

Except that it isn't.

.* isn't a noop; it forces the XXX part to match at the last possible
place in the string, where /XXX/ would match at the first possible
place. This effects $&, pos(), and any captures in XXX. I also assume
that the bug report had the code trimmed down to a short snippet
showing the unwanted behaviour, instead of copy-and-pasting some
real production code.

There is the same performance difference in /(.*)[xy]/ vs /(.{0,})[xy]/.

You rely on the computer to optimize it away, but there is no reason
to put it there in the first place.

How about /^.*XXX.*$/ for even more silliness. :)

That's something different as well.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2015

From Stromeko@nexgo.de

Dave Mitchell writes​:

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.
Arguably it could be extended to spot {0,} too, but it this a likely
use case?

So what is the non-silly thing to "greedily match anything, but only so
much that XXX matches at the end" aka "match XXX in the last possible
position"? I agree that this is usually misused and when the match data
isn't consumed it makes no difference which XXX you matched. But the
match resulting from /XXX/ is different even when there's only a single
occurence.

Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Factory and User Sound Singles for Waldorf Q+, Q and microQ​:
http​://Synth.Stromeko.net/Downloads.html#WaldorfSounds

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2015

From @khwilliamson

On 08/14/2015 01​:25 PM, l.mai@​web.de via RT wrote​:

On Fri Aug 14 10​:26​:40 2015, davem wrote​:

On Fri, Aug 14, 2015 at 06​:36​:33PM +0200, Abigail wrote​:

I do agree with the OP though that there should not be a user
visible difference.

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.
Arguably it could be extended to spot {0,} too, but it this a likely
use case?

I've attached a patch attempt that fixes the reported issue. It doesn't seem to break any core tests.

However, I haven't actually looked at the bigger context of the code. Someone who's seen S_regpiece before should review this - I suspect the goto logic can be simplified.

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125810

Your patch looks good to me. It's now smoking as

smoke-me/khw-star (v5.23.1-197-gcfa19b0)​:

@p5pRT
Copy link
Author

p5pRT commented Aug 17, 2015

From @ikegami

On Fri, Aug 14, 2015 at 4​:37 PM, Abigail <abigail@​abigail.be> wrote​:

On Fri, Aug 14, 2015 at 06​:26​:05PM +0100, Dave Mitchell wrote​:

On Fri, Aug 14, 2015 at 06​:36​:33PM +0200, Abigail wrote​:

I do agree with the OP though that there should not be a user
visible difference.

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.

Why is /.*XXX/ a silly thing to do?

Because it's effectively

/^.*?.*XXX/

it should have been written

/^.*XXX/

@p5pRT
Copy link
Author

p5pRT commented Aug 24, 2015

From @khwilliamson

On 08/15/2015 01​:49 PM, Karl Williamson wrote​:

On 08/14/2015 01​:25 PM, l.mai@​web.de via RT wrote​:

On Fri Aug 14 10​:26​:40 2015, davem wrote​:

On Fri, Aug 14, 2015 at 06​:36​:33PM +0200, Abigail wrote​:

I do agree with the OP though that there should not be a user
visible difference.

/.*XXX/ is a silly thing to do. Perl spots this and optimises it.
Arguably it could be extended to spot {0,} too, but it this a likely
use case?

I've attached a patch attempt that fixes the reported issue. It
doesn't seem to break any core tests.

However, I haven't actually looked at the bigger context of the code.
Someone who's seen S_regpiece before should review this - I suspect
the goto logic can be simplified.

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125810

Your patch looks good to me. It's now smoking as

smoke-me/khw-star (v5.23.1-197-gcfa19b0)​:

I will apply your patch, but it needs a better commit message than "PoC"

@p5pRT
Copy link
Author

p5pRT commented Aug 25, 2015

From @mauke

On Sun Aug 23 21​:28​:04 2015, public@​khwilliamson.com wrote​:

On 08/15/2015 01​:49 PM, Karl Williamson wrote​:

Your patch looks good to me. It's now smoking as

smoke-me/khw-star (v5.23.1-197-gcfa19b0)​:

I will apply your patch, but it needs a better commit message than "PoC"

I've applied it myself in 02853b7.
Some tests were added in 49bc8c2.

@p5pRT
Copy link
Author

p5pRT commented Aug 31, 2015

From @arc

Abigail <abigail@​abigail.be> wrote​:

Using "use re 'debug'" shows that […]
/.{0,}/ becomes "CURLY{0,32767}" (but it can still match more than
32767 times).

This is somewhat at a tangent to the rest of this thread, but as of
e4105e8, blead displays /.{0,}/ as
"CURLY{0,INFTY}" when debugging (rather than with the spurious 32767
constant).

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

From @mauke

On Tue, 25 Aug 2015 01​:17​:25 -0700, mauke- wrote​:

I've applied it myself in 02853b7.
Some tests were added in 49bc8c2.

The fix has been released.

@p5pRT p5pRT closed this as completed Nov 11, 2016
@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

@mauke - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant