Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A multiline regex that starts with /^/m is much slower than the corresponding one that starts with /\n/ #16094

Open
p5pRT opened this issue Jul 31, 2017 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 31, 2017

Migrated from rt.perl.org#131822 (status was 'new')

Searchable as RT131822$

@p5pRT
Copy link
Author

p5pRT commented Jul 31, 2017

From @shlomif

A multiline regex that starts with /^/m is much slower than the corresponding
one that starts with /\n/. Below is Dave Mitchell's analysis​:

The code can be reduced to the following​:

  my $nomatch = <<EOF;
  Start 1 =
  Boo
  End
  EOF

  my $match = <<EOF;
  Start 1 =
  End
  EOF

  $_ = ($nomatch x 10_000) . $match;
  my $n = $ARGV[0] ? '^' : '\n';
  m/${n}Start [0-9]+ =\nEnd\n/m or die;

$ time perl5260o ~/tmp/p 0; time perl5260o ~/tmp/p 1

real 0m0.004s
user 0m0.002s
sys 0m0.002s

real 0m0.691s
user 0m0.690s
sys 0m0.001s

It's probably down to this in regexec_flags()​:

  /* note that with PREGf_IMPLICIT, intuit can only fail
  * or return the start position, so it's of limited utility.
  * Nevertheless, I made the decision that the potential for
  * quick fail was still worth it - DAPM */

Basically the '^' causes it to (fruitlessly) run intuit at the start of
every line; the \n instead causes it to just fbm to the next "\nStart"
string.

I may need to revisit that decision. The whole 'pick the next viable start
position' logic in regexec_flags() needs an overhaul, and its on my list
of things to do (but not currently near the top).

===========

Please look into fixing it in a future version.

@joevt
Copy link

joevt commented Oct 4, 2022

Searching a file with 200 thousand lines for certain patterns using the following:

time perl -0777 -pE '
	$found = 0;
	while ( /^(\t*).*\n(\\ d........... ........ ..... .*? .....\.\n)/mg ) { $found++; }
	print STDERR $found . "\n";
' "/tmp/testfile.txt" > "/tmp/testfile2.txt"

The file contains 48 matches.

  • ^ = 53.47s
  • replace ^ with (?<=\n) = 0.49s
  • replace ^ with \n = 0.15s
  • use ^ but replace \\ with . = 0.88s. Why is . an improvement over \\?
perl -v

This is perl 5, version 30, subversion 3 (v5.30.3) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

@hvds
Copy link
Contributor

hvds commented Oct 4, 2022

@joevt can you provide a way for us to replicate your findings, without having to attach a very large file? Perhaps by pointing to some corpus already available on the web that has similar timings, or by constructing the corpus from multiple copies of something much smaller? This perlmonks node shows an example of how that might be possible.

Note that #17427 also reports an issue quite similar to this ticket, but is not specific to /^/m. It is possible that if we could fix that more general suboptimal behaviour when searching for floating substrings, this special case would also disappear.

@joevt
Copy link

joevt commented Oct 8, 2022

@hvds I'm having difficulty trying to contrive a method to create a file that reproduces the results since I don't know what about the file is causing the problem. Is 400K zipped too large?

The only problem with (?<=\n) is that it won't match at the beginning of the text file. That's not a problem in my case.

I found that (?>^) takes about as much time as (?<=\n). Is (?>^) a valid substitution for ^ (at least at the beginning of the regex) ? I think so - it at least finds the 48 matches I'm searching for.

@hvds
Copy link
Contributor

hvds commented Oct 8, 2022

@hvds I'm having difficulty trying to contrive a method to create a file that reproduces the results since I don't know what about the file is causing the problem. Is 400K zipped too large?

It's probably a bit large for here, but you can email it to me - my email address is under Hugo in AUTHORS. Hopefully some analysis should let me create a more succinct reproducer.

@joevt
Copy link

joevt commented Aug 10, 2023

testfile.txt.zip

testfile=testfile.txt
{
printf "# test with (?<=\\\\n)\n"
time perl -0777 -pE '
	$found = 0;
	while ( /(?<=\n)(\t*)fffff-eee\n(\\ d........... ........ ..... .*? .....\.\n)/mg ) { $found++; }
	print STDERR "found " . $found . " times\n";
' "$testfile" > /dev/null

printf "# test with (?>^)\n"
time perl -0777 -pE '
	$found = 0;
	while ( /(?>^)(\t*)fffff-eee\n(\\ d........... ........ ..... .*? .....\.\n)/mg ) { $found++; }
	print STDERR "found " . $found . " times\n";
' "$testfile" > /dev/null

printf "# test with ^\n"
time perl -0777 -pE '
	$found = 0;
	while ( /^(\t*)fffff-eee\n(\\ d........... ........ ..... .*? .....\.\n)/mg ) { $found++; }
	print STDERR "found " . $found . " times\n";
' "$testfile" > /dev/null
} 2>&1
# test with (?<=\n)
found 1 times
perl -0777 -pE  "$testfile" > /dev/null  0.12s user 0.01s system 95% cpu 0.133 total
# test with (?>^)
found 1 times
perl -0777 -pE  "$testfile" > /dev/null  0.11s user 0.01s system 96% cpu 0.134 total
# test with ^
found 1 times
perl -0777 -pE  "$testfile" > /dev/null  3.21s user 0.23s system 97% cpu 3.548 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants