Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/\G^/ seems abnormally slow #14220

Open
p5pRT opened this issue Nov 8, 2014 · 3 comments
Open

/\G^/ seems abnormally slow #14220

p5pRT opened this issue Nov 8, 2014 · 3 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 8, 2014

Migrated from rt.perl.org#123156 (status was 'open')

Searchable as RT123156$

@p5pRT
Copy link
Author

p5pRT commented Nov 8, 2014

From @jimav

This is a bug report for perl from jim.avera@​gmail.com,
generated with the help of perlbug 1.39 running under perl 5.18.2.


Regexs like /\G^/msgc seem abnormally slow.

The run-time appears to be quadratic in the total length of the string
being searched, as if a failed match causes the *entire* string to be
re-processed, even though the /gc should mean just back up to where \G was.

The problem is specific to using ^ i.e. the exact same regex without ^
runs quadratically faster.

The demo script below scans a large string using a sequence of stanzas
beginning with either /\G^x/msgc or /\Gx/msgc
(neither of which ever match).

Here are my results​:
  Scan 5k bytes using \G^x​: 8.43/s
  Scan 10k bytes using \G^x​: 2.11/s
  Scan 20k bytes using \G^x​: .53/s
  Scan 40k bytes using \G^x​: .13/s
  Scan 50k bytes using \Gx : 168.07/s

#!/usr/bin/perl
use strict; use warnings;
use Benchmark;
sub test($$) {
  my ($re, $datasize) = @​_;
  my $data = "foo\n" x ($datasize/5);
  for (;;) {
  if ($data =~ /$re/msgc) { die "unexpected match" }
  elsif ($data =~ /\G(\w+)([^\n]*\n)/sgc) { }
  elsif ($data =~ /\G\z/sgc) { last; }
  else { die "bug" }
  }
  $data =~ /\G./msg && die "bug"; # reset pos
}
timethis (700, sub{ test('\Gx', 50000) }, "Scan 50k bytes using \\Gx ");
timethis (30, sub{ test('\G^x', 5000) }, "Scan 5k bytes using \\G^x");
timethis (15, sub{ test('\G^x', 10000) }, "Scan 10k bytes using \\G^x");
timethis (5, sub{ test('\G^x', 20000) }, "Scan 20k bytes using \\G^x");
timethis (3, sub{ test('\G^x', 40000) }, "Scan 40k bytes using \\G^x");



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.18.2​:

Configured by Debian Project at Thu Mar 27 18​:28​:21 UTC 2014.

Summary of my perl5 (revision 5 version 18 subversion 2) configuration​:
 
  Platform​:
  osname=linux, osvers=3.2.0-58-generic, archname=x86_64-linux-gnu-thread-multi
  uname='linux brownie 3.2.0-58-generic #88-ubuntu smp tue dec 3 17​:37​:58 utc 2013 x86_64 x86_64 x86_64 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.18 -Darchlib=/usr/lib/perl/5.18 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.18.2 -Dsitearch=/usr/local/lib/perl/5.18.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.18.2 -des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='4.8.2', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=, so=so, useshrplib=true, libperl=libperl.so.5.18.2
  gnulibc_version='2.19'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector'

Locally applied patches​:
  DEBPKG​:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
  DEBPKG​:debian/db_file_ver - http​://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
  DEBPKG​:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
  DEBPKG​:debian/enc2xs_inc - http​://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @​INC directories.
  DEBPKG​:debian/errno_ver - http​://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
  DEBPKG​:debian/libperl_embed_doc - http​://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
  DEBPKG​:fixes/respect_umask - Respect umask during installation
  DEBPKG​:debian/writable_site_dirs - Set umask approproately for site install directories
  DEBPKG​:debian/extutils_set_libperl_path - EU​:MM​: Set location of libperl.a to /usr/lib
  DEBPKG​:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
  DEBPKG​:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile
  DEBPKG​:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
  DEBPKG​:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
  DEBPKG​:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
  DEBPKG​:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
  DEBPKG​:debian/mod_paths - Tweak @​INC ordering for Debian
  DEBPKG​:debian/module_build_man_extensions - http​://bugs.debian.org/479460 Adjust Module​::Build manual page extensions for the Debian Perl policy
  DEBPKG​:debian/prune_libs - http​://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
  DEBPKG​:fixes/net_smtp_docs - [rt.cpan.org #36038] http​://bugs.debian.org/100195 Document the Net​::SMTP 'Port' option
  DEBPKG​:debian/perlivp - http​://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
  DEBPKG​:debian/cpanplus_definstalldirs - http​://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default.
  DEBPKG​:debian/cpanplus_config_path - Save local versions of CPANPLUS​::Config​::System into /etc/perl.
  DEBPKG​:debian/deprecate-with-apt - http​://bugs.debian.org/702096 Point users to Debian packages of deprecated core modules
  DEBPKG​:debian/squelch-locale-warnings - http​://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
  DEBPKG​:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
  DEBPKG​:debian/patchlevel - http​://bugs.debian.org/567489 List packaged patches for 5.18.2-2ubuntu1 in patchlevel.h
  DEBPKG​:debian/skip-kfreebsd-crash - http​://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
  DEBPKG​:fixes/document_makemaker_ccflags - http​://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
  DEBPKG​:debian/find_html2text - http​://bugs.debian.org/640479 Configure CPAN​::Distribution with correct name of html2text
  DEBPKG​:debian/hurd_test_skip_stack - http​://bugs.debian.org/650175 Disable failing GNU/Hurd tests dist/threads/t/stack.t
  DEBPKG​:fixes/manpage_name_Test-Harness - http​://bugs.debian.org/650451 [rt.cpan.org #73399] cpan/Test-Harness​: add NAME headings in modules with POD
  DEBPKG​:debian/makemaker-pasthru - http​://bugs.debian.org/660195 [rt.cpan.org #28632] Make EU​::MM pass LD through to recursive Makefile.PL invocations
  DEBPKG​:debian/perl5db-x-terminal-emulator.patch - http​://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
  DEBPKG​:debian/cpan-missing-site-dirs - http​://bugs.debian.org/688842 Fix CPAN​::FirstTime defaults with nonexisting site dirs if a parent is writable
  DEBPKG​:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http​://bugs.debian.org/587650 Memoize​::Storable​: respect 'nstore' option not respected
  DEBPKG​:fixes/net_ftp_failed_command - [rt.cpan.org #37700] http​://bugs.debian.org/491062 Net​::FTP​: cope gracefully with a failed command
  DEBPKG​:fixes/perlbug-patchlist - [3541c11] http​://bugs.debian.org/710842 [perl #118433] Make perlbug look up the list of local patches at run time
  DEBPKG​:fixes/module_metadata_security_doc - [68cdd4b] CVE-2013-1437 documentation fix
  DEBPKG​:fixes/module_metadata_taint_fix - [bff978f] http​://bugs.debian.org/722210 [rt.cpan.org #88576] untaint version, if needed, in Module​::Metadata
  DEBPKG​:fixes/IPC-SysV-spelling - http​://bugs.debian.org/730558 [rt.cpan.org #86736] Fix spelling of IPC_CREAT in IPC-SysV documentation
  DEBPKG​:fixes/fix-undef-source -


@​INC for perl 5.18.2​:
  /home/jima/lib/perl
  /home/jima/perl5/lib/perl5/x86_64-linux-gnu-thread-multi
  /home/jima/perl5/lib/perl5/x86_64-linux-gnu-thread-multi
  /home/jima/perl5/lib/perl5
  /etc/perl
  /usr/local/lib/perl/5.18.2
  /usr/local/share/perl/5.18.2
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.18
  /usr/share/perl/5.18
  /usr/local/lib/site_perl
  .


Environment for perl 5.18.2​:
  HOME=/home/jima
  LANG=en_US.UTF-8
  LANGUAGE=en_US
  LD_LIBRARY_PATH=/home/jima/local/lib
  LOGDIR (unset)
  PATH=/home/jima/perl5/bin​:/home/jima/bin​:/home/jima/local/bin​:/home/jima/jima_tools/x86_64/bin​:/home/jima/jima_tools/bin​:/opt/Adobe/Reader9/bin​:/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/bin/X11​:/usr/local/bin​:/usr/local/sbin​:/usr/games​:/usr/local/games​:/usr/lib/jvm/java-7-oracle/bin​:/usr/lib/jvm/java-7-oracle/db/bin​:/usr/lib/jvm/java-7-oracle/jre/bin​:.
  PERL5LIB=/home/jima/lib/perl​:/home/jima/perl5/lib/perl5/x86_64-linux-gnu-thread-multi​:/home/jima/perl5/lib/perl5
  PERL_BADLANG (unset)
  PERL_LOCAL_LIB_ROOT=/home/jima/perl5
  PERL_MB_OPT=--install_base /home/jima/perl5
  PERL_MM_OPT=INSTALL_BASE=/home/jima/perl5
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Nov 10, 2014

From @iabyn

On Fri, Nov 07, 2014 at 11​:19​:14PM -0800, via RT wrote​:

Regexs like /\G^/msgc seem abnormally slow.

The run-time appears to be quadratic in the total length of the string
being searched, as if a failed match causes the *entire* string to be
re-processed, even though the /gc should mean just back up to where \G was.

The problem is specific to using ^ i.e. the exact same regex without ^
runs quadratically faster.

The demo script below scans a large string using a sequence of stanzas
beginning with either /\G^x/msgc or /\Gx/msgc
(neither of which ever match).

Here are my results​:
Scan 5k bytes using \G^x​: 8.43/s
Scan 10k bytes using \G^x​: 2.11/s
Scan 20k bytes using \G^x​: .53/s
Scan 40k bytes using \G^x​: .13/s
Scan 50k bytes using \Gx : 168.07/s

#!/usr/bin/perl
use strict; use warnings;
use Benchmark;
sub test($$) {
my ($re, $datasize) = @​_;
my $data = "foo\n" x ($datasize/5);
for (;;) {
if ($data =~ /$re/msgc) { die "unexpected match" }
elsif ($data =~ /\G(\w+)([^\n]*\n)/sgc) { }
elsif ($data =~ /\G\z/sgc) { last; }
else { die "bug" }
}
$data =~ /\G./msg && die "bug"; # reset pos
}
timethis (700, sub{ test('\Gx', 50000) }, "Scan 50k bytes using \\Gx ");
timethis (30, sub{ test('\G^x', 5000) }, "Scan 5k bytes using \\G^x");
timethis (15, sub{ test('\G^x', 10000) }, "Scan 10k bytes using \\G^x");
timethis (5, sub{ test('\G^x', 20000) }, "Scan 20k bytes using \\G^x");
timethis (3, sub{ test('\G^x', 40000) }, "Scan 40k bytes using \\G^x");

Turns out the optimiser (re_intuit_start()) doesn't cope too well
with multiple anchors (both \G and /^/m), and currently prioritises
the second one; so it goes hunting for /^x/m across most of the string
length (from pos()..end), rather than only looking in the one position
based on the \G restriction.

I'm halfway though refactoring re_intuit_start() (that work went into
5.20); I'll add this to my big list of things to do once I get round to
finishing it off.

--
Please note that ash-trays are provided for the use of smokers,
whereas the floor is provided for the use of all patrons.
  -- Bill Royston

@p5pRT
Copy link
Author

p5pRT commented Nov 10, 2014

The RT System itself - Status changed from 'new' to 'open'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants