Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perlre(1) and paired double quote regex searches -- #15498

Open
p5pRT opened this issue Aug 6, 2016 · 13 comments
Open

perlre(1) and paired double quote regex searches -- #15498

p5pRT opened this issue Aug 6, 2016 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 6, 2016

Migrated from rt.perl.org#128864 (status was 'open')

Searchable as RT128864$

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2016

From aab@purdue.edu

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN "JavaScript​::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub, it extracts quoted strings. I swiped the (first encountered) perlre(1) example

  /"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately, during my testing, I found a javascript file, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message. I next tried variations of the second example

  /"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery-string.js' file is on the first line, the closing quote is on last line, 10314 lines later, and the file is full of \" .

I ended up using the regex expression

  /\G((?>$peek.*?(?<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll bet that there are probably a bunch of "gotchas" that I haven't encountered yet.

Suggestion​: update the perlre(1) example(s) ala double quoted strings to something that doesn't go "recursion limit" flaky.

-- Thanks,

-- Paul Townsend

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2016

From @cpansprout

On Sat Aug 06 14​:10​:37 2016, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN
"JavaScript​::Beautifier" module to increase its performance and
completeness. In one part of its 'get_next_token()' sub, it extracts
quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately, during my
testing, I found a javascript file,
https://github.com/ternjs/acorn/blob/master/test/jquery-string.js,
that causes that dread "Complex regular subexpression recursion limit"
error message.

I am surprised. I thought the ++ would avoid that.

I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery-
string.js' file is on the first line, the closing quote is on last
line, 10314 lines later, and the file is full of \" .

I ended up using the regex expression

/\G((?>$peek.*?(?<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll
bet that there are probably a bunch of "gotchas" that I haven't
encountered yet.

Such as this code snippet​:

  alert("\\" + "\\");

In a browser, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G​:

$peek = '"';
<<\ENd =~ /((?>$peek.*?(?<!\\)$peek))/s;
alert("\\" + "\\");
ENd
print "$1\n";
__END__

it gives me​:

"\\" + "

FWIW, JE uses​:

  /\G (?​: '([^'\\]*(?​:\\.[^'\\]*)*)'
  |
  "([^"\\]*(?​:\\.[^"\\]*)*)" )/xcgs

The basic pattern is​:

  normal* ( special normal* )*

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 8, 2016

From aab@purdue.edu

Yeah, I found the "\\" not working. Boo Hiss.

________________________________
From​: Father Chrysostomos via RT <perlbug-followup@​perl.org>
Sent​: Saturday, August 6, 2016 7​:10​:14 PM
To​: Paul Townsend
Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Sat Aug 06 14​:10​:37 2016, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN
"JavaScript​::Beautifier" module to increase its performance and
completeness. In one part of its 'get_next_token()' sub, it extracts
quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately, during my
testing, I found a javascript file,
https://github.com/ternjs/acorn/blob/master/test/jquery-string.js,
that causes that dread "Complex regular subexpression recursion limit"
error message.

I am surprised. I thought the ++ would avoid that.

I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery-
string.js' file is on the first line, the closing quote is on last
line, 10314 lines later, and the file is full of \" .

I ended up using the regex expression

/\G((?>$peek.*?(?<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll
bet that there are probably a bunch of "gotchas" that I haven't
encountered yet.

Such as this code snippet​:

  alert("\\" + "\\");

In a browser, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G​:

$peek = '"';
<<\ENd =~ /((?>$peek.*?(?<!\\)$peek))/s;
alert("\\" + "\\");
ENd
print "$1\n";
__END__

it gives me​:

"\\" + "

FWIW, JE uses​:

  /\G (?​: '([^'\\]*(?​:\\.[^'\\]*)*)'
  |
  "([^"\\]*(?​:\\.[^"\\]*)*)" )/xcgs

The basic pattern is​:

  normal* ( special normal* )*

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 8, 2016

From aab@purdue.edu

Further testing indicates that a small change in the first perlre(1) example enables that pattern to work even on the jsquery-string.js file

  old​: /"(?​:[^"\\]++|\\.)*+"/

  new​: /"(?​:[^"\\]++|(\\.)++)*+"/

-- Thanks,

-- Paul Townsend

________________________________
From​: Father Chrysostomos via RT <perlbug-followup@​perl.org>
Sent​: Saturday, August 6, 2016 7​:10 PM
To​: Paul Townsend
Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Sat Aug 06 14​:10​:37 2016, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN
"JavaScript​::Beautifier" module to increase its performance and
completeness. In one part of its 'get_next_token()' sub, it extracts
quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately, during my
testing, I found a javascript file,
https://github.com/ternjs/acorn/blob/master/test/jquery-string.js,
that causes that dread "Complex regular subexpression recursion limit"
error message.

I am surprised. I thought the ++ would avoid that.

I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery-
string.js' file is on the first line, the closing quote is on last
line, 10314 lines later, and the file is full of \" .

I ended up using the regex expression

/\G((?>$peek.*?(?<!\\)$peek))/s

where $peek is the quote character. It seems to work fine but I'll
bet that there are probably a bunch of "gotchas" that I haven't
encountered yet.

Such as this code snippet​:

  alert("\\" + "\\");

In a browser, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G​:

$peek = '"';
<<\ENd =~ /((?>$peek.*?(?<!\\)$peek))/s;
alert("\\" + "\\");
ENd
print "$1\n";
__END__

it gives me​:

"\\" + "

FWIW, JE uses​:

  /\G (?​: '([^'\\]*(?​:\\.[^'\\]*)*)'
  |
  "([^"\\]*(?​:\\.[^"\\]*)*)" )/xcgs

The basic pattern is​:

  normal* ( special normal* )*

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Sep 22, 2016

From @Abigail

On Sat, Aug 06, 2016 at 04​:10​:14PM -0700, Father Chrysostomos via RT wrote​:

On Sat Aug 06 14​:10​:37 2016, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

I've been playing around (read a LOT of hacking) with the CPAN
"JavaScript​::Beautifier" module to increase its performance and
completeness. In one part of its 'get_next_token()' sub, it extracts
quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately, during my
testing, I found a javascript file,
https://github.com/ternjs/acorn/blob/master/test/jquery-string.js,
that causes that dread "Complex regular subexpression recursion limit"
error message.

I am surprised. I thought the ++ would avoid that.

I'm surprised as well. Mostly by the claim it's the most efficient
way of finding such strings. The alternation inside the loop is
a red flag.

[ Snip ]

FWIW, JE uses​:

    /\\G \(?&#8203;: '\(\[^'\\\\\]\*\(?&#8203;:\\\\\.\[^'\\\\\]\*\)\*\)'
              |
            "\(\[^"\\\\\]\*\(?&#8203;:\\\\\.\[^"\\\\\]\*\)\*\)"  \)/xcgs

The basic pattern is​:

normal\* \( special normal\* \)\*

Yes. That's the loop unrolling technique Jeffrey Friedl taught us
almost 20 years ago.

A benchmark shows it's faster than the suggestion made by perlre.
More surprisingly, using possessive quantifiers is slightly slower,
both in the perlre suggestion, and the unrolling technique​:

  #!/opt/perl/bin/perl

  use 5.010;

  use strict;
  use warnings;
  no warnings 'syntax';

  my $RUNS = 1_000;

  my %patterns = (
  unroll_pos => qr /"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+"/,
  unroll_reg => qr /"[^"\\]*(?​:\\(?s​:.)[^"\\]*)*"/,
  perlre_pos => qr /"(?​:[^"\\]++|\\(?s​:.))*+"/,
  perlre_reg => qr /"(?​:[^"\\]+|\\(?s​:.))*"/,
  );

  my $text = `cat jquery-string.js`;

  my %results;

  while (my ($name, $pattern) = each %patterns) {
  my ($u1, $s1) = times;
  foreach (1 .. $RUNS) {
  my @​r = $text =~ /$pattern/g;
  }
  my ($u2, $s2) = times;
  $results {$name} = $u2 + $s2 - $u1 - $s1;
  }

  foreach my $name (sort {$results {$a} <=> $results {$b}} keys %results) {
  printf "Avg. run time for '%s'​: %.4f\n" => $name,
  $results {$name} / $RUNS;
  }

  __END__
  Avg. run time for 'unroll_reg'​: 0.0998
  Avg. run time for 'unroll_pos'​: 0.1055
  Avg. run time for 'perlre_reg'​: 0.1131
  Avg. run time for 'perlre_pos'​: 0.1195

It's tempting to change the pattern in perlre to the more efficient
unrolling technique, but the pattern is there to demonstrate how
possessive quantifiers work.

But we may want to scratch the part about it claiming to be the most
efficient way.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2016

From @xsawyerx

On 09/22/2016 03​:25 AM, Abigail wrote​:

[...]
It's tempting to change the pattern in perlre to the more efficient
unrolling technique, but the pattern is there to demonstrate how
possessive quantifiers work.

What about adding a line that says, "This is meant to demonstrate
possessive quantifiers. If you're looking for efficiency, you could
write the following​:..." and adding the more efficient example?

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2016

From @jkeenan

On Sat Aug 06 14​:10​:37 2016, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

To return to the original problem for a moment ...

I've been playing around (read a LOT of hacking) with the CPAN
"JavaScript​::Beautifier" module to increase its performance and
completeness. In one part of its 'get_next_token()' sub, it extracts
quoted strings. I swiped the (first encountered) perlre(1) example

/"(?​:[^"\\]++|\\.)*+"/

as the basis for the regex pattern to use. Unfortunately, during my
testing, I found a javascript file,
https://github.com/ternjs/acorn/blob/master/test/jquery-string.js,
that causes that dread "Complex regular subexpression recursion limit"
error message. I next tried variations of the second example

/"(?>(?​:(?>[^"\\]+)|\\.)*)"/

with the same result. FWIW - the opening quote in the 'jquery-
string.js' file is on the first line, the closing quote is on last
line, 10314 lines later, and the file is full of \" .

I was not able to reproduce that error message. See attached 128864-test-long-js.pl. Running that program produced​:

#####
$ perl 128864-test-long-js.pl
No
No
No
No
#####

Is there a flaw in my test program or does this error only occur on certain platforms?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2016

From @jkeenan

128864-test-long-js.pl

@p5pRT
Copy link
Author

p5pRT commented Sep 29, 2016

From aab@purdue.edu

I just realized the other day that I specified he wrong URL for 'jquery-string.js'. The original

  https://github.com/ternjs/acorn/blob/master/test/jquery-string.js

is a browserfied HTML version of the raw file. The raw file itself is at

  https​://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-string.js<https://raw.githubusercontent.com/ternjs/acorn>

Reading between the lines, I suspect that the last several commenters have used the HTML file as the source.

I apologize for the confusion. I'm just learning the ins/outs of the github site

FWIW - the current HTML file has 83692 quoted strings. All of the raw files double quotes have been converted to '"'. The raw file has a single quoted string that's 338729 characters long and contains a very large number of '\.' pairs many of which are consecutive.

.

All of Abigail's test patterns fail using the raw file. If you change her "unroll_pos" pattern

  /"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+"/

to

  /"[^"\\]*+(?​:(?s​:\\.)++[^"\\]*+)*+"/

I apologize for the confusion. I'm just learning the ins/outs of the github site

/

it works but is slightly slower in general.

My testing of a modified Abigail script tells me that possesive patterns are faster for the raw file but slower for the HTML file.

-- Paul Townsend

________________________________
From​: James E Keenan via RT <perlbug-followup@​perl.org>
Sent​: Wednesday, September 28, 2016 5​:34 PM
To​: Paul Townsend
Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Sat Aug 06 14​:10​:37 2016, aab@​purdue.edu wrote​:

This is a comment and a possible suggestion.

To return to the original problem for a moment ...

James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Sep 29, 2016

From @jkeenan

On Thu Sep 29 12​:34​:27 2016, aab@​purdue.edu wrote​:

I just realized the other day that I specified he wrong URL for
'jquery-string.js'. The original

https://github.com/ternjs/acorn/blob/master/test/jquery-string.js

is a browserfied HTML version of the raw file. The raw file itself is
at

https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-
string.js<https://raw.githubusercontent.com/ternjs/acorn>

Okay, I have just now used 'wget' to obtain that file and have run it thru my script. I got exactly the same results as previously, i.e., 4 "No\n" with no fatal errors.

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2016

From aab@purdue.edu

Okay, here's my contribution. I took Abigails's script and went a bit overboard hacking it (see attachment).

# Pattern name modifiers​:
# pos = uses possessive quantifiers, i.e., *+ instead of * and ++ instead of +
# reg = uses non-possessive quantifiers, i.e., * or + or nothing
# pr = uses a combination of possessive/non-possessive quantifiers
# vb = uses variable backslashdot pattern, i.e., (?​:\\.)* or (?​:\\.)+
# na = does not use alternation(|)

URL = https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-string.js
Average CPU Seconds per regex invocation per pattern​: invocations = 60000
pattern-name CPU/regex state 'qr/.../' compiled-pattern
unroll_pr2_vb 0.00572 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*")
unroll_pr1_vb 0.00577 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*+")
perlre_pos_vbna 0.00597 pass (?^​:"(?​:[^"\\]++(?s​:\\.)*+)*+")
perlre_pr2_vbna 0.00600 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*+)*")
unroll_pos_vb 0.00602 pass (?^​:"[^"\\]*+(?​:(?s​:\\.)++[^"\\]*+)*+")
perlre_pr1_vbna 0.00752 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*+")
unroll_reg_vb 0.00759 pass (?^​:"[^"\\]*(?​:(?s​:\\.)+[^"\\]*)*")
perlre_reg_vbna 0.00762 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*")
perlre_pos_vb 0.01020 pass (?^​:"(?​:[^"\\]++|(?s​:\\.)*+)*+")
perlre_reg_vb 0.01151 pass (?^​:"(?​:[^"\\]+|(?s​:\\.)*)*")
perlre_reg 0.13182 fail (?^​:"(?​:[^"\\]+|\\(?s​:.))*")
unroll_reg 0.14338 fail (?^​:"[^"\\]*(?​:\\(?s​:.)[^"\\]*)*")
unroll_pos 4.10578 fail (?^​:"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+")
perlre_pos 6.31433 fail (?^​:"(?​:[^"\\]++|\\(?s​:.))*+")

URL = https://github.com/ternjs/acorn/blob/master/test/jquery-string.js
Average CPU Seconds per regex invocation per pattern​: invocations = 60000
pattern-name CPU/regex state 'qr/.../' compiled-pattern
unroll_reg 0.11773 pass (?^​:"[^"\\]*(?​:\\(?s​:.)[^"\\]*)*")
unroll_reg_vb 0.11941 pass (?^​:"[^"\\]*(?​:(?s​:\\.)+[^"\\]*)*")
unroll_pos 0.12037 pass (?^​:"[^"\\]*+(?​:\\(?s​:.)[^"\\]*+)*+")
unroll_pr2_vb 0.12207 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*")
unroll_pr1_vb 0.12430 pass (?^​:"[^"\\]*(?​:(?s​:\\.)++[^"\\]*)*+")
unroll_pos_vb 0.12483 pass (?^​:"[^"\\]*+(?​:(?s​:\\.)++[^"\\]*+)*+")
perlre_reg_vbna 0.12925 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*")
perlre_pr1_vbna 0.12994 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*)*+")
perlre_reg 0.13298 pass (?^​:"(?​:[^"\\]+|\\(?s​:.))*")
perlre_pr2_vbna 0.13323 pass (?^​:"(?​:[^"\\]+(?s​:\\.)*+)*")
perlre_reg_vb 0.13626 pass (?^​:"(?​:[^"\\]+|(?s​:\\.)*)*")
perlre_pos_vbna 0.13668 pass (?^​:"(?​:[^"\\]++(?s​:\\.)*+)*+")
perlre_pos 0.13795 pass (?^​:"(?​:[^"\\]++|\\(?s​:.))*+")
perlre_pos_vb 0.14314 pass (?^​:"(?​:[^"\\]++|(?s​:\\.)*+)*+")

The first set is for the raw 'jquery-string.js' file and the second for its HTMLized version. Note that Abigail's four patterns all fail with the raw file but her 'unroll_reg' pattern is the fastest for the HTMLized file. Actually, all of the 'unroll_...' patterns are generally faster than any of the 'perlre_...' ones for that file.

The raw file has a single quoted string that is 338729 characters long and contains a large count of (often consecutive) '\.' pairs.

The HTML file contains 83692 quoted strings and all of the raw files double quotes have been converted to '"'. Although I didn't really check, I don't think any of the quoted strings contains a '\.' pair.

If you KNOW that the files you are scanning do not have a large number of '\.' pairs, especially consecutive ones, use the 'unroll_reg' pattern. The possessive pattern, 'unroll_pos' is not far behind. If you want a pattern that is fast and works for all quoted strings, use the 'unroll_reg_vb' pattern. If you want a pattern that's fast for files with a lot of '\.' pairs, use the 'unroll_pr2_vb' pattern.

It should be noted that the script was executed on a Dell Optiplex-760 box with two 3GHz CPUs using a Cygwin 2.6.0 installation that sits on top of a Windows Vista 32-bit OS. Perl's Time​::HiRes​::clock() and times() functions use a clock that has a 1 millisecond resolution so some of the raw file timings are probably variable. I'm not sure what it means but Time​::HiRes​::clock_getres() says approximately 15 milliseconds. Boy, it would be nice to have clock with a higher resolution.

Mr. Keenan. You can see the perl error messages describing why Abigail's patterns failed if you run the attached script with the '-d' (debug) option.

The script can also be used check the patterns against your favorite quoted string test file by using the '-t file' option.

-- Thanks,
-- Paul Townsend

________________________________
From​: James E Keenan via RT <perlbug-followup@​perl.org>
Sent​: Thursday, September 29, 2016 6​:13 PM
To​: Paul Townsend
Subject​: [perl #128864] perlre(1) and paired double quote regex searches --

On Thu Sep 29 12​:34​:27 2016, aab@​purdue.edu wrote​:

I just realized the other day that I specified he wrong URL for
'jquery-string.js'. The original

https://github.com/ternjs/acorn/blob/master/test/jquery-string.js

is a browserfied HTML version of the raw file. The raw file itself is
at

https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-
string.js<https://raw.githubusercontent.com/ternjs/acorn>

Okay, I have just now used 'wget' to obtain that file and have run it thru my script. I got exactly the same results as previously, i.e., 4 "No\n" with no fatal errors.

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2016

From aab@purdue.edu

perl_quote_patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants