New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perlre(1) and paired double quote regex searches -- #15498
Comments
From aab@purdue.eduThis is a comment and a possible suggestion. I've been playing around (read a LOT of hacking) with the CPAN "JavaScript::Beautifier" module to increase its performance and completeness. In one part of its 'get_next_token()' sub, it extracts quoted strings. I swiped the (first encountered) perlre(1) example /"(?:[^"\\]++|\\.)*+"/ as the basis for the regex pattern to use. Unfortunately, during my testing, I found a javascript file, https://github.com/ternjs/acorn/blob/master/test/jquery-string.js, that causes that dread "Complex regular subexpression recursion limit" error message. I next tried variations of the second example /"(?>(?:(?>[^"\\]+)|\\.)*)"/ with the same result. FWIW - the opening quote in the 'jquery-string.js' file is on the first line, the closing quote is on last line, 10314 lines later, and the file is full of \" . I ended up using the regex expression /\G((?>$peek.*?(?<!\\)$peek))/s where $peek is the quote character. It seems to work fine but I'll bet that there are probably a bunch of "gotchas" that I haven't encountered yet. Suggestion: update the perlre(1) example(s) ala double quoted strings to something that doesn't go "recursion limit" flaky. -- Thanks, -- Paul Townsend |
From @cpansproutOn Sat Aug 06 14:10:37 2016, aab@purdue.edu wrote:
I am surprised. I thought the ++ would avoid that.
Such as this code snippet: alert("\\" + "\\"); In a browser, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G: $peek = '"'; it gives me: "\\" + " FWIW, JE uses: /\G (?: '([^'\\]*(?:\\.[^'\\]*)*)' The basic pattern is: normal* ( special normal* )* -- Father Chrysostomos |
The RT System itself - Status changed from 'new' to 'open' |
From aab@purdue.eduYeah, I found the "\\" not working. Boo Hiss. ________________________________ On Sat Aug 06 14:10:37 2016, aab@purdue.edu wrote:
I am surprised. I thought the ++ would avoid that.
Such as this code snippet: alert("\\" + "\\"); In a browser, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G: $peek = '"'; it gives me: "\\" + " FWIW, JE uses: /\G (?: '([^'\\]*(?:\\.[^'\\]*)*)' The basic pattern is: normal* ( special normal* )* -- Father Chrysostomos |
From aab@purdue.eduFurther testing indicates that a small change in the first perlre(1) example enables that pattern to work even on the jsquery-string.js file old: /"(?:[^"\\]++|\\.)*+"/ new: /"(?:[^"\\]++|(\\.)++)*+"/ -- Thanks, -- Paul Townsend ________________________________ On Sat Aug 06 14:10:37 2016, aab@purdue.edu wrote:
I am surprised. I thought the ++ would avoid that.
Such as this code snippet: alert("\\" + "\\"); In a browser, that pops up an alert message with two backslashes. If I run your regular expression on it without the \G: $peek = '"'; it gives me: "\\" + " FWIW, JE uses: /\G (?: '([^'\\]*(?:\\.[^'\\]*)*)' The basic pattern is: normal* ( special normal* )* -- Father Chrysostomos |
From @AbigailOn Sat, Aug 06, 2016 at 04:10:14PM -0700, Father Chrysostomos via RT wrote:
I'm surprised as well. Mostly by the claim it's the most efficient [ Snip ]
Yes. That's the loop unrolling technique Jeffrey Friedl taught us A benchmark shows it's faster than the suggestion made by perlre. #!/opt/perl/bin/perl use 5.010; use strict; my $RUNS = 1_000; my %patterns = ( my $text = `cat jquery-string.js`; my %results; while (my ($name, $pattern) = each %patterns) { foreach my $name (sort {$results {$a} <=> $results {$b}} keys %results) { __END__ It's tempting to change the pattern in perlre to the more efficient But we may want to scratch the part about it claiming to be the most Abigail |
From @xsawyerxOn 09/22/2016 03:25 AM, Abigail wrote:
What about adding a line that says, "This is meant to demonstrate |
From @jkeenanOn Sat Aug 06 14:10:37 2016, aab@purdue.edu wrote:
To return to the original problem for a moment ...
I was not able to reproduce that error message. See attached 128864-test-long-js.pl. Running that program produced: ##### Is there a flaw in my test program or does this error only occur on certain platforms? Thank you very much. -- |
From aab@purdue.eduI just realized the other day that I specified he wrong URL for 'jquery-string.js'. The original https://github.com/ternjs/acorn/blob/master/test/jquery-string.js is a browserfied HTML version of the raw file. The raw file itself is at https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-string.js<https://raw.githubusercontent.com/ternjs/acorn> Reading between the lines, I suspect that the last several commenters have used the HTML file as the source. I apologize for the confusion. I'm just learning the ins/outs of the github site FWIW - the current HTML file has 83692 quoted strings. All of the raw files double quotes have been converted to '"'. The raw file has a single quoted string that's 338729 characters long and contains a very large number of '\.' pairs many of which are consecutive. . All of Abigail's test patterns fail using the raw file. If you change her "unroll_pos" pattern /"[^"\\]*+(?:\\(?s:.)[^"\\]*+)*+"/ to /"[^"\\]*+(?:(?s:\\.)++[^"\\]*+)*+"/ I apologize for the confusion. I'm just learning the ins/outs of the github site / it works but is slightly slower in general. My testing of a modified Abigail script tells me that possesive patterns are faster for the raw file but slower for the HTML file. -- Paul Townsend ________________________________ On Sat Aug 06 14:10:37 2016, aab@purdue.edu wrote:
To return to the original problem for a moment ... James E Keenan (jkeenan@cpan.org) |
From @jkeenanOn Thu Sep 29 12:34:27 2016, aab@purdue.edu wrote:
Okay, I have just now used 'wget' to obtain that file and have run it thru my script. I got exactly the same results as previously, i.e., 4 "No\n" with no fatal errors. Thank you very much. -- |
From aab@purdue.eduOkay, here's my contribution. I took Abigails's script and went a bit overboard hacking it (see attachment). # Pattern name modifiers: URL = https://raw.githubusercontent.com/ternjs/acorn/master/test/jquery-string.js URL = https://github.com/ternjs/acorn/blob/master/test/jquery-string.js The first set is for the raw 'jquery-string.js' file and the second for its HTMLized version. Note that Abigail's four patterns all fail with the raw file but her 'unroll_reg' pattern is the fastest for the HTMLized file. Actually, all of the 'unroll_...' patterns are generally faster than any of the 'perlre_...' ones for that file. The raw file has a single quoted string that is 338729 characters long and contains a large count of (often consecutive) '\.' pairs. The HTML file contains 83692 quoted strings and all of the raw files double quotes have been converted to '"'. Although I didn't really check, I don't think any of the quoted strings contains a '\.' pair. If you KNOW that the files you are scanning do not have a large number of '\.' pairs, especially consecutive ones, use the 'unroll_reg' pattern. The possessive pattern, 'unroll_pos' is not far behind. If you want a pattern that is fast and works for all quoted strings, use the 'unroll_reg_vb' pattern. If you want a pattern that's fast for files with a lot of '\.' pairs, use the 'unroll_pr2_vb' pattern. It should be noted that the script was executed on a Dell Optiplex-760 box with two 3GHz CPUs using a Cygwin 2.6.0 installation that sits on top of a Windows Vista 32-bit OS. Perl's Time::HiRes::clock() and times() functions use a clock that has a 1 millisecond resolution so some of the raw file timings are probably variable. I'm not sure what it means but Time::HiRes::clock_getres() says approximately 15 milliseconds. Boy, it would be nice to have clock with a higher resolution. Mr. Keenan. You can see the perl error messages describing why Abigail's patterns failed if you run the attached script with the '-d' (debug) option. The script can also be used check the patterns against your favorite quoted string test file by using the '-t file' option. -- Thanks, ________________________________ On Thu Sep 29 12:34:27 2016, aab@purdue.edu wrote:
Okay, I have just now used 'wget' to obtain that file and have run it thru my script. I got exactly the same results as previously, i.e., 4 "No\n" with no fatal errors. Thank you very much. -- |
Migrated from rt.perl.org#128864 (status was 'open')
Searchable as RT128864$
The text was updated successfully, but these errors were encountered: