Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(*SKIP) not triggering correctly #14979

Open
p5pRT opened this issue Oct 12, 2015 · 8 comments
Open

(*SKIP) not triggering correctly #14979

p5pRT opened this issue Oct 12, 2015 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 12, 2015

Migrated from rt.perl.org#126327 (status was 'open')

Searchable as RT126327$

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From 0perlbugs@rexegg.com

(*SKIP) should get triggered if the engine attempts to backtrack across it.

Perhaps due to internal optimizations, (*SKIP) is not getting triggered in cases where backtracking across (*SKIP) is expected.

=== Case 1 ===
if ('aaaardvark aaardwolf' =~ /a{1,2}(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; }
# After failing to match the "r", an attempt to backtrack into the {1,2} should trigger (*SKIP)
# expected​: aaardwolf
# matched​: aaardwark
# note​: PCRE matches the expected aaardwolf, as does Python's alternate "regex" Package

=== Case 2 ===
if ('aaaardvark aaardwolf' =~ /aa(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; }
# matched​: aaardwark
# This is more open to interpretation. Even though there is nothing to backtrack to the left of (*SKIP), a naive path exploration would cause the engine to backtrack to the beginning of the string, triggering (*SKIP). This is the interpretation chosen by PCRE (even though internally it does not backtrack) as well as Python's alternate "regex" Package.

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From 0perlbugs@rexegg.com

The Case 2 behavior is also inconsistent with the ever so popular (*SKIP)(*FAIL) construct, where the engine fires the (*SKIP) even when there is nothing to backtrack to the left of it. For instance,

if ('tatatiti' =~ /tata(*SKIP)(*FAIL)|.{4}/ ) { print "\$&=\"$&\"\n"; }
# $&="titi"
# this shows that (*SKIP) fired

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From [Unknown Contact. See original ticket]

The Case 2 behavior is also inconsistent with the ever so popular (*SKIP)(*FAIL) construct, where the engine fires the (*SKIP) even when there is nothing to backtrack to the left of it. For instance,

if ('tatatiti' =~ /tata(*SKIP)(*FAIL)|.{4}/ ) { print "\$&=\"$&\"\n"; }
# $&="titi"
# this shows that (*SKIP) fired

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From 0perlbugs@rexegg.com

Case 2 is also inconsistent with
if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; }

where (*SKIP) fires (correctly IMO) even though there is nothing to backtrack to the left of it, eventually matching ABC

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From [Unknown Contact. See original ticket]

Case 2 is also inconsistent with
if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; }

where (*SKIP) fires (correctly IMO) even though there is nothing to backtrack to the left of it, eventually matching ABC

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From @demerphq

On 12 October 2015 at 02​:40, Rex <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Rex
# Please include the string​: [perl #126327]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=126327 >

(*SKIP) should get triggered if the engine attempts to backtrack across it.

Perhaps due to internal optimizations, (*SKIP) is not getting triggered in cases where backtracking across (*SKIP) is expected.

Yes, other optimizations kick in which mean that in some cases it does
not even try the pattern.

=== Case 1 ===
if ('aaaardvark aaardwolf' =~ /a{1,2}(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; }
# After failing to match the "r", an attempt to backtrack into the {1,2} should trigger (*SKIP)
# expected​: aaardwolf
# matched​: aaardwark
# note​: PCRE matches the expected aaardwolf, as does Python's alternate "regex" Package

The mandatrory minimal string in the pattern is aard. If we do not see
an aard in the string then we do not even try the regex engine.

=== Case 2 ===
if ('aaaardvark aaardwolf' =~ /aa(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; }
# matched​: aaardwark
# This is more open to interpretation. Even though there is nothing to backtrack to the left of (*SKIP), a naive path exploration would cause the engine to backtrack to the beginning of the string, triggering (*SKIP). This is the interpretation chosen by PCRE (even though internally it does not backtrack) as well as Python's alternate "regex" Package.

Again this an interaction with the minimal substring optimization. We
jump directly to the 2nd char, which is the first place the
mandatory substring "aaard" is found.

When I originally implemented these directives I decided that they
would NOT disable general optimizations. In a few cases I had been
frustrated by (??{}) and (?{}) doing so, and decided not to repeat
the same for the backtracking verbs.

I think probably the bestway to fix this is to have a modifier flag
which disables start position optimisations, so people can opt in if
they wish.

Alternatively, maybe I just didnt make the right decision about which
verbs should disable optimisations.

I was going to say that you can stick (??{ "" }) in your pattern to
disable the required string optimisation, but either I misremember
that that used to work, or something has changed with how that works.

I will try to follow up on this stuff when I get time.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 12, 2015

From 0perlbugs@rexegg.com

Hi Yves,

Thank you very much for looking into this!

Let me preface this by saying that I find it wonderful that these verbs are even in the language. Thank you for this facility and all your hard work. A few weeks ago (*SKIP) and (*PRUNE) were picked up by Python's alternate `regex` package, so their influence is slowly spreading.

I'm a little sad about the crippling of (*SKIP) by optimizations. Could this be a case where it's less important to save time by studying the pattern than to preserve the intent expressed by the pattern writer?

You mentioned two possible directions​: disabling optimizations for (*SKIP), or introducing a verb to do that. If you choose the second direction, may I suggest (*NO_START_OPT) ?
This would make it compatible with PCRE. This modifier is explained in this section about start-of-pattern modifiers.
http​://www.rexegg.com/regex-modifiers.html#pcre

Usually PCRE regex borrows from Perl, but there have been occasions when the reverse has taken place​:
http​://www.rexegg.com/pcre-documentation.html#perl_pcre

I'm preparing a long page to explain backtracking control verbs in the three engines that support them (Perl, PCRE and to a lesser extent Python via the alternate regex package), and that's how I happened to notice these behaviors.

With many thanks and kindest regards,

Rex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants