Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(*THEN) broken inside condition subpattern #11443

Open
p5pRT opened this issue Jun 15, 2011 · 29 comments
Open

(*THEN) broken inside condition subpattern #11443

p5pRT opened this issue Jun 15, 2011 · 29 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 15, 2011

Migrated from rt.perl.org#92898 (status was 'open')

Searchable as RT92898$

@p5pRT
Copy link
Author

p5pRT commented Jun 15, 2011

From ph10@hermes.cam.ac.uk

Created by ph10@cam.ac.uk

It seems to me that, if what precedes (*THEN) in a branch matches
only a fixed string (no backtracking points), then the behaviour
should be exactly the same as if (*THEN) is not present. Here is
an example where that is not so​:

Pattern​: /^.*?(?(?=a)a|b(*THEN)c)/
Subject​: ba
Result​: no match

Pattern​: /^.*?(?(?=a)a|bc)/
Subject​: ba
Result​: matches "ba"

I noticed this because I have just fixed the same bug in PCRE.

Philip Hazel

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.12.3:

Configured by Gentoo at Tue May 10 17:35:58 BST 2011.

Summary of my perl5 (revision 5 version 12 subversion 3) configuration:
   
  Platform:
    osname=linux, osvers=2.6.37-gentoo-r4, archname=i686-linux
    uname='linux quercite 2.6.37-gentoo-r4 #1 smp preempt tue may 10 17:11:00 bst 2011 i686 intel(r) pentium(r) 4 cpu 2.80ghz genuineintel gnulinux '
    config_args='-des -Duseshrplib -Darchname=i686-linux -Dcc=i686-pc-linux-gnu-gcc -Doptimize=-O2 -march=pentium4 -funroll-loops -fprefetch-loop-arrays -pipe -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dsiteprefix=/usr -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib/perl5/5.12.3 -Darchlib=/usr/lib/perl5/5.12.3/i686-linux -Dsitelib=/usr/lib/perl5/site_perl/5.12.3 -Dsitearch=/usr/lib/perl5/site_perl/5.12.3/i686-linux -Dvendorlib=/usr/lib/perl5/vendor_perl/5.12.3 -Dvendorarch=/usr/lib/perl5/vendor_perl/5.12.3/i686-linux -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/share/man/man1 -Dsiteman3dir=/usr/share/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.12.3 -Dlocincpth=  -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -DDEBUGGING=none -Dinc_version_list=5.12.2/i686-linux 5.12.2 5.12.1/i686-linux 5.12.1 5.12.0/i686-linux 5.12.0 '
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='i686-pc-linux-gnu-gcc', ccflags ='-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -march=pentium4 -funroll-loops -fprefetch-loop-arrays -pipe',
    cppflags='-fno-strict-aliasing -pipe'
    ccversion='', gccversion='4.4.5', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='i686-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.11.3.so, so=so, useshrplib=true, libperl=libperl.so.5.12.3
    gnulibc_version='2.11.3'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -march=pentium4 -funroll-loops -fprefetch-loop-arrays -pipe -L/usr/local/lib -Wl,-O1 -Wl,--as-needed'

Locally applied patches:
    0001-gentoo_MakeMaker-RUNPATH.diff
    0002-gentoo_config_over.diff
    0003-gentoo_cpan_definstalldirs.diff
    0004-gentoo_cpanplus_definstalldirs.diff
    0005-gentoo_create-libperl-soname.diff
    0006-gentoo_MakeMaker-delete_packlist.diff
    0007-fixes_8d66b3f9_h2hp_fix.diff
    0008-fixes_f178b03b_h2ph_using_deprecated_goto.diff
    0009-gentoo_mod-paths.diff
    0010-gentoo_enc2xs.diff
    0011-gentoo_IO-Compress_AutoLoader_dropped_from_Compress-Zlib.diff
    0012-gentoo_drop-fstack-protector.diff
    0013-fixes_539689e74a_unwarrantedly_laundering_tainted_data.diff


@INC for perl 5.12.3:
    /etc/perl
    /usr/lib/perl5/site_perl/5.12.3/i686-linux
    /usr/lib/perl5/site_perl/5.12.3
    /usr/lib/perl5/vendor_perl/5.12.3/i686-linux
    /usr/lib/perl5/vendor_perl/5.12.3
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.12.2/i686-linux
    /usr/lib/perl5/vendor_perl/5.12.2
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/5.12.3/i686-linux
    /usr/lib/perl5/5.12.3
    /usr/local/lib/site_perl
    .


Environment for perl 5.12.3:
    HOME=/home/ph10
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/ph10/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.4.5:/sbin:/usr/sbin:.:/usr/X11R6/bin:/usr/local/exim/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2011

From @cpansprout

On Wed Jun 15 11​:26​:50 2011, ph10@​hermes.cam.ac.uk wrote​:

It seems to me that, if what precedes (*THEN) in a branch matches
only a fixed string (no backtracking points), then the behaviour
should be exactly the same as if (*THEN) is not present. Here is
an example where that is not so​:

Pattern​: /^.*?(?(?=a)a|b(*THEN)c)/
Subject​: ba
Result​: no match

Pattern​: /^.*?(?(?=a)a|bc)/
Subject​: ba
Result​: matches "ba"

I noticed this because I have just fixed the same bug in PCRE.

Do the pipes in the (?(...)...) condition expression count as regular
alternations? It seems they don’t. Should they?

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2011

From ph10@hermes.cam.ac.uk

On Sun, 11 Sep 2011, Father Chrysostomos via RT wrote​:

On Wed Jun 15 11​:26​:50 2011, ph10@​hermes.cam.ac.uk wrote​:

It seems to me that, if what precedes (*THEN) in a branch matches
only a fixed string (no backtracking points), then the behaviour
should be exactly the same as if (*THEN) is not present. Here is
an example where that is not so​:

Pattern​: /^.*?(?(?=a)a|b(*THEN)c)/
Subject​: ba
Result​: no match

Pattern​: /^.*?(?(?=a)a|bc)/
Subject​: ba
Result​: matches "ba"

I noticed this because I have just fixed the same bug in PCRE.

Do the pipes in the (?(...)...) condition expression count as regular
alternations? It seems they don’t. Should they?

The documentation for THEN says that it tries "the next alternation in
the innermost enclosing group". And it also says "Note that if this
operator is used and NOT inside of an alternation then it acts exactly
like the "(*PRUNE)" operator."

So yes, I guess it all depends on whether or not the branches of a
conditional subpattern count as regular alternations. My feeling is that
they should, so that all branches behave in the same way in regard to
(*THEN). That is, if (*THEN) is backtracked onto, it skips any previous
backtracking points in the current branch, and then fails the branch. If
there are other branches (in non-conditional groups), or other
backtracking points prior to the current group, they will be activated.

While thinking about this and experimenting, I've just discovered
another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

However, PCRE matches only "ac". The same thing happens with (*PRUNE).
PCRE is failing the entire match when backtracking onto (*THEN), and
moving on to the next position in the subject. Perl seems to be
backtracking to the a+? item. This seems not to be in accordance with
the documentation for (*PRUNE).

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 18, 2011

From @cpansprout

On Mon Sep 12 06​:24​:25 2011, ph10@​hermes.cam.ac.uk wrote​:

On Sun, 11 Sep 2011, Father Chrysostomos via RT wrote​:

On Wed Jun 15 11​:26​:50 2011, ph10@​hermes.cam.ac.uk wrote​:

It seems to me that, if what precedes (*THEN) in a branch matches
only a fixed string (no backtracking points), then the behaviour
should be exactly the same as if (*THEN) is not present. Here is
an example where that is not so​:

Pattern​: /^.*?(?(?=a)a|b(*THEN)c)/
Subject​: ba
Result​: no match

Pattern​: /^.*?(?(?=a)a|bc)/
Subject​: ba
Result​: matches "ba"

I noticed this because I have just fixed the same bug in PCRE.

Do the pipes in the (?(...)...) condition expression count as regular
alternations? It seems they don’t. Should they?

The documentation for THEN says that it tries "the next alternation in
the innermost enclosing group". And it also says "Note that if this
operator is used and NOT inside of an alternation then it acts exactly
like the "(*PRUNE)" operator."

So yes, I guess it all depends on whether or not the branches of a
conditional subpattern count as regular alternations. My feeling is that
they should, so that all branches behave in the same way in regard to
(*THEN). That is, if (*THEN) is backtracked onto, it skips any previous
backtracking points in the current branch, and then fails the branch. If
there are other branches (in non-conditional groups), or other
backtracking points prior to the current group, they will be activated.

So you are saying that (?(condition)foo(*THEN)bar|baz) should jump out
of the conditional group (since the |baz part is not a backtracking
point, but is only reached when the condition is false), but that
(?(condition)foo(*THEN)bar) should fail the whole pattern (there being
no |bar)?

I think it ends up being too confusing. The | in a conditional has
nothing to do with regular alternation. That view is reinforced by the
fact that only one pipe is permitted​:

$ perl -le' /(?(1)foo|bar|baz)/'
Switch (?(condition)... contains too many branches in regex; marked by
<-- HERE in m/(?(1)foo|bar| <-- HERE baz)/ at -e line 1.

While thinking about this and experimenting, I've just discovered
another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is worse.

However, PCRE matches only "ac". The same thing happens with (*PRUNE).
PCRE is failing the entire match when backtracking onto (*THEN), and
moving on to the next position in the subject. Perl seems to be
backtracking to the a+? item. This seems not to be in accordance with
the documentation for (*PRUNE).

Regards,
Philip

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2011

From ph10@hermes.cam.ac.uk

On Sun, 18 Sep 2011, Father Chrysostomos via RT wrote​:

So you are saying that (?(condition)foo(*THEN)bar|baz) should jump out
of the conditional group (since the |baz part is not a backtracking
point, but is only reached when the condition is false), but that
(?(condition)foo(*THEN)bar) should fail the whole pattern (there being
no |bar)?

No, I'm not.

I think it ends up being too confusing. The | in a conditional has
nothing to do with regular alternation. That view is reinforced by the
fact that only one pipe is permitted​:

That is certainly true, but to me, as a simple-minded person, it *looks*
like a regular alternation. The only difference is that the matching
engine just tries one of the alternatives rather than both. Consider

/^.*?(?(?=a)a(*THEN)b|c)/ Pattern
ac Subject
 
It starts off trying with zero matches of "a". The condition is true, so
it matches a, fails on b, and then backtracks to (*THEN). In a "normal"
group it would try the next alternative, but a conditional group behaves
as if there is only one alternative, so it should just backtrack as if
the group had failed, thereby trying again with one "a" matching .* and
so eventually succeeding.

I think the same should happen for this example​:

/^.*?(?(?=a)a(*THEN)b)c/
ac

Further investigation shows up another issue. If (*THEN) appears in a
regular (non-conditional) group that has no alternatives, its effect
again extends beyond the group.

/^.*?(a(*THEN)b)c/
aabc

Perl gives "no match"; PCRE currently matches. However, if we give it
a dummy alternative​:

/^.*?(a(*THEN)b|z)c/

then Perl (5.012003) does match. That seems very counter-intuitive to
me. Perhaps, however this does tie in with the way Perl handles
conditional groups, since they seem to have the same behaviour.

The text in perlre for *THEN says "when backtracked into on failure, it
causes the regex engine to try the next alternation in the innermost
enclosing group". It doesn't say what happens if there are no
alternations or indeed if *THEN occurs in the final alternation. A check
with

/^.*?(z|a(*THEN)b)c/

shows that Perl does match in this case too.

While thinking about this and experimenting, I've just discovered
another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is worse.

I sometimes wonder whether these new backtracking verbs are going to
prove more trouble than they are worth.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2011

From @nwc10

On Mon, Sep 19, 2011 at 05​:49​:19PM +0100, Philip Hazel wrote​:

I sometimes wonder whether these new backtracking verbs are going to
prove more trouble than they are worth.

About the only possibly useful input I think I can have after reading
the thread of this bug, and the entire documentation​:

  =item C<(*THEN)> C<(*THEN​:NAME)>
 
  This is similar to the "cut group" operator C<​::> from Perl 6. Like
  C<(*PRUNE)>, this verb always matches, and when backtracked into on
  failure, it causes the regex engine to try the next alternation in the
  innermost enclosing group (capturing or otherwise).
 
  Its name comes from the observation that this operation combined with the
  alternation operator (C<|>) can be used to create what is essentially a
  pattern-based if/then/else block​:
 
  ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
 
  Note that if this operator is used and NOT inside of an alternation then
  it acts exactly like the C<(*PRUNE)> operator.
 
  / A (*PRUNE) B /
 
  is the same as
 
  / A (*THEN) B /
 
  but
 
  / ( A (*THEN) B | C (*THEN) D ) /
 
  is not the same as
 
  / ( A (*PRUNE) B | C (*PRUNE) D ) /
 
  as after matching the A but failing on the B the C<(*THEN)> verb will
  backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
 
  =back

is that I'm sadly thinking that you're right about the trouble/worth trade.

I don't actually understand any of this. Which isn't a good sign, as based
on previous experience I'm going to make the possibly arrogant assumption
that *I* am not the one at fault for the lack of understanding.

[Nothing uses (*THEN) in the core, other than the 14 lines of tests for it]

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2011

From @ilmari

Nicholas Clark <nick@​ccl4.org> writes​:

[Nothing uses (*THEN) in the core, other than the 14 lines of tests for it]

Nor on CPAN, other than various re​::engine​:: tests.

<http​://grep.cpan.me/?q=%5C%28%5C*THEN%5C%29>

--
ilmari
"A disappointingly low fraction of the human race is,
at any given time, on fire." - Stig Sandbeck Mathisen

@p5pRT
Copy link
Author

p5pRT commented Sep 21, 2011

From ph10@hermes.cam.ac.uk

On Tue, 20 Sep 2011, Nicholas Clark wrote​:

About the only possibly useful input I think I can have after reading
the thread of this bug, and the entire documentation​:

<snip>

is that I'm sadly thinking that you're right about the trouble/worth trade.

I don't actually understand any of this. Which isn't a good sign, as based
on previous experience I'm going to make the possibly arrogant assumption
that *I* am not the one at fault for the lack of understanding.

[Nothing uses (*THEN) in the core, other than the 14 lines of tests for it]

I've been thinking about this some more. My naive understanding of *THEN
is basically this​: it is effectively just another way of doing what
(?>...) does, but with possibly simpler syntax and the added feature of
*THEN​:NAME. The emphasis on alternation is really a red herring.
Thus, if you have

  A (*THEN) B
 
(where A and B are complex patterns) the matching engine, having passed
(*THEN) and subsequently failed in B, no longer backtracks into A. This
would be the same​:

  (?>A)B

If you go along with this, it follows that, if (*THEN) is within a
group, for example,

  C (A (*THEN) B)
 
then a failure in B must backtrack into C, just as would happen with

  C ((?>A) B)
 
In other words, the effect of (*THEN) within a group does not propagate
back beyond the start of the group. IMHO this should also apply to
conditional groups which, after all, behave sort of like a group with
only one branch (just that there's a choice of which one each time).

Now, it seems that Perl thinks differently to me. There seems to be the
concept of "group with no alternation" and "group with alternation" as
two different things that are handled differently. (And a conditional
group is of the former type.) This was shown up by my example, where
adding what was in effect a dummy alternative to a group with only one
branch caused a change in behaviour. It is a valid concept, but I humbly
submit that it is very confusing and unexpected. The fact that (*THEN)
in these two examples behaves differently is, to me at least,
counter-intuitive​:

  A (B (*THEN) C)
  A (B (*THEN) C | D)

In the first, Perl fails the match without any backtracking if C fails;
in the second, it backtracks into A. Or that is what my experiments
imply.

Regards,
Philip

PS On the matter of value/worth, some of the other backtracking verbs
behave oddly too, though admittedly with silly examples​:

Pattern​: (*ACCEPT)a
Subject​: bax
Result​: Perl matches "", with "ax" as the remainder of the subject,
  in other words, the match point is after "b".

Pattern​: (*ACCEPT)
Subject​: bax
Result​: Perl matches "", with "bax" as the remainder of the subject,
  in other words, the match point is at the start.

This is Perl 5.012003.

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 21, 2011

From @ap

* Philip Hazel <ph10@​hermes.cam.ac.uk> [2011-09-21 11​:20]​:

I've been thinking about this some more. My naive understanding
of *THEN is basically this​: it is effectively just another way
of doing what (?>...) does, but with possibly simpler syntax
and the added feature of *THEN​:NAME. The emphasis on
alternation is really a red herring. Thus, if you have

A (*THEN) B

(where A and B are complex patterns) the matching engine,
having passed (*THEN) and subsequently failed in B, no longer
backtracks into A. This would be the same​:

(?>A)B

That is how I understand it as well.

In which case the emphasis on alternation is not entirely a red
herring, as the question it tries to address is that both (*THEN)
and (*PRUNE) could be considered the closing paren of an atomic
group which differ in where the opening paren is inferred to be
(start of pattern vs start of group) – which makes the most
practical difference inside alternations, where (*PRUNE) causes
a very different effect from (*THEN). In most other cases the
difference in matching is very much subtler.

It occurs to me that “THEN” is really the wrong thing to call
this predicate. It should be maybe “ASLONGAS”. That is

  A(*THEN)B

translates to something like “match A as long as B also matches
from that point on”.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Sep 21, 2011

From @cpansprout

On Wed Sep 21 02​:17​:32 2011, ph10@​hermes.cam.ac.uk wrote​:

The fact that (*THEN)
in these two examples behaves differently is, to me at least,
counter-intuitive​:

A (B (*THEN) C)
A (B (*THEN) C | D)

In the first, Perl fails the match without any backtracking if C
fails;
in the second, it backtracks into A. Or that is what my experiments
imply.

Oddly, I don’t find that counterintuitive at all. Do we need three
versions of prune/then?

@p5pRT
Copy link
Author

p5pRT commented Sep 21, 2011

From ph10@hermes.cam.ac.uk

On Wed, 21 Sep 2011, Father Chrysostomos via RT wrote​:

A (B (*THEN) C)
A (B (*THEN) C | D)

In the first, Perl fails the match without any backtracking if C
fails;
in the second, it backtracks into A. Or that is what my experiments
imply.

Oddly, I don’t find that counterintuitive at all. Do we need three
versions of prune/then?

Aha! One person's intuition is always another's totally craziness. :-)
I wonder what the percentages each way would be if we surveyed the
general Perl-using population? At least one other person thinks as I
do, because it was a bug report for PCRE - which was behaving more like
Perl - that got me into this issue in the first place.

Is there a forum where we could ask the following question?

  Folks, consider the pattern ^A(B(*THEN)C), where A, B, and C are
  complex patterns. If matching fails in C, do you expect that
  (a) the entire match should fail, or
  (b) the matching should backtrack into A?

I will ask this question on the pcre-dev mailing list and see what
answers (if any) I get. I might try Jeffrey Friedl as well. I will not
be in the least offended if I am "outvoted".

It seems that you intuitively think that a group without a | is a
different kind of animal to a group that contains a |, whereas I don't.
I just think that a group without a | has "one alternative", maybe
better expressed as "one branch" (since "alternative" implies at least
one of two).

I can, however, understand your logic; I have to say that to me it seems
rather mathematically pedantic (with respect :-).

I'd rather not created yet another version of prune/then! I *thought* I
understood these verbs. It seemed to me that they provide different
"strengths" of pruning when backtracked onto, as follows (from weakest
to strongest)​:

*THEN fails the current alternation branch, and restarts at the next
alternation in the current group, or fails the whole group if there are
no more alternatives.

*PRUNE fails the current match, but allows an advance to the next
starting position (unless anchored).

*SKIP is like *PRUNE, but can skip forward more than one character.

*COMMIT fails the entire matching process, not allowing any further
advance in the subject.

That's fairly straightforward; the issue between us is what constitutes
"the current group". Having two verbs (one for me, one for you) is just
a recipe for even more confusion.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 21, 2011

From @cpansprout

On Wed Sep 21 12​:04​:45 2011, ph10@​hermes.cam.ac.uk wrote​:

On Wed, 21 Sep 2011, Father Chrysostomos via RT wrote​:

Oddly, I don’t find that counterintuitive at all. Do we need three
versions of prune/then?

Aha! One person's intuition is always another's totally craziness. :-)
I wonder what the percentages each way would be if we surveyed the
general Perl-using population? At least one other person thinks as I
do, because it was a bug report for PCRE - which was behaving more like
Perl - that got me into this issue in the first place.

Is there a forum where we could ask the following question?

Folks, consider the pattern ^A(B(*THEN)C), where A, B, and C are
complex patterns. If matching fails in C, do you expect that
(a) the entire match should fail, or
(b) the matching should backtrack into A?

I will ask this question on the pcre-dev mailing list and see what
answers (if any) I get. I might try Jeffrey Friedl as well. I will not
be in the least offended if I am "outvoted".

It seems that you intuitively think that a group without a | is a
different kind of animal to a group that contains a |, whereas I don't.
I just think that a group without a | has "one alternative", maybe
better expressed as "one branch" (since "alternative" implies at least
one of two).

I can, however, understand your logic; I have to say that to me it seems
rather mathematically pedantic (with respect :-).

I'd rather not created yet another version of prune/then! I *thought* I
understood these verbs. It seemed to me that they provide different
"strengths" of pruning when backtracked onto, as follows (from weakest
to strongest)​:

Come to think of it, we already have your version of (*THEN), as
(?>...). I know I’m just repeating what you said by that. But it has
just occurred to me that it belongs at the top of this list.

So your ^A(B(*THEN)C) translates into ^A((?>B)C).

My understanding of (*THEN) (and perl’s implementation), would be

  [?>^A(B]C)

where I’m using [?>...], as it doesn’t nest. So the latter meaning
seems more useful, as there is no other way to get it.

*THEN fails the current alternation branch, and restarts at the next
alternation in the current group, or fails the whole group if there are
no more alternatives.

*PRUNE fails the current match, but allows an advance to the next
starting position (unless anchored).

*SKIP is like *PRUNE, but can skip forward more than one character.

*COMMIT fails the entire matching process, not allowing any further
advance in the subject.

That's fairly straightforward; the issue between us is what constitutes
"the current group". Having two verbs (one for me, one for you) is just
a recipe for even more confusion.

Regards,
Philip

@p5pRT
Copy link
Author

p5pRT commented Sep 23, 2011

From ph10@hermes.cam.ac.uk

On Wed, 21 Sep 2011, Father Chrysostomos via RT wrote​:

Come to think of it, we already have your version of (*THEN), as
(?>...). I know I’m just repeating what you said by that. But it has
just occurred to me that it belongs at the top of this list.

So your ^A(B(*THEN)C) translates into ^A((?>B)C).

Sure ... but Perl has never shied away from having "more than one way to
do it" has it? :-)

You will be please to hear that Jeff Friedl, who has only just
discovered (*THEN) and friends - I guess he's been doing other stuff -
agrees with your and Perl's interpretation. He even pointed out that
adding a non-matching alternation would change the behaviour. I haven't
had any responses from the mailing list I asked. So on a sample of one
third party, it looks like my intuition is invalid.

At a practical level, I am not sure it is feasible to change PCRE to
behave like Perl, so I may have to settle for documenting the
difference. There are precedents (i.e. other differences).

Thanks for picking this up and having this discussion.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2011

From @cpansprout

On Fri Sep 23 02​:15​:05 2011, ph10@​hermes.cam.ac.uk wrote​:

On Wed, 21 Sep 2011, Father Chrysostomos via RT wrote​:

Come to think of it, we already have your version of (*THEN), as
(?>...). I know I’m just repeating what you said by that. But it has
just occurred to me that it belongs at the top of this list.

So your ^A(B(*THEN)C) translates into ^A((?>B)C).

Sure ... but Perl has never shied away from having "more than one way to
do it" has it? :-)

You will be please to hear that Jeff Friedl, who has only just
discovered (*THEN) and friends - I guess he's been doing other stuff -
agrees with your and Perl's interpretation. He even pointed out that
adding a non-matching alternation would change the behaviour. I haven't
had any responses from the mailing list I asked. So on a sample of one
third party, it looks like my intuition is invalid.

I wouldn’t say it’s invalid, as Perl’s documentation is a little
unclear. I hope I have clarified that with commit ac9d848.

At a practical level, I am not sure it is feasible to change PCRE to
behave like Perl, so I may have to settle for documenting the
difference. There are precedents (i.e. other differences).

Thanks for picking this up and having this discussion.

Regards,
Philip

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2011

From @cpansprout

On Sun Sep 18 13​:33​:28 2011, sprout wrote​:

On Mon Sep 12 06​:24​:25 2011, ph10@​hermes.cam.ac.uk wrote​:

another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is worse.

The change occurred with this commit​:

commit d1c771f
Author​: Bram <p5p@​perl.wizbit.be>
Date​: Thu Aug 26 13​:27​:24 2010 +0200

  VERB nodes in the regex engine should NOT be marked as JUMPABLE.
 
  JUMPABLE nodes can be ignored during certain phases of regex execution,
  including ones where backtracking is affected. This change disables this
  behviour so that the VERBS can perform their desired results.
 
  Committer has taken the liberty of modifying the patch so that all
  VERBS are jumped, thus making the JUMPABLE expression a little simpler.
  I have left Bram's change to JUMPABLE intact, but inside of a comment
  for now.
 
  See discussion in thread for [perl #71942] *COMMIT bypasses optimisation
  for futher details.
 
  http​://rt.perl.org/rt3/Ticket/Display.html?id=71942
 
  There appears to be room for futher optimisation here
  by moving the JUMPABLE logic to regex-compile time. Currently
  it is arguable that the "optimisation" this patch seeks to avoid
  is actually not an optimisation at all, as it happens OVER AND OVER
  during execution of a match, thus the extra effort might actually
  outweight the benefit, especially on large strings.

Inline Patch
diff --git a/regexec.c b/regexec.c
index 35ef8d4..ec4c4b0 100644
--- a/regexec.c
+++ b/regexec.c
@@ -252,7 +252,8 @@
     OP(rn) == EVAL ||   \
     OP(rn) == SUSPEND || OP(rn) == IFMATCH || \
     OP(rn) == PLUS || OP(rn) == MINMOD || \
-    OP(rn) == KEEPS || (PL_regkind[OP(rn)] == VERB) || \
+    OP(rn) == KEEPS || \
+    /*(PL_regkind[OP(rn)] == VERB && OP(rn) != PRUNE && OP(rn) !=
COMMIT && OP(rn) != MARKPOINT && OP(rn) != SKIP && OP(rn) != CUTGROUP) || \*/\\   \(PL\_regkind\[OP\(rn\)\] == CURLY && ARG1\(rn\) > 0\) \\ \) \#define IS\_EXACT\(rn\) \(PL\_regkind\[OP\(rn\)\] == EXACT\) diff \-\-git a/t/re/pat\_advanced\.t b/t/re/pat\_advanced\.t \.\.\.snipped\.\.\.

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

From ph10@hermes.cam.ac.uk

On Fri, 23 Sep 2011, Father Chrysostomos via RT wrote​:

You will be please to hear that Jeff Friedl, who has only just
discovered (*THEN) and friends - I guess he's been doing other stuff -
agrees with your and Perl's interpretation. He even pointed out that
adding a non-matching alternation would change the behaviour. I haven't
had any responses from the mailing list I asked. So on a sample of one
third party, it looks like my intuition is invalid.

I wouldn’t say it’s invalid, as Perl’s documentation is a little
unclear. I hope I have clarified that with commit ac9d848.

Thanks.

No doubt that will eventually make it into a release and I'll see it.
For your interest, two people on the pcre-dev list have now responded to
the query, and they went for my interpretation. So at least I'm not
alone, but I would of course prefer PCRE to be Perl-compatible. I'm
still brooding about this.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @ikegami

On Sat, Sep 24, 2011 at 9​:41 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 18 13​:33​:28 2011, sprout wrote​:

On Mon Sep 12 06​:24​:25 2011, ph10@​hermes.cam.ac.uk wrote​:

another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is worse.

According to the 5.14.1 docs, it shouldn't match.

Note that if [the (*PRUNE)] operator is used and NOT inside of an
alternation then it acts exactly like the (*PRUNE) operator.

Consider the pattern A (*PRUNE) B, where A and B are complex patterns. Until
the (*PRUNE) verb is reached, A may backtrack as necessary to match. Once it
is reached, matching continues in B, which may also backtrack as necessary;
however, should B not match, then no further backtracking will take place,
and the pattern will fail outright at the current starting position.
The change occurred with this commit​:

The following should match according to the docs​:

/a+(*THEN)c/

/a+?(?=c)(*THEN)c/

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @ikegami

On Mon, Sep 26, 2011 at 11​:03 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

According to the 5.14.1 docs, it shouldn't match.

Note that if [the (*PRUNE)] operator is used and NOT inside of an
alternation then it acts exactly like the (*PRUNE) operator.

Oops, that should be (*NEXT) at the beginning.

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @ikegami

On Mon, Sep 26, 2011 at 11​:04 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Mon, Sep 26, 2011 at 11​:03 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

According to the 5.14.1 docs, it shouldn't match.

Note that if [the (*PRUNE)] operator is used and NOT inside of an
alternation then it acts exactly like the (*PRUNE) operator.

Oops, that should be (*NEXT) at the beginning.

ARGH!

(*THEN)

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @cpansprout

On Mon Sep 26 20​:04​:14 2011, ikegami@​adaelis.com wrote​:

On Sat, Sep 24, 2011 at 9​:41 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 18 13​:33​:28 2011, sprout wrote​:

On Mon Sep 12 06​:24​:25 2011, ph10@​hermes.cam.ac.uk wrote​:

another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is worse.

According to the 5.14.1 docs, it shouldn't match.

Note that if [the (*PRUNE)] operator is used and NOT inside of an
alternation then it acts exactly like the (*PRUNE) operator.

Consider the pattern A (*PRUNE) B, where A and B are complex patterns.
Until
the (*PRUNE) verb is reached, A may backtrack as necessary to match.
Once it
is reached, matching continues in B, which may also backtrack as
necessary;
however, should B not match, then no further backtracking will take place,
and the pattern will fail outright at the current starting position.

*at the current starting position*

(*THEN) acts like (*COMMIT) in the example given above.

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @ikegami

On Tue, Sep 27, 2011 at 1​:22 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Mon Sep 26 20​:04​:14 2011, ikegami@​adaelis.com wrote​:

On Sat, Sep 24, 2011 at 9​:41 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 18 13​:33​:28 2011, sprout wrote​:

On Mon Sep 12 06​:24​:25 2011, ph10@​hermes.cam.ac.uk wrote​:

another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is
worse.

According to the 5.14.1 docs, it shouldn't match.

Note that if [the (*THEN)] operator is used and NOT inside of an
alternation then it acts exactly like the (*PRUNE) operator.

Consider the pattern A (*PRUNE) B, where A and B are complex patterns.
Until
the (*PRUNE) verb is reached, A may backtrack as necessary to match.
Once it
is reached, matching continues in B, which may also backtrack as
necessary;
however, should B not match, then no further backtracking will take
place,
and the pattern will fail outright at the current starting position.

*at the current starting position*

How can you check at what position it failed? I don't even know what that
means (and this is in our documentation).

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @ikegami

On Tue, Sep 27, 2011 at 2​:44 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Tue, Sep 27, 2011 at 1​:22 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Mon Sep 26 20​:04​:14 2011, ikegami@​adaelis.com wrote​:

On Sat, Sep 24, 2011 at 9​:41 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 18 13​:33​:28 2011, sprout wrote​:

On Mon Sep 12 06​:24​:25 2011, ph10@​hermes.cam.ac.uk wrote​:

another oddity of (*THEN).

Pattern​: /a+?(*THEN)c/
Subject​: aaac
Result​: Perl 5.012003 matches "aaac"

That’s strange. In 5.14 it doesn’t match. I don’t know which is
worse.

According to the 5.14.1 docs, it shouldn't match.

Note that if [the (*THEN)] operator is used and NOT inside of an

alternation then it acts exactly like the (*PRUNE) operator.

Consider the pattern A (*PRUNE) B, where A and B are complex patterns.
Until
the (*PRUNE) verb is reached, A may backtrack as necessary to match.
Once it
is reached, matching continues in B, which may also backtrack as
necessary;
however, should B not match, then no further backtracking will take
place,
and the pattern will fail outright at the current starting position.

*at the current starting position*

How can you check at what position it failed? I don't even know what that
means (and this is in our documentation).

Ok, I understand what the docs are *trying* to convery.

The docs aren't just unclear, they are self-contradictory. In two places,
even!

  - Failing outright is exactly the opposite the failing for only one start
  position.
  - Allowing no further backtracking to take place is the opposite of
  allowing backtracking to try at a different start position.

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2011

From ph10@hermes.cam.ac.uk

On Mon, 26 Sep 2011, Father Chrysostomos via RT wrote​:

*at the current starting position*

(*THEN) acts like (*COMMIT) in the example given above.

Aha.

Since we are quoting "man perlre", here's another bit of confusion​:

  "(?PARNO)" "(?-PARNO)" "(?+PARNO)" "(?R)" "(?0)"
  Similar to "(??{ code })" except it does not involve
  compiling any code, instead it treats the contents of a
  capture buffer as an independent pattern that must match at
  the current position.

Consider this pattern (ignore white space)​:

  ^.*? (?1) c (?(DEFINE) (a(*THEN)b) )
 
When I read "an independent pattern", it suggests to me that the effect
of (*THEN) is confined to that independent pattern. However, experiment
with Perl shows that I am wrong here as well. The pattern as it stands
does not match "aabc", but if |(*F) is added after b, it does.

As it happens, PCRE processes recursions/subroutines differently to
Perl, or perhaps I should say that the other way round, since PCRE had
them first. :-) So another difference probably doesn't matter much.

However, I'm still thinking about this whole issue with regard to PCRE
implementation. It tries to resolve where (*THEN) should back up to at
compile time, but looking at the example above, it seems to me that it
can only be done dynamically, at run time.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Jun 22, 2013

From @demerphq

Sorry about the incredibly laggy reply. :-(

FC recently brought to my attention these threads, which I had managed
to overlook or forget or whatever.

I am doing my best to work my way through them and will reply as I go.

On 20 September 2011 18​:32, Philip Hazel <ph10@​hermes.cam.ac.uk> wrote​:

On Tue, 20 Sep 2011, Nicholas Clark wrote​:
I've been thinking about this some more. My naive understanding of *THEN
is basically this​: it is effectively just another way of doing what
(?>...) does, but with possibly simpler syntax and the added feature of
*THEN​:NAME. The emphasis on alternation is really a red herring.
Thus, if you have

A (*THEN) B

(where A and B are complex patterns) the matching engine, having passed
(*THEN) and subsequently failed in B, no longer backtracks into A. This
would be the same​:

(?>A)B

If you go along with this, it follows that, if (*THEN) is within a
group, for example,

C (A (*THEN) B)

then a failure in B must backtrack into C, just as would happen with

C ((?>A) B)

In other words, the effect of (*THEN) within a group does not propagate
back beyond the start of the group. IMHO this should also apply to
conditional groups which, after all, behave sort of like a group with
only one branch (just that there's a choice of which one each time).

Now, it seems that Perl thinks differently to me. There seems to be the
concept of "group with no alternation" and "group with alternation" as
two different things that are handled differently. (And a conditional
group is of the former type.)

I think you are right that perl thinks differently to you. Probably
because of implementation details.

Perl's regex engine doesn't have a concept of a "group". It has a
concept of an "alternation" (BRANCH), and the special behavior you
associate with "group" parens perl actually associates with the BRANCH
operator "|".

There is no group parens in the pattern /foo|bar|baz/ however there is
an alternation.

The only part the group parens play is to tell the parser where the
alternation starts and ends.

You could say that (?​: | ) are some kind of really complicated ternary
operator. IE, not different constructs, but optional part of a single
operator.

Ditto for the association of groups to quantifiers. In reality the
group parens are part of the quantifier and have no behavior of their
own.

So, when we have this​:

(?​:foo|bar)*

the (?​: ) are part of the | operator in that they denote the beginning
and end of the alternation, and they are part of the * quantifier,
because they tell it what pattern that quantifier applies to.

Another aspect of this is that to perl /a(?​:foo)b/ is the same as
/afoob/. Since there is neither a quantifier on the (?​:) nor a
modifier in the ?​: nor an alternation inside, it is a complete no-op.
In other words it is not the case that (?​:foo) is a group with only
one alternation. It is simply "foo".

So thinking of a (?​: ... ) as a "thing" is wrong from the Perl
perspective. It only becomes a thing when it has a quantifier
attached, or an alternation inside. Even when there is a modifier in
the ?​: it still isnt a thing, it modifies the rules for interpreting
the ....

IOW, at an implementation level it is not that case that writing
/foo|bar|baz/ is a short hand way of writing /(?​:foo|bar|baz)/.

Consider this output​:

$ perl -Mre=debug -e'/[ac]|[bc]/'
Compiling REx "[ac]|[bc]"
Final program​:
  1​: BRANCH (13)
  2​: ANYOF[ac][] (25)
  13​: BRANCH (FAIL)
  14​: ANYOF[bc][] (25)
  25​: END (0)
minlen 1
Freeing REx​: "[ac]|[bc]"

$ perl -Mre=debug -e'/(?​:[ac]|[bc])/'
Compiling REx "(?​:[ac]|[bc])"
Final program​:
  1​: BRANCH (13)
  2​: ANYOF[ac][] (26)
  13​: BRANCH (FAIL)
  14​: ANYOF[bc][] (26)
  25​: TAIL (26)
  26​: END (0)
minlen 1
Freeing REx​: "(?​:[ac]|[bc])"

The program generated is the same. There is no operator for GROUP,
instead there is a chain of BRANCH operators ending in a TAIL.

(*THEN) affects how we transition from BRANCH to BRANCH, and if we
arent in a BRANCH acts like a (*PRUNE).

This gets hard to debug in Perl if you dont use charclasses with at
least two chars in them, as Perl will optimize to a TRIE.

Anyway, I can see how it would be reasonable to consider (?​:foo) to be
an alternation with only one branch, but to Perl it isnt, and that is
how you should interpret the verb operators.

The intent of the (*THEN) operator is for something like this​:

/ 1111 ( (?​:101)+ (*THEN) 000 | (?​: 110 | 011 )+ (*THEN) 000 ) 0101 /x

If we match say 1000 '101's, and we cannot match '000' no amount of
backtracking will allow the '000' to match. Perl is not good at
detecting this, and therefore we allow the user to use the (*THEN)
verb to tell Perl that the thing on the left side of the (*THEN) cant
match this branch through backtracking, and we should go to the next
BRANCH to see what it would do. If there is no branch then it is the
same thing as saying (*PRUNE), in that it effectively says that there
is no match at this starting position.

Here is an example of how this can help. wc -l is useful but crude
count of how much work the regex engine does for a given match. As you
can see, using the hints makes for a faster match.

$ for i in 1 2 4 8 16 32 64 128 256; do perl -Mre=Debug,EXECUTE
-le'("xx".("abcd" x $ARGV[0]). "gyy")=/^x+ (?​: [abcd]+ [pq] |
(?​:ab|cd)+ [xy] | (?​:abcd)+ [gh] ) y+ /x;' $i 2>&1 | wc -l; done;
61
97
169
313
601
1177
2329
4633
9241
$ for i in 1 2 4 8 16 32 64 128 256; do perl -Mre=Debug,EXECUTE
-le'("xx".("abcd" x $ARGV[0]). "gyy")=
/^x+ (?​: [abcd]+ (*THEN) [pq] |
(?​:ab|cd)+ (*THEN) [xy] | (?​:abcd)+ (*THEN) [gh] ) y+ /x;' $i 2>&1 |
wc -l; done;
55
77
121
209
385
737
1441
2849
5665

The behavior of /A+? (*THEN) BC / appeared to be broken in some cases.
For instance for the literals A B and C, we /should/ match "ABC",
however we trigger the PREGf_SKIP optimization, which basically turns
/A+ B/ into /A+ (*SKIP) B/, which is not correct for /A+? (*THEN) BC/.
You can see this by adding (?{}) at the end of the pattern​:

$ ./perl -le'"aaabc"=/a+?(*THEN)bc/ and print "match​: $&"'
$ ./perl -le'"aaabc"=
/a+?(*THEN)bc(?{})/ and print "match​: $&"'
match​: abc

The (?{}) disables this optimization, and make it do the expected thing.

Your original bug report was about this​:

./perl -Ilib -Mre=Debug,All,FLAGS -le'"ba"=~/^.*?(?(?=a)a|b(*THEN)c)/
and print "match​: $&"'

which does not match, probably because of an optimisation in .*? handling.

An unanchored match /P/ should match the same thing as /^.*?P/, and
we see that Perl does match for that form​:

./perl -le'"ba"=~/((?(?=a)a|b(*THEN)c))/ and print "match​: $&\ncond​: $1"'
match​: a
cond​: a

And if we removed the non-greedy quantifier modifier​:

./perl -le'"ba"=~/^.*((?(?=a)a|b(*THEN)c))/ and print "match​: $&\ncond​: $1"'
match​: ba
cond​: a

Both of which match as expected, however this still doesnt match​:

./perl -le'"ba"=~/^.*?((?(?=a)a|b(*THEN)c))/ and print "match​: $&\ncond​: $1"'

which I continue to investigate.

I have pushed b8f6efd to fix the case
of /a+?(*THEN)b/ but I have more to do.

I also pushed 337ff30 which shows the intflags.

You need to use -Mre=Debug,All,FLAGS to see the flags. I might change
that a bit later.

Yves

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2013

From ph10@hermes.cam.ac.uk

On Sat, 22 Jun 2013, demerphq wrote​:

Sorry about the incredibly laggy reply. :-(

Hi Yves,

No problem; glad somebody has picked this up. For myself, I can no
longer remember exactly what I was complaining about!

I think you are right that perl thinks differently to you. Probably
because of implementation details.

Thanks for taking the time to give a long explanation of the way Perl
works, which indeed is different to the way PCRE works - and of course
it's far too late to change that now.

In a recent release of PCRE we have formulated some reasonably
consistent rules for way that backtracking verbs are handled, and
documented them. In many cases the behaviour is the same as Perl's, but
there are some differences.

Anyway, I can see how it would be reasonable to consider (?​:foo) to be
an alternation with only one branch, but to Perl it isnt, and that is
how you should interpret the verb operators.

PCRE does now get this (perhaps not 100%, but to some extent) "right" in
its current treatment of (*THEN). The documentation even says this​: "A
subpattern that does not contain a | character is just a part of the
enclosing alternative; it is not a nested alternation with only one
alternative."

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2013

From @demerphq

On 23 June 2013 13​:18, <ph10@​hermes.cam.ac.uk> wrote​:

On Sat, 22 Jun 2013, demerphq wrote​:

Sorry about the incredibly laggy reply. :-(

Hi Yves,

No problem; glad somebody has picked this up. For myself, I can no
longer remember exactly what I was complaining about!

FC pointed out a list of VERB issues and I am trying to work through
them and decide if they are buggy, or ambiguous or whatnot.

Definitely some of things that have been reported have been bugs. A
few have been interpretation issues.

I think you are right that perl thinks differently to you. Probably
because of implementation details.

Thanks for taking the time to give a long explanation of the way Perl
works, which indeed is different to the way PCRE works - and of course
it's far too late to change that now.

Indeed. Both projects have a very long history and some things are
difficult to change.

In a recent release of PCRE we have formulated some reasonably
consistent rules for way that backtracking verbs are handled, and
documented them. In many cases the behaviour is the same as Perl's, but
there are some differences.

Id like to hear about the differences. This thread makes me wonder how
many are due to bugs in how the various optimizations interact and how
many are due to other issues.

Anyway, I can see how it would be reasonable to consider (?​:foo) to be
an alternation with only one branch, but to Perl it isnt, and that is
how you should interpret the verb operators.

PCRE does now get this (perhaps not 100%, but to some extent) "right" in
its current treatment of (*THEN). The documentation even says this​: "A
subpattern that does not contain a | character is just a part of the
enclosing alternative; it is not a nested alternation with only one
alternative."

Cool. I will try to find time to read the PCRE docs and look for
discrepancies and investigate.

Cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2013

From ph10@hermes.cam.ac.uk

On Sun, 23 Jun 2013, demerphq wrote​:

Cool. I will try to find time to read the PCRE docs and look for
discrepancies and investigate.

This is perhaps the relevant section of the "pcrepattern" man page​:


  More than one backtracking verb

  If more than one backtracking verb is present in a pattern, the one
  that is backtracked onto first acts. For example, consider this pat-
  tern, where A, B, etc. are complex pattern fragments​:

  (A(*COMMIT)B(*THEN)C|ABD)

  If A matches but B fails, the backtrack to (*COMMIT) causes the entire
  match to fail. However, if A and B match, but C fails, the backtrack to
  (*THEN) causes the next alternative (ABD) to be tried. This behaviour
  is consistent, but is not always the same as Perl's. It means that if
  two or more backtracking verbs appear in succession, all the the last
  of them has no effect. Consider this example​:

  ...(*COMMIT)(*PRUNE)...

  If there is a matching failure to the right, backtracking onto (*PRUNE)
  causes it to be triggered, and its action is taken. There can never be
  a backtrack onto (*COMMIT).

  Backtracking verbs in repeated groups

  PCRE differs from Perl in its handling of backtracking verbs in
  repeated groups. For example, consider​:

  /(a(*COMMIT)b)+ac/

  If the subject is "abac", Perl matches, but PCRE fails because the
  (*COMMIT) in the second repeat of the group acts.

  Backtracking verbs in assertions

  (*FAIL) in an assertion has its normal effect​: it forces an immediate
  backtrack.

  (*ACCEPT) in a positive assertion causes the assertion to succeed with-
  out any further processing. In a negative assertion, (*ACCEPT) causes
  the assertion to fail without any further processing.

  The other backtracking verbs are not treated specially if they appear
  in a positive assertion. In particular, (*THEN) skips to the next
  alternative in the innermost enclosing group that has alternations,
  whether or not this is within the assertion.

  Negative assertions are, however, different, in order to ensure that
  changing a positive assertion into a negative assertion changes its
  result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
  ative assertion to be true, without considering any further alternative
  branches in the assertion. Backtracking into (*THEN) causes it to skip
  to the next enclosing alternative within the assertion (the normal be-
  haviour), but if the assertion does not have such an alternative,
  (*THEN) behaves like (*PRUNE).

  Backtracking verbs in subroutines

  These behaviours occur whether or not the subpattern is called recur-
  sively. Perl's treatment of subroutines is different in some cases.

  (*FAIL) in a subpattern called as a subroutine has its normal effect​:
  it forces an immediate backtrack.

  (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
  match to succeed without any further processing. Matching then contin-
  ues after the subroutine call.

  (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
  cause the subroutine match to fail.

  (*THEN) skips to the next alternative in the innermost enclosing group
  within the subpattern that has alternatives. If there is no such group
  within the subpattern, (*THEN) causes the subroutine match to fail.


Regards,
Philip

--
Philip Hazel

@khwilliamson
Copy link
Contributor

@demerphq ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants