Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(*COMMIT) behaviour when inside a repeated group #17074

Open
p5pRT opened this issue Jul 2, 2019 · 6 comments
Open

(*COMMIT) behaviour when inside a repeated group #17074

p5pRT opened this issue Jul 2, 2019 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 2, 2019

Migrated from rt.perl.org#134254 (status was 'open')

Searchable as RT134254$

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2019

From ph10@hermes.cam.ac.uk

Created by ph10@cam.ac.uk

Please see the following example​:

$ perl -e 'if (abcd =~ /\A(?​:.(*COMMIT))*c/) { print "yes >$&<\n"; } else { print "no \n"; }'
yes >abc<

Why does that pattern match? The repeated group should swallow all of "abcd",
then fail to match "c" and so backtrack onto (*COMMIT), which should fail the
whole match, shouldn't it?

Philip

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.30.0:

Configured by builduser at Tue Jun 18 19:57:11 CEST 2019.

Summary of my perl5 (revision 5 version 30 subversion 0) configuration:
   
  Platform:
    osname=linux
    osvers=5.1.8-arch1-1-arch
    archname=x86_64-linux-thread-multi
    uname='linux flo-64 5.1.8-arch1-1-arch #1 smp preempt sun jun 9 20:28:28 utc 2019 x86_64 gnulinux '
    config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.30/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.30/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.30/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='9.1.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64
    libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.29.so
    so=so
    useshrplib=true
    libperl=libperl.so
    gnulibc_version='2.29'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.30/core_perl/CORE'
    cccdlflags='-fPIC'
    lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.30.0:
    /usr/lib/perl5/5.30/site_perl
    /usr/share/perl5/site_perl
    /usr/lib/perl5/5.30/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib/perl5/5.30/core_perl
    /usr/share/perl5/core_perl


Environment for perl 5.30.0:
    HOME=/home/ph10
    LANG=en_GB.utf8
    LANGUAGE=en_GB.utf8
    LC_ALL=C
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/ph10/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/sbin:.:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2019

From @demerphq

On Tue, 2 Jul 2019 at 11​:17, Philip Hazel (via RT)
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Philip Hazel
# Please include the string​: [perl #134254]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=134254 >

To​: perlbug@​perl.org
Reply-To​: ph10@​cam.ac.uk
Subject​: (*COMMIT) behaviour when inside a repeated group
From​: ph10@​cam.ac.uk
Cc​: builduser
Message-Id​: <5.30.0_5771_1562058560@​quercite>

This is a bug report for perl from ph10@​cam.ac.uk,
generated with the help of perlbug 1.41 running under perl 5.30.0.

-----------------------------------------------------------------
[Please describe your issue here]

Please see the following example​:

$ perl -e 'if (abcd =~ /\A(?​:.(*COMMIT))*c/) { print "yes >$&<\n"; } else { print "no \n"; }'
yes >abc<

Why does that pattern match? The repeated group should swallow all of "abcd",
then fail to match "c" and so backtrack onto (*COMMIT), which should fail the
whole match, shouldn't it?

Well I have to admit its a bit murky, but I think the match should
succeed. (*COMMIT) relates to how perl moves the start point on
failure. This is an anchored match, so it effectively has a (*COMMIT)
at the very start anyway, and the match does not fail at the given
start point, it succeeds. I think you are confusing atomic matches and
(*COMMIT). Change that to /\A(?​:.(*COMMIT))*+c/ and it behaves as you
say.

From the docs​:
  It's a zero-width pattern similar to "(*SKIP)", except that when
  backtracked into on failure it causes the match to fail outright.
  No further attempts to find a valid match by advancing the start
  pointer will occur again.

In this case technically we never backtracked into the COMMIT, and we
certainly did not attempt to advance the start pointer. So I don't
think the commit should have fired. But I admit the other
interpretation is not unreasonable. We have never really specified how
quantifiers and verbs should interact, and I could go along with an
interpretation that says we are doing it wrong. Might need Larry to
rule on it tho, this was copied from Perl6.

Yves

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2019

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2019

From ph10@hermes.cam.ac.uk

On Tue, 2 Jul 2019, yves orton via RT wrote​:

Well I have to admit its a bit murky, but I think the match should
succeed. (*COMMIT) relates to how perl moves the start point on
failure. This is an anchored match, so it effectively has a (*COMMIT)
at the very start anyway, and the match does not fail at the given
start point, it succeeds.

Hmm. I can now see that there is an alternative interpretation of
(*COMMIT), which is "If (*COMMIT) is passed during a match attempt,
carry on as normal, but if the overall match fails, do not advance the
start pointer." However, under that interpretation, this match should
succeed​:

$ perl -e 'if (abcd =~ /.*(*COMMIT)c/) { print "yes >$&<\n"; } else { print "no \n"; }'
no

Incidentally, the \A at the start of my previous example is not
relevant. Leaving it out makes no difference. So the question is why are
these two different?

/.*(*COMMIT)c/ does not match "abcd"
/(?​:.(*COMMIT))*c/ does match "abcd" (matches "abc")

Under your interpretation, shouldn't the first one match? After all, it
can find a match without advancing the starting point.

From the docs​:
It's a zero-width pattern similar to "(*SKIP)", except that when
backtracked into on failure it causes the match to fail outright.
No further attempts to find a valid match by advancing the start
pointer will occur again.

In this case technically we never backtracked into the COMMIT,

As an outsider looking at Perl, I don't understand that. I thought that
passing (*COMMIT) always established a backtrack point, so (unless it
was inside an atomic group) a subsequent failure must then backtrack
onto it. How has this not happened?

We have never really specified how quantifiers and verbs should
interact, and I could go along with an interpretation that says we are
doing it wrong. Might need Larry to rule on it tho, this was copied
from Perl6.

I await the ruling!

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2019

From @demerphq

On Tue, 2 Jul 2019 at 16​:44, <ph10@​hermes.cam.ac.uk> wrote​:

On Tue, 2 Jul 2019, yves orton via RT wrote​:

Well I have to admit its a bit murky, but I think the match should
succeed. (*COMMIT) relates to how perl moves the start point on
failure. This is an anchored match, so it effectively has a (*COMMIT)
at the very start anyway, and the match does not fail at the given
start point, it succeeds.

Hmm. I can now see that there is an alternative interpretation of
(*COMMIT), which is "If (*COMMIT) is passed during a match attempt,
carry on as normal, but if the overall match fails, do not advance the
start pointer." However, under that interpretation, this match should
succeed​:

$ perl -e 'if (abcd =~ /.*(*COMMIT)c/) { print "yes >$&<\n"; } else { print "no \n"; }'
no

No, because this one definitely DOES backtrack into into the COMMIT
op. Right after it fails to match 'c'.

Incidentally, the \A at the start of my previous example is not
relevant. Leaving it out makes no difference. So the question is why are
these two different?

/.*(*COMMIT)c/ does not match "abcd"
/(?​:.(*COMMIT))*c/ does match "abcd" (matches "abc")

Because the * is on a group, the group is rolled back as a group, so
we never backtrack into the COMMIT in a technical sense, instead we
unroll the attempt to match the group as a whole. I am talking purely
implementation here. Try add -Mre=debug and you will see what I mean
in the two cases.

Under your interpretation, shouldn't the first one match? After all, it
can find a match without advancing the starting point.

Again no, because it backtracks into the (*COMMIT). I am not saying
this is *right* by the way, just that it is what is happening.

From the docs​:
It's a zero-width pattern similar to "(*SKIP)", except that when
backtracked into on failure it causes the match to fail outright.
No further attempts to find a valid match by advancing the start
pointer will occur again.

In this case technically we never backtracked into the COMMIT,

As an outsider looking at Perl, I don't understand that. I thought that
passing (*COMMIT) always established a backtrack point, so (unless it
was inside an atomic group) a subsequent failure must then backtrack
onto it. How has this not happened?

Because (?​:...)* is compiled into a construct called CURLYM which does
the work of unrolling the group, and we never actually enter the
COMMIT regop via backtracking. You could argue it is an optimisation
gone wrong. You could also argue that a commit in a quantifier doesnt
make sense.

The problem I think is that there is a clash between the purist DFA
interpretation of a star quantifier, and the backtracking that we
actually use.

Lets just take a simple case​:

"aaab"=~/a*x/

The rules dont really specify if we ever actually "fail to match" with
a star quantifier.

You can say that "a*" matches the first three a's and then fails on
the b and backtracks by one then allowing the x to match. And if you
did you likely would match what actually happens in the regex engine.
Roughly speaking this would be backtracking interpretation.

BUT

There is another equally valid interpretation that a* matches ONLY the
first three a's and never even tries to match 'a' to 'x'. This would
not match how the engine actually works in perl, but is perfectly
consistent with a purist DFA for the regex.

IOW, we would have a state machine like this​:

/ Input
state | a | x | anything-else
1 | 1 | 2 | 3
2 | ACCEPT
3 | FAIL

This machine would never backtrack on "aaab"=~/a*x/;

So at a certain level it doesn't surprise me that we show both
behaviors, they both are reasonable in themselves. Its just they
aren't consistent with each other....

Yves

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2019

From ph10@hermes.cam.ac.uk

On Tue, 2 Jul 2019, yves orton via RT wrote​:

$ perl -e 'if (abcd =~ /.*(*COMMIT)c/) { print "yes >$&<\n"; } else { print "no \n"; }'
no

No, because this one definitely DOES backtrack into into the COMMIT
op. Right after it fails to match 'c'.

Yves,

Thanks for taking the time to explain all this to me. So if
/.*(*COMMIT)c/ backtracks onto the (*COMMIT), presumably
/(.*(*COMMIT))c/ would also backtrack onto (*COMMIT) (it does). Seems
surprising that /(.(*COMMIT))*c/ doesn't, but hey, what do I know?
(This all actually arose from a PCRE user's question, incidentally. I'm
not devious enough to think of putting (*COMMIT) inside a quantifier. :-)

Because the * is on a group, the group is rolled back as a group, so
we never backtrack into the COMMIT in a technical sense, instead we
unroll the attempt to match the group as a whole.

But surely there can be backtracks into other things in a group? How
about something like "aaaab" ~ /(a+)+a/ which, as I understand it, will
swallow all the a's in the first match of the group, fail to match a
second attempt at the group so go on to try the final "a", then fail
again, and so must backtrack into the group in order to match. But of
course this is all implementation dependent, and subject to
optimizations as well.

I am talking purely implementation here. Try add -Mre=debug and you
will see what I mean in the two cases.

I've tried to avoid learning anything about Perl internals :-) but
thanks for that. I may have to...

Again no, because it backtracks into the (*COMMIT). I am not saying
this is *right* by the way, just that it is what is happening.

Understood.

Because (?​:...)* is compiled into a construct called CURLYM which does
the work of unrolling the group, and we never actually enter the
COMMIT regop via backtracking. You could argue it is an optimisation
gone wrong. You could also argue that a commit in a quantifier doesnt
make sense.

Indeed! These are are very pathological examples.

So at a certain level it doesn't surprise me that we show both
behaviors, they both are reasonable in themselves. Its just they
aren't consistent with each other....

Indeed again.

Regards,
Philip

--
Philip Hazel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants