Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*COMMIT bug? #16625

Open
p5pRT opened this issue Jul 12, 2018 · 10 comments
Open

*COMMIT bug? #16625

p5pRT opened this issue Jul 12, 2018 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 12, 2018

Migrated from rt.perl.org#133366 (status was 'open')

Searchable as RT133366$

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2018

From ph10@hermes.cam.ac.uk

Created by ph10@cam.ac.uk

Consider this example​:

perl -e 'if ("ABCABD" =~ /(A(*COMMIT)|B)(A|B)D/) { print "yes $&\n"; } else { print "no\n"; }

On my Linux box the output is "yes ABD". But surely it should not match? After
matching the initial "A", shouldn't the (*COMMIT) stop any "bumaplong"
happening? If the (*COMMIT) is moved to before the second "A" in the pattern,
it does indeed print "no".

Regards,
Philip

--
Philip Hazel

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.26.2:

Configured by builduser at Wed Apr 18 22:20:25 CEST 2018.

Summary of my perl5 (revision 5 version 26 subversion 2) configuration:
   
  Platform:
    osname=linux
    osvers=4.15.15-1-arch
    archname=x86_64-linux-thread-multi
    uname='linux flo-64 4.15.15-1-arch #1 smp preempt sat mar 31 23:59:25 utc 2018 x86_64 gnulinux '
    config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.26/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.26/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.26/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='7.3.1 20180312'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-pc-linux-gnu/7.3.1/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.26.so
    so=so
    useshrplib=true
    libperl=libperl.so
    gnulibc_version='2.26'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.26/core_perl/CORE'
    cccdlflags='-fPIC'
    lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.26.2:
    /usr/lib/perl5/5.26/site_perl
    /usr/share/perl5/site_perl
    /usr/lib/perl5/5.26/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib/perl5/5.26/core_perl
    /usr/share/perl5/core_perl


Environment for perl 5.26.2:
    HOME=/home/ph10
    LANG=en_GB.utf8
    LANGUAGE=en_GB.utf8
    LC_ALL=C
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/ph10/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/sbin:.:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2018

From @Abigail

On Thu, Jul 12, 2018 at 10​:02​:55AM -0700, Philip Hazel (via RT) wrote​:

# New Ticket Created by Philip Hazel
# Please include the string​: [perl #133366]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133366 >

Subject​: Odd *COMMIT behaviour
To​: perlbug@​perl.org
Message-Id​: <5.26.2_28082_1531414535@​quercite>
Reply-To​: ph10@​cam.ac.uk
Cc​: builduser
From​: ph10@​cam.ac.uk

This is a bug report for perl from ph10@​cam.ac.uk,
generated with the help of perlbug 1.40 running under perl 5.26.2.

-----------------------------------------------------------------
[Please describe your issue here]

Consider this example​:

perl -e 'if ("ABCABD" =~ /(A(*COMMIT)|B)(A|B)D/) { print "yes $&\n"; } else { print "no\n"; }

On my Linux box the output is "yes ABD". But surely it should not match? After
matching the initial "A", shouldn't the (*COMMIT) stop any "bumaplong"
happening? If the (*COMMIT) is moved to before the second "A" in the pattern,
it does indeed print "no".

I think you are right.

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

  use re 'debug';
  "ABCABD" =~ /(A(*COMMIT)|B)(A|B)D/;
  __END__

  Compiling REx "(A(*COMMIT)|B)(A|B)D"
  Final program​:
  1​: OPEN1 (3)
  3​: TRIE-EXACT[AB] (11)
  <A> (6)
  6​: COMMIT (11)
  <B> (11)
  11​: CLOSE1 (13)
  13​: OPEN2 (15)
  15​: TRIE-EXACT[AB] (21)
  <A>
  <B>
  21​: CLOSE2 (23)
  23​: EXACT <D> (25)
  25​: END (0)
  anchored "D" at 2 (checking anchored) stclass AHOCORASICK-EXACT[AB] minlen 3
  Matching REx "(A(*COMMIT)|B)(A|B)D" against "ABCABD"
  Intuit​: trying to determine minimum start position...
  doing 'check' fbm scan, [2..6] gave 5
  Found anchored substr "D" at offset 5 (rx_origin now 3)...
  (multiline anchor test skipped)
  try at offset...
  Intuit​: Successfully guessed​: match at offset 3
  3 <ABC> <ABD> | 0| 1​:OPEN1(3)
  3 <ABC> <ABD> | 0| 3​:TRIE-EXACT[AB](11)
  3 <ABC> <ABD> | 0| State​: 1 Accepted​: N Charid​: 1 CP​: 41 After State​: 2
  4 <ABCA> <BD> | 0| State​: 2 Accepted​: Y Charid​: 0 CP​: 0 After State​: 0
  | 0| got 1 possible matches
  | 0| TRIE matched word #1, continuing
  4 <ABCA> <BD> | 1| 6​:COMMIT(11)
  4 <ABCA> <BD> | 2| 11​:CLOSE1(13)
  4 <ABCA> <BD> | 2| 13​:OPEN2(15)
  4 <ABCA> <BD> | 2| 15​:TRIE-EXACT[AB](21)
  4 <ABCA> <BD> | 2| State​: 1 Accepted​: N Charid​: 2 CP​: 42 After State​: 3
  5 <ABCAB> <D> | 2| State​: 3 Accepted​: Y Charid​: 0 CP​: 0 After State​: 0
  | 2| got 1 possible matches
  | 2| TRIE matched word #2, continuing
  | 2| only one match left, short-circuiting​: #2 <B>
  5 <ABCAB> <D> | 2| 21​:CLOSE2(23)
  5 <ABCAB> <D> | 2| 23​:EXACT <D>(25)
  6 <ABCABD> <> | 2| 25​:END(0)
  Match successful!
  Freeing REx​: "(A(*COMMIT)|B)(A|B)D"
 

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2018

From ph10@hermes.cam.ac.uk

On Thu, 12 Jul 2018, Abigail via RT wrote​:

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

Aha! Optimizing does that kind of thing - PCRE has/had similar issues.
One of the PCRE users has been checking out all kinds of stuff and
comparing with Perl. There seem to be some odd anomalies in the handling
of captures in negative assertions and in the handling of (*VERB)s in
subroutine calls in Perl. Would it be helpful if I reported these as
independent issues?

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2018

From @Abigail

On Fri, Jul 13, 2018 at 08​:28​:20AM +0100, ph10@​hermes.cam.ac.uk wrote​:

On Thu, 12 Jul 2018, Abigail via RT wrote​:

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

Aha! Optimizing does that kind of thing - PCRE has/had similar issues.
One of the PCRE users has been checking out all kinds of stuff and
comparing with Perl. There seem to be some odd anomalies in the handling
of captures in negative assertions and in the handling of (*VERB)s in
subroutine calls in Perl. Would it be helpful if I reported these as
independent issues?

I think it would.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2018

From @demerphq

On Fri, 13 Jul 2018, 03​:28 , <ph10@​hermes.cam.ac.uk> wrote​:

On Thu, 12 Jul 2018, Abigail via RT wrote​:

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

Aha! Optimizing does that kind of thing - PCRE has/had similar issues.
One of the PCRE users has been checking out all kinds of stuff and
comparing with Perl. There seem to be some odd anomalies in the handling
of captures in negative assertions and in the handling of (*VERB)s in
subroutine calls in Perl. Would it be helpful if I reported these as
independent issues

Definitely. Although I have to admit we might not "simply" fix this but
instead add a modifier to disable the start position optimizations and
advise users to consider using the modifier in such cases where it matters.

Fwiw there is already an internal flag which disables such optimisations.
For instance add a (??{""}) to the pattern and you should get the expected
results.

Yves

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2018

From ph10@hermes.cam.ac.uk

On Thu, 12 Jul 2018, Abigail via RT wrote​:

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

This same issue is present if (*COMMIT) is encountered in subroutine
call​:

$ perl -e 'if ("ABXABD" =~ /(?1)(A(*COMMIT)|B)D/) { print "yes\n"; } else { print "no\n"; } '
yes
$ perl -e 'if ("ABXABD" =~ /(??{""})(?1)(A(*COMMIT)|B)D/) { print "yes\n"; } else { print "no\n"; } '
no
$

Turning off the optimization changes the result.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From @Abigail

On Fri, Jul 13, 2018 at 12​:27​:19PM -0400, demerphq wrote​:

On Fri, 13 Jul 2018, 03​:28 , <ph10@​hermes.cam.ac.uk> wrote​:

On Thu, 12 Jul 2018, Abigail via RT wrote​:

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

Aha! Optimizing does that kind of thing - PCRE has/had similar issues.
One of the PCRE users has been checking out all kinds of stuff and
comparing with Perl. There seem to be some odd anomalies in the handling
of captures in negative assertions and in the handling of (*VERB)s in
subroutine calls in Perl. Would it be helpful if I reported these as
independent issues

Definitely. Although I have to admit we might not "simply" fix this but
instead add a modifier to disable the start position optimizations and
advise users to consider using the modifier in such cases where it matters.

That just feels wrong to me. That reads as "we know there's a bug, we
have a fix, but you need to jump through some extra hoops to enable
the fix". Not to mention that most people will not know when they
should enable this.

Fwiw there is already an internal flag which disables such optimisations.
For instance add a (??{""}) to the pattern and you should get the expected
results.

Can't we make it so that the mere presence of a "(*COMMIT)" disables
this optimization?

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From ph10@hermes.cam.ac.uk

On Mon, 16 Jul 2018, Abigail wrote​:

Can't we make it so that the mere presence of a "(*COMMIT)" disables
this optimization?

I thought about doing something similar for PCRE, where starting a
pattern with (*COMMIT) (that is, force match at starting position)
really needs to disable the "skip rapidly ahead to the first matching
character before looking at the pattern" optimization. However, always
disabling this optimization when *COMMIT is present is overkill, because
if it appears later in the pattern there is no problem. Maybe there's a
similar consideration in Perl? [In the PCRE case, one can actually make
this out to be a feature - "commit at first matching character" instead
of "commit where you start looking" - and there is a switch to turn the
optimization off.]

This is all annoyingly messy.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2018

From @demerphq

On Mon, 16 Jul 2018 at 12​:39, Abigail <abigail@​abigail.be> wrote​:

On Fri, Jul 13, 2018 at 12​:27​:19PM -0400, demerphq wrote​:

On Fri, 13 Jul 2018, 03​:28 , <ph10@​hermes.cam.ac.uk> wrote​:

On Thu, 12 Jul 2018, Abigail via RT wrote​:

What seems to happen is that the optimizer is too eager. It finds the 'D'
in both the pattern and the string, and determines that if the pattern
matches, it should start matching 2 characters before the 'D'. This is
a valid strategy if the (*COMMIT) is not present, but not in the presence
of it.

Aha! Optimizing does that kind of thing - PCRE has/had similar issues.
One of the PCRE users has been checking out all kinds of stuff and
comparing with Perl. There seem to be some odd anomalies in the handling
of captures in negative assertions and in the handling of (*VERB)s in
subroutine calls in Perl. Would it be helpful if I reported these as
independent issues

Definitely. Although I have to admit we might not "simply" fix this but
instead add a modifier to disable the start position optimizations and
advise users to consider using the modifier in such cases where it matters.

That just feels wrong to me. That reads as "we know there's a bug, we
have a fix, but you need to jump through some extra hoops to enable
the fix". Not to mention that most people will not know when they
should enable this.

I think there is a tension between fixing bugs and performance in this case.

Disabling the start point optimizations can have a very serious affect
on performance.

Doing so in any pattern that uses a verb, just to fix an edge case set
of errors that I don't think most people would care about or even
notice, seems wrong.

Fwiw there is already an internal flag which disables such optimisations.
For instance add a (??{""}) to the pattern and you should get the expected
results.

Can't we make it so that the mere presence of a "(*COMMIT)" disables
this optimization?

We can, but it isn't clear to me that we should.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants