Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior with a regular expression #15666

Closed
p5pRT opened this issue Oct 16, 2016 · 10 comments
Closed

Unexpected behavior with a regular expression #15666

p5pRT opened this issue Oct 16, 2016 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 16, 2016

Migrated from rt.perl.org#129897 (status was 'resolved')

Searchable as RT129897$

@p5pRT
Copy link
Author

p5pRT commented Oct 16, 2016

From @jormalaaksonen

Created by @jormalaaksonen

#! /usr/bin/perl

use strict;
use diagnostics;

my @​a = "riiaan" =~ /.*?(xs|p)*(a(a)|i(i))n/;
my @​b = ($&, @​a);
for my $i ( 0 .. $#b ) {
  if (defined $b[$i]) {
  print "$i $-[$i] $+[$i] [$b[$i]]\n";
  } else {
  print "$i\n";
  }
}

Running it shows me​:

0 0 6 [riiaan]
1
2 3 5 [aa]
3 4 5 [a]
4 2 3 [i]

so the last regexp group (i) seems to have been matched as "ri(i)aan"
even though it should not have matched at all. The match can be
avoided eg. by removing "?", "x" or the first "|" from the regexp.
Then the output is correct​:

0 0 6 [riiaan]
1
2 3 5 [aa]
3 4 5 [a]
4

Any hint if I'm doing something wrong or not doing something I should
do?

Yours,

Jorma

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl 5.22.1:

Configured by Debian Project at Sun Mar 13 11:54:18 UTC 2016.

Summary of my perl5 (revision 5 version 22 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=3.16.0, archname=x86_64-linux-gnu-thread-multi
    uname='linux localhost 3.16.0 #1 smp debian 3.16.0 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dcc=x86_64-linux-gnu-gcc -Dcpp=x86_64-linux-gnu-cpp -Dld=x86_64-linux-gnu-gcc -Dccflags=-DDEBIAN -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.22 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.22 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.22 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.22.1 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.22.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -dEs -Duseshrplib -Dlibperl=libperl.so.5.22.1'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='x86_64-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='5.3.1 20160311', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='x86_64-linux-gnu-gcc', ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/5/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=libc-2.21.so, so=so, useshrplib=true, libperl=libperl.so.5.22
    gnulibc_version='2.21'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector-strong'

Locally applied patches:
    DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
    DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
    DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
    DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
    DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
    DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
    DEBPKG:fixes/respect_umask - Respect umask during installation
    DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories
    DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib
    DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
    DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
    DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
    DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
    DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
    DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
    DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
    DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option
    DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
    DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules
    DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
    DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
    DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.22.1-9 in patchlevel.h
    DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
    DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
    DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text
    DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
    DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable
    DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected
    DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories
    DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories
    DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer
    DEBPKG:debian/locale-robustness - http://bugs.debian.org/782068 [perl #124310] Make t/run/locale.t survive missing locales masked by LC_ALL
    DEBPKG:fixes/podman-utc - http://bugs.debian.org/780259 Make the embedded date from Pod::Man reproducible
    DEBPKG:fixes/podman-utc-docs - http://bugs.debian.org/780259 Documentation and test suite updates for UTC fix
    DEBPKG:fixes/podman-empty-date - http://bugs.debian.org/780259 Support an empty POD_MAN_DATE environment variable
    DEBPKG:fixes/podman-pipe - http://bugs.debian.org/777405 Better errors for man pages from standard input
    DEBPKG:debian/pod2man-customized - Update porting/customized.dat for pod2man modifications
    DEBPKG:debian/makemaker-manext - http://bugs.debian.org/247370 Make EU::MakeMaker honour MANnEXT settings in generated manpage headers
    DEBPKG:debian/makemaker_customized - Update t/porting/customized.dat for files patched in Debian
    DEBPKG:debian/do-not-record-build-date - [6baa8db] http://bugs.debian.org/774422 [perl #125830] Allow overriding the compile time in "perl -V" output
    DEBPKG:fixes/podman-source-date-epoch - http://bugs.debian.org/801621 Make Pod::Man honor the SOURCE_DATE_EPOCH environment variable
    DEBPKG:fixes/podman-source-date-epoch-cleanups - http://bugs.debian.org/801621 Coding style and documentation for SOURCE_EPOCH_DATE
    DEBPKG:fixes/podman-source-date-epoch-testfix - http://bugs.debian.org/807086 Guard for building with SOURCE_DATE_EPOCH or POD_MAN_DATE set
    DEBPKG:debian/devel-ppport-reproducibility - http://bugs.debian.org/801523 Sort the list of XS code files when generating RealPPPort.xs
    DEBPKG:fixes/encode-unicode-bom - http://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
    DEBPKG:debian/encode-unicode-bom-doc - http://bugs.debian.org/798727 Document Debian backport of Encode::Unicode fix
    DEBPKG:debian/kfreebsd-softupdates - http://bugs.debian.org/796798 Work around Debian Bug#796798
    DEBPKG:fixes/autodie-scope - http://bugs.debian.org/798096 Fix a scoping issue with "no autodie" and the "system" sub
    DEBPKG:debian/debugperl-compat-fix - [perl #127212] http://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds
    DEBPKG:fixes/CVE-2015-8607_file_spec_taint_fix - http://bugs.debian.org/810719 [perl #126862] ensure File::Spec::canonpath() preserves taint
    DEBPKG:fixes/mkstemp-umask - http://bugs.debian.org/810924 [perl #127322] [e57270b] Fix umask for mkstemp(3) calls
    DEBPKG:fixes/crosscompile-no-targethost - [perl #127234] Fix the Configure escape with usecrosscompile but no targethost
    DEBPKG:fixes/podlators-no-encode - [rt.cpan.org #111156] Degrade gracefully if utf8 is requested but Encode is not available
    DEBPKG:debian/cross-time-hires - [rt.cpan.org #111391] Add an environment variable to skip running configuration probes
    DEBPKG:fixes/encode-unicode-pod - Unicode.pm: Fix POD error
    DEBPKG:fixes/memoize-pod - [rt.cpan.org #89441] Fix POD errors in Memoize
    DEBPKG:fixes/ok-pod - Added encoding for pod.
    DEBPKG:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ


@INC for perl 5.22.1:
    /etc/perl
    /usr/local/lib/x86_64-linux-gnu/perl/5.22.1
    /usr/local/share/perl/5.22.1
    /usr/lib/x86_64-linux-gnu/perl5/5.22
    /usr/share/perl5
    /usr/lib/x86_64-linux-gnu/perl/5.22
    /usr/share/perl/5.22
    /usr/local/lib/site_perl
    /usr/lib/x86_64-linux-gnu/perl-base
    .


Environment for perl 5.22.1:
    HOME=/home/jorma
    LANG=fi_FI.UTF-8
    LANGUAGE=en_US
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/jorma/bin:/home/jorma/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/sbin:/usr/sbin:/home/jorma/hosts
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh
-- 
Jorma Laaksonen                             jorma.laaksonen@aalto.fi
Teaching Researcher                 http://users.ics.aalto.fi/jorma/
Dr. of Science in Technology                    mob. +358-50-3058719
Department of Computer Science                  fax. +358-9-47023277
Aalto University School of Science
Konemiehentie 2, PO Box 15400, FI-00076 Aalto, Finland

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2016

From zefram@fysh.org

Jorma Laaksonen wrote​:

Any hint if I'm doing something wrong or not doing something I should
do?

No, that's all supported usage. You're quite right about the behaviour
being erroneous.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2016

From @demerphq

On 17 October 2016 at 04​:50, Zefram <zefram@​fysh.org> wrote​:

Jorma Laaksonen wrote​:

Any hint if I'm doing something wrong or not doing something I should
do?

No, that's all supported usage. You're quite right about the behaviour
being erroneous.

Agreed.

It seems to be a bug about unwinding .*? although it also interacts
with TRIE code in ways I dont entirely understand. (Making the code
not produce a TRIE fixes the bug, but on the other hand, so does
removing the .*?)

Nevertheless I can fix the bug (while possibly introducing new bugs)
with the code in yves/fix_129897
c09f087

I would prefer that Dave have a look into this, as I dont entirely
understand why my patch fixes things for this case, but that in most
other cases it is not needed.

The key point is that when we fail a .*? match we should unwind and
reset any buffers we matched after our current point. But STAR and
PLUS do not initialize the proper member fields so that we can do this
unwinding properly.

I have to admit that this bug is quite surprising. I would have
thought that if we have a bug like this that we fail our regex tests
completely, but apparently not.

Of course, it may have to do with the fact that the form of this bug
is incredibly horrible. Having an unanchored .* at the beginning of a
pattern is a good way to make your regex quadratic on failure. (We may
trigger an optimisation that automagically adds the anchor, and we may
not....)

So it may simply be that most times we dont trigger this bug, but I
admit its not obvious to me why not.

Yves

commit c09f087
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Mon Oct 17 18​:29​:43 2016 +0200

  provisional patch to fix [perl #129897]

Inline Patch
diff --git a/regexec.c b/regexec.c
index e9e23f2..0cde487 100644
--- a/regexec.c
+++ b/regexec.c
@@ -7868,6 +7868,8 @@ NULL

         case STAR:             /*  /A*B/ where A is width 1 char */
            ST.paren = 0;
+            ST.lastparen      = rex->lastparen;
+           ST.lastcloseparen = rex->lastcloseparen;
            ST.min = 0;
            ST.max = REG_INFTY;
            scan = NEXTOPER(scan);
@@ -7875,6 +7877,8 @@ NULL

         case PLUS:             /*  /A+B/ where A is width 1 char */
            ST.paren = 0;
+            ST.lastparen      = rex->lastparen;
+           ST.lastcloseparen = rex->lastcloseparen;
            ST.min = 1;
            ST.max = REG_INFTY;
            scan = NEXTOPER(scan);
@@ -7900,6 +7904,8 @@ NULL
            ST.paren = 0;
            ST.min = ARG1(scan);  /* min to match */
            ST.max = ARG2(scan);  /* max to match */
+            ST.lastparen      = rex->lastparen;
+           ST.lastcloseparen = rex->lastcloseparen;
            scan = NEXTOPER(scan) + NODE_STEP_REGNODE;
          repeat:
            /*
@@ -8013,7 +8019,7 @@ NULL
            /* failed to find B in a non-greedy match where c1,c2 valid */

            REGCP_UNWIND(ST.cp);
-            if (ST.paren) {
+            if ( 1 || ST.paren ) {
                 UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
             }
            /* Couldn't or didn't -- move forward. */
@@ -8086,7 +8092,7 @@ NULL
            /* failed to find B in a non-greedy match where c1,c2 invalid */

            REGCP_UNWIND(ST.cp);
-            if (ST.paren) {
+            if ( 1 || ST.paren ) {
                 UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
             }
            /* failed -- move forward one */
@@ -8147,7 +8153,7 @@ NULL
            /* failed to find B in a greedy match */

            REGCP_UNWIND(ST.cp);
-            if (ST.paren) {
+            if ( 1 || ST.paren ) {
                 UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
             }
            /*  back up. */

-- 

perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2016

From @demerphq

On 17 October 2016 at 18​:37, demerphq <demerphq@​gmail.com> wrote​:

On 17 October 2016 at 04​:50, Zefram <zefram@​fysh.org> wrote​:

Jorma Laaksonen wrote​:

Any hint if I'm doing something wrong or not doing something I should
do?

No, that's all supported usage. You're quite right about the behaviour
being erroneous.

Agreed.

It seems to be a bug about unwinding .*? although it also interacts
with TRIE code in ways I dont entirely understand. (Making the code
not produce a TRIE fixes the bug, but on the other hand, so does
removing the .*?)

Nevertheless I can fix the bug (while possibly introducing new bugs)
with the code in yves/fix_129897
c09f087

I would prefer that Dave have a look into this, as I dont entirely
understand why my patch fixes things for this case, but that in most
other cases it is not needed.

The key point is that when we fail a .*? match we should unwind and
reset any buffers we matched after our current point. But STAR and
PLUS do not initialize the proper member fields so that we can do this
unwinding properly.

I have to admit that this bug is quite surprising. I would have
thought that if we have a bug like this that we fail our regex tests
completely, but apparently not.

Of course, it may have to do with the fact that the form of this bug
is incredibly horrible. Having an unanchored .* at the beginning of a
pattern is a good way to make your regex quadratic on failure. (We may
trigger an optimisation that automagically adds the anchor, and we may
not....)

So it may simply be that most times we dont trigger this bug, but I
admit its not obvious to me why not.

Cause my analysis was wrong... Dave, forget it, nothing you need to poke into.

Put simply, the "short-circuit" logic in the TRIE code should not
trigger when there is a jump table.

I have a patch ready, but i am having issues talking to the master
repo right now.

Yves

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2016

From @jormalaaksonen

On Mon Oct 17 14​:40​:47 2016, demerphq wrote​:

On 17 October 2016 at 18​:37, demerphq <demerphq@​gmail.com> wrote​:

On 17 October 2016 at 04​:50, Zefram <zefram@​fysh.org> wrote​:

Jorma Laaksonen wrote​:

Any hint if I'm doing something wrong or not doing something I
should
do?

No, that's all supported usage. You're quite right about the
behaviour
being erroneous.

Agreed.

It seems to be a bug about unwinding .*? although it also interacts
with TRIE code in ways I dont entirely understand. (Making the code
not produce a TRIE fixes the bug, but on the other hand, so does
removing the .*?)

Nevertheless I can fix the bug (while possibly introducing new bugs)
with the code in yves/fix_129897
c09f087

I would prefer that Dave have a look into this, as I dont entirely
understand why my patch fixes things for this case, but that in most
other cases it is not needed.

The key point is that when we fail a .*? match we should unwind and
reset any buffers we matched after our current point. But STAR and
PLUS do not initialize the proper member fields so that we can do
this
unwinding properly.

I have to admit that this bug is quite surprising. I would have
thought that if we have a bug like this that we fail our regex tests
completely, but apparently not.

Of course, it may have to do with the fact that the form of this bug
is incredibly horrible. Having an unanchored .* at the beginning of a
pattern is a good way to make your regex quadratic on failure. (We
may
trigger an optimisation that automagically adds the anchor, and we
may
not....)

So it may simply be that most times we dont trigger this bug, but I
admit its not obvious to me why not.

Cause my analysis was wrong... Dave, forget it, nothing you need to
poke into.

Put simply, the "short-circuit" logic in the TRIE code should not
trigger when there is a jump table.

I have a patch ready, but i am having issues talking to the master
repo right now.

Yves

Thank you for your rapid responses and the patch. I'm happy to confirm that the fix has removed all problems I had associated to this behavior or perl.

Thanks,

Jorma

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

From @hvds

On Mon, 17 Oct 2016 14​:40​:47 -0700, demerphq wrote​:

I have a patch ready, but i am having issues talking to the master
repo right now.

It appears this did eventually go in as cfe04db​:
  regexec.c​: fix perl #129897​: trie short circuit breaks capture buffers
 
  There is an optimisation when a trie matches only one thing
  which causes it to fall through to the following code without
  setting up a stack unwind frame. This breaks if we are using
  a trie jump table where we might change state that will need
  to be unwound on failure.

.. with a followup to fix the test in ac2365f.

I'm setting it to 'pending release' - Yves, please correct it if that was inappropriate.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

@hvds - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented May 30, 2017

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0, this and 210 other issues have been
resolved.

Perl 5.26.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT
Copy link
Author

p5pRT commented May 30, 2017

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant