Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Regex matches when it shouldn't #15753

Closed
p5pRT opened this issue Dec 10, 2016 · 15 comments
Closed

Bug: Regex matches when it shouldn't #15753

p5pRT opened this issue Dec 10, 2016 · 15 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 10, 2016

Migrated from rt.perl.org#130307 (status was 'resolved')

Searchable as RT130307$

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

From @ikegami

Created by @ikegami

$ perl -E'say $& if "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~
/(?​:(?​:g{3,}[atgc]{1,7}){3,}g{3,})/'
gggaaaaggggaaaagggg

This shouldn't match.

http​://stackoverflow.com/questions/41071498/perl-regular-expression-mismatches-repetitive-strings

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.24.0:

Configured by ikegami at Wed May 25 21:22:43 PDT 2016.

Summary of my perl5 (revision 5 version 24 subversion 0) configuration:

  Platform:
    osname=linux, osvers=3.2.61-grsec-modsign,
archname=x86_64-linux-thread-multi
    uname='linux springfield 3.2.61-grsec-modsign #1 smp tue aug 12
09:58:26 utc 2014 x86_64 x86_64 x86_64 gnulinux '
    config_args='-de -Dprefix=/home/ikegami/usr/perlbrew/perls/5.24.0t
-DPERL_SUB_DEPTH_WARN=1000 -Dusethreads
-Aeval:scriptdir=/home/ikegami/usr/perlbrew/perls/5.24.0t/bin'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.6.3', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678,
doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16,
longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.6/include-fixed
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.15.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.15'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib
-fstack-protector'



@INC for perl 5.24.0:

/home/ikegami/usr/perlbrew/perls/5.24.0t/lib/site_perl/5.24.0/x86_64-linux-thread-multi
    /home/ikegami/usr/perlbrew/perls/5.24.0t/lib/site_perl/5.24.0

/home/ikegami/usr/perlbrew/perls/5.24.0t/lib/5.24.0/x86_64-linux-thread-multi
    /home/ikegami/usr/perlbrew/perls/5.24.0t/lib/5.24.0
    .


Environment for perl 5.24.0:
    HOME=/home/ikegami
    LANG (unset)
    LANGUAGE (unset)
    LC_COLLATE=C
    LD_LIBRARY_PATH=/home/ikegami/lib
    LOGDIR (unset)

PATH=/home/ikegami/usr/perlbrew/bin:/home/ikegami/usr/perlbrew/perls/latest/bin:.:/home/ikegami/bin:/home/ikegami/.gems/bin:/usr/lib/ruby/gems/1.8/bin/:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
    PERLBREW_BASHRC_VERSION=0.71
    PERLBREW_HOME=/home/ikegami/.perlbrew
    PERLBREW_MANPATH=/home/ikegami/usr/perlbrew/perls/latest/man

PERLBREW_PATH=/home/ikegami/usr/perlbrew/bin:/home/ikegami/usr/perlbrew/perls/latest/bin
    PERLBREW_PERL=latest
    PERLBREW_ROOT=/home/ikegami/usr/perlbrew
    PERLBREW_VERSION=0.71
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

From @demerphq

On 10 December 2016 at 05​:38, Eric Brine <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by "Eric Brine"
# Please include the string​: [perl #130307]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=130307 >

This is a bug report for perl from ikegami@​adaelis.com,
generated with the help of perlbug 1.40 running under perl 5.24.0.

-----------------------------------------------------------------
[Please describe your issue here]

$ perl -E'say $& if "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~
/(?​:(?​:g{3,}[atgc]{1,7}){3,}g{3,})/'
gggaaaaggggaaaagggg

This shouldn't match.

http​://stackoverflow.com/questions/41071498/perl-regular-expression-mismatches-repetitive-strings

Agreed, it should not match.

A bit of superficial poking suggests two areas to follow up on​:

first, if I recode it to eliminate the [atgc]{1,7} the match fails as
expected. (For instance using [atgc](?[atgc](?​:..)?)? ), so the logic
is related to the {1,7} handling somehow.

second, i see the super linear cache kicking in, right before we
match, which seems a bit suspicious.

hugo and dave know that code the best i would say,

On the other hand, it shows off Dave's constant string optimisation
rework to some advantage. It detects quickly the match couldnt start
before the 11'th char (iirc).

cheers.
yves

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

From @iabyn

On Fri, Dec 09, 2016 at 08​:38​:10PM -0800, Eric Brine wrote​:

# New Ticket Created by "Eric Brine"
# Please include the string​: [perl #130307]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=130307 >

This is a bug report for perl from ikegami@​adaelis.com,
generated with the help of perlbug 1.40 running under perl 5.24.0.

-----------------------------------------------------------------
[Please describe your issue here]

$ perl -E'say $& if "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~
/(?​:(?​:g{3,}[atgc]{1,7}){3,}g{3,})/'
gggaaaaggggaaaagggg

This shouldn't match.

http​://stackoverflow.com/questions/41071498/perl-regular-expression-mismatches-repetitive-strings

I'm not seeing a bug.

Changing the g to an X to make it more visually distinct, and adding some
captures to show what's happening, then with this​:

print "[$1][$2][$3][$4] = [$&]\n" if "aaaXaXXaaaaXXXXaaaaXXXXaaaaXXXXaaa" =~
/( (?​:(X{3,}) ([atXc]{1,7}) ){3,}) (X{3,})/x;

I get​:

[XXXaaaaXXXXaaaaX][XXX][XaaaaX][XXX] = [XXXaaaaXXXXaaaaXXXX]

Breaking $& down, we get

  first second third
X{3,} [atXc]{1,7}) [atXc]{1,7}) [atXc]{1,7}) X{3,}
----- ------------ ------------ ------------ -----
  XXX aaaaXX X XaaaaX XXX

(the split between the first and second iterations is speculative - the
capture only shows the third and final iteration).

Note that there's no requirement that the second and subsequent iterations
of [atXc]{1,7} have to exactly match the first iteration of [atXc]{1,7}, or
even to be the same length.

--
The crew of the Enterprise encounter an alien life form which is
surprisingly neither humanoid nor made from pure energy.
  -- Things That Never Happen in "Star Trek" #22

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

From @demerphq

On 10 December 2016 at 10​:49, Dave Mitchell <davem@​iabyn.com> wrote​:

On Fri, Dec 09, 2016 at 08​:38​:10PM -0800, Eric Brine wrote​:

# New Ticket Created by "Eric Brine"
# Please include the string​: [perl #130307]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=130307 >

This is a bug report for perl from ikegami@​adaelis.com,
generated with the help of perlbug 1.40 running under perl 5.24.0.

-----------------------------------------------------------------
[Please describe your issue here]

$ perl -E'say $& if "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~
/(?​:(?​:g{3,}[atgc]{1,7}){3,}g{3,})/'
gggaaaaggggaaaagggg

This shouldn't match.

http​://stackoverflow.com/questions/41071498/perl-regular-expression-mismatches-repetitive-strings

I'm not seeing a bug.

Changing the g to an X to make it more visually distinct, and adding some
captures to show what's happening, then with this​:

print "[$1][$2][$3][$4] = [$&]\n" if "aaaXaXXaaaaXXXXaaaaXXXXaaaaXXXXaaa" =~
/( (?​:(X{3,}) ([atXc]{1,7}) ){3,}) (X{3,})/x;

I get​:

[XXXaaaaXXXXaaaaX][XXX][XaaaaX][XXX] = [XXXaaaaXXXXaaaaXXXX]

Which does not match the pattern stated. we have /(?​:X{3,}Y{1,7}){3,}X{3,}/.

$1 should contain at least 9 X's, and contain at least 3 sequences of
'X' which are 3 chars or longer.

In total we need​: 3 X's, followed by 1-7 chars, 3 X's, followed by 1-7
chars, 3 X's, followed by 1-7 chars, followed by 3 Xes.

the minimum string this should accept is

XXXaXXXaXXXaXXX

Note the difference to the accepted match, it does not have enough
XXX'es in it.There should be at least 3 * 3 + 3 = 12, the match only
has 11.

Also if you reformulate the pattern to not use {3,} it does not match.

If we accept that X{3,} is the same as XXX+ and that X{1,7} is the
same as X(?​:X(?​:X(?​:X(?​:X(?​:XX?)?)?)?)?)? then the following should
match the same as when the counted quantifier is used​:

perl -le' print
"aaaXaXXaaaaXXXXaaaaXXXXaaaaXXXXaaa" =~/
XXX+ [atXc](?​:[atXc](?​:[atXc](?​:[atXc](?​:[atXc](?​:[atXc][atXc]?)?)?)?)?)?
XXX+ [atXc](?​:[atXc](?​:[atXc](?​:[atXc](?​:[atXc](?​:[atXc][atXc]?)?)?)?)?)?
(?​:XXX+ [atXc](?​:[atXc](?​:[atXc](?​:[atXc](?​:[atXc](?​:[atXc][atXc]?)?)?)?)?)?)+
XXX+
/x ? "matched" : "failed"; '

But it prints "failed".

Breaking $& down, we get

       first         second         third

X{3,} [atXc]{1,7}) [atXc]{1,7}) [atXc]{1,7}) X{3,}
----- ------------ ------------ ------------ -----
XXX aaaaXX X XaaaaX XXX

(the split between the first and second iterations is speculative - the
capture only shows the third and final iteration).

Note that there's no requirement that the second and subsequent iterations
of [atXc]{1,7} have to exactly match the first iteration of [atXc]{1,7}, or
even to be the same length.

Also, if I run the match twice, the second time on $& from the first,
the second time it fails​:

perl -E'say $& if "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~
/(?​:(?​:g{3,}[atgc]{1,7}){3,}g{3,})/; my $match= $&amp;; say
$match=~/(?​:(?​:g{3,}[atgc]{1,7}){3,}g{3,})/ ? "matched" : "failed";;'
gggaaaaggggaaaagggg
failed

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

From @hvds

On Sat, 10 Dec 2016 01​:50​:00 -0800, davem wrote​:

I'm not seeing a bug.

Changing the g to an X to make it more visually distinct
[...]
/( (?​:(X{3,}) ([atXc]{1,7}) ){3,}) (X{3,})/x;

For this to match, I'd expect the target string to need 4 distinct groups of X{3,}, so I'm surprised that you're not surprised.

Reducing it slightly, and then expanding some of the quantifiers, I have this​:
% perl -wle 'print $1 if "XXX-XXX-XXX-" =~ m{( (?​: XXX*[-X][-X]?[-X]?[-X]? ){3,} ) XXX*}x'
XXX-XXX-X
%
.. but that $1 should have at least 3 copies of the mandatory XX in /XXX*[-X][-X]?[-X]?[-X]?/, and it doesn't.

Like Yves, I suspect the superlinear cache, I'll try to dig some over the weekend.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2016

From @iabyn

On Sat, Dec 10, 2016 at 11​:34​:21AM +0100, demerphq wrote​:

On 10 December 2016 at 10​:49, Dave Mitchell <davem@​iabyn.com> wrote​:

I'm not seeing a bug.

Changing the g to an X to make it more visually distinct, and adding some
captures to show what's happening, then with this​:

print "[$1][$2][$3][$4] = [$&]\n" if "aaaXaXXaaaaXXXXaaaaXXXXaaaaXXXXaaa" =~
/( (?​:(X{3,}) ([atXc]{1,7}) ){3,}) (X{3,})/x;

I get​:

[XXXaaaaXXXXaaaaX][XXX][XaaaaX][XXX] = [XXXaaaaXXXXaaaaXXXX]

Which does not match the pattern stated. we have /(?​:X{3,}Y{1,7}){3,}X{3,}/.

Ah yeah - total misread on my part.

Need to think some more.

--
But Pity stayed his hand. "It's a pity I've run out of bullets",
he thought. -- "Bored of the Rings"

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2016

From explorer@joaquinferrero.com

Confirmed the workaround found by the stackoverflow author with v5.24.0​:

False positive​:

$ perl -E 'say $& while "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~ /((G{3,}[ATGC]{1,7}){3,}G{3,})/gi'
gggaaaaggggaaaagggg

Workaround​: remove /i

$ perl -E'say $& while "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~ /(([gG]{3,}[aAtTgGcC]{1,7}){3,}[gG]{3,})/g'

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2016

From @hvds

I believe the root cause here is that after enabling the super-linear cache, when we short-circuit on a cache hit we fail to do proper cleanup on backtracking.

The particular issue is that this chunk​:
  if (reginfo->info_aux->poscache[offset] & mask) {
  DEBUG_EXECUTE_r( Perl_re_exec_indentf( aTHX_ "whilem​: (cache) already tried at this position...\n",
  depth)
  );
  sayNO; /* cache records failure */
  }
.. needs at least to --cur_curlyx->u.curlyx.count, but I'm not yet sure precisely what other backtrack cleanup it needs to do. Hopefully a bit of history-digging will help me clarify that.

This helps to highlight what's going wrong​:

% ./miniperl -Dr -we '"ABC.EFG.IJK.." =~ m{(((([A-Z]{2,})([.A-Z]{1,4})(?{ local @​t = (@​t, "[$4 $5]"); warn sprintf "whilem captures @​t\n"})){3,})[A-Z]{2,})}' 2>&1 | grep whilem | grep -C4 'matched 4'
whilem captures [AB C.] [EFG .] [IJK ..]
  | 10| whilem​: matched 3 out of 3..32767
  | 10| whilem​: (cache) already tried at this position...
whilem captures [AB C.] [EFG .] [IJK .]
  | 10| whilem​: matched 4 out of 3..32767
  | 10| whilem​: (cache) already tried at this position...
whilem captures [AB C.] [EFG .] [IJ K..]
  | 10| whilem​: matched 5 out of 3..32767
  | 10| whilem​: (cache) already tried at this position...
%

Hugo

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2016

From @iabyn

On Sun, Dec 11, 2016 at 08​:04​:28AM -0800, Hugo van der Sanden via RT wrote​:

I believe the root cause here is that after enabling the super-linear
cache, when we short-circuit on a cache hit we fail to do proper cleanup
on backtracking.

The particular issue is that this chunk​:
if (reginfo->info_aux->poscache[offset] & mask) {
DEBUG_EXECUTE_r( Perl_re_exec_indentf( aTHX_ "whilem​: (cache) already tried at this position...\n",
depth)
);
sayNO; /* cache records failure */
}
.. needs at least to --cur_curlyx->u.curlyx.count, but I'm not yet sure
precisely what other backtrack cleanup it needs to do. Hopefully a bit
of history-digging will help me clarify that.

I think that should be sufficient. We only need to undo what's done
between the

  case WHILEM​:

and the

  if (cache hit)
  sayNO;

since we're trying to pretend that the WHILEM match never happened.

I'm amazed this was never encountered before.

--
No man treats a motor car as foolishly as he treats another human being.
When the car will not go, he does not attribute its annoying behaviour to
sin, he does not say, You are a wicked motorcar, and I shall not give you
any more petrol until you go. He attempts to find out what is wrong and
set it right.
  -- Bertrand Russell,
  Has Religion Made Useful Contributions to Civilization?

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2016

From @hvds

On Sun, 11 Dec 2016 08​:04​:28 -0800, hv wrote​:

I believe the root cause here is that after enabling the super-linear
cache, when we short-circuit on a cache hit we fail to do proper
cleanup on backtracking.

I've now pushed d3c48e8​:
  [perl #130307] Correctly unwind on cache hit
 
  We've already incremented curlyx.count in the WHILEM branch before
  we check for a hit in the super-linear cache, so must reverse that
  on the sayNO.

I don't see other side-effects that need reversing, but it'd be useful to have a second opinion from DaveM if he has time for a look.

When tracing through what was happening here, it also seemed to me that the way this works is not very effective - even once the cache is turned on, it seems to do a lot of repetitive work before getting as far as recording any fails - so I plan to look at this further to see if there are additional opportunities for improvement.

I don't think this particularly merits a perldelta entry; I also assume it does not qualify for maint (the bug was introduced in 2006 by c476f42), but should be a safe and easy backport if it does.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2016

@hvds - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2016

From @hvds

On Mon, 12 Dec 2016 07​:26​:21 -0800, hv wrote​:

I don't see other side-effects that need reversing, but it'd be useful
to have a second opinion from DaveM if he has time for a look.

Ah, I see he already has, we crossed in the post.

Thanks Dave.

Hugo

@p5pRT
Copy link
Author

p5pRT commented May 30, 2017

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0, this and 210 other issues have been
resolved.

Perl 5.26.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT
Copy link
Author

p5pRT commented May 30, 2017

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant