Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl regular expression bug in v5.18 #14775

Closed
p5pRT opened this issue Jun 26, 2015 · 6 comments
Closed

Perl regular expression bug in v5.18 #14775

p5pRT opened this issue Jun 26, 2015 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 26, 2015

Migrated from rt.perl.org#125491 (status was 'resolved')

Searchable as RT125491$

@p5pRT
Copy link
Author

p5pRT commented Jun 26, 2015

From @mschout

Created by @mschout

This is a bug report for perl from mschout@​gkg.net,
generated with the help of perlbug 1.39 running under perl 5.18.4.

-----------------------------------------------------------------
I have found a rather innocuous regular expression that does not work in perl
v5.18, but works in v5.16 and works again in v5.20.

Here is the regular expression substitution​:

  s|^(?​:\s*h)?/||i

When given the string 'h//w/b', this should strip off the "h/" at the
beginning, resulting in '/w/b'.

NOTE​: This is a reduced down version from the original that triggered this bug
for me​:

  s|^(?​:\s*https?​:)?//[^/]+||i

Under perl v5.18, this does not work​:

perl 5.18.4​:

$ perl -E '$x = "h//w/b"; $x =~ s|^(?​:\s*h)?/||i; say $x'

  h//w/b

However, this works perfectly on v5.16.3, and also on v5.20.0​:

perl 5.16.3​:
$ perl -E '$x = "h//w/b"; $x =~ s|^(?​:\s*h)?/||i; say $x'

  /w/b

perl 5.20.0​:

$ perl -E '$x = "h//w/b"; $x =~ s|^(?​:\s*h)?/||i; say $x'

  /w/b

Curiously, the following changes to the regex yield the correct results​:

- remove the "\s*"​: s|^(?​:h)?/||i -> works correctly in 5.18
- remove the "/i"​: s|^(?​:\s*h)?/|| -> works correctly in 5.18

A git bisect shows that this broke in the following commit on v5.17​:

b25a103 is the first bad commit
commit b25a103
Author​: Karl Williamson <public@​khwilliamson.com>
Date​: Thu Dec 27 12​:34​:41 2012 -0700

  regcomp.c​: Clean up ANYOF_CLASS handling.
 
  The ANYOF_CLASS flag is used in ANYOF nodes (for [bracketed] and the
  synthetic start class) only when matching something like \w, [​:punct​:]
  etc., under /l (locale). It should not be set unless /l is specified.
  However, it was always getting set for the synthetic start class. This
  commit fixes that. The previous code was masking errors in which it was
  being tested for unnecessarily, and for much of the 5.17 series, the
  synthetic start class was always set to test for locale, which was a
  waste of cpu when no locale was specified.

:100644 100644 eee952f376f066c6757208aae350b70dfb9fc9e3 bf8d7e0bab1079f84d490f4cc93e6737516ef47d M regcomp.c
:100644 100644 8eb849cc1ec789d4c488525c587dd102298236b2 be3970d0ed9e3fca1ab95be89354c7a8ca1aeb39 M regcomp.h
:040000 040000 b7cb07f7f1266276b13171049a9fe59bef27f891 224e7347bcc874f5265c0ac313a66e9f7489d9d3 M

And reverse bisecting for when this bug was fixed in 5.19, I found that this bug went away with the following commit​:

commit cdd87c1
Author​: Karl Williamson <public@​khwilliamson.com>
Date​: Sun Sep 22 21​:36​:29 2013 -0600

  Teach regex optimizer to handle above-Latin1
 
  Until this commit, the regular expression optimizer has essentially
  punted on above-Latin1 code points. Under some circumstances, they
  would be taken into account, more or less, but often, the generated
  synthetic start class would end up matching all above-Latin1 code
  points. With the advent of inversion lists, it becomes feasible to
  actually fully handle such code points, as inversion lists are a
  convenient way to express arbitrary lists of code points and take their
  union, intersection, etc. This commit changes the optimizer to use
  inversion lists for operating on the code points the synthetic start
  class can match.
 
  I don't much understand the overall operation of the optimizer. I'm
  told that previous porters found that perturbing it caused unexpected
  behaviors. I had promised to get this change in 5.18, but didn't. I'm
  trying to get it in early enough into the 5.20 preliminary series that
  any problems will surface before 5.20 ships.
 
  This commit doesn't change the macro level logic, but does significantly
  change various micro level things. Thus the 'and' and 'or' subroutines
  have been rewritten to use inversion lists. I'm pretty confident that
  they do what their names suggest. I re-derived the equations for what
  these operations should do, getting the same results in some cases, but
  extending others where the previous code mostly punted. The derivations
  are given in comments in the respective routines.
 
  Some of the code is greatly simplified, as it no longer has to treat
  above-Latin1 specially.
 
  It is now feasible for /i matching of above-Latin1 code points to know
  explicitly the folds that should be in the synthetic start class. But
  more prepatory work needs to be done before putting that into place.
  ...

:100644 100644 ec203f9c1f3ea42c65324e632c746042c32954f1 3dd62f946eedd99f87489fecfeb1acd86e2d250b M embed.fnc
:100644 100644 fca8736feb457d00987232649f8b563aef4f24fa 5fc9171d233fd06b4d6cc08570cdb41f976d0b0b M embed.h
:100644 100644 568cdf733c7784cc3f5a059583d9ae9a262ae72a 33d00ef42cf01836e9365e4a9086cf64b678b74b M proto.h
:100644 100644 ec24583f1f7bec42e186a4241353b9688bf6f2de efefd0a17ba14144286dc55c149d6ee422b18249 M regcomp.c
:100644 100644 f0153fc12c842c26d928d3303feb519430b73a07 eccb46690a4d252fb107044116973e3baa0cbd2c M regcomp.h

I realize fully that 5.18 is officially "End of life", but this is a rather
severe regression and 5.18 still widespread deployment at this time, so it
would be really nice to get this fixed in maint-5.18 somehow. Given that this
is an extremely simple regular expression I suspect there may be many other
cases where regular expressions are broken in 5.18. Perhaps the patch that fixed
this in 5.19 can be backported to maint-5.18? I am not familiar at all with
what is going on in this commit to attempt that.

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl 5.18.4:

Configured by mschout at Fri Jun 26 15:13:44 CDT 2015.

Summary of my perl5 (revision 5 version 18 subversion 4) configuration:
   
  Platform:
    osname=darwin, osvers=14.3.0, archname=darwin-2level
    uname='darwin lothlorien.snowcrash.lan 14.3.0 darwin kernel version 14.3.0: mon mar 23 11:59:05 pdt 2015; root:xnu-2782.20.48~5release_x86_64 x86_64 '
    config_args='-de -Dprefix=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4 -Accflags=-fPIC -Aeval:scriptdir=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fPIC -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include',
    optimize='-O3',
    cppflags='-fno-common -DPERL_DARWIN -fPIC -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include'
    ccversion='', gccversion='4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/opt/local/lib'
    libpth=/opt/local/lib /usr/lib
    libs=-lgdbm -ldbm -ldl -lm -lutil -lc
    perllibs=-ldl -lm -lutil -lc
    libc=, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/opt/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.18.4:
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4/darwin-2level
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/5.18.4/darwin-2level
    /Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/5.18.4
    .


Environment for perl 5.18.4:
    DYLD_LIBRARY_PATH (unset)
    HOME=/Users/mschout
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/Users/mschout/perl5/perlbrew/bin:/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin:/opt/local/bin:/opt/local/sbin:/opt/local/bin:/opt/local/sbin:/Users/mschout/.rvm/gems/ruby-2.1.1/bin:/Users/mschout/.rvm/gems/ruby-2.1.1@global/bin:/Users/mschout/.rvm/rubies/ruby-2.1.1/bin:/opt/local/bin:/opt/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/mschout/bin:/Users/mschout/Dropbox/bin:/opt/local/libexec/gnubin:/Users/mschout/.rvm/bin:/Users/mschout/.rvm/bin
    PERLBREW_BASHRC_VERSION=0.69
    PERLBREW_HOME=/Users/mschout/.perlbrew
    PERLBREW_MANPATH=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/man
    PERLBREW_PATH=/Users/mschout/perl5/perlbrew/bin:/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin
    PERLBREW_PERL=perl-5.18.4
    PERLBREW_ROOT=/Users/mschout/perl5/perlbrew
    PERLBREW_VERSION=0.69
    PERL_BADLANG (unset)
    PERL_LOCAL_LIB_ROOT=
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jun 28, 2015

From @cowens

It looks like it can be further reduced to​:

perl -E 'say map "<$_>", "a/" =~ m{(b*a)?/}i'

The following all fail

perl -E 'say map "<$_>", "a%" =~ m{(\s*a)?%}i'
perl -E 'say map "<$_>", "a1" =~ m{(b*a)?1}i'
perl -E 'say map "<$_>", "a\x{2603}" =~ m{(b*a)?\x{2603}}i'

The following (as far as I can see as a user) semantically equivalent
regexes work

perl -E 'say map "<$_>", "ac" =~ m{(b*a)?c}i'
perl -E 'say map "<$_>", "a/" =~ m{(.*a)?/}i'

It is interesting to note that the regex succeeds, but matches nothing.
Also interesting, if the b* matches at least one character, then it works​:

perl -E 'say map "<$_>", "ba/" =~ m{(b*a)?/}i'

On Sat, Jun 27, 2015 at 2​:20 AM Michael Schout <perlbug-followup@​perl.org>
wrote​:

# New Ticket Created by Michael Schout
# Please include the string​: [perl #125491]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125491 >

This is a bug report for perl from mschout@​gkg.net,
generated with the help of perlbug 1.39 running under perl 5.18.4.

-----------------------------------------------------------------
I have found a rather innocuous regular expression that does not work in
perl
v5.18, but works in v5.16 and works again in v5.20.

Here is the regular expression substitution​:

s|^\(?&#8203;:\\s\*h\)?/||i

When given the string 'h//w/b', this should strip off the "h/" at the
beginning, resulting in '/w/b'.

NOTE​: This is a reduced down version from the original that triggered this
bug
for me​:

s|^\(?&#8203;:\\s\*https?&#8203;:\)?//\[^/\]\+||i

Under perl v5.18, this does not work​:

perl 5.18.4​:

$ perl -E '$x = "h//w/b"; $x =~ s|^(?​:\s*h)?/||i; say $x'

h//w/b

However, this works perfectly on v5.16.3, and also on v5.20.0​:

perl 5.16.3​:
$ perl -E '$x = "h//w/b"; $x =~ s|^(?​:\s*h)?/||i; say $x'

/w/b

perl 5.20.0​:

$ perl -E '$x = "h//w/b"; $x =~ s|^(?​:\s*h)?/||i; say $x'

/w/b

Curiously, the following changes to the regex yield the correct results​:

- remove the "\s*"​: s|^(?​:h)?/||i -> works correctly in 5.18
- remove the "/i"​: s|^(?​:\s*h)?/|| -> works correctly in 5.18

A git bisect shows that this broke in the following commit on v5.17​:

b25a103 is the first bad commit
commit b25a103
Author​: Karl Williamson <public@​khwilliamson.com>
Date​: Thu Dec 27 12​:34​:41 2012 -0700

regcomp\.c&#8203;: Clean up ANYOF\_CLASS handling\.

The ANYOF\_CLASS flag is used in ANYOF nodes \(for \[bracketed\] and the
synthetic start class\) only when matching something like \\w\, \[&#8203;:punct&#8203;:\]
etc\.\, under /l \(locale\)\.  It should not be set unless /l is specified\.
However\, it was always getting set for the synthetic start class\.  This
commit fixes that\.  The previous code was masking errors in which it

was
being tested for unnecessarily, and for much of the 5.17 series, the
synthetic start class was always set to test for locale, which was a
waste of cpu when no locale was specified.

:100644 100644 eee952f376f066c6757208aae350b70dfb9fc9e3
bf8d7e0bab1079f84d490f4cc93e6737516ef47d M regcomp.c
:100644 100644 8eb849cc1ec789d4c488525c587dd102298236b2
be3970d0ed9e3fca1ab95be89354c7a8ca1aeb39 M regcomp.h
:040000 040000 b7cb07f7f1266276b13171049a9fe59bef27f891
224e7347bcc874f5265c0ac313a66e9f7489d9d3 M

And reverse bisecting for when this bug was fixed in 5.19, I found that
this bug went away with the following commit​:

commit cdd87c1
Author​: Karl Williamson <public@​khwilliamson.com>
Date​: Sun Sep 22 21​:36​:29 2013 -0600

Teach regex optimizer to handle above\-Latin1

Until this commit\, the regular expression optimizer has essentially
punted on above\-Latin1 code points\.  Under some circumstances\, they
would be taken into account\, more or less\, but often\, the generated
synthetic start class would end up matching all above\-Latin1 code
points\.  With the advent of inversion lists\, it becomes feasible to
actually fully handle such code points\, as inversion lists are a
convenient way to express arbitrary lists of code points and take their
union\, intersection\, etc\.  This commit changes the optimizer to use
inversion lists for operating on the code points the synthetic start
class can match\.

I don't much understand the overall operation of the optimizer\.  I'm
told that previous porters found that perturbing it caused unexpected
behaviors\.  I had promised to get this change in 5\.18\, but didn't\.  I'm
trying to get it in early enough into the 5\.20 preliminary series that
any problems will surface before 5\.20 ships\.

This commit doesn't change the macro level logic\, but does

significantly
change various micro level things. Thus the 'and' and 'or' subroutines
have been rewritten to use inversion lists. I'm pretty confident that
they do what their names suggest. I re-derived the equations for what
these operations should do, getting the same results in some cases, but
extending others where the previous code mostly punted. The
derivations
are given in comments in the respective routines.

Some of the code is greatly simplified\, as it no longer has to treat
above\-Latin1 specially\.

It is now feasible for /i matching of above\-Latin1 code points to know
explicitly the folds that should be in the synthetic start class\.  But
more prepatory work needs to be done before putting that into place\.
\.\.\.

:100644 100644 ec203f9c1f3ea42c65324e632c746042c32954f1
3dd62f946eedd99f87489fecfeb1acd86e2d250b M embed.fnc
:100644 100644 fca8736feb457d00987232649f8b563aef4f24fa
5fc9171d233fd06b4d6cc08570cdb41f976d0b0b M embed.h
:100644 100644 568cdf733c7784cc3f5a059583d9ae9a262ae72a
33d00ef42cf01836e9365e4a9086cf64b678b74b M proto.h
:100644 100644 ec24583f1f7bec42e186a4241353b9688bf6f2de
efefd0a17ba14144286dc55c149d6ee422b18249 M regcomp.c
:100644 100644 f0153fc12c842c26d928d3303feb519430b73a07
eccb46690a4d252fb107044116973e3baa0cbd2c M regcomp.h

I realize fully that 5.18 is officially "End of life", but this is a rather
severe regression and 5.18 still widespread deployment at this time, so it
would be really nice to get this fixed in maint-5.18 somehow. Given that
this
is an extremely simple regular expression I suspect there may be many other
cases where regular expressions are broken in 5.18. Perhaps the patch that
fixed
this in 5.19 can be backported to maint-5.18? I am not familiar at all with
what is going on in this commit to attempt that.

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=high
---
Site configuration information for perl 5.18.4​:

Configured by mschout at Fri Jun 26 15​:13​:44 CDT 2015.

Summary of my perl5 (revision 5 version 18 subversion 4) configuration​:

Platform​:
osname=darwin, osvers=14.3.0, archname=darwin-2level
uname='darwin lothlorien.snowcrash.lan 14.3.0 darwin kernel version
14.3.0​: mon mar 23 11​:59​:05 pdt 2015; root​:xnu-2782.20.48~5release_x86_64
x86_64 '
config_args='-de
-Dprefix=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4 -Accflags=-fPIC
-Aeval​:scriptdir=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fPIC
-fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include',
optimize='-O3',
cppflags='-fno-common -DPERL_DARWIN -fPIC -fno-strict-aliasing -pipe
-fstack-protector -I/opt/local/include'
ccversion='', gccversion='4.2.1 Compatible Apple LLVM 6.1.0
(clang-602.0.53)', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='
-fstack-protector -L/opt/local/lib'
libpth=/opt/local/lib /usr/lib
libs=-lgdbm -ldbm -ldl -lm -lutil -lc
perllibs=-ldl -lm -lutil -lc
libc=, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup
-L/opt/local/lib -fstack-protector'

Locally applied patches​:

---
@​INC for perl 5.18.4​:

/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4/darwin-2level
/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/site_perl/5.18.4

/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/5.18.4/darwin-2level
/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/lib/5.18.4
.

---
Environment for perl 5.18.4​:
DYLD_LIBRARY_PATH (unset)
HOME=/Users/mschout
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/Users/mschout/perl5/perlbrew/bin​:/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin​:/opt/local/bin​:/opt/local/sbin​:/opt/local/bin​:/opt/local/sbin​:/Users/mschout/.rvm/gems/ruby-2.1.1/bin​:/Users/mschout/.rvm/gems/ruby-2.1.1@​global
/bin​:/Users/mschout/.rvm/rubies/ruby-2.1.1/bin​:/opt/local/bin​:/opt/local/sbin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/sbin​:/sbin​:/opt/X11/bin​:/Users/mschout/bin​:/Users/mschout/Dropbox/bin​:/opt/local/libexec/gnubin​:/Users/mschout/.rvm/bin​:/Users/mschout/.rvm/bin
PERLBREW_BASHRC_VERSION=0.69
PERLBREW_HOME=/Users/mschout/.perlbrew
PERLBREW_MANPATH=/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/man

PERLBREW_PATH=/Users/mschout/perl5/perlbrew/bin​:/Users/mschout/perl5/perlbrew/perls/perl-5.18.4/bin
PERLBREW_PERL=perl-5.18.4
PERLBREW_ROOT=/Users/mschout/perl5/perlbrew
PERLBREW_VERSION=0.69
PERL_BADLANG (unset)
PERL_LOCAL_LIB_ROOT=
SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jun 28, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 24, 2015

From @khwilliamson

I'm reluctant to try to fix this bug in 5.18.5. The commit that fixed it in 5.20 had 937 changed lines in regcomp.c It would be very dangerous to try to include parts of that commit. The commit that broke it is quite small, but reverting that might break something else.
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2016

From @khwilliamson

This ticket has been overtaken by events. It asked that a regression that was fixed in 5.20 be fixed in a 5.18 maintenance release, even though it wasn't discovered until 5.18 was out of maintenance. That didn't happen because the fixing commit was quite complicated. Now that 5.18 is even further past being maintained, there is not going to be a 5.18 maintenance release, so I'm closing this ticket
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2016

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant