//g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match #16806

p5pRT · 2019-01-09T14:11:06Z

Migrated from rt.perl.org#133756 (status was 'resolved')

Searchable as RT133756$

p5pRT · 2019-01-09T14:11:06Z

From @nwc10

Created by @nwc10

For the case where the eval'd source code contains Ā in a code comment this
doesn't match. Where the code comment is ÿ it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {
my $got = eval "\$text =~ /$mr/g; # $char";

if ($got) {
print "Y\n";
} elsif ($@) {
print "\$\@: $@\n";
} else {
print "n\n";
}
}

__END__
nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl
Y
n
Y
n

If one removes the //g flag, it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {
my $got = eval "\$text =~ /$mr/; # $char";

if ($got) {
print "Y\n";
} elsif ($@) {
print "\$\@: $@\n";
} else {
print "n\n";
}
}

__END__
nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl-no-g
Y
Y
Y
Y

I would expect that it should match always, independent of whether //g is
present, or whether the source code is encoded as UTF-8 or octets.

Running with -Dr suggests that the culprit is the regex optimiser,
"Regex match can't succeed, so not even tried":

[snip]

EXECUTING...

Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char at 1
Final program:
1: EXACT <L\x{fc}ften Kalt> (5)
5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against "L%x{fc}ften Kalt"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..11] gave 0
Found anchored substr "L%x{fc}ften Kalt" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Freeing REx: "L%x{fc}ften Kalt"
Y
Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char at 1
Final program:
1: EXACT <L\x{fc}ften Kalt> (5)
5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against ""
Regex match can't succeed, so not even tried
Freeing REx: "L%x{fc}ften Kalt"
n
Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char at 1
Final program:
1: EXACT <L\x{fc}ften Kalt> (5)
5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against "L%x{fc}ften Kalt"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [0..11] gave 0
Found anchored substr "L%x{fc}ften Kalt" at offset 0 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Freeing REx: "L%x{fc}ften Kalt"
Y
Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char at 1
Final program:
1: EXACT <L\x{fc}ften Kalt> (5)
5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against ""
Regex match can't succeed, so not even tried
Freeing REx: "L%x{fc}ften Kalt"
n

This does not seem to be a regression - everything I sampled back to 5.6.0
shows the same Y/n/Y/n behaviour.

Nicholas Clark

Perl Info


Flags:
    category=core
    severity=low

Site configuration information for perl 5.29.7:

Configured by nicholas at Wed Jan  9 14:11:11 CET 2019.

Summary of my perl5 (revision 5 version 29 subversion 7) configuration:
  Commit id: 5203d63deea0ef134714a48c272a928fbbe64ce1
  Platform:
    osname=linux
    osvers=2.6.32-358.el6.x86_64
    archname=x86_64-linux
    uname='linux dromedary-001.ams6.corp.booking.com 2.6.32-358.el6.x86_64 #1 smp fri feb 22 00:31:26 utc 2013 x86_64 x86_64 x86_64 gnulinux '
    config_args='-Dusedevel -Dcc=ccache /usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64 -Dld=/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64 -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Accflags=-DDEBUGGING -g -Doptimize=-Og -Uusethreads -Uuselongdouble -Uuse64bitall -Dprefix=~/Sandpit/snap-v5.29.6-94-g5203d63dee -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap-v5.29.6-94-g5203d63dee -de'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64'
    ccflags ='-DDEBUGGING -g -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-Og'
    cppflags='-DDEBUGGING -g -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='4.9.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/local/gcc49/lib /usr/local/gcc49/lib/gcc/x86_64-unknown-linux-gnu/4.9.0/include-fixed /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.12.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.12'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -Og -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.29.7:
    lib
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/site_perl/5.29.7/x86_64-linux
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/site_perl/5.29.7
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/5.29.7/x86_64-linux
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/5.29.7


Environment for perl 5.29.7:
    HOME=/home/nicholas
    LANG (unset)
    LANGUAGE (unset)
    LC_ALL=en_GB.utf8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nicholas/bin:/opt/local/bin:/opt/local/sbin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

p5pRT · 2019-01-09T14:37:05Z

@nwc10 - Status changed from 'new' to 'open'

p5pRT · 2019-01-09T16:50:35Z

From @khwilliamson

On 1/9/19 7:11 AM, Nicholas Clark (via RT) wrote:

# New Ticket Created by Nicholas Clark
# Please include the string: [perl #133756]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133756 >

This is a bug report for perl from nick@ccl4.org,
generated with the help of perlbug 1.41 running under perl 5.29.7.

-----------------------------------------------------------------
[Please describe your issue here]

For the case where the eval'd source code contains Ā in a code comment this
doesn't match. Where the code comment is ÿ it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {
my $got = eval "\$text =~ /$mr/g; # $char";
 if $$got$ \{
print "Y\\n";
 \} elsif $$@&#8203;$ \{
print "\\$\\@&#8203;&#8203;: $@&#8203;\\n";
 \} else \{
print "n\\n";
 \}
}

__END__
nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl
Y
n
Y
n

If one removes the //g flag, it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {
my $got = eval "\$text =~ /$mr/; # $char";
 if $$got$ \{
print "Y\\n";
 \} elsif $$@&#8203;$ \{
print "\\$\\@&#8203;&#8203;: $@&#8203;\\n";
 \} else \{
print "n\\n";
 \}
}

__END__
nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl-no-g
Y
Y
Y
Y

I would expect that it should match always, independent of whether //g is
present, or whether the source code is encoded as UTF-8 or octets.

Running with -Dr suggests that the culprit is the regex optimiser,
"Regex match can't succeed, so not even tried":

[snip]

My gvim syntax highlighter immediately showed that \x100 is \x10
followed by a "0". Without that, I would have expected that $char
contained a single character: \x{100}. The /g would cause the second
character, the "0" (U+0030) to be attempted to be matched. I haven't
investigated further, because my guess is that is what is going on here.
If you say there is more to it, then I'll investigate further.

p5pRT · 2019-01-09T17:14:32Z

From @nwc10

On Wed, Jan 09, 2019 at 09:49:59AM -0700, Karl Williamson wrote:

My gvim syntax highlighter immediately showed that \x100 is \x10 followed by
a "0". Without that, I would have expected that $char contained a single
character: \x{100}. The /g would cause the second character, the "0"
(U+0030) to be attempted to be matched. I haven't investigated further,
because my guess is that is what is going on here. If you say there is more
to it, then I'll investigate further.

Thanks for the rapid response. I think that you might be right, but will
investigate further tomorrow at work with a fresh head.

This would mean that I've failed to correctly reduce the original problem
to a representitive test case. (The problem might *still* be PEBKAC, but
the original discrepency in the much larger input and generated regex looked
like a bug.)

Nicholas Clark

p5pRT · 2019-01-09T17:28:30Z

From @khwilliamson

On 1/9/19 10:14 AM, Nicholas Clark wrote:

On Wed, Jan 09, 2019 at 09:49:59AM -0700, Karl Williamson wrote:

My gvim syntax highlighter immediately showed that \x100 is \x10 followed by
a "0". Without that, I would have expected that $char contained a single
character: \x{100}. The /g would cause the second character, the "0"
(U+0030) to be attempted to be matched. I haven't investigated further,
because my guess is that is what is going on here. If you say there is more
to it, then I'll investigate further.

Thanks for the rapid response. I think that you might be right, but will
investigate further tomorrow at work with a fresh head.

This would mean that I've failed to correctly reduce the original problem
to a representitive test case. (The problem might *still* be PEBKAC, but
the original discrepency in the much larger input and generated regex looked
like a bug.)

I looked at it a little more, and it seems weird that a comment would
cause this variance in behavior. OTOH, eval of UTF-8 is buggy, and
that's why we have the unicode_eval feature.

Nicholas Clark

p5pRT · 2019-01-10T00:27:35Z

From @tonycoz

You're going to facepalm yourself.

On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:

For the case where the eval'd source code contains Ā in a code comment
this
doesn't match. Where the code comment is ÿ it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {

Add:

pos($text) = 0;

here and they all pass.

my $got = eval "\\$text =~ /$mr/g; \# $char";
if ($got) {
print "Y\n";
} elsif ($@) {
print "\$\@: $@\n";
} else {
print "n\n";
}
}

Tony

p5pRT · 2019-01-10T09:41:19Z

From @nwc10

On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote:

You're going to facepalm yourself.

Yes. :-)

On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:

For the case where the eval'd source code contains Ā in a code comment
this
doesn't match. Where the code comment is ÿ it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {

Add:

pos($text) = 0;

here and they all pass.

Yes. I totally failed to figure out the proper test case. I now have a fresh
head, the office coffee machine bootstrapped, and...

On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote:

I looked at it a little more, and it seems weird that a comment would cause
this variance in behavior. OTOH, eval of UTF-8 is buggy, and that's why we
have the unicode_eval feature.

OK, *here* is the test case. It's very sensitive to string length:

nc@I34 perl2$ cat /tmp/133756.pl
#!perl

my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt";
my $text = $mr;

if (shift) {
utf8::downgrade($mr);
} else {
utf8::upgrade($mr);
}

# use Devel::Peek; Dump($mr);

my $got = eval "\$text =~ /$mr/i;";

if ($got) {
print "Y\n";
exit 1;
} elsif ($@) {
print "\$\@: $@\n";
die;
} else {
print "n\n";
exit 0;
}

__END__
nc@I34 perl2$ perl /tmp/133756.pl 1
Y
nc@I34 perl2$ perl /tmp/133756.pl 0
n

It's "fixed" in blead:

792b461 is the first bad commit
commit 792b461
Author: Karl Williamson <khw@cpan.org>
Date: Sat Feb 3 10:25:31 2018 -0700

regcomp.c: Pack EXACTish nodes more fully

Prior to this commit, nodes that are to match a string exactly, or
possibly case insensitively used only half the potential space available
(that being limited by the length field which is a U8). (The optimizer
might later pack some together to make a larger node.) Talking it over
with Yves, we suspect that this is a relic of an earlier time. It makes
more sense to have longer nodes when possible to lower overhead in
the matching engine.

:100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M regcomp.c
bisect run success
That took 491 seconds.

and that commit is just a refactoring - no bug fixes, no new tests:

commit 792b461
Author: Karl Williamson <khw@cpan.org>
Date: Sat Feb 3 10:25:31 2018 -0700

regcomp.c: Pack EXACTish nodes more fully

Prior to this commit, nodes that are to match a string exactly, or
possibly case insensitively used only half the potential space available
(that being limited by the length field which is a U8). (The optimizer
might later pack some together to make a larger node.) Talking it over
with Yves, we suspect that this is a relic of an earlier time. It makes
more sense to have longer nodes when possible to lower overhead in
the matching engine.

regcomp.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)

Inline Patch

diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..74c8eb2e21 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 	    UV ender = 0;
 	    char *p;
 	    char *s;
-#define MAX_NODE_STRING_SIZE 127
+
+/* This allows us to fill a node with just enough spare so that if the final
+ * character folds, its expansion is guaranteed to fit */
+#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
 	    char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1];
+
 	    char *s0;
 	    U8 upper_parse = MAX_NODE_STRING_SIZE;
             U8 node_type = compute_EXACTish(pRExC_state);
@@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
              * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
             maybe_exact = FOLD && PASS2;
 
-	    /* XXX The node can hold up to 255 bytes, yet this only goes to
-             * 127.  I (khw) do not know why.  Keeping it somewhat less than
-             * 255 allows us to not have to worry about overflow due to
-             * converting to utf8 and fold expansion, but that value is
-             * 255-UTF8_MAXBYTES_CASE.  join_exact() may join adjacent nodes
-             * split up by this limit into a single one using the real max of
-             * 255.  Even at 127, this breaks under rare circumstances.  If
-             * folding, we do not want to split a node at a character that is a
-             * non-final in a multi-char fold, as an input string could just
-             * happen to want to match across the node boundary.  The join
-             * would solve that problem if the join actually happens.  But a
-             * series of more than two nodes in a row each of 127 would cause
-             * the first join to succeed to get to 254, but then there wouldn't
-             * be room for the next one, which could at be one of those split
-             * multi-char folds.  I don't know of any fool-proof solution.  One
-             * could back off to end with only a code point that isn't such a
-             * non-final, but it is possible for there not to be any in the
-             * entire node. */
+            /* This breaks under rare circumstances.  If folding, we do not
+             * want to split a node at a character that is a non-final in a
+             * multi-char fold, as an input string could just happen to want to
+             * match across the node boundary.  The code at the end of the loop
+             * looks for this, and backs off until it finds not such a
+             * character, but it is possible (though extremely, extremely
+             * unlikely) for all characters in the node to be non-final fold
+             * ones, in which case we just leave the node fully filled, and
+             * hope that it doesn't match the string in just the wrong place */
 
             assert(   ! UTF     /* Is at the beginning of a character */
                    || UTF8_IS_INVARIANT(UCHARAT(RExC_parse))

So I think that we still have a real live bug here, now just better hidden. And sorry for the distraction of the poor test case\.

Given that that commit just changes the value of MAX_NODE_STRING_SIZE
I figure that I *ought* to be able to tweak my test script to use a
different sized string and still expose a problem, but I'm failing to
make it (not) work.

The original problem was revealed as exactly 2 lines of difference in a pair
of 1.2G files, which are produced from something like 7M of generated code,
when I tested a refactoring of the how that code is generated.

So whatever it is, it's obscure.

Nicholas Clark

p5pRT · 2019-01-11T05:01:12Z

From @khwilliamson

On 1/10/19 2:05 AM, Nicholas Clark wrote:

On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote:

You're going to facepalm yourself.

Yes. :-)

On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:

For the case where the eval'd source code contains Ā in a code comment
this
doesn't match. Where the code comment is ÿ it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {

Add:

pos($text) = 0;

here and they all pass.

Yes. I totally failed to figure out the proper test case. I now have a fresh
head, the office coffee machine bootstrapped, and...

On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote:

I looked at it a little more, and it seems weird that a comment would cause
this variance in behavior. OTOH, eval of UTF-8 is buggy, and that's why we
have the unicode_eval feature.

OK, *here* is the test case. It's very sensitive to string length:

nc@I34 perl2$ cat /tmp/133756.pl
#!perl

my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt";
my $text = $mr;

if (shift) {
utf8::downgrade($mr);
} else {
utf8::upgrade($mr);
}

# use Devel::Peek; Dump($mr);

my $got = eval "\$text =~ /$mr/i;";

if ($got) {
print "Y\n";
exit 1;
} elsif ($@) {
print "\$\@: $@\n";
die;
} else {
print "n\n";
exit 0;
}

__END__
nc@I34 perl2$ perl /tmp/133756.pl 1
Y
nc@I34 perl2$ perl /tmp/133756.pl 0
n

It's "fixed" in blead:

792b461 is the first bad commit
commit 792b461
Author: Karl Williamson <khw@cpan.org>
Date: Sat Feb 3 10:25:31 2018 -0700
 regcomp\.c&#8203;: Pack EXACTish nodes more fully

 Prior to this commit\, nodes that are to match a string exactly\, or
 possibly case insensitively used only half the potential space available
 $that being limited by the length field which is a U8$\.  $The optimizer
 might later pack some together to make a larger node\.$  Talking it over
 with Yves\, we suspect that this is a relic of an earlier time\.  It makes
 more sense to have longer nodes when possible to lower overhead in
 the matching engine\.
:100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M regcomp.c
bisect run success
That took 491 seconds.

and that commit is just a refactoring - no bug fixes, no new tests:

commit 792b461
Author: Karl Williamson <khw@cpan.org>
Date: Sat Feb 3 10:25:31 2018 -0700
 regcomp\.c&#8203;: Pack EXACTish nodes more fully
 
 Prior to this commit\, nodes that are to match a string exactly\, or
 possibly case insensitively used only half the potential space available
 $that being limited by the length field which is a U8$\.  $The optimizer
 might later pack some together to make a larger node\.$  Talking it over
 with Yves\, we suspect that this is a relic of an earlier time\.  It makes
 more sense to have longer nodes when possible to lower overhead in
 the matching engine\.
---
regcomp.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..74c8eb2e21 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
UV ender = 0;
char *p;
char *s;
-#define MAX_NODE_STRING_SIZE 127
+
+/* This allows us to fill a node with just enough spare so that if the final
+ * character folds, its expansion is guaranteed to fit */
+#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1];
+
char *s0;
U8 upper_parse = MAX_NODE_STRING_SIZE;
U8 node_type = compute_EXACTish(pRExC_state);
@@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* use a pseudo regnode like 'EXACT_ORIG_FOLD' */
maybe_exact = FOLD && PASS2;

- /* XXX The node can hold up to 255 bytes, yet this only goes to
- * 127. I (khw) do not know why. Keeping it somewhat less than
- * 255 allows us to not have to worry about overflow due to
- * converting to utf8 and fold expansion, but that value is
- * 255-UTF8_MAXBYTES_CASE. join_exact() may join adjacent nodes
- * split up by this limit into a single one using the real max of
- * 255. Even at 127, this breaks under rare circumstances. If
- * folding, we do not want to split a node at a character that is a
- * non-final in a multi-char fold, as an input string could just
- * happen to want to match across the node boundary. The join
- * would solve that problem if the join actually happens. But a
- * series of more than two nodes in a row each of 127 would cause
- * the first join to succeed to get to 254, but then there wouldn't
- * be room for the next one, which could at be one of those split
- * multi-char folds. I don't know of any fool-proof solution. One
- * could back off to end with only a code point that isn't such a
- * non-final, but it is possible for there not to be any in the
- * entire node. */
+ /* This breaks under rare circumstances. If folding, we do not
+ * want to split a node at a character that is a non-final in a
+ * multi-char fold, as an input string could just happen to want to
+ * match across the node boundary. The code at the end of the loop
+ * looks for this, and backs off until it finds not such a
+ * character, but it is possible (though extremely, extremely
+ * unlikely) for all characters in the node to be non-final fold
+ * ones, in which case we just leave the node fully filled, and
+ * hope that it doesn't match the string in just the wrong place */
          assert$   \! UTF     /\* Is at the beginning of a character \*/
                 || UTF8\_IS\_INVARIANT\(UCHARAT\(RExC\_parse$\)
So I think that we still have a real live bug here, now just better hidden.
And sorry for the distraction of the poor test case.

Given that that commit just changes the value of MAX_NODE_STRING_SIZE
I figure that I *ought* to be able to tweak my test script to use a
different sized string and still expose a problem, but I'm failing to
make it (not) work.

The original problem was revealed as exactly 2 lines of difference in a pair
of 1.2G files, which are produced from something like 7M of generated code,
when I tested a refactoring of the how that code is generated.

So whatever it is, it's obscure.

Nicholas Clark

Thank you for reporting this.

And your surmisal was correct. There is a real bug, as hopefully
explained by this commit message, now in blead:

From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Thu, 10 Jan 2019 20:59:31 -0700
Subject: [PATCH 2/2] PATCH: [perl #133756] Failure to match properly

This was caused by a counting error.

An EXACTFish regnode has a finite length it can hold for the string
being matched. If that length is exceeded, a 2nd node is used for the
next segment of the string, for as many regnodes as are needed.

A problem occurs if a regnode ends with one of the 22 characters in
Unicode 11 that occur in non-final positions of a multi-character fold.
The design of the pattern matching engine doesn't allow matches across
regnodes. Consider, for example if a node ended in the letter 'f' and
the next node begins with the letter 'i'. That sequence should match,
under /i, the ligature "fi" (U+FB01). But it wouldn't because the
pattern splits them across nodes. The solution I adopted was to forbid
a node to end with one of those 22 characters if there is another string
node that follows it. This is not fool proof, for example, if the
entire node consisted of only these characters, one would have to split
it at some position (In that case, we just take as much of the string as
will fit.) But for real life applications, it is good enough.

What happens if a node ends with one of the 22, is that the node is
shortened so that those are instead placed at the beginning of the
following node. When the code encounters this situation, it backs off
until it finds a character that isn't a non-final fold one, and closes
the node with that one.

A /i node is filled with the fold of the input, for several reasons.
The most obvious is that it saves time, you can skip folding the pattern
at runtime. But there are reasons based on the design of the optimzer
as well, which I won't go into here, but are documented in regcomp.c.
When we back out the final characters in a node, we also have to back
out the corresponding unfolded characters in the input, so that those
can be (folded) into the following node. Since the number of characters
in the fold may not be the same as unfolded, there is not an easily
discernable correspondence between the input and the folded output.
That means that generally, what has to be done is that the input is
reparsed from the beginning of the node, but the permitted length has
been shortened (we know precisely how much to shorten it to) so that it
will end with something other than the 22. But, the code saves the
previous input character's position (for other reasons), so if we only
have to backup one character, we can just use that and not have to
reparse.

This bug was that the code thought a two character backup was really a
one character one, and did not reparse the node, creating an off-by-one
error, and a character was simply omitted in the pattern (that should
have started the following node). And the input had two of the 22
characters adjacent to each other in just the right positions that the
node was split. The bisect showed that when the node size was changed
the bug went away, at least for this particular input string. But a
different, longer, string would have triggered the bug, and this commit
fixes that.

This bug is actually very unlikely to occur in most real world
applications. That is because other changes in the regex compiler have
caused nodes to be split so that things that don't particpate in folds
at all are separated out into EXACT nodes. (The reason for that is it
allows the optimizer things to grab on to under /i that it wouldn't
otherwise have known about.) That means that anything like this string
would never cause the bug to happen because blanks and commas, etc.
would be in separate nodes, and so no node would ever get large enough
to fill the 238 available byte slots in a node (235 on EBCDIC). Only a
long string without punctuation would trigger it. I have artificially
constructed such a string in the tests added by this commit.

One of the 22 characters is 't', so long strings of DNA "ACTG" could
trigger this bug. I find it somewhat amusing that this is something
like a DNA transcription error, which occurs in nature at very low
rates, but selection, it is believed, will make sure the error rate is
above zero.

p5pRT · 2019-01-11T13:48:17Z

From @jkeenan

On Fri, 11 Jan 2019 05:01:12 GMT, public@khwilliamson.com wrote:

[snip]

So whatever it is, it's obscure.

Nicholas Clark

Thank you for reporting this.

And your surmisal was correct. There is a real bug, as hopefully
explained by this commit message, now in blead:

From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 2001

I don't think the line above is correct. When I update blead the commit appears to be:

#####
commit 7397626 (HEAD -> blead, origin/blead, origin/HEAD)
Author: Karl Williamson <khw@cpan.org>
AuthorDate: Thu Jan 10 22:59:31 2019
Commit: Karl Williamson <khw@cpan.org>
CommitDate: Thu Jan 10 23:56:24 2019

PATCH: [perl #133756] Failure to match properly

...
#####

Thank you very much.

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2019-01-21T01:14:33Z

From @khwilliamson

On Fri, 11 Jan 2019 05:48:17 -0800, jkeenan wrote:

On Fri, 11 Jan 2019 05:01:12 GMT, public@khwilliamson.com wrote:

[snip]

So whatever it is, it's obscure.

Nicholas Clark

Thank you for reporting this.

And your surmisal was correct. There is a real bug, as hopefully
explained by this commit message, now in blead:

From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00
2001

I don't think the line above is correct. When I update blead the
commit appears to be:

#####
commit 7397626 (HEAD -> blead,
origin/blead, origin/HEAD)
Author: Karl Williamson <khw@cpan.org>
AuthorDate: Thu Jan 10 22:59:31 2019
Commit: Karl Williamson <khw@cpan.org>
CommitDate: Thu Jan 10 23:56:24 2019

PATCH: [perl #133756] Failure to match properly

This is the correct commit number for the fix. Now closing this ticket.

...
#####

Thank you very much.

--
Karl Williamson

p5pRT · 2019-01-21T01:15:01Z

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT · 2019-05-22T15:07:04Z

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.30.0, this and 160 other issues have been
resolved.

Perl 5.30.0 may be downloaded via:
https://metacpan.org/release/XSAWYERX/perl-5.30.0

If you find that the problem persists, feel free to reopen this ticket.

p5pRT · 2019-05-22T15:07:04Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed May 22, 2019

p5pRT added Severity Low labels Oct 19, 2019

p5pRT added affects-5.24 type-core labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

//g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match #16806

//g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match #16806

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

p5pRT commented Jan 10, 2019

p5pRT commented Jan 10, 2019

p5pRT commented Jan 11, 2019

p5pRT commented Jan 11, 2019

p5pRT commented Jan 21, 2019

p5pRT commented Jan 21, 2019

p5pRT commented May 22, 2019

p5pRT commented May 22, 2019

//g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match #16806

//g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match #16806

Comments

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

From @nwc10

Created by @nwc10

p5pRT commented Jan 9, 2019

p5pRT commented Jan 9, 2019

From @khwilliamson

p5pRT commented Jan 9, 2019

From @nwc10

p5pRT commented Jan 9, 2019

From @khwilliamson

p5pRT commented Jan 10, 2019

From @tonycoz

p5pRT commented Jan 10, 2019

From @nwc10

p5pRT commented Jan 11, 2019

From @khwilliamson

p5pRT commented Jan 11, 2019

From @jkeenan

p5pRT commented Jan 21, 2019

From @khwilliamson

p5pRT commented Jan 21, 2019

p5pRT commented May 22, 2019

From @khwilliamson

p5pRT commented May 22, 2019