Skip Menu |
Report information
Id: 133756
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: nicholas <nick [at] ccl4.org>
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: core
Perl Version:
  • 5.29.7
  • 5.6.0
  • 5.6.1
  • 5.6.2
  • 5.7.0
  • 5.7.1
  • 5.7.2
  • 5.8.0
  • 5.8.1
  • 5.8.2
  • 5.8.3
  • 5.8.4
  • 5.8.5
  • 5.8.6
  • 5.8.7
  • 5.8.8
  • 5.8.9
  • 5.9.0
  • 5.9.1
  • 5.9.2
  • 5.9.3
  • 5.9.4
  • 5.9.5
  • 5.0.1
  • 5.10.0
  • 5.10.1
  • 5.11.0
  • 5.11.1
  • 5.11.2
  • 5.11.3
  • 5.11.4
  • 5.11.5
  • 5.12.0
  • 5.12.1
  • 5.12.2
  • 5.12.3
  • 5.12.4
  • 5.12.5
  • 5.12.6
  • 5.13.0
  • 5.13.1
  • 5.13.2
  • 5.13.3
  • 5.13.4
  • 5.13.5
  • 5.13.6
  • 5.13.7
  • 5.13.8
  • 5.13.9
  • 5.13.10
  • 5.13.11
  • 5.14.0
  • 5.14.1
  • 5.14.2
  • 5.14.3
  • 5.14.4
  • 5.14.5
  • 5.15.0
  • 5.15.1
  • 5.15.2
  • 5.15.3
  • 5.15.4
  • 5.15.5
  • 5.15.6
  • 5.15.7
  • 5.15.8
  • 5.15.9
  • 5.15.10
  • 5.16.0
  • 5.16.1
  • 5.16.2
  • 5.16.3
  • 5.16.4
  • 5.17.0
  • 5.17.1
  • 5.17.2
  • 5.17.3
  • 5.17.4
  • 5.17.5
  • 5.17.6
  • 5.17.7
  • 5.17.8
  • 5.17.9
  • 5.17.10
  • 5.17.11
  • 5.17.12
  • 5.18.0
  • 5.18.1
  • 5.18.2
  • 5.18.3
  • 5.18.4
  • 5.18.5
  • 5.19.0
  • 5.19.1
  • 5.19.2
  • 5.19.3
  • 5.19.4
  • 5.19.5
  • 5.19.6
  • 5.19.7
  • 5.19.8
  • 5.19.9
  • 5.19.10
  • 5.19.11
  • 5.20.0
  • 5.20.1
  • 5.20.2
  • 5.20.3
  • 5.21.0
  • 5.21.1
  • 5.21.2
  • 5.21.3
  • 5.21.4
  • 5.21.5
  • 5.21.6
  • 5.21.7
  • 5.21.8
  • 5.21.9
  • 5.21.10
  • 5.21.11
  • 5.22.0
  • 5.22.1
  • 5.22.2
  • 5.22.3
  • 5.22.4
  • 5.23.0
  • 5.23.1
  • 5.23.2
  • 5.23.3
  • 5.23.4
  • 5.23.5
  • 5.23.6
  • 5.23.7
  • 5.23.8
  • 5.23.9
  • 5.24.0
  • 5.24.1
  • 5.24.2
  • 5.24.3
  • 5.24.4
  • 5.25.0
  • 5.25.1
  • 5.25.2
  • 5.25.3
  • 5.25.4
  • 5.25.5
  • 5.25.6
  • 5.25.7
  • 5.25.8
  • 5.25.9
  • 5.25.10
  • 5.25.11
  • 5.25.12
  • 5.25.13
  • 5.26.0
  • 5.26.1
  • 5.26.2
  • 5.26.3
  • 5.27.0
  • 5.27.1
  • 5.27.2
  • 5.27.3
  • 5.27.4
  • 5.27.5
  • 5.27.6
  • 5.27.7
  • 5.27.8
  • 5.27.9
  • 5.27.10
  • 5.27.11
  • 5.27.12
  • 5.28.0
  • 5.28.1
  • 5.29.0
  • 5.29.1
  • 5.29.2
  • 5.29.3
  • 5.29.4
  • 5.29.5
  • 5.29.6
Fixed In: (no value)



Subject: //g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match
From: Nicholas Clark <nick [...] ccl4.org>
To: perlbug [...] perl.org
Date: Wed, 9 Jan 2019 13:35:51 +0000
Download (untitled) / with headers
text/plain 7.1k
This is a bug report for perl from nick@ccl4.org, generated with the help of perlbug 1.41 running under perl 5.29.7. ----------------------------------------------------------------- [Please describe your issue here] For the case where the eval'd source code contains Ā in a code comment this doesn't match. Where the code comment is ÿ it does: nicholas@dromedary-001 perl6$ cat /tmp/rule.pl use strict; use warnings; my $mr = "L\xFCften Kalt"; my $text = "L\xFCften Kalt"; # Culprit here is the //g flag: for my $char ("\xFF", "\x100", "\xFF", "\x100") { my $got = eval "\$text =~ /$mr/g; # $char"; if ($got) { print "Y\n"; } elsif ($@) { print "\$\@: $@\n"; } else { print "n\n"; } } __END__ nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl Y n Y n If one removes the //g flag, it does: nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g use strict; use warnings; my $mr = "L\xFCften Kalt"; my $text = "L\xFCften Kalt"; # Culprit here is the //g flag: for my $char ("\xFF", "\x100", "\xFF", "\x100") { my $got = eval "\$text =~ /$mr/; # $char"; if ($got) { print "Y\n"; } elsif ($@) { print "\$\@: $@\n"; } else { print "n\n"; } } __END__ nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl-no-g Y Y Y Y I would expect that it should match always, independent of whether //g is present, or whether the source code is encoded as UTF-8 or octets. Running with -Dr suggests that the culprit is the regex optimiser, "Regex match can't succeed, so not even tried": [snip] EXECUTING... Compiling REx "L%x{fc}ften Kalt" ~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4 rarest char at 1 Final program: 1: EXACT <L\x{fc}ften Kalt> (5) 5: END (0) anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11 Matching REx "L%x{fc}ften Kalt" against "L%x{fc}ften Kalt" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..11] gave 0 Found anchored substr "L%x{fc}ften Kalt" at offset 0 (rx_origin now 0)... (multiline anchor test skipped) Intuit: Successfully guessed: match at offset 0 Freeing REx: "L%x{fc}ften Kalt" Y Compiling REx "L%x{fc}ften Kalt" ~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4 rarest char at 1 Final program: 1: EXACT <L\x{fc}ften Kalt> (5) 5: END (0) anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11 Matching REx "L%x{fc}ften Kalt" against "" Regex match can't succeed, so not even tried Freeing REx: "L%x{fc}ften Kalt" n Compiling REx "L%x{fc}ften Kalt" ~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4 rarest char at 1 Final program: 1: EXACT <L\x{fc}ften Kalt> (5) 5: END (0) anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11 Matching REx "L%x{fc}ften Kalt" against "L%x{fc}ften Kalt" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [0..11] gave 0 Found anchored substr "L%x{fc}ften Kalt" at offset 0 (rx_origin now 0)... (multiline anchor test skipped) Intuit: Successfully guessed: match at offset 0 Freeing REx: "L%x{fc}ften Kalt" Y Compiling REx "L%x{fc}ften Kalt" ~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4 rarest char at 1 Final program: 1: EXACT <L\x{fc}ften Kalt> (5) 5: END (0) anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11 Matching REx "L%x{fc}ften Kalt" against "" Regex match can't succeed, so not even tried Freeing REx: "L%x{fc}ften Kalt" n This does not seem to be a regression - everything I sampled back to 5.6.0 shows the same Y/n/Y/n behaviour. Nicholas Clark [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl 5.29.7: Configured by nicholas at Wed Jan 9 14:11:11 CET 2019. Summary of my perl5 (revision 5 version 29 subversion 7) configuration: Commit id: 5203d63deea0ef134714a48c272a928fbbe64ce1 Platform: osname=linux osvers=2.6.32-358.el6.x86_64 archname=x86_64-linux uname='linux dromedary-001.ams6.corp.booking.com 2.6.32-358.el6.x86_64 #1 smp fri feb 22 00:31:26 utc 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-Dusedevel -Dcc=ccache /usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64 -Dld=/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64 -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list= -Dinc_version_list_init=0 -Accflags=-DDEBUGGING -g -Doptimize=-Og -Uusethreads -Uuselongdouble -Uuse64bitall -Dprefix=~/Sandpit/snap-v5.29.6-94-g5203d63dee -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap-v5.29.6-94-g5203d63dee -de' hint=recommended useposix=true d_sigaction=define useithreads=undef usemultiplicity=undef use64bitint=define use64bitall=undef uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64' ccflags ='-DDEBUGGING -g -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2' optimize='-Og' cppflags='-DDEBUGGING -g -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='4.9.0' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64' ldflags =' -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/local/gcc49/lib /usr/local/gcc49/lib/gcc/x86_64-unknown-linux-gnu/4.9.0/include-fixed /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64 libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.12.so so=so useshrplib=false libperl=libperl.a gnulibc_version='2.12' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E' cccdlflags='-fPIC' lddlflags='-shared -Og -L/usr/local/lib -fstack-protector-strong' --- @INC for perl 5.29.7: lib /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/site_perl/5.29.7/x86_64-linux /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/site_perl/5.29.7 /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/5.29.7/x86_64-linux /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/5.29.7 --- Environment for perl 5.29.7: HOME=/home/nicholas LANG (unset) LANGUAGE (unset) LC_ALL=en_GB.utf8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/nicholas/bin:/opt/local/bin:/opt/local/sbin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/sbin:/sbin:/usr/sbin PERL_BADLANG (unset) SHELL=/bin/bash
From: Karl Williamson <public [...] khwilliamson.com>
To: perl5-porters [...] perl.org
Date: Wed, 9 Jan 2019 09:49:59 -0700
Subject: Re: [perl #133756] //g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match
Download (untitled) / with headers
text/plain 2.3k
On 1/9/19 7:11 AM, Nicholas Clark (via RT) wrote: Show quoted text
> # New Ticket Created by Nicholas Clark > # Please include the string: [perl #133756] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=133756 > > > > This is a bug report for perl from nick@ccl4.org, > generated with the help of perlbug 1.41 running under perl 5.29.7. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > For the case where the eval'd source code contains Ā in a code comment this > doesn't match. Where the code comment is ÿ it does: > > nicholas@dromedary-001 perl6$ cat /tmp/rule.pl > use strict; > use warnings; > > my $mr = "L\xFCften Kalt"; > my $text = "L\xFCften Kalt"; > > # Culprit here is the //g flag: > for my $char ("\xFF", "\x100", "\xFF", "\x100") { > my $got = eval "\$text =~ /$mr/g; # $char"; > > if ($got) { > print "Y\n"; > } elsif ($@) { > print "\$\@: $@\n"; > } else { > print "n\n"; > } > } > > __END__ > nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl > Y > n > Y > n > > > > If one removes the //g flag, it does: > > nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g > use strict; > use warnings; > > my $mr = "L\xFCften Kalt"; > my $text = "L\xFCften Kalt"; > > # Culprit here is the //g flag: > for my $char ("\xFF", "\x100", "\xFF", "\x100") { > my $got = eval "\$text =~ /$mr/; # $char"; > > if ($got) { > print "Y\n"; > } elsif ($@) { > print "\$\@: $@\n"; > } else { > print "n\n"; > } > } > > __END__ > nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl-no-g > Y > Y > Y > Y > > > > I would expect that it should match always, independent of whether //g is > present, or whether the source code is encoded as UTF-8 or octets. > > Running with -Dr suggests that the culprit is the regex optimiser, > "Regex match can't succeed, so not even tried": > > [snip] >
My gvim syntax highlighter immediately showed that \x100 is \x10 followed by a "0". Without that, I would have expected that $char contained a single character: \x{100}. The /g would cause the second character, the "0" (U+0030) to be attempted to be matched. I haven't investigated further, because my guess is that is what is going on here. If you say there is more to it, then I'll investigate further.
Date: Wed, 9 Jan 2019 17:14:21 +0000
CC: perl5-porters [...] perl.org
To: Karl Williamson <public [...] khwilliamson.com>
From: Nicholas Clark <nick [...] ccl4.org>
Subject: Re: [perl #133756] //g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match
Download (untitled) / with headers
text/plain 871b
On Wed, Jan 09, 2019 at 09:49:59AM -0700, Karl Williamson wrote: Show quoted text
> My gvim syntax highlighter immediately showed that \x100 is \x10 followed by > a "0". Without that, I would have expected that $char contained a single > character: \x{100}. The /g would cause the second character, the "0" > (U+0030) to be attempted to be matched. I haven't investigated further, > because my guess is that is what is going on here. If you say there is more > to it, then I'll investigate further.
Thanks for the rapid response. I think that you might be right, but will investigate further tomorrow at work with a fresh head. This would mean that I've failed to correctly reduce the original problem to a representitive test case. (The problem might *still* be PEBKAC, but the original discrepency in the much larger input and generated regex looked like a bug.) Nicholas Clark
To: Nicholas Clark <nick [...] ccl4.org>
From: Karl Williamson <public [...] khwilliamson.com>
Date: Wed, 9 Jan 2019 10:25:42 -0700
CC: perl5-porters [...] perl.org
Subject: Re: [perl #133756] //g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match
Download (untitled) / with headers
text/plain 1.1k
On 1/9/19 10:14 AM, Nicholas Clark wrote: Show quoted text
> On Wed, Jan 09, 2019 at 09:49:59AM -0700, Karl Williamson wrote: >
>> My gvim syntax highlighter immediately showed that \x100 is \x10 followed by >> a "0". Without that, I would have expected that $char contained a single >> character: \x{100}. The /g would cause the second character, the "0" >> (U+0030) to be attempted to be matched. I haven't investigated further, >> because my guess is that is what is going on here. If you say there is more >> to it, then I'll investigate further.
> > Thanks for the rapid response. I think that you might be right, but will > investigate further tomorrow at work with a fresh head. > > This would mean that I've failed to correctly reduce the original problem > to a representitive test case. (The problem might *still* be PEBKAC, but > the original discrepency in the much larger input and generated regex looked > like a bug.)
I looked at it a little more, and it seems weird that a comment would cause this variance in behavior. OTOH, eval of UTF-8 is buggy, and that's why we have the unicode_eval feature. Show quoted text
> > Nicholas Clark >
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 681b
You're going to facepalm yourself. On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote: Show quoted text
> For the case where the eval'd source code contains Ā in a code comment > this > doesn't match. Where the code comment is ÿ it does: > > nicholas@dromedary-001 perl6$ cat /tmp/rule.pl > use strict; > use warnings; > > my $mr = "L\xFCften Kalt"; > my $text = "L\xFCften Kalt"; > > # Culprit here is the //g flag: > for my $char ("\xFF", "\x100", "\xFF", "\x100") {
Add: pos($text) = 0; here and they all pass. Show quoted text
> my $got = eval "\$text =~ /$mr/g; # $char"; > > if ($got) { > print "Y\n"; > } elsif ($@) { > print "\$\@: $@\n"; > } else { > print "n\n"; > } > }
Tony
Date: Thu, 10 Jan 2019 09:05:43 +0000
To: Karl Williamson <public [...] khwilliamson.com>, Tony Cook via RT <perlbug-followup [...] perl.org>
From: Nicholas Clark <nick [...] ccl4.org>
Subject: Re: [perl #133756] //g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match
Download (untitled) / with headers
text/plain 6.8k
On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote: Show quoted text
> You're going to facepalm yourself.
Yes. :-) Show quoted text
> On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:
> > For the case where the eval'd source code contains Ā in a code comment > > this > > doesn't match. Where the code comment is ÿ it does: > > > > nicholas@dromedary-001 perl6$ cat /tmp/rule.pl > > use strict; > > use warnings; > > > > my $mr = "L\xFCften Kalt"; > > my $text = "L\xFCften Kalt"; > > > > # Culprit here is the //g flag: > > for my $char ("\xFF", "\x100", "\xFF", "\x100") {
> > Add: > > pos($text) = 0; > > here and they all pass.
Yes. I totally failed to figure out the proper test case. I now have a fresh head, the office coffee machine bootstrapped, and... On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote: Show quoted text
> I looked at it a little more, and it seems weird that a comment would cause > this variance in behavior. OTOH, eval of UTF-8 is buggy, and that's why we > have the unicode_eval feature.
OK, *here* is the test case. It's very sensitive to string length: nc@I34 perl2$ cat /tmp/133756.pl #!perl my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt"; my $text = $mr; if (shift) { utf8::downgrade($mr); } else { utf8::upgrade($mr); } # use Devel::Peek; Dump($mr); my $got = eval "\$text =~ /$mr/i;"; if ($got) { print "Y\n"; exit 1; } elsif ($@) { print "\$\@: $@\n"; die; } else { print "n\n"; exit 0; } __END__ nc@I34 perl2$ perl /tmp/133756.pl 1 Y nc@I34 perl2$ perl /tmp/133756.pl 0 n It's "fixed" in blead: 792b461851b63487dbcd63fc1948671db8f1dbe5 is the first bad commit commit 792b461851b63487dbcd63fc1948671db8f1dbe5 Author: Karl Williamson <khw@cpan.org> Date: Sat Feb 3 10:25:31 2018 -0700 regcomp.c: Pack EXACTish nodes more fully Prior to this commit, nodes that are to match a string exactly, or possibly case insensitively used only half the potential space available (that being limited by the length field which is a U8). (The optimizer might later pack some together to make a larger node.) Talking it over with Yves, we suspect that this is a relic of an earlier time. It makes more sense to have longer nodes when possible to lower overhead in the matching engine. :100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M regcomp.c bisect run success That took 491 seconds. and that commit is just a refactoring - no bug fixes, no new tests: commit 792b461851b63487dbcd63fc1948671db8f1dbe5 Author: Karl Williamson <khw@cpan.org> Date: Sat Feb 3 10:25:31 2018 -0700 regcomp.c: Pack EXACTish nodes more fully Prior to this commit, nodes that are to match a string exactly, or possibly case insensitively used only half the potential space available (that being limited by the length field which is a U8). (The optimizer might later pack some together to make a larger node.) Talking it over with Yves, we suspect that this is a relic of an earlier time. It makes more sense to have longer nodes when possible to lower overhead in the matching engine. --- regcomp.c | 33 ++++++++++++++------------------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/regcomp.c b/regcomp.c index 6dbfed52ab..74c8eb2e21 100644 --- a/regcomp.c +++ b/regcomp.c @@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) UV ender = 0; char *p; char *s; -#define MAX_NODE_STRING_SIZE 127 + +/* This allows us to fill a node with just enough spare so that if the final + * character folds, its expansion is guaranteed to fit */ +#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE) char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1]; + char *s0; U8 upper_parse = MAX_NODE_STRING_SIZE; U8 node_type = compute_EXACTish(pRExC_state); @@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) * use a pseudo regnode like 'EXACT_ORIG_FOLD' */ maybe_exact = FOLD && PASS2; - /* XXX The node can hold up to 255 bytes, yet this only goes to - * 127. I (khw) do not know why. Keeping it somewhat less than - * 255 allows us to not have to worry about overflow due to - * converting to utf8 and fold expansion, but that value is - * 255-UTF8_MAXBYTES_CASE. join_exact() may join adjacent nodes - * split up by this limit into a single one using the real max of - * 255. Even at 127, this breaks under rare circumstances. If - * folding, we do not want to split a node at a character that is a - * non-final in a multi-char fold, as an input string could just - * happen to want to match across the node boundary. The join - * would solve that problem if the join actually happens. But a - * series of more than two nodes in a row each of 127 would cause - * the first join to succeed to get to 254, but then there wouldn't - * be room for the next one, which could at be one of those split - * multi-char folds. I don't know of any fool-proof solution. One - * could back off to end with only a code point that isn't such a - * non-final, but it is possible for there not to be any in the - * entire node. */ + /* This breaks under rare circumstances. If folding, we do not + * want to split a node at a character that is a non-final in a + * multi-char fold, as an input string could just happen to want to + * match across the node boundary. The code at the end of the loop + * looks for this, and backs off until it finds not such a + * character, but it is possible (though extremely, extremely + * unlikely) for all characters in the node to be non-final fold + * ones, in which case we just leave the node fully filled, and + * hope that it doesn't match the string in just the wrong place */ assert( ! UTF /* Is at the beginning of a character */ || UTF8_IS_INVARIANT(UCHARAT(RExC_parse)) So I think that we still have a real live bug here, now just better hidden. And sorry for the distraction of the poor test case. Given that that commit just changes the value of MAX_NODE_STRING_SIZE I figure that I *ought* to be able to tweak my test script to use a different sized string and still expose a problem, but I'm failing to make it (not) work. The original problem was revealed as exactly 2 lines of difference in a pair of 1.2G files, which are produced from something like 7M of generated code, when I tested a refactoring of the how that code is generated. So whatever it is, it's obscure. Nicholas Clark
Date: Thu, 10 Jan 2019 22:00:45 -0700
Subject: Re: [perl #133756] //g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match
From: Karl Williamson <public [...] khwilliamson.com>
To: Nicholas Clark <nick [...] ccl4.org>, Tony Cook via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 11.6k
On 1/10/19 2:05 AM, Nicholas Clark wrote: Show quoted text
> > On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote:
>> You're going to facepalm yourself.
> > Yes. :-) >
>> On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:
>>> For the case where the eval'd source code contains Ā in a code comment >>> this >>> doesn't match. Where the code comment is ÿ it does: >>> >>> nicholas@dromedary-001 perl6$ cat /tmp/rule.pl >>> use strict; >>> use warnings; >>> >>> my $mr = "L\xFCften Kalt"; >>> my $text = "L\xFCften Kalt"; >>> >>> # Culprit here is the //g flag: >>> for my $char ("\xFF", "\x100", "\xFF", "\x100") {
>> >> Add: >> >> pos($text) = 0; >> >> here and they all pass.
> > > Yes. I totally failed to figure out the proper test case. I now have a fresh > head, the office coffee machine bootstrapped, and... > > On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote: >
>> I looked at it a little more, and it seems weird that a comment would cause >> this variance in behavior. OTOH, eval of UTF-8 is buggy, and that's why we >> have the unicode_eval feature.
> > > OK, *here* is the test case. It's very sensitive to string length: > > nc@I34 perl2$ cat /tmp/133756.pl > #!perl > > my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt"; > my $text = $mr; > > if (shift) { > utf8::downgrade($mr); > } else { > utf8::upgrade($mr); > } > > # use Devel::Peek; Dump($mr); > > my $got = eval "\$text =~ /$mr/i;"; > > if ($got) { > print "Y\n"; > exit 1; > } elsif ($@) { > print "\$\@: $@\n"; > die; > } else { > print "n\n"; > exit 0; > } > > __END__ > nc@I34 perl2$ perl /tmp/133756.pl 1 > Y > nc@I34 perl2$ perl /tmp/133756.pl 0 > n > > > > It's "fixed" in blead: > > > 792b461851b63487dbcd63fc1948671db8f1dbe5 is the first bad commit > commit 792b461851b63487dbcd63fc1948671db8f1dbe5 > Author: Karl Williamson <khw@cpan.org> > Date: Sat Feb 3 10:25:31 2018 -0700 > > regcomp.c: Pack EXACTish nodes more fully > > Prior to this commit, nodes that are to match a string exactly, or > possibly case insensitively used only half the potential space available > (that being limited by the length field which is a U8). (The optimizer > might later pack some together to make a larger node.) Talking it over > with Yves, we suspect that this is a relic of an earlier time. It makes > more sense to have longer nodes when possible to lower overhead in > the matching engine. > > :100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M regcomp.c > bisect run success > That took 491 seconds. > > > and that commit is just a refactoring - no bug fixes, no new tests: > > commit 792b461851b63487dbcd63fc1948671db8f1dbe5 > Author: Karl Williamson <khw@cpan.org> > Date: Sat Feb 3 10:25:31 2018 -0700 > > regcomp.c: Pack EXACTish nodes more fully > > Prior to this commit, nodes that are to match a string exactly, or > possibly case insensitively used only half the potential space available > (that being limited by the length field which is a U8). (The optimizer > might later pack some together to make a larger node.) Talking it over > with Yves, we suspect that this is a relic of an earlier time. It makes > more sense to have longer nodes when possible to lower overhead in > the matching engine. > --- > regcomp.c | 33 ++++++++++++++------------------- > 1 file changed, 14 insertions(+), 19 deletions(-) > > diff --git a/regcomp.c b/regcomp.c > index 6dbfed52ab..74c8eb2e21 100644 > --- a/regcomp.c > +++ b/regcomp.c > @@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) > UV ender = 0; > char *p; > char *s; > -#define MAX_NODE_STRING_SIZE 127 > + > +/* This allows us to fill a node with just enough spare so that if the final > + * character folds, its expansion is guaranteed to fit */ > +#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE) > char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1]; > + > char *s0; > U8 upper_parse = MAX_NODE_STRING_SIZE; > U8 node_type = compute_EXACTish(pRExC_state); > @@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth) > * use a pseudo regnode like 'EXACT_ORIG_FOLD' */ > maybe_exact = FOLD && PASS2; > > - /* XXX The node can hold up to 255 bytes, yet this only goes to > - * 127. I (khw) do not know why. Keeping it somewhat less than > - * 255 allows us to not have to worry about overflow due to > - * converting to utf8 and fold expansion, but that value is > - * 255-UTF8_MAXBYTES_CASE. join_exact() may join adjacent nodes > - * split up by this limit into a single one using the real max of > - * 255. Even at 127, this breaks under rare circumstances. If > - * folding, we do not want to split a node at a character that is a > - * non-final in a multi-char fold, as an input string could just > - * happen to want to match across the node boundary. The join > - * would solve that problem if the join actually happens. But a > - * series of more than two nodes in a row each of 127 would cause > - * the first join to succeed to get to 254, but then there wouldn't > - * be room for the next one, which could at be one of those split > - * multi-char folds. I don't know of any fool-proof solution. One > - * could back off to end with only a code point that isn't such a > - * non-final, but it is possible for there not to be any in the > - * entire node. */ > + /* This breaks under rare circumstances. If folding, we do not > + * want to split a node at a character that is a non-final in a > + * multi-char fold, as an input string could just happen to want to > + * match across the node boundary. The code at the end of the loop > + * looks for this, and backs off until it finds not such a > + * character, but it is possible (though extremely, extremely > + * unlikely) for all characters in the node to be non-final fold > + * ones, in which case we just leave the node fully filled, and > + * hope that it doesn't match the string in just the wrong place */ > > assert( ! UTF /* Is at the beginning of a character */ > || UTF8_IS_INVARIANT(UCHARAT(RExC_parse)) > > > > > So I think that we still have a real live bug here, now just better hidden. > And sorry for the distraction of the poor test case. > > > Given that that commit just changes the value of MAX_NODE_STRING_SIZE > I figure that I *ought* to be able to tweak my test script to use a > different sized string and still expose a problem, but I'm failing to > make it (not) work. > > The original problem was revealed as exactly 2 lines of difference in a pair > of 1.2G files, which are produced from something like 7M of generated code, > when I tested a refactoring of the how that code is generated. > > So whatever it is, it's obscure. > > Nicholas Clark >
Thank you for reporting this. And your surmisal was correct. There is a real bug, as hopefully explained by this commit message, now in blead: From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@cpan.org> Date: Thu, 10 Jan 2019 20:59:31 -0700 Subject: [PATCH 2/2] PATCH: [perl #133756] Failure to match properly This was caused by a counting error. An EXACTFish regnode has a finite length it can hold for the string being matched. If that length is exceeded, a 2nd node is used for the next segment of the string, for as many regnodes as are needed. A problem occurs if a regnode ends with one of the 22 characters in Unicode 11 that occur in non-final positions of a multi-character fold. The design of the pattern matching engine doesn't allow matches across regnodes. Consider, for example if a node ended in the letter 'f' and the next node begins with the letter 'i'. That sequence should match, under /i, the ligature "fi" (U+FB01). But it wouldn't because the pattern splits them across nodes. The solution I adopted was to forbid a node to end with one of those 22 characters if there is another string node that follows it. This is not fool proof, for example, if the entire node consisted of only these characters, one would have to split it at some position (In that case, we just take as much of the string as will fit.) But for real life applications, it is good enough. What happens if a node ends with one of the 22, is that the node is shortened so that those are instead placed at the beginning of the following node. When the code encounters this situation, it backs off until it finds a character that isn't a non-final fold one, and closes the node with that one. A /i node is filled with the fold of the input, for several reasons. The most obvious is that it saves time, you can skip folding the pattern at runtime. But there are reasons based on the design of the optimzer as well, which I won't go into here, but are documented in regcomp.c. When we back out the final characters in a node, we also have to back out the corresponding unfolded characters in the input, so that those can be (folded) into the following node. Since the number of characters in the fold may not be the same as unfolded, there is not an easily discernable correspondence between the input and the folded output. That means that generally, what has to be done is that the input is reparsed from the beginning of the node, but the permitted length has been shortened (we know precisely how much to shorten it to) so that it will end with something other than the 22. But, the code saves the previous input character's position (for other reasons), so if we only have to backup one character, we can just use that and not have to reparse. This bug was that the code thought a two character backup was really a one character one, and did not reparse the node, creating an off-by-one error, and a character was simply omitted in the pattern (that should have started the following node). And the input had two of the 22 characters adjacent to each other in just the right positions that the node was split. The bisect showed that when the node size was changed the bug went away, at least for this particular input string. But a different, longer, string would have triggered the bug, and this commit fixes that. This bug is actually very unlikely to occur in most real world applications. That is because other changes in the regex compiler have caused nodes to be split so that things that don't particpate in folds at all are separated out into EXACT nodes. (The reason for that is it allows the optimizer things to grab on to under /i that it wouldn't otherwise have known about.) That means that anything like this string would never cause the bug to happen because blanks and commas, etc. would be in separate nodes, and so no node would ever get large enough to fill the 238 available byte slots in a node (235 on EBCDIC). Only a long string without punctuation would trigger it. I have artificially constructed such a string in the tests added by this commit. One of the 22 characters is 't', so long strings of DNA "ACTG" could trigger this bug. I find it somewhat amusing that this is something like a DNA transcription error, which occurs in nature at very low rates, but selection, it is believed, will make sure the error rate is above zero.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 841b
On Fri, 11 Jan 2019 05:01:12 GMT, public@khwilliamson.com wrote: [snip] Show quoted text
> > So whatever it is, it's obscure. > > > > Nicholas Clark > >
> > Thank you for reporting this. > > And your surmisal was correct. There is a real bug, as hopefully > explained by this commit message, now in blead: > > From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 2001
I don't think the line above is correct. When I update blead the commit appears to be: ##### commit 739762602079d68c47cbde3b29275e844d0c92ef (HEAD -> blead, origin/blead, origin/HEAD) Author: Karl Williamson <khw@cpan.org> AuthorDate: Thu Jan 10 22:59:31 2019 Commit: Karl Williamson <khw@cpan.org> CommitDate: Thu Jan 10 23:56:24 2019 PATCH: [perl #133756] Failure to match properly ... ##### Thank you very much. -- James E Keenan (jkeenan@cpan.org)
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1012b
On Fri, 11 Jan 2019 05:48:17 -0800, jkeenan wrote: Show quoted text
> On Fri, 11 Jan 2019 05:01:12 GMT, public@khwilliamson.com wrote: > > [snip] >
> > > So whatever it is, it's obscure. > > > > > > Nicholas Clark > > >
> > > > Thank you for reporting this. > > > > And your surmisal was correct. There is a real bug, as hopefully > > explained by this commit message, now in blead: > > > > From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 > > 2001
> > I don't think the line above is correct. When I update blead the > commit appears to be: > > ##### > commit 739762602079d68c47cbde3b29275e844d0c92ef (HEAD -> blead, > origin/blead, origin/HEAD) > Author: Karl Williamson <khw@cpan.org> > AuthorDate: Thu Jan 10 22:59:31 2019 > Commit: Karl Williamson <khw@cpan.org> > CommitDate: Thu Jan 10 23:56:24 2019 > > PATCH: [perl #133756] Failure to match properly >
This is the correct commit number for the fix. Now closing this ticket. Show quoted text
> ... > ##### > > Thank you very much.
-- Karl Williamson
Download (untitled) / with headers
text/plain 313b
Thank you for filing this report. You have helped make Perl better. With the release today of Perl 5.30.0, this and 160 other issues have been resolved. Perl 5.30.0 may be downloaded via: https://metacpan.org/release/XSAWYERX/perl-5.30.0 If you find that the problem persists, feel free to reopen this ticket.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org