Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CVE-2018-6797] heap-buffer-overflow (WRITE of size 1) in S_regatom (regcomp.c) #16185

Closed
p5pRT opened this issue Oct 6, 2017 · 49 comments
Closed

Comments

@p5pRT
Copy link

p5pRT commented Oct 6, 2017

Migrated from rt.perl.org#132227 (status was 'resolved')

Searchable as RT132227$

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2017

From @geeknik

Triggered while fuzzing Perl v5.27.4-29-gdc41635.

od -tx1 ./test514
0000000 2f 30 30 5c 4e 7b 55 2b 30 7d df df df df df df
0000020 df 30 30 30 df df 30 2f 69
0000031

==28186==ERROR​: AddressSanitizer​: heap-buffer-overflow on address
0x60700000ac58 at pc 0x000000846c2d bp 0x7ffe716bc7f0 sp 0x7ffe716bc7e0
WRITE of size 1 at 0x60700000ac58 thread T0
  #0 0x846c2c in S_regatom /root/perl/regcomp.c​:13652
  #1 0x8587f6 in S_regpiece /root/perl/regcomp.c​:11708
  #2 0x8587f6 in S_regbranch /root/perl/regcomp.c​:11633
  #3 0x88830a in S_reg /root/perl/regcomp.c​:11371
  #4 0x8c90dc in Perl_re_op_compile /root/perl/regcomp.c​:7363
  #5 0x5297d0 in Perl_pmruntime /root/perl/op.c​:5888
  #6 0x74d853 in Perl_yyparse /root/perl/perly.y​:1210
  #7 0x58b9b8 in S_parse_body /root/perl/perl.c​:2450
  #8 0x593622 in perl_parse /root/perl/perl.c​:1753
  #9 0x42eb7d in main /root/perl/perlmain.c​:121
  #10 0x7fba4cebe82f in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
  #11 0x42fe18 in _start (/root/perl/perl+0x42fe18)

0x60700000ac58 is located 0 bytes to the right of 72-byte region
[0x60700000ac10,0x60700000ac58)
allocated by thread T0 here​:
  #0 0x7fba4dc62602 in malloc
(/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
  #1 0x92dfd4 in Perl_safesysmalloc /root/perl/util.c​:153
  #2 0x8c6cbe in Perl_re_op_compile /root/perl/regcomp.c​:7209
  #3 0x5297d0 in Perl_pmruntime /root/perl/op.c​:5888
  #4 0x74d853 in Perl_yyparse /root/perl/perly.y​:1210
  #5 0x58b9b8 in S_parse_body /root/perl/perl.c​:2450
  #6 0x593622 in perl_parse /root/perl/perl.c​:1753
  #7 0x42eb7d in main /root/perl/perlmain.c​:121
  #8 0x7fba4cebe82f in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

SUMMARY​: AddressSanitizer​: heap-buffer-overflow /root/perl/regcomp.c​:13652
S_regatom

When tested against Perl 5.22.1 under Valgrind, the following occurs​:

==5420== Invalid write of size 1
==5420== at 0x52F178​: Perl__to_fold_latin1 (in /usr/bin/perl)
==5420== by 0x532904​: Perl__to_uni_fold_flags (in /usr/bin/perl)
==5420== by 0x4826E7​: ??? (in /usr/bin/perl)
==5420== by 0x48479C​: ??? (in /usr/bin/perl)
==5420== by 0x4798EA​: ??? (in /usr/bin/perl)
==5420== by 0x48E942​: Perl_re_op_compile (in /usr/bin/perl)
==5420== by 0x436377​: Perl_pmruntime (in /usr/bin/perl)
==5420== by 0x46CB15​: Perl_yyparse (in /usr/bin/perl)
==5420== by 0x442FB4​: perl_parse (in /usr/bin/perl)
==5420== by 0x41CB28​: main (in /usr/bin/perl)
==5420== Address 0x5b9dd88 is 0 bytes after a block of size 72 alloc'd
==5420== at 0x4C2DB8F​: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5420== by 0x498241​: Perl_safesysmalloc (in /usr/bin/perl)
==5420== by 0x48E5B4​: Perl_re_op_compile (in /usr/bin/perl)
==5420== by 0x436377​: Perl_pmruntime (in /usr/bin/perl)
==5420== by 0x46CB15​: Perl_yyparse (in /usr/bin/perl)
==5420== by 0x442FB4​: perl_parse (in /usr/bin/perl)
==5420== by 0x41CB28​: main (in /usr/bin/perl)
==5420==
==5420== Invalid write of size 1
==5420== at 0x52F17B​: Perl__to_fold_latin1 (in /usr/bin/perl)
==5420== by 0x532904​: Perl__to_uni_fold_flags (in /usr/bin/perl)
==5420== by 0x4826E7​: ??? (in /usr/bin/perl)
==5420== by 0x48479C​: ??? (in /usr/bin/perl)
==5420== by 0x4798EA​: ??? (in /usr/bin/perl)
==5420== by 0x48E942​: Perl_re_op_compile (in /usr/bin/perl)
==5420== by 0x436377​: Perl_pmruntime (in /usr/bin/perl)
==5420== by 0x46CB15​: Perl_yyparse (in /usr/bin/perl)
==5420== by 0x442FB4​: perl_parse (in /usr/bin/perl)
==5420== by 0x41CB28​: main (in /usr/bin/perl)
==5420== Address 0x5b9dd89 is 1 bytes after a block of size 72 alloc'd
==5420== at 0x4C2DB8F​: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5420== by 0x498241​: Perl_safesysmalloc (in /usr/bin/perl)
==5420== by 0x48E5B4​: Perl_re_op_compile (in /usr/bin/perl)
==5420== by 0x436377​: Perl_pmruntime (in /usr/bin/perl)
==5420== by 0x46CB15​: Perl_yyparse (in /usr/bin/perl)
==5420== by 0x442FB4​: perl_parse (in /usr/bin/perl)
==5420== by 0x41CB28​: main (in /usr/bin/perl)
==5420==
==5420== Invalid write of size 1
==5420== at 0x482311​: ??? (in /usr/bin/perl)
==5420== by 0x48479C​: ??? (in /usr/bin/perl)
==5420== by 0x4798EA​: ??? (in /usr/bin/perl)
==5420== by 0x48E942​: Perl_re_op_compile (in /usr/bin/perl)
==5420== by 0x436377​: Perl_pmruntime (in /usr/bin/perl)
==5420== by 0x46CB15​: Perl_yyparse (in /usr/bin/perl)
==5420== by 0x442FB4​: perl_parse (in /usr/bin/perl)
==5420== by 0x41CB28​: main (in /usr/bin/perl)
==5420== Address 0x5b9dd8c is 4 bytes after a block of size 72 alloc'd
==5420== at 0x4C2DB8F​: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5420== by 0x498241​: Perl_safesysmalloc (in /usr/bin/perl)
==5420== by 0x48E5B4​: Perl_re_op_compile (in /usr/bin/perl)
==5420== by 0x436377​: Perl_pmruntime (in /usr/bin/perl)
==5420== by 0x46CB15​: Perl_yyparse (in /usr/bin/perl)
==5420== by 0x442FB4​: perl_parse (in /usr/bin/perl)
==5420== by 0x41CB28​: main (in /usr/bin/perl)
==5420==
panic​: reg_node overrun trying to emit 0, 5b9dd90>=5b9dd88 at test514 line 1

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2017

From @geeknik

test514.gz

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2017

From @khwilliamson

I'll look at this when I get around to reworking this portion of the code later in 5.27
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 31, 2018

From @tonycoz

On Fri, 06 Oct 2017 16​:46​:50 -0700, khw wrote​:

I'll look at this when I get around to reworking this portion of the
code later in 5.27

Simplifies to the attached, which doesn't assert under valgrind etc, but does cause a panic​:

$ ./perl ../132227b.pl
panic​: reg_node overrun trying to emit 0, 5584354b4bf8>=5584354b4bf8 at ../132227b.pl line 1.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jan 31, 2018

From @tonycoz

132227b.pl

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2018

From @tonycoz

On Tue, 30 Jan 2018 20​:53​:29 -0800, tonyc wrote​:

On Fri, 06 Oct 2017 16​:46​:50 -0700, khw wrote​:

I'll look at this when I get around to reworking this portion of the
code later in 5.27

Simplifies to the attached, which doesn't assert under valgrind etc,
but does cause a panic​:

$ ./perl ../132227b.pl
panic​: reg_node overrun trying to emit 0, 5584354b4bf8>=5584354b4bf8
at ../132227b.pl line 1.

What's happening is the first pass is emitting the double-s as a single byte, but the second pass emits "ss", and there isn't enough room for it.

If you have a larger number of ß this can push the following bytes emitted well beyond the end of the allocated block.

At minimum this has the potential to cause a segmentation fault, or to corrupt the arena, leading to a crash further on.

Depending on the heap implementation it may be possible to perform a nastier exploit - an attacker has almost complete control over the bytes written.

I do believe this is a security issue.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2018

From @khwilliamson

On 01/31/2018 08​:44 PM, Tony Cook via RT wrote​:

On Tue, 30 Jan 2018 20​:53​:29 -0800, tonyc wrote​:

On Fri, 06 Oct 2017 16​:46​:50 -0700, khw wrote​:

I'll look at this when I get around to reworking this portion of the
code later in 5.27

Simplifies to the attached, which doesn't assert under valgrind etc,
but does cause a panic​:

$ ./perl ../132227b.pl
panic​: reg_node overrun trying to emit 0, 5584354b4bf8>=5584354b4bf8
at ../132227b.pl line 1.

What's happening is the first pass is emitting the double-s as a single byte, but the second pass emits "ss", and there isn't enough room for it.

If you have a larger number of ß this can push the following bytes emitted well beyond the end of the allocated block.

At minimum this has the potential to cause a segmentation fault, or to corrupt the arena, leading to a crash further on.

Depending on the heap implementation it may be possible to perform a nastier exploit - an attacker has almost complete control over the bytes written.

I do believe this is a security issue.

Tony

Since you've done so much analysis, do you have a patch, or know the
line of code in error? I've fixed several similar ones involving β in
the past, but missed this one.

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2018

From @demerphq

FWIW, this looks like a reparse bug. The pattern starts out latin-1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yves

On 1 February 2018 at 18​:27, Karl Williamson <public@​khwilliamson.com> wrote​:

On 01/31/2018 08​:44 PM, Tony Cook via RT wrote​:

On Tue, 30 Jan 2018 20​:53​:29 -0800, tonyc wrote​:

On Fri, 06 Oct 2017 16​:46​:50 -0700, khw wrote​:

I'll look at this when I get around to reworking this portion of the
code later in 5.27

Simplifies to the attached, which doesn't assert under valgrind etc,
but does cause a panic​:

$ ./perl ../132227b.pl
panic​: reg_node overrun trying to emit 0, 5584354b4bf8>=5584354b4bf8
at ../132227b.pl line 1.

What's happening is the first pass is emitting the double-s as a single
byte, but the second pass emits "ss", and there isn't enough room for it.

If you have a larger number of ß this can push the following bytes emitted
well beyond the end of the allocated block.

At minimum this has the potential to cause a segmentation fault, or to
corrupt the arena, leading to a crash further on.

Depending on the heap implementation it may be possible to perform a
nastier exploit - an attacker has almost complete control over the bytes
written.

I do believe this is a security issue.

Tony

Since you've done so much analysis, do you have a patch, or know the line of
code in error? I've fixed several similar ones involving β in the past, but
missed this one.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2018

From @demerphq

On 2 February 2018 at 13​:21, demerphq <demerphq@​gmail.com> wrote​:

FWIW, this looks like a reparse bug. The pattern starts out latin-1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yep, that is it, or more or less, although we ARE trying to trigger
reparse, but that logic has been "optimized" such that it does not do
the reparse if it has not seen \xDF first, and later when it DOES see
the \xDF it also does not realize it needs to upgrade the pattern. The
following patch fixes it, although its not the right patch.

$ git diff

Inline Patch
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -352,7 +352,7 @@ struct RExC_state_t {
                 assert(PASS1);                                              \
                 set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);      \
                 RExC_uni_semantics = 1;                                     \
-                if (RExC_seen_unfolded_sharp_s) {                           \
+                if (!UTF || RExC_seen_unfolded_sharp_s) {                   \
                     *flagp |= RESTART_PASS1;                                \
                     return restart_retval;                                  \
                 }

You have evolved some of this handling beyond my current understanding of the subtleties\, so it will take me some time to work a better patch\.

With my patch​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
  | | brnc
  | | piec
  | | atom
Need to redo pass 1
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
  | | brnc
  | | piec
  | | atom
Required size 4 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
  | | brnc
  | | piec
  | | atom
<> | 5| lsbr~ tying lastbr EXACTFU <00\2ss> (1) to
ender END (4) offset 3
  | | tail~ EXACTFU <00\2ss> (1) -> END
Starting post parse optimization
first​:> 1​: EXACTFU <00\2ss> (4)
first at 1
RExC_seen​:
study_chunk stopparen=-1 recursed_count=1 depth=0 recursed_depth=0
scan=1118124 last=1118134
Peep​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​: '' @​
0/0 Float​: '' @​ 0/0
Peep> 1​: EXACTFU <00\2ss> (4) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
  join> 1​: EXACTFU <00\2ss> (4)
commit​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 *Fixed​:
'' @​ 0/0 Float​: '' @​ 0/0
pre-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
post-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
commit​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​: ''
@​ 0/0 *Float​: '' @​ 0/0
minlen​: 4 r->minlen​:0 maxlen​:5
study_chunk_recursed_count​: 1
RExC_seen​:
Final program​:
  1​: EXACTFU_SS <00\2ss> (4)
  4​: END (0)
stclass EXACTFU_SS <00\2ss> minlen 4
r->extflags​: FOLD UNICODE
r->intflags​: [none-set]
Freeing REx​: "00\N{U+2}\xDF"

---------->8--------------->8-------------

Without​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
  | | brnc
  | | piec
  | | atom
Required size 3 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
  | | brnc
  | | piec
  | | atom
panic​: reg_node overrun trying to emit 0, 2133840>=2133840 at -e line 1.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2018

From @demerphq

On 2 February 2018 at 17​:40, demerphq <demerphq@​gmail.com> wrote​:

On 2 February 2018 at 13​:21, demerphq <demerphq@​gmail.com> wrote​:

FWIW, this looks like a reparse bug. The pattern starts out latin-1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yep, that is it, or more or less, although we ARE trying to trigger
reparse, but that logic has been "optimized" such that it does not do
the reparse if it has not seen \xDF first, and later when it DOES see
the \xDF it also does not realize it needs to upgrade the pattern. The
following patch fixes it, although its not the right patch.

$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@​@​ -352,7 +352,7 @​@​ struct RExC_state_t {
assert(PASS1); \
set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET); \
RExC_uni_semantics = 1; \
- if (RExC_seen_unfolded_sharp_s) { \
+ if (!UTF || RExC_seen_unfolded_sharp_s) { \
*flagp |= RESTART_PASS1; \
return restart_retval; \
}

You have evolved some of this handling beyond my current understanding
of the subtleties, so it will take me some time to work a better
patch.

With my patch​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Need to redo pass 1
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 4 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
<> | 5| lsbr~ tying lastbr EXACTFU <00\2ss> (1) to
ender END (4) offset 3
| | tail~ EXACTFU <00\2ss> (1) -> END
Starting post parse optimization
first​:> 1​: EXACTFU <00\2ss> (4)
first at 1
RExC_seen​:
study_chunk stopparen=-1 recursed_count=1 depth=0 recursed_depth=0
scan=1118124 last=1118134
Peep​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​: '' @​
0/0 Float​: '' @​ 0/0
Peep> 1​: EXACTFU <00\2ss> (4) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
join> 1​: EXACTFU <00\2ss> (4)
commit​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 *Fixed​:
'' @​ 0/0 Float​: '' @​ 0/0
pre-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
post-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
commit​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​: ''
@​ 0/0 *Float​: '' @​ 0/0
minlen​: 4 r->minlen​:0 maxlen​:5
study_chunk_recursed_count​: 1
RExC_seen​:
Final program​:
1​: EXACTFU_SS <00\2ss> (4)
4​: END (0)
stclass EXACTFU_SS <00\2ss> minlen 4
r->extflags​: FOLD UNICODE
r->intflags​: [none-set]
Freeing REx​: "00\N{U+2}\xDF"

---------->8--------------->8-------------

Without​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 3 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
panic​: reg_node overrun trying to emit 0, 2133840>=2133840 at -e line 1.

I think the attached patch is better.

Basically we only need to reparse if we see a \xDF and we are in
unicode semantics.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2018

From @demerphq

rt_132227.patch
commit d194576035c4a6d99fada0278b2247baf0fcf6a6
Author: Yves Orton <demerphq@gmail.com>
Date:   Fri Feb 2 22:20:57 2018 +0100

    fixup /00\N{U+0}\xDF/i

diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..801f857 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13757,6 +13757,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                          * 'ss' */
                         if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
                             RExC_seen_unfolded_sharp_s = 1;
+                            if (RExC_uni_semantics)
+                                REQUIRE_UTF8(flagp);
                             maybe_exactfu = FALSE;
                         }
                         else if (maybe_exactfu

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2018

From @khwilliamson

On 02/02/2018 02​:26 PM, demerphq wrote​:

On 2 February 2018 at 17​:40, demerphq <demerphq@​gmail.com> wrote​:

On 2 February 2018 at 13​:21, demerphq <demerphq@​gmail.com> wrote​:

FWIW, this looks like a reparse bug. The pattern starts out latin-1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yep, that is it, or more or less, although we ARE trying to trigger
reparse, but that logic has been "optimized" such that it does not do
the reparse if it has not seen \xDF first, and later when it DOES see
the \xDF it also does not realize it needs to upgrade the pattern. The
following patch fixes it, although its not the right patch.

$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@​@​ -352,7 +352,7 @​@​ struct RExC_state_t {
assert(PASS1); \
set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET); \
RExC_uni_semantics = 1; \
- if (RExC_seen_unfolded_sharp_s) { \
+ if (!UTF || RExC_seen_unfolded_sharp_s) { \
*flagp |= RESTART_PASS1; \
return restart_retval; \
}

You have evolved some of this handling beyond my current understanding
of the subtleties, so it will take me some time to work a better
patch.

With my patch​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Need to redo pass 1
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 4 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
<> | 5| lsbr~ tying lastbr EXACTFU <00\2ss> (1) to
ender END (4) offset 3
| | tail~ EXACTFU <00\2ss> (1) -> END
Starting post parse optimization
first​:> 1​: EXACTFU <00\2ss> (4)
first at 1
RExC_seen​:
study_chunk stopparen=-1 recursed_count=1 depth=0 recursed_depth=0
scan=1118124 last=1118134
Peep​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​: '' @​
0/0 Float​: '' @​ 0/0
Peep> 1​: EXACTFU <00\2ss> (4) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
join> 1​: EXACTFU <00\2ss> (4)
commit​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 *Fixed​:
'' @​ 0/0 Float​: '' @​ 0/0
pre-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
post-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
commit​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​: ''
@​ 0/0 *Float​: '' @​ 0/0
minlen​: 4 r->minlen​:0 maxlen​:5
study_chunk_recursed_count​: 1
RExC_seen​:
Final program​:
1​: EXACTFU_SS <00\2ss> (4)
4​: END (0)
stclass EXACTFU_SS <00\2ss> minlen 4
r->extflags​: FOLD UNICODE
r->intflags​: [none-set]
Freeing REx​: "00\N{U+2}\xDF"

---------->8--------------->8-------------

Without​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 3 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
panic​: reg_node overrun trying to emit 0, 2133840>=2133840 at -e line 1.

I think the attached patch is better.

Basically we only need to reparse if we see a \xDF and we are in
unicode semantics.

Yves

Yves' patch unnecessarily upgrades to UTF-8. And on irc, he indicated
he was conflating in his mind upgrading and reparsing.

Attached is my patch.

The problem occurs when we are under /d semantics. An EXACTF node is
generated, which means it may have different meanings if the eventual
matched target string is in UTF-8. Then the \N{} is seen, which changes
the rules to unicode, as \N is a unicode construct. Then the \xDF comes
along and sees that it is an EXACTF node and so doesn't do anything
special besides setting a flag.

What really should happen in this situation is addressed in my patch.
It restarts parsing, having set it to be using unicode rules. Then
instead of an EXACTF node, it creates an EXACTFU. In these nodes, the
\xDF gets folded in pass1 and you don't have this problem

Yves wanted to know why it works now if the \xDF comes before the \N{}.
The reason is that flag. The \N{} comes along after the flag is set,
and is smart enough to look at the flag and restart the parsing with
unicode rules, so that the EXACTFU is generated before the \xDF is seen.

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2018

From @khwilliamson

0010-132227.patch
From 03d81af254f2dc298aa2e8925eaba73d00dcb440 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: [PATCH 10/10] 132227

---
 regcomp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..fa04aa3f41 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13756,6 +13756,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
                         if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+                            /* If the node started out having uni rules, we
+                             * wouldn't have gotten here.  So this means
+                             * something in the middle has changed it, but
+                             * didn't think it needed to reparse.  But this
+                             * sharp s now does indicate the need for
+                             * reparsing. */
+                            if (RExC_uni_semantics) {
+                                *flagp |= RESTART_PASS1;
+                                return NULL;
+                            }
+
                             RExC_seen_unfolded_sharp_s = 1;
                             maybe_exactfu = FALSE;
                         }
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 3, 2018

From @demerphq

On 2 February 2018 at 23​:39, Karl Williamson <public@​khwilliamson.com> wrote​:

On 02/02/2018 02​:26 PM, demerphq wrote​:

On 2 February 2018 at 17​:40, demerphq <demerphq@​gmail.com> wrote​:

On 2 February 2018 at 13​:21, demerphq <demerphq@​gmail.com> wrote​:

FWIW, this looks like a reparse bug. The pattern starts out latin-1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yep, that is it, or more or less, although we ARE trying to trigger
reparse, but that logic has been "optimized" such that it does not do
the reparse if it has not seen \xDF first, and later when it DOES see
the \xDF it also does not realize it needs to upgrade the pattern. The
following patch fixes it, although its not the right patch.

$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@​@​ -352,7 +352,7 @​@​ struct RExC_state_t {
assert(PASS1);
\
set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);
\
RExC_uni_semantics = 1;
\
- if (RExC_seen_unfolded_sharp_s) {
\
+ if (!UTF || RExC_seen_unfolded_sharp_s) {
\
*flagp |= RESTART_PASS1;
\
return restart_retval;
\
}

You have evolved some of this handling beyond my current understanding
of the subtleties, so it will take me some time to work a better
patch.

With my patch​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Need to redo pass 1
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 4 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
<> | 5| lsbr~ tying lastbr EXACTFU <00\2ss> (1) to
ender END (4) offset 3
| | tail~ EXACTFU <00\2ss> (1) -> END
Starting post parse optimization
first​:> 1​: EXACTFU <00\2ss> (4)
first at 1
RExC_seen​:
study_chunk stopparen=-1 recursed_count=1 depth=0 recursed_depth=0
scan=1118124 last=1118134
Peep​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​: '' @​
0/0 Float​: '' @​ 0/0
Peep> 1​: EXACTFU <00\2ss> (4) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
join> 1​: EXACTFU <00\2ss> (4)
commit​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 *Fixed​:
'' @​ 0/0 Float​: '' @​ 0/0
pre-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
post-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
commit​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​: ''
@​ 0/0 *Float​: '' @​ 0/0
minlen​: 4 r->minlen​:0 maxlen​:5
study_chunk_recursed_count​: 1
RExC_seen​:
Final program​:
1​: EXACTFU_SS <00\2ss> (4)
4​: END (0)
stclass EXACTFU_SS <00\2ss> minlen 4
r->extflags​: FOLD UNICODE
r->intflags​: [none-set]
Freeing REx​: "00\N{U+2}\xDF"

---------->8--------------->8-------------

Without​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 3 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
panic​: reg_node overrun trying to emit 0, 2133840>=2133840 at -e line 1.

I think the attached patch is better.

Basically we only need to reparse if we see a \xDF and we are in
unicode semantics.

Yves

Yves' patch unnecessarily upgrades to UTF-8. And on irc, he indicated he
was conflating in his mind upgrading and reparsing.

Yes, im a bit behind the times.

Attached is my patch.

Yep, makes sense to me. Should we wrap this logic in a well named
macro like we do with the other related macros?

The problem occurs when we are under /d semantics. An EXACTF node is
generated, which means it may have different meanings if the eventual
matched target string is in UTF-8. Then the \N{} is seen, which changes the
rules to unicode, as \N is a unicode construct. Then the \xDF comes along
and sees that it is an EXACTF node and so doesn't do anything special
besides setting a flag.

What really should happen in this situation is addressed in my patch. It
restarts parsing, having set it to be using unicode rules. Then instead of
an EXACTF node, it creates an EXACTFU. In these nodes, the \xDF gets folded
in pass1 and you don't have this problem

Yves wanted to know why it works now if the \xDF comes before the \N{}. The
reason is that flag. The \N{} comes along after the flag is set, and is
smart enough to look at the flag and restart the parsing with unicode rules,
so that the EXACTFU is generated before the \xDF is seen.

No, that wasn't my point :-),

I was conflating reparse and upgrade in some of my communications and
thinking, since upgrade implies reparse, and when I added the reparse
logic it always upgraded.

So when you were saying "it shouldnt upgrade", and I was looking at
the DEBUG 'ALL' output, I was seeing reparse messages in the
non-failing case and assumed that meant upgrade.

Once the fact that you can reparse without upgrading penetrated my
skull things became clear. I think we should add diagnostic messages
that shows what is going on.

All I was concerned with was that /00\N{U+0}\xDF/ and /00\xDF\N{U+0}/
should reparse (and also upgrade) the same number of times, be it 0 or
1, for each possibility.

FWIW, I would like to push a different patch to blead, and we can use
your patch for the maint-releases.

I wrapped all the macros that are used for RESTART_PASS1 (including
your patch) and gave them names, etc. I think we can silently fix this
in blead in such a patch without worrying about it revealing anything
about the CVE aspect of this. But we need to get back patches done,
and then do a mega-cve release along with the other issue we have
pending. ;-)

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 3, 2018

From @demerphq

wrap_restart.patch
commit 45fa60a82457aeca7f3e1cf2f486005e0bb94f22
Author: Yves Orton <demerphq@gmail.com>
Date:   Sat Feb 3 14:02:17 2018 +0100

    wrap RESTART_PASS1 logic in a macro layer for clarity and reduction of duplicate code

diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..c139ae8 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -359,6 +359,33 @@ struct RExC_state_t {
             }                                                               \
     } STMT_END
 
+#define REQUIRE_LATIN_RULES(flagp)                                          \
+    STMT_START {                                                            \
+            if (RExC_uni_semantics) {                                       \
+                *flagp |= RESTART_PASS1;                                    \
+                return NULL;                                                \
+            }                                                               \
+    } STMT_END
+
+
+#define RETURN_NULL_ON_RESTART_OR_FLAGS(flags,flagp,extra)                  \
+    STMT_START {                                                            \
+            if ((flags) & (RESTART_PASS1|NEED_UTF8|(extra))) {              \
+                *(flagp) = (flags) & (RESTART_PASS1|NEED_UTF8|(extra));     \
+                return NULL;                                                \
+            }                                                               \
+    } STMT_END
+
+
+#define RETURN_NULL_ON_RESTART_FLAGP_OR_FLAGS(flagp,extra)  \
+            if (*(flagp) & (RESTART_PASS1|(extra))) return NULL
+
+#define MUST_RESTART(flags) ((flags) & (RESTART_PASS1))
+
+
+#define RETURN_NULL_ON_RESTART(flags,flagp) RETURN_NULL_ON_RESTART_OR_FLAGS(flags,flagp,0)
+#define RETURN_NULL_ON_RESTART_FLAGP(flagp) RETURN_NULL_ON_RESTART_FLAGP_OR_FLAGS(flagp,0)
+
 /* This converts the named class defined in regcomp.h to its equivalent class
  * number defined in handy.h. */
 #define namedclass_to_classnum(class)  ((int) ((class) / 2))
@@ -7204,14 +7231,14 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
         at least some part of the pattern, and therefore must convert the whole
         thing.
         -- dmq */
-        if (flags & RESTART_PASS1) {
+        if (MUST_RESTART(flags)) {
             if (flags & NEED_UTF8) {
                 S_pat_upgrade_to_utf8(aTHX_ pRExC_state, &exp, &plen,
                 pRExC_state->code_blocks ? pRExC_state->code_blocks->count : 0);
+                DEBUG_PARSE_r(Perl_re_printf( aTHX_ "Need to redo pass 1 after upgrade\n"));
             }
             else {
-                DEBUG_PARSE_r(Perl_re_printf( aTHX_
-                "Need to redo pass 1\n"));
+                DEBUG_PARSE_r(Perl_re_printf( aTHX_ "Need to redo pass 1\n"));
             }
 
             goto redo_first_pass;
@@ -10864,10 +10891,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
                 }
 
                 ret = reg(pRExC_state, 's', &flags, depth+1);
-                if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                    *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                    return NULL;
-                }
+                RETURN_NULL_ON_RESTART(flags,flagp);
 
                 return ret;
             }
@@ -11249,10 +11273,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
 			    ret->flags = 1;
 
                         tail = reg(pRExC_state, 1, &flag, depth+1);
-                        if (flag & (RESTART_PASS1|NEED_UTF8)) {
-                            *flagp = flag & (RESTART_PASS1|NEED_UTF8);
-                            return NULL;
-                        }
+                        RETURN_NULL_ON_RESTART(flag,flagp);
                         REGTAIL(pRExC_state, ret, tail);
 			goto insert_if;
 		    }
@@ -11361,10 +11382,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
                     REGTAIL(pRExC_state, ret, reganode(pRExC_state, IFTHEN, 0));
                     br = regbranch(pRExC_state, &flags, 1,depth+1);
 		    if (br == NULL) {
-                        if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                            *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                            return NULL;
-                        }
+                        RETURN_NULL_ON_RESTART(flags,flagp);
                         FAIL2("panic: regbranch returned NULL, flags=%#" UVxf,
                               (UV) flags);
                     } else
@@ -11382,10 +11400,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
                         lastbr = reganode(pRExC_state, IFTHEN, 0);
 
                         if (!regbranch(pRExC_state, &flags, 1,depth+1)) {
-                            if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                                return NULL;
-                            }
+                            RETURN_NULL_ON_RESTART(flags,flagp);
                             FAIL2("panic: regbranch returned NULL, flags=%#" UVxf,
                                   (UV) flags);
                         }
@@ -11490,10 +11505,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
     /*     branch_len = (paren != 0); */
 
     if (br == NULL) {
-        if (flags & (RESTART_PASS1|NEED_UTF8)) {
-            *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-            return NULL;
-        }
+        RETURN_NULL_ON_RESTART(flags,flagp);
         FAIL2("panic: regbranch returned NULL, flags=%#" UVxf, (UV) flags);
     }
     if (*RExC_parse == '|') {
@@ -11537,10 +11549,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
         br = regbranch(pRExC_state, &flags, 0, depth+1);
 
 	if (br == NULL) {
-            if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                return NULL;
-            }
+            RETURN_NULL_ON_RESTART(flags,flagp);
             FAIL2("panic: regbranch returned NULL, flags=%#" UVxf, (UV) flags);
         }
         REGTAIL(pRExC_state, lastbr, br);               /* BRANCH -> BRANCH. */
@@ -11755,10 +11764,7 @@ S_regbranch(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, I32 first, U32 depth)
 	if (latest == NULL) {
 	    if (flags & TRYAGAIN)
 		continue;
-            if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                return NULL;
-            }
+            RETURN_NULL_ON_RESTART(flags,flagp);
             FAIL2("panic: regpiece returned NULL, flags=%#" UVxf, (UV) flags);
 	}
 	else if (ret == NULL)
@@ -11828,11 +11834,8 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
     ret = regatom(pRExC_state, &flags,depth+1);
     if (ret == NULL) {
-	if (flags & (TRYAGAIN|RESTART_PASS1|NEED_UTF8))
-	    *flagp |= flags & (TRYAGAIN|RESTART_PASS1|NEED_UTF8);
-        else
-            FAIL2("panic: regatom returned NULL, flags=%#" UVxf, (UV) flags);
-	return(NULL);
+        RETURN_NULL_ON_RESTART_OR_FLAGS(flags,flagp,TRYAGAIN);
+        FAIL2("panic: regatom returned NULL, flags=%#" UVxf, (UV) flags);
     }
 
     op = *RExC_parse;
@@ -12349,10 +12352,7 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
         SvREFCNT_dec_NN(substitute_parse);
 
         if (! *node_p) {
-            if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                return FALSE;
-            }
+            RETURN_NULL_ON_RESTART(flags,flagp);
             FAIL2("panic: reg returned NULL to grok_bslash_N, flags=%#" UVxf,
                 (UV) flags);
         }
@@ -12752,8 +12752,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                        NULL,
                        NULL);
         if (ret == NULL) {
-            if (*flagp & (RESTART_PASS1|NEED_UTF8))
-                return NULL;
+            RETURN_NULL_ON_RESTART_FLAGP_OR_FLAGS(flagp,NEED_UTF8);
             FAIL2("panic: regclass returned NULL to regatom, flags=%#" UVxf,
                   (UV) *flagp);
         }
@@ -12777,10 +12776,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 		    }
 		    goto tryagain;
 		}
-                if (flags & (RESTART_PASS1|NEED_UTF8)) {
-                    *flagp = flags & (RESTART_PASS1|NEED_UTF8);
-                    return NULL;
-                }
+                RETURN_NULL_ON_RESTART(flags,flagp);
                 FAIL2("panic: reg returned NULL to regatom, flags=%#" UVxf,
                                                                  (UV) flags);
 	}
@@ -13065,8 +13061,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                            TRUE, /* Allow an optimized regnode result */
                            NULL,
                            NULL);
-            if (*flagp & RESTART_PASS1)
-                return NULL;
+            RETURN_NULL_ON_RESTART_FLAGP(flagp);
             /* regclass() can only return RESTART_PASS1 and NEED_UTF8 if
              * multi-char folds are allowed.  */
             if (!ret)
@@ -13105,8 +13100,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 break;
             }
 
-            if (*flagp & RESTART_PASS1)
-                return NULL;
+            RETURN_NULL_ON_RESTART_FLAGP(flagp);
 
             /* Here, evaluates to a single code point.  Go get that */
             RExC_parse = parse_start;
@@ -13434,8 +13428,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                         ) {
                             if (*flagp & NEED_UTF8)
                                 FAIL("panic: grok_bslash_N set NEED_UTF8");
-                            if (*flagp & RESTART_PASS1)
-                                return NULL;
+                            RETURN_NULL_ON_RESTART_FLAGP(flagp);
 
                             /* Here, it wasn't a single code point.  Go close
                              * up this EXACTish node.  The switch() prior to
@@ -13756,6 +13749,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
                         if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+                            /* If the node started out having uni rules, we
+                             * wouldn't have gotten here.  So this means
+                             * something in the middle has changed it, but
+                             * didn't think it needed to reparse.  But this
+                             * sharp s now does indicate the need for
+                             * reparsing. */
+                            REQUIRE_LATIN_RULES(flagp);
+
                             RExC_seen_unfolded_sharp_s = 1;
                             maybe_exactfu = FALSE;
                         }
@@ -16383,8 +16385,8 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
 
                         if (*flagp & NEED_UTF8)
                             FAIL("panic: grok_bslash_N set NEED_UTF8");
-                        if (*flagp & RESTART_PASS1)
-                            return NULL;
+
+                        RETURN_NULL_ON_RESTART_FLAGP(flagp);
 
                         if (cp_count < 0) {
                             vFAIL("\\N in a character class must be a named character: \\N{...}");
@@ -17353,7 +17355,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
 
 	ret = reg(pRExC_state, 1, &reg_flags, depth+1);
 
-	*flagp |= reg_flags&(HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_PASS1|NEED_UTF8);
+        *flagp |= reg_flags & (HASWIDTH|SIMPLE|SPSTART|POSTPONED|RESTART_PASS1|NEED_UTF8);
 
         /* And restore so can parse the rest of the pattern */
         RExC_parse = save_parse;

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2018

From @khwilliamson

On 02/03/2018 06​:32 AM, demerphq wrote​:

On 2 February 2018 at 23​:39, Karl Williamson <public@​khwilliamson.com> wrote​:

On 02/02/2018 02​:26 PM, demerphq wrote​:

On 2 February 2018 at 17​:40, demerphq <demerphq@​gmail.com> wrote​:

On 2 February 2018 at 13​:21, demerphq <demerphq@​gmail.com> wrote​:

FWIW, this looks like a reparse bug. The pattern starts out latin-1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yep, that is it, or more or less, although we ARE trying to trigger
reparse, but that logic has been "optimized" such that it does not do
the reparse if it has not seen \xDF first, and later when it DOES see
the \xDF it also does not realize it needs to upgrade the pattern. The
following patch fixes it, although its not the right patch.

$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@​@​ -352,7 +352,7 @​@​ struct RExC_state_t {
assert(PASS1);
\
set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);
\
RExC_uni_semantics = 1;
\
- if (RExC_seen_unfolded_sharp_s) {
\
+ if (!UTF || RExC_seen_unfolded_sharp_s) {
\
*flagp |= RESTART_PASS1;
\
return restart_retval;
\
}

You have evolved some of this handling beyond my current understanding
of the subtleties, so it will take me some time to work a better
patch.

With my patch​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Need to redo pass 1
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 4 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
<> | 5| lsbr~ tying lastbr EXACTFU <00\2ss> (1) to
ender END (4) offset 3
| | tail~ EXACTFU <00\2ss> (1) -> END
Starting post parse optimization
first​:> 1​: EXACTFU <00\2ss> (4)
first at 1
RExC_seen​:
study_chunk stopparen=-1 recursed_count=1 depth=0 recursed_depth=0
scan=1118124 last=1118134
Peep​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​: '' @​
0/0 Float​: '' @​ 0/0
Peep> 1​: EXACTFU <00\2ss> (4) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS ]
join> 1​: EXACTFU <00\2ss> (4)
commit​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 *Fixed​:
'' @​ 0/0 Float​: '' @​ 0/0
pre-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
post-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
commit​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​: ''
@​ 0/0 *Float​: '' @​ 0/0
minlen​: 4 r->minlen​:0 maxlen​:5
study_chunk_recursed_count​: 1
RExC_seen​:
Final program​:
1​: EXACTFU_SS <00\2ss> (4)
4​: END (0)
stclass EXACTFU_SS <00\2ss> minlen 4
r->extflags​: FOLD UNICODE
r->intflags​: [none-set]
Freeing REx​: "00\N{U+2}\xDF"

---------->8--------------->8-------------

Without​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 3 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
panic​: reg_node overrun trying to emit 0, 2133840>=2133840 at -e line 1.

I think the attached patch is better.

Basically we only need to reparse if we see a \xDF and we are in
unicode semantics.

Yves

Yves' patch unnecessarily upgrades to UTF-8. And on irc, he indicated he
was conflating in his mind upgrading and reparsing.

Yes, im a bit behind the times.

Attached is my patch.

Yep, makes sense to me. Should we wrap this logic in a well named
macro like we do with the other related macros?

The problem occurs when we are under /d semantics. An EXACTF node is
generated, which means it may have different meanings if the eventual
matched target string is in UTF-8. Then the \N{} is seen, which changes the
rules to unicode, as \N is a unicode construct. Then the \xDF comes along
and sees that it is an EXACTF node and so doesn't do anything special
besides setting a flag.

What really should happen in this situation is addressed in my patch. It
restarts parsing, having set it to be using unicode rules. Then instead of
an EXACTF node, it creates an EXACTFU. In these nodes, the \xDF gets folded
in pass1 and you don't have this problem

Yves wanted to know why it works now if the \xDF comes before the \N{}. The
reason is that flag. The \N{} comes along after the flag is set, and is
smart enough to look at the flag and restart the parsing with unicode rules,
so that the EXACTFU is generated before the \xDF is seen.

No, that wasn't my point :-),

I was conflating reparse and upgrade in some of my communications and
thinking, since upgrade implies reparse, and when I added the reparse
logic it always upgraded.

So when you were saying "it shouldnt upgrade", and I was looking at
the DEBUG 'ALL' output, I was seeing reparse messages in the
non-failing case and assumed that meant upgrade.

Once the fact that you can reparse without upgrading penetrated my
skull things became clear. I think we should add diagnostic messages
that shows what is going on.

All I was concerned with was that /00\N{U+0}\xDF/ and /00\xDF\N{U+0}/
should reparse (and also upgrade) the same number of times, be it 0 or
1, for each possibility.

FWIW, I would like to push a different patch to blead, and we can use
your patch for the maint-releases.

I wrapped all the macros that are used for RESTART_PASS1 (including
your patch) and gave them names, etc. I think we can silently fix this
in blead in such a patch without worrying about it revealing anything
about the CVE aspect of this. But we need to get back patches done,
and then do a mega-cve release along with the other issue we have
pending. ;-)

cheers,
Yves

7 months ago I made a proposal, fully explained here, and in followup posts​:
http​://nntp.perl.org/group/perl.perl5.porters/245342

But briefly, under /i matching there may very well be characters
included in the pattern that only match themselves even under /i.
Spaces and punctuation are two prominent members of this class.

One of the things the optimizer does is to look for sequences of bytes
that have to exist in the target string in order for it to match. It
can then search the target using boyer moore to quickly find the
possible places in the string to start the match, or to rule out a match
immediately.

But if the whole pattern is /i, there is no such sequence of bytes that
the optimizer can grab onto.

What I proposed is for the regex parser to split the EXACTish nodes into
two types, one that consists solely of characters that can match a
different character than themselves under /i, and nodes that consist
solely of characters that match only themselves even under /i.

That would mean the optimizer would very likely be able to find a fixed
sequence in just about any /i pattern longer than a few bytes.

I chatted with Yves about this yesterday, and he liked it; which he also
did last July.

I now have a branch that implements this. The reason it is relevant to
this security ticket, is that implementing this causes this bug to go
away with no special effort required. Thus if we were to put it in
blead, no one would notice that it is fixing a bug.

I therefore propose that we do this, and leave my basic patch to include
in the CVE that must happen for other reasons, like Yves suggested.

In the proposal of July, I suggested that a new compile-time only node
be used to mark those nodes which are under /i but contain no characters
that match other than themselves under /i. Having talked with Yves, and
recently looked at the pattern matching code, I'm not so certain about
this. The reason to do it would be to join these nodes back in, after
the optimizer is done with them, so that the results are packed into
fewer nodes.

The reason for doing that is that there is overhead in the engine
whenever we switch to the next node. But the reason for not doing it is
that it is slower to match /i nodes than ones that an memEQ can be used on.

I suppose we could join short nodes, and leave longer ones separate.

And finally, I have realized some more things related to this ticket.
The bottom line cause is that we start parsing and we assume /d rules,
we find the \N{} which tells us we need to use Unicode rules, and then
we find the \xDF which expands to ss under Unicode rules but remains a
single byte otherwise. But the code doesn't change the rules in mid
parse, so the DF is parsed still assuming that /d is in effect so it
doesn't expand, until pass2 when the space has already been calculated.
The reason my new code works is that it will change the rules in
midparse, so we know immediately that the DF needs two bytes.

If the order in the ticket's pattern were reversed. and the DF came
first, it currently sets a flag, which the \N{} code sees and restarts
the parse. That would continue under my new scheme. Except, that we
don't actually need to restart the parse here. We need only restart
parsing the current EXACTish node. That is far less work.

In general, we do need to restart the parse, because the \N{} could be a
long ways away in the pattern, in a different node, and then we have to
got back and restart the whole parse. But a variable locale to the loop
that parses this node could be used to avoid restarting the whole parse.

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2018

From @demerphq

On 4 February 2018 at 01​:11, Karl Williamson <public@​khwilliamson.com> wrote​:

On 02/03/2018 06​:32 AM, demerphq wrote​:

[snip]

I wrapped all the macros that are used for RESTART_PASS1 (including
your patch) and gave them names, etc. I think we can silently fix this
in blead in such a patch without worrying about it revealing anything
about the CVE aspect of this. But we need to get back patches done,
and then do a mega-cve release along with the other issue we have
pending. ;-)

[snip]

7 months ago I made a proposal, fully explained here, and in followup posts​:
http​://nntp.perl.org/group/perl.perl5.porters/245342

[snip]

I now have a branch that implements this. The reason it is relevant to this
security ticket, is that implementing this causes this bug to go away with
no special effort required. Thus if we were to put it in blead, no one
would notice that it is fixing a bug.

I don't follow that part entirely. I would like to see the branch. But
I am fine with the plan. It sounds like a good thing regardless.

I think we should do the following at this point for /blead/​:

I will push the patch i posted in my previous reply, with the cleanup
of the restart logic that includes Karls original patch. It removes a
ton of replicated code, and makes things easier to understand, and
fixes the original bug. Because the fix is inside of a much larger
set of "trivial" changes the impact is going to be non-obvious, it
isnt going to set of some wave of exploits.

Then afterwards Karl can apply his changes to EXACT joining and typing
at different phases on top of them.

For backports we will do the minimal patch proposed by Karl, and
release along with the Trie issue, (132063) which is MUCH more serious
IMO.

Then we fix both with one back-release.

Personally I think the Trie issue justifies posting fixes back to 5.22

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @tonycoz

On Fri, 02 Feb 2018 14​:40​:35 -0800, public@​khwilliamson.com wrote​:

On 02/02/2018 02​:26 PM, demerphq wrote​:

On 2 February 2018 at 17​:40, demerphq <demerphq@​gmail.com> wrote​:

On 2 February 2018 at 13​:21, demerphq <demerphq@​gmail.com> wrote​:

FWIW, this looks like a reparse bug. The pattern starts out latin-
1.
We should trigger "reparse as utf8" when we see the \N{U+0}, which
then should do the correct sizing.

So I would look into what is happening with \N{U+0} and why it isnt
triggering a reparse of the pattern.

Yep, that is it, or more or less, although we ARE trying to trigger
reparse, but that logic has been "optimized" such that it does not
do
the reparse if it has not seen \xDF first, and later when it DOES
see
the \xDF it also does not realize it needs to upgrade the pattern.
The
following patch fixes it, although its not the right patch.

$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6dbfed5..2668df5 100644
--- a/regcomp.c
+++ b/regcomp.c
@​@​ -352,7 +352,7 @​@​ struct RExC_state_t {
assert(PASS1);
\
set_regex_charset(&RExC_flags,
REGEX_UNICODE_CHARSET); \
RExC_uni_semantics = 1;
\
- if (RExC_seen_unfolded_sharp_s) {
\
+ if (!UTF || RExC_seen_unfolded_sharp_s) {
\
*flagp |= RESTART_PASS1;
\
return restart_retval;
\
}

You have evolved some of this handling beyond my current
understanding
of the subtleties, so it will take me some time to work a better
patch.

With my patch​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Need to redo pass 1
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 4 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
<> | 5| lsbr~ tying lastbr EXACTFU <00\2ss> (1) to
ender END (4) offset 3
| | tail~ EXACTFU <00\2ss> (1) -> END
Starting post parse optimization
first​:> 1​: EXACTFU <00\2ss> (4)
first at 1
RExC_seen​:
study_chunk stopparen=-1 recursed_count=1 depth=0 recursed_depth=0
scan=1118124 last=1118134
Peep​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​: ''
@​
0/0 Float​: '' @​ 0/0
Peep> 1​: EXACTFU <00\2ss> (4) [ SCF_DO_SUBSTR SCF_WHILEM_VISITED_POS
Peep> ]
join> 1​: EXACTFU <00\2ss> (4)
commit​: Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 *Fixed​:
'' @​ 0/0 Float​: '' @​ 0/0
pre-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
post-fin​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0
Fixed​:
'' @​ 0/0 *Float​: '' @​ 0/0
commit​: Pos​:4/1 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' -1​:0/0 Fixed​:
''
@​ 0/0 *Float​: '' @​ 0/0
minlen​: 4 r->minlen​:0 maxlen​:5
study_chunk_recursed_count​: 1
RExC_seen​:
Final program​:
1​: EXACTFU_SS <00\2ss> (4)
4​: END (0)
stclass EXACTFU_SS <00\2ss> minlen 4
r->extflags​: FOLD UNICODE
r->intflags​: [none-set]
Freeing REx​: "00\N{U+2}\xDF"

---------->8--------------->8-------------

Without​:

$ ./perl -Ilib -Mre=Debug,ALL -le'/00\N{U+2}\xDF/i'
Assembling pattern from 1 elements
Compiling REx "00\N{U+2}\xDF"
Starting first pass (sizing)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
Required size 3 nodes
Starting second pass (creation)
<00\N{U+2}\xDF> | 1| reg
| | brnc
| | piec
| | atom
panic​: reg_node overrun trying to emit 0, 2133840>=2133840 at -e
line 1.

I think the attached patch is better.

Basically we only need to reparse if we see a \xDF and we are in
unicode semantics.

Yves

Yves' patch unnecessarily upgrades to UTF-8. And on irc, he indicated
he was conflating in his mind upgrading and reparsing.

Attached is my patch.

The problem occurs when we are under /d semantics. An EXACTF node is
generated, which means it may have different meanings if the eventual
matched target string is in UTF-8. Then the \N{} is seen, which
changes
the rules to unicode, as \N is a unicode construct. Then the \xDF
comes
along and sees that it is an EXACTF node and so doesn't do anything
special besides setting a flag.

What really should happen in this situation is addressed in my patch.
It restarts parsing, having set it to be using unicode rules. Then
instead of an EXACTF node, it creates an EXACTFU. In these nodes, the
\xDF gets folded in pass1 and you don't have this problem

Yves wanted to know why it works now if the \xDF comes before the
\N{}.
The reason is that flag. The \N{} comes along after the flag is set,
and is smart enough to look at the flag and restart the parsing with
unicode rules, so that the EXACTFU is generated before the \xDF is
seen.

The patch makes sense to me, and prevents the failures in both my simple test case and the original.

I was confused about the meaning of seen_unfolded_sharp_s, since we were seeing the \xDF in a /i regexp, thinking "hey this is folded", but I assume it means something like "we've emitted a \xDF, but because we weren't in unicode rules, we didn't fold it".

The original test case causes overflows back to 5.18, but the patch applies back to through the perls we support, 5.24 and 5.26.

The CVE form requests the versions an issue applies to, while the specific code that's failing here only exists in 5.24 and later, I suspect the issue was inherited from those earlier versions when the code was re-worked.

So I expect I'll put versions 5.18 through 5.26 on the CVE request.

The initial request won't include enough detail to reproduce the issue, here's what I expect to update the form with once the issue is public​:

Vulnerability type​:
  Buffer Overflow

Vendor of the product​:
  Perl5 Porters

Product​:
  perl

Version​:
  5.18 through 5.26

Has vendor confirmed or acknowledged the vulnerability?
  Yes

Attack vector​:
  Context dependent

Impact​:
  Denial of service (it can crash perl)

- the other choices for Impact are​:
  Code Execution - this may be possible, but we don't have a confirmed exploit
  Information Disclosure - no direct disclosure
  Escalation of Privilege - nope
  Other - not that I can think of

Affected Components​:
  regcomp.c, S_regatom()

Attack vector
 
An attacker supplies a regular expression containing one or more \xDF characters after an escape putting the regexp into unicode matching mode, such as a \N{} escape. Each \xDF character adds one byte of overflow, and any other text in the regular expression is written in order, providing the attacker control over the bytes written to the overflowed region.

Suggested description of the vulnerability for use in the CVE

  A crafted regular expression can cause a heap buffer overflow, with control over the bytes written

Discoverer(s)/Credits

  Brian Carpenter <brian.carpenter@​gmail.com>

Reference(s)

  https://rt.perl.org/Public/Bug/Display.html?id=132227

Additional Information

(blank)

I'll only fill in Vulnerability Type through version and the description when initially requesting it (tomorrow) and fill the rest in when it goes public.

To see the form, visit​:

  https://cveform.mitre.org/

and select "Request a CVE ID".

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @tonycoz

On Sun, 04 Feb 2018 16​:17​:13 -0800, tonyc wrote​:

Suggested description of the vulnerability for use in the CVE

A crafted regular expression can cause a heap buffer overflow, with
control over the bytes written

A crafted regular expression can cause a heap buffer write overflow, with control over the bytes written

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

On 02/04/2018 02​:10 AM, demerphq wrote​:

I now have a branch that implements this. The reason it is relevant to this
security ticket, is that implementing this causes this bug to go away with
no special effort required. Thus if we were to put it in blead, no one
would notice that it is fixing a bug.
I don't follow that part entirely. I would like to see the branch. But
I am fine with the plan. It sounds like a good thing regardless.

I think we should do the following at this point for/blead/​:

I will push the patch i posted in my previous reply, with the cleanup
of the restart logic that includes Karls original patch. It removes a
ton of replicated code, and makes things easier to understand, and
fixes the original bug. Because the fix is inside of a much larger
set of "trivial" changes the impact is going to be non-obvious, it
isnt going to set of some wave of exploits.

I think it would be better to to do my (attached) patch first, but I'm
not wedded to the idea.

Basically, the problem goes away, in passing, because a new algorithm is
used that completely side steps the issue. Thus, there is nothing to
hide that we hope people won't notice. It's simply a different way of
doing things that happens to avoid the current pitfalls.

The algorithm is to always start out with an EXACT node. As long as
only non-folding characters are found, they get put into that node. At
the first one found, the old node is closed up, and a new one started
consisting of only folding characters, until a non-folding one is found....

The reason blead fails is the node type is computed before the \N{} is
processed. It sees that there has been no sharp s so far so continues
on with the node (now with the wrong type in Pass1), so that the sharp s
is later processed (only in pass 1) under the wrong type. The new
algorithm changes the node type as we go along, so that the sharp s is
processed with the correct type even in pass 1.

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

0003-regcomp.c-Under-i-segregate-folding-vs-non-folding-c.patch
From 084d1c50e1e4072257acd6a855a26796204eeb91 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 4 Feb 2018 19:43:00 -0700
Subject: [PATCH 3/3] regcomp.c: Under/i segregate folding vs non-folding
 characters

For matching sequences of characters, the regex compiler generates
various flavors of EXACT-type nodes.  The optimizer uses those nodes to
look for sequences that must be in the matched string.  In this way, the
pattern matching engine may be able to quickly rule out any possible
match altogether, or to narrow down the places in the target string that
might match.

Under /i matching, this generally has not been possible, because there
is no fixed string that the optimizer can grab onto, as something can
match, say, either 'A' or 'a', etc.  However, in many patterns that
contain text, there are characters that are fixed even under /i.  Things
like tabs, space, and punctuation, for example.

This commit segregates such folding vs non-folding characters into
separate nodes.  I proposed this 7 months ago:

http://nntp.perl.org/group/perl.perl5.porters/245342

and in talking with Yves recently, decided to go ahead with it.

In the proposal of July, I suggested that a new node type be used to
mark those nodes which are under /i but contain no characters that match
other than themselves under /i.  These nodes could be joined with either
a plain EXACT node, or an EXACTFish one to create longer nodes.

The reason for joining is that there is overhead in the engine whenever
we switch to the next node.  But the reason for not doing it is that it
is slower to match /i nodes than ones that an memEQ can be used on.

I suppose we could join short nodes, and leave longer ones separate, but
that decision can be deferred based on real-world experience.

This patch also consolidates into one place the handling of the Latin
Sharp S, in order to avoid extra #ifdefs, and cause the logic to be
linearly shown.
---
 regcomp.c | 246 ++++++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 158 insertions(+), 88 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index f5b2ddb871..928a70bc65 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3695,10 +3695,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
  * XXX khw thinks this should be enhanced to fill EXACT (at least) nodes as full
  * as possible, even if that means splitting an existing node so that its first
  * part is moved to the preceeding node.  This would maximise the efficiency of
- * memEQ during matching.  Elsewhere in this file, khw proposes splitting
- * EXACTFish nodes into portions that don't change under folding vs those that
- * do.  Those portions that don't change may be the only things in the pattern that
- * could be used to find fixed and floating strings.
+ * memEQ during matching.
  *
  * If a node is to match under /i (folded), the number of characters it matches
  * can be different than its character length if it contains a multi-character
@@ -13297,7 +13294,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	    char *s0;
 	    U8 upper_parse = MAX_NODE_STRING_SIZE;
-            U8 node_type = compute_EXACTish(pRExC_state);
+
+            /* We start out as an EXACT node, even if under /i, until we find a
+             * character which is in a fold.  The algorithm now segregates into
+             * separate nodes, characters that fold from those that don't under
+             * /i.  (This hopefull will create nodes that are fixed strings
+             * even under /i, giving the optimizer something to grab onto to.)
+             * So, if a node has something in it and the next character is in
+             * the opposite category, that node is closed up, and the function
+             * returns.  Then regatom is called again, and a new node is
+             * created for the new category. */
+            U8 node_type = EXACT;
+
             bool next_is_quantifier;
             char * oldp = NULL;
 
@@ -13311,14 +13319,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
              * which don't participate in folds with Latin1-range characters,
              * as the latter's folds aren't known until runtime.  (We don't
              * need to figure this out until pass 2) */
-            bool maybe_exactfu = PASS2
-                               && (node_type == EXACTF || node_type == EXACTFL);
-
-            /* If a folding node contains only code points that don't
-             * participate in folds, it can be changed into an EXACT node,
-             * which allows the optimizer more things to look for, and is
-             * faster to match */
-            bool maybe_exact;
+            bool maybe_exactfu = PASS2;
 
             /* The node_type may change below, but since the size of the node
              * doesn't change, it works */
@@ -13332,15 +13333,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	  reparse:
 
-            /* We look for the EXACTFish to EXACT node optimizaton only if
-             * folding.  (And we don't need to figure this out until pass 2).
-             * XXX It might actually make sense to split the node into portions
-             * that are exact and ones that aren't, so that we could later use
-             * the exact ones to find the longest fixed and floating strings.
-             * One would want to join them back into a larger node.  One could
-             * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
-            maybe_exact = FOLD && PASS2;
-
             /* This breaks under rare circumstances.  If folding, we do not
              * want to split a node at a character that is a non-final in a
              * multi-char fold, as an input string could just happen to want to
@@ -13357,7 +13349,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
             /* Here, we have a literal character.  Find the maximal string of
              * them in the input that we can fit into a single EXACTish node.
-             * We quit at the first non-literal or when the node gets full */
+             * We quit at the first non-literal or when the node gets full, or
+             * under /i the categorization of folding/non-folding character
+             * changes */
 	    for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
 
                 /* In most cases each iteration adds one byte to the output.
@@ -13709,8 +13703,17 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
 
                     /* Here are folding under /l, and the code point is
-                     * problematic.  First, we know we can't simplify things */
-                    maybe_exact = FALSE;
+                     * problematic.  If this is the first character in the
+                     * node, change the node type to folding.   Otherwise, if
+                     * this is the first problematic character, close up the
+                     * existing node, so can start a new node with this one */
+                    if (! len) {
+                        node_type = EXACTFL;
+                    }
+                    else if (node_type == EXACT) {
+                        p = oldp;
+                        goto loopdone;
+                    }
 
                     /* This code point means we can't simplify things */
                     maybe_exactfu = FALSE;
@@ -13730,44 +13733,74 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                      * do for both passes is the PASS2 code for non-folding */
                     goto not_fold_common;
                 }
-                else /* A regular FOLD code point */
-                    if (! (   UTF
-#if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
-   || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
-                                      || UNICODE_DOT_DOT_VERSION > 0)
-                            /* See comments for join_exact() as to why we fold
-                             * this non-UTF at compile time */
-                            || (   node_type == EXACTFU
-                                && ender == LATIN_SMALL_LETTER_SHARP_S)
-#endif
-                )) {
+                else                /* A regular FOLD code point */
+                     if (! UTF)
+                {
                     /* Here, are folding and are not UTF-8 encoded; therefore
                      * the character must be in the range 0-255, and is not /l.
                      * (Not /l because we already handled these under /l in
                      * is_PROBLEMATIC_LOCALE_FOLD_cp) */
-                    if (IS_IN_SOME_FOLD_L1(ender)) {
-                        maybe_exact = FALSE;
+                    if (! IS_IN_SOME_FOLD_L1(ender)) {
 
-                        /* See if the character's fold differs between /d and
-                         * /u.  This includes the multi-char fold SHARP S to
-                         * 'ss' */
-                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
-                            RExC_seen_unfolded_sharp_s = 1;
-                            maybe_exactfu = FALSE;
+                        /* Start a new node for this non-folding character if
+                         * previous ones in the node were folded */
+                        if (len && node_type != EXACT) {
+                            p = oldp;
+                            goto loopdone;
                         }
-                        else if (maybe_exactfu
-                            && (PL_fold[ender] != PL_fold_latin1[ender]
+
+                        *(s++) = (char) ender;
+                    }
+                    else {  /* Here, does participate in some fold */
+
+                        /* if this is the first character in the node, change
+                         * its type to folding.  Otherwise, if this is the
+                         * first folding character in the node, close up the
+                         * existing node, so can start a new node with this
+                         * one.  */
+                        if (! len) {
+                            node_type = compute_EXACTish(pRExC_state);
+                        }
+                        else if (node_type == EXACT) {
+                            p = oldp;
+                            goto loopdone;
+                        }
+
+                        /* See if the character's fold differs between /d and
+                         * /u.  On non-ancient Unicode versions, his includes
+                         * the multi-char fold SHARP S to 'ss' */
+
 #if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
    || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
                                       || UNICODE_DOT_DOT_VERSION > 0)
-                                || (   len > 0
-                                    && isALPHA_FOLD_EQ(ender, 's')
-                                    && isALPHA_FOLD_EQ(*(s-1), 's'))
+
+                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+                            /* See comments for join_exact() as to why we fold
+                             * this non-UTF at compile time */
+                            if (node_type == EXACTFU) {
+                                *(s++) = 's';
+
+                                /* Let the code below add in the extra 's' */
+                                ender = 's';
+                                added_len = 2;
+                            }
+                            else {
+                                RExC_seen_unfolded_sharp_s = 1;
+                                maybe_exactfu = FALSE;
+                            }
+                        }
+                        else if (   len
+                                 && isALPHA_FOLD_EQ(ender, 's')
+                                 && isALPHA_FOLD_EQ(*(s-1), 's'))
+                        {
+                            maybe_exactfu = FALSE;
+                        }
+                        else
 #endif
-                        )) {
+
+                        if (PL_fold[ender] != PL_fold_latin1[ender]) {
                             maybe_exactfu = FALSE;
                         }
-                    }
 
                         /* Even when folding, we store just the input
                          * character, as we have an array that finds its fold
@@ -13775,7 +13808,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                         *(s++) = (char) ender;
                     }
                 }
-                else {  /* FOLD, and UTF (or sharp s) */
+                else {  /* FOLD, and UTF */
                     /* Unlike the non-fold case, we do actually have to
                      * calculate the fold in pass 1.  This is for two reasons,
                      * the folded length may be longer than the unfolded, and
@@ -13784,41 +13817,72 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                      * of a potential multi-char fold, and have to back off
                      * accordingly.  */
 
-                    UV folded;
                     if (isASCII_uni(ender)) {
-                        folded = toFOLD(ender);
-                        *(s)++ = (U8) folded;
+
+                        /* As above, we close up and start a new node if the
+                         * previous characters don't match the fold/non-fold
+                         * state of this one.  And if this is the first
+                         * character in the node, and it folds, we change the
+                         * node away from being EXACT */
+                        if (! IS_IN_SOME_FOLD_L1(ender)) {
+                            if (len && node_type != EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            *(s)++ = (U8) ender;
+                        }
+                        else {  /* Is in a fold */
+
+                            if (! len) {
+                                node_type = compute_EXACTish(pRExC_state);
+                            }
+                            else if (node_type == EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            *(s)++ = (U8) toFOLD(ender);
+                        }
                     }
                     else {  /* Not ASCII */
                         STRLEN foldlen;
 
-                        folded = _to_uni_fold_flags(
+                        /* As above, we close up and start a new node if the
+                         * previous characters don't match the fold/non-fold
+                         * state of this one.  And if this is the first
+                         * character in the node, and it folds, we change the
+                         * node away from being EXACT */
+                        if (! _invlist_contains_cp(PL_utf8_foldable, ender)) {
+                            if (len && node_type != EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            s = (char *) uvchr_to_utf8((U8 *) s, ender);
+                            added_len = UVCHR_SKIP(ender);
+                        }
+                        else {
+
+                            if (! len) {
+                                node_type = compute_EXACTish(pRExC_state);
+                            }
+                            else if (node_type == EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            ender = _to_uni_fold_flags(
                                      ender,
                                      (U8 *) s,
                                      &foldlen,
                                      FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
                                                         ? FOLD_FLAGS_NOMIX_ASCII
                                                         : 0));
-                        s += foldlen;
-                        added_len = foldlen;
-                    }
-                    /* If this node only contains non-folding code points so
-                     * far, see if this new one is also non-folding */
-                    if (maybe_exact) {
-                        if (folded != ender) {
-                            maybe_exact = FALSE;
-                        }
-                        else {
-                            /* Here the fold is the original; we have to check
-                             * further to see if anything folds to it */
-                            if (_invlist_contains_cp(PL_utf8_foldable,
-                                                        ender))
-                            {
-                                maybe_exact = FALSE;
-                            }
+                            s += foldlen;
+                            added_len = foldlen;
                         }
                     }
-                    ender = folded;
 		}
 
                 len += added_len;
@@ -14004,21 +14068,27 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 OP(ret) = NOTHING;
             }
             else {
-                if (FOLD) {
-                    /* If 'maybe_exact' is still set here, means there are no
-                     * code points in the node that participate in folds;
-                     * similarly for 'maybe_exactfu' and code points that match
-                     * differently depending on UTF8ness of the target string
-                     * (for /u), or depending on locale for /l */
-                    if (maybe_exact) {
-                        OP(ret) = (LOC)
-                                  ? EXACTL
-                                  : EXACT;
+                OP(ret) = node_type;
+
+                /* If the node type is EXACT here, check to see if it
+                 * should be EXACTL. */
+                if (node_type == EXACT) {
+                    if (LOC) {
+                        OP(ret) = EXACTL;
                     }
-                    else if (maybe_exactfu) {
-                        OP(ret) = (LOC)
-                                  ? EXACTFLU8
-                                  : EXACTFU;
+                }
+
+                if (FOLD) {
+                    /* If 'maybe_exactfu' is set, then there are no code points
+                     * that match differently depending on UTF8ness of the
+                     * target string (for /u), or depending on locale for /l */
+                    if (maybe_exactfu) {
+                        if (node_type == EXACTF) {
+                            OP(ret) = EXACTFU;
+                        }
+                        else if (node_type == EXACTFL) {
+                            OP(ret) = EXACTFLU8;
+                        }
                     }
                 }
 
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

That patch doesn't apply to blead; the very slightly revised one here does

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

0001-regcomp.c-Under-i-segregate-folding-vs-non-folding-c.patch
From fc2ba75031cf1477d32b3eff3115e2c4ce4f4d43 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 4 Feb 2018 19:43:00 -0700
Subject: [PATCH] regcomp.c: Under/i segregate folding vs non-folding
 characters

For matching sequences of characters, the regex compiler generates
various flavors of EXACT-type nodes.  The optimizer uses those nodes to
look for sequences that must be in the matched string.  In this way, the
pattern matching engine may be able to quickly rule out any possible
match altogether, or to narrow down the places in the target string that
might match.

Under /i matching, this generally has not been possible, because there
is no fixed string that the optimizer can grab onto, as something can
match, say, either 'A' or 'a', etc.  However, in many patterns that
contain text, there are characters that are fixed even under /i.  Things
like tabs, space, and punctuation, for example.

This commit segregates such folding vs non-folding characters into
separate nodes.  I proposed this 7 months ago:

http://nntp.perl.org/group/perl.perl5.porters/245342

and in talking with Yves recently, decided to go ahead with it.

In the proposal of July, I suggested that a new node type be used to
mark those nodes which are under /i but contain no characters that match
other than themselves under /i.  These nodes could be joined with either
a plain EXACT node, or an EXACTFish one to create longer nodes.

The reason for joining is that there is overhead in the engine whenever
we switch to the next node.  But the reason for not doing it is that it
is slower to match /i nodes than ones that an memEQ can be used on.

I suppose we could join short nodes, and leave longer ones separate, but
that decision can be deferred based on real-world experience.

This patch also consolidates into one place the handling of the Latin
Sharp S, in order to avoid extra #ifdefs, and cause the logic to be
linearly shown.
---
 regcomp.c | 246 ++++++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 158 insertions(+), 88 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 6f89a8ec30..dbdfd7b694 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3695,10 +3695,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
  * XXX khw thinks this should be enhanced to fill EXACT (at least) nodes as full
  * as possible, even if that means splitting an existing node so that its first
  * part is moved to the preceeding node.  This would maximise the efficiency of
- * memEQ during matching.  Elsewhere in this file, khw proposes splitting
- * EXACTFish nodes into portions that don't change under folding vs those that
- * do.  Those portions that don't change may be the only things in the pattern that
- * could be used to find fixed and floating strings.
+ * memEQ during matching.
  *
  * If a node is to match under /i (folded), the number of characters it matches
  * can be different than its character length if it contains a multi-character
@@ -13297,7 +13294,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	    char *s0;
 	    U8 upper_parse = MAX_NODE_STRING_SIZE;
-            U8 node_type = compute_EXACTish(pRExC_state);
+
+            /* We start out as an EXACT node, even if under /i, until we find a
+             * character which is in a fold.  The algorithm now segregates into
+             * separate nodes, characters that fold from those that don't under
+             * /i.  (This hopefull will create nodes that are fixed strings
+             * even under /i, giving the optimizer something to grab onto to.)
+             * So, if a node has something in it and the next character is in
+             * the opposite category, that node is closed up, and the function
+             * returns.  Then regatom is called again, and a new node is
+             * created for the new category. */
+            U8 node_type = EXACT;
+
             bool next_is_quantifier;
             char * oldp = NULL;
 
@@ -13311,14 +13319,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
              * which don't participate in folds with Latin1-range characters,
              * as the latter's folds aren't known until runtime.  (We don't
              * need to figure this out until pass 2) */
-            bool maybe_exactfu = PASS2
-                               && (node_type == EXACTF || node_type == EXACTFL);
-
-            /* If a folding node contains only code points that don't
-             * participate in folds, it can be changed into an EXACT node,
-             * which allows the optimizer more things to look for, and is
-             * faster to match */
-            bool maybe_exact;
+            bool maybe_exactfu = PASS2;
 
             /* The node_type may change below, but since the size of the node
              * doesn't change, it works */
@@ -13332,15 +13333,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	  reparse:
 
-            /* We look for the EXACTFish to EXACT node optimizaton only if
-             * folding.  (And we don't need to figure this out until pass 2).
-             * XXX It might actually make sense to split the node into portions
-             * that are exact and ones that aren't, so that we could later use
-             * the exact ones to find the longest fixed and floating strings.
-             * One would want to join them back into a larger node.  One could
-             * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
-            maybe_exact = FOLD && PASS2;
-
             /* This breaks under rare circumstances.  If folding, we do not
              * want to split a node at a character that is a non-final in a
              * multi-char fold, as an input string could just happen to want to
@@ -13357,7 +13349,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
             /* Here, we have a literal character.  Find the maximal string of
              * them in the input that we can fit into a single EXACTish node.
-             * We quit at the first non-literal or when the node gets full */
+             * We quit at the first non-literal or when the node gets full, or
+             * under /i the categorization of folding/non-folding character
+             * changes */
 	    for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
 
                 /* In most cases each iteration adds one byte to the output.
@@ -13709,8 +13703,17 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
 
                     /* Here are folding under /l, and the code point is
-                     * problematic.  First, we know we can't simplify things */
-                    maybe_exact = FALSE;
+                     * problematic.  If this is the first character in the
+                     * node, change the node type to folding.   Otherwise, if
+                     * this is the first problematic character, close up the
+                     * existing node, so can start a new node with this one */
+                    if (! len) {
+                        node_type = EXACTFL;
+                    }
+                    else if (node_type == EXACT) {
+                        p = oldp;
+                        goto loopdone;
+                    }
 
                     /* This code point means we can't simplify things */
                     maybe_exactfu = FALSE;
@@ -13730,51 +13733,81 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                      * do for both passes is the PASS2 code for non-folding */
                     goto not_fold_common;
                 }
-                else /* A regular FOLD code point */
-                    if (! (   UTF
-#if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
-   || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
-                                      || UNICODE_DOT_DOT_VERSION > 0)
-                            /* See comments for join_exact() as to why we fold
-                             * this non-UTF at compile time */
-                            || (   node_type == EXACTFU
-                                && ender == LATIN_SMALL_LETTER_SHARP_S)
-#endif
-                )) {
+                else                /* A regular FOLD code point */
+                     if (! UTF)
+                {
                     /* Here, are folding and are not UTF-8 encoded; therefore
                      * the character must be in the range 0-255, and is not /l.
                      * (Not /l because we already handled these under /l in
                      * is_PROBLEMATIC_LOCALE_FOLD_cp) */
-                    if (IS_IN_SOME_FOLD_L1(ender)) {
-                        maybe_exact = FALSE;
+                    if (! IS_IN_SOME_FOLD_L1(ender)) {
 
-                        /* See if the character's fold differs between /d and
-                         * /u.  This includes the multi-char fold SHARP S to
-                         * 'ss' */
-                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
-                            RExC_seen_unfolded_sharp_s = 1;
-                            maybe_exactfu = FALSE;
+                        /* Start a new node for this non-folding character if
+                         * previous ones in the node were folded */
+                        if (len && node_type != EXACT) {
+                            p = oldp;
+                            goto loopdone;
                         }
-                        else if (maybe_exactfu
-                            && (PL_fold[ender] != PL_fold_latin1[ender]
+
+                        *(s++) = (char) ender;
+                    }
+                    else {  /* Here, does participate in some fold */
+
+                        /* if this is the first character in the node, change
+                         * its type to folding.  Otherwise, if this is the
+                         * first folding character in the node, close up the
+                         * existing node, so can start a new node with this
+                         * one.  */
+                        if (! len) {
+                            node_type = compute_EXACTish(pRExC_state);
+                        }
+                        else if (node_type == EXACT) {
+                            p = oldp;
+                            goto loopdone;
+                        }
+
+                        /* See if the character's fold differs between /d and
+                         * /u.  On non-ancient Unicode versions, his includes
+                         * the multi-char fold SHARP S to 'ss' */
+
 #if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
    || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
                                       || UNICODE_DOT_DOT_VERSION > 0)
-                                || (   len > 0
-                                    && isALPHA_FOLD_EQ(ender, 's')
-                                    && isALPHA_FOLD_EQ(*(s-1), 's'))
+
+                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+                            /* See comments for join_exact() as to why we fold
+                             * this non-UTF at compile time */
+                            if (node_type == EXACTFU) {
+                                *(s++) = 's';
+
+                                /* Let the code below add in the extra 's' */
+                                ender = 's';
+                                added_len = 2;
+                            }
+                            else {
+                                RExC_seen_unfolded_sharp_s = 1;
+                                maybe_exactfu = FALSE;
+                            }
+                        }
+                        else if (   len
+                                 && isALPHA_FOLD_EQ(ender, 's')
+                                 && isALPHA_FOLD_EQ(*(s-1), 's'))
+                        {
+                            maybe_exactfu = FALSE;
+                        }
+                        else
 #endif
-                        )) {
+
+                        if (PL_fold[ender] != PL_fold_latin1[ender]) {
                             maybe_exactfu = FALSE;
                         }
-                    }
 
                         /* Even when folding, we store just the input
                          * character, as we have an array that finds its fold
                          * quickly */
                         *(s++) = (char) ender;
                 }
-                else {  /* FOLD, and UTF (or sharp s) */
+                else {  /* FOLD, and UTF */
                     /* Unlike the non-fold case, we do actually have to
                      * calculate the fold in pass 1.  This is for two reasons,
                      * the folded length may be longer than the unfolded, and
@@ -13783,41 +13816,72 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                      * of a potential multi-char fold, and have to back off
                      * accordingly.  */
 
-                    UV folded;
                     if (isASCII_uni(ender)) {
-                        folded = toFOLD(ender);
-                        *(s)++ = (U8) folded;
+
+                        /* As above, we close up and start a new node if the
+                         * previous characters don't match the fold/non-fold
+                         * state of this one.  And if this is the first
+                         * character in the node, and it folds, we change the
+                         * node away from being EXACT */
+                        if (! IS_IN_SOME_FOLD_L1(ender)) {
+                            if (len && node_type != EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            *(s)++ = (U8) ender;
+                        }
+                        else {  /* Is in a fold */
+
+                            if (! len) {
+                                node_type = compute_EXACTish(pRExC_state);
+                            }
+                            else if (node_type == EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            *(s)++ = (U8) toFOLD(ender);
+                        }
                     }
                     else {  /* Not ASCII */
                         STRLEN foldlen;
 
-                        folded = _to_uni_fold_flags(
+                        /* As above, we close up and start a new node if the
+                         * previous characters don't match the fold/non-fold
+                         * state of this one.  And if this is the first
+                         * character in the node, and it folds, we change the
+                         * node away from being EXACT */
+                        if (! _invlist_contains_cp(PL_utf8_foldable, ender)) {
+                            if (len && node_type != EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            s = (char *) uvchr_to_utf8((U8 *) s, ender);
+                            added_len = UVCHR_SKIP(ender);
+                        }
+                        else {
+
+                            if (! len) {
+                                node_type = compute_EXACTish(pRExC_state);
+                            }
+                            else if (node_type == EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            ender = _to_uni_fold_flags(
                                      ender,
                                      (U8 *) s,
                                      &foldlen,
                                      FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
                                                         ? FOLD_FLAGS_NOMIX_ASCII
                                                         : 0));
-                        s += foldlen;
-                        added_len = foldlen;
-                    }
-                    /* If this node only contains non-folding code points so
-                     * far, see if this new one is also non-folding */
-                    if (maybe_exact) {
-                        if (folded != ender) {
-                            maybe_exact = FALSE;
-                        }
-                        else {
-                            /* Here the fold is the original; we have to check
-                             * further to see if anything folds to it */
-                            if (_invlist_contains_cp(PL_utf8_foldable,
-                                                        ender))
-                            {
-                                maybe_exact = FALSE;
-                            }
+                            s += foldlen;
+                            added_len = foldlen;
                         }
                     }
-                    ender = folded;
 		}
 
                 len += added_len;
@@ -14003,21 +14067,27 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 OP(ret) = NOTHING;
             }
             else {
-                if (FOLD) {
-                    /* If 'maybe_exact' is still set here, means there are no
-                     * code points in the node that participate in folds;
-                     * similarly for 'maybe_exactfu' and code points that match
-                     * differently depending on UTF8ness of the target string
-                     * (for /u), or depending on locale for /l */
-                    if (maybe_exact) {
-                        OP(ret) = (LOC)
-                                  ? EXACTL
-                                  : EXACT;
+                OP(ret) = node_type;
+
+                /* If the node type is EXACT here, check to see if it
+                 * should be EXACTL. */
+                if (node_type == EXACT) {
+                    if (LOC) {
+                        OP(ret) = EXACTL;
                     }
-                    else if (maybe_exactfu) {
-                        OP(ret) = (LOC)
-                                  ? EXACTFLU8
-                                  : EXACTFU;
+                }
+
+                if (FOLD) {
+                    /* If 'maybe_exactfu' is set, then there are no code points
+                     * that match differently depending on UTF8ness of the
+                     * target string (for /u), or depending on locale for /l */
+                    if (maybe_exactfu) {
+                        if (node_type == EXACTF) {
+                            OP(ret) = EXACTFU;
+                        }
+                        else if (node_type == EXACTFL) {
+                            OP(ret) = EXACTFLU8;
+                        }
                     }
                 }
 
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

On 02/04/2018 08​:38 PM, Karl Williamson wrote​:

That patch doesn't apply to blead; the very slightly revised one here does

And, in thinking about it overnight, I realized that my patch does not
actually fix the problem. It just means that it's harder to trigger.
You have to have an ASCII folding character before the \N{}, which also
has to an ASCII folding character, then the sharp s. It also did not
trigger with these unless I added more sharp s's. Here's a case where
it triggers​:

qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

Compiling REx
"0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\x"...
panic​: reg_node overrun trying to emit 34, 560fca5bd924>=560fca5bd8c4 at
-e line 1.
panic​: bad free, ->next->prev=7373737373737373, header=560fca5bdb60,
->prev->next=560fca5bdb60 at (null) line 1 during global destruction.

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

On 02/05/2018 10​:41 AM, Karl Williamson wrote​:

On 02/04/2018 08​:38 PM, Karl Williamson wrote​:

That patch doesn't apply to blead; the very slightly revised one here
does

And, in thinking about it overnight, I realized that my patch does not
actually fix the problem.  It just means that it's harder to trigger.
You have to have an ASCII folding character before the \N{}, which also
has to an ASCII folding character, then the sharp s.  It also did not
trigger with these unless I added more sharp s's.  Here's a case where
it triggers​:

qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

Compiling REx
"0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\x"...
panic​: reg_node overrun trying to emit 34, 560fca5bd924>=560fca5bd8c4 at
-e line 1.

panic​: bad free, ->next->prev=7373737373737373, header=560fca5bdb60,
->prev->next=560fca5bdb60 at (null) line 1 during global destruction.

The small patch I originally sent turns out to be overkill. We don't
need to restart all the parsing; we just need to restart filling the
node. A patch that does just that is attached.

And that patch is easily incorporated into my other one, which the 2nd
attachment does.

So here's what I think should happen

We need to decide if we're going to sneak this in or wait for the CVE date.

If we sneak it in, do we do it with Yves or my patch? Note that his
patch assumed that we would restart the whole parse, and doesn't make
sense if it's just restarting the current node. For efficiency, we
would want to change at some point to just restarting the node if we
went with his patch. I think we should just do it once and not make any
changes until the CVE is out, which we could do with my patch. Also my
patch is doing a bunch of stuff in the area of the code this affects,
including several times using the same paradigm it does. But this is
all moot if we decide to wait for the CVE. We can apply the portions of
Yves patch and mine that don't specifically address this bug, and wait
for the CVE for the rest.

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

0002-132227.patch
From 65ca3a816676489b315d69c69ae58818ee7e5a89 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: [PATCH 2/2] 132227

---
 regcomp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..67f9c93287 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13756,6 +13756,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
                         if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+                            /* If the node started out having uni rules, we
+                             * wouldn't have gotten here.  So this means
+                             * something in the middle has changed it, but
+                             * didn't think it needed to reparse.  But this
+                             * sharp s now does indicate the need for
+                             * reparsing. */
+                            if (RExC_uni_semantics) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
                             RExC_seen_unfolded_sharp_s = 1;
                             maybe_exactfu = FALSE;
                         }
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2018

From @khwilliamson

0001-regcomp.c-Under-i-segregate-folding-vs-non-folding-c.patch
From b7665795ecc38f0b218e3c25cec93e12549e1215 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 4 Feb 2018 19:43:00 -0700
Subject: [PATCH] regcomp.c: Under/i segregate folding vs non-folding
 characters

For matching sequences of characters, the regex compiler generates
various flavors of EXACT-type nodes.  The optimizer uses those nodes to
look for sequences that must be in the matched string.  In this way, the
pattern matching engine may be able to quickly rule out any possible
match altogether, or to narrow down the places in the target string that
might match.

Under /i matching, this generally has not been possible, because there
is no fixed string that the optimizer can grab onto, as something can
match, say, either 'A' or 'a', etc.  However, in many patterns that
contain text, there are characters that are fixed even under /i.  Things
like tabs, space, and punctuation, for example.

This commit segregates such folding vs non-folding characters into
separate nodes.  I proposed this 7 months ago:

http://nntp.perl.org/group/perl.perl5.porters/245342

and in talking with Yves recently, decided to go ahead with it.

In the proposal of July, I suggested that a new node type be used to
mark those nodes which are under /i but contain no characters that match
other than themselves under /i.  These nodes could be joined with either
a plain EXACT node, or an EXACTFish one to create longer nodes.

The reason for joining is that there is overhead in the engine whenever
we switch to the next node.  But the reason for not doing it is that it
is slower to match /i nodes than ones that an memEQ can be used on.

I suppose we could join short nodes, and leave longer ones separate, but
that decision can be deferred based on real-world experience.

This patch also consolidates into one place the handling of the Latin
Sharp S, in order to avoid extra #ifdefs, and cause the logic to be
linearly shown.
---
 regcomp.c | 252 ++++++++++++++++++++++++++++++++++++++++----------------------
 regexec.c |   5 ++
 2 files changed, 169 insertions(+), 88 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 6f89a8ec30..934d0ae819 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3695,10 +3695,7 @@ S_construct_ahocorasick_from_trie(pTHX_ RExC_state_t *pRExC_state, regnode *sour
  * XXX khw thinks this should be enhanced to fill EXACT (at least) nodes as full
  * as possible, even if that means splitting an existing node so that its first
  * part is moved to the preceeding node.  This would maximise the efficiency of
- * memEQ during matching.  Elsewhere in this file, khw proposes splitting
- * EXACTFish nodes into portions that don't change under folding vs those that
- * do.  Those portions that don't change may be the only things in the pattern that
- * could be used to find fixed and floating strings.
+ * memEQ during matching.
  *
  * If a node is to match under /i (folded), the number of characters it matches
  * can be different than its character length if it contains a multi-character
@@ -13297,7 +13294,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	    char *s0;
 	    U8 upper_parse = MAX_NODE_STRING_SIZE;
-            U8 node_type = compute_EXACTish(pRExC_state);
+
+            /* We start out as an EXACT node, even if under /i, until we find a
+             * character which is in a fold.  The algorithm now segregates into
+             * separate nodes, characters that fold from those that don't under
+             * /i.  (This hopefull will create nodes that are fixed strings
+             * even under /i, giving the optimizer something to grab onto to.)
+             * So, if a node has something in it and the next character is in
+             * the opposite category, that node is closed up, and the function
+             * returns.  Then regatom is called again, and a new node is
+             * created for the new category. */
+            U8 node_type = EXACT;
+
             bool next_is_quantifier;
             char * oldp = NULL;
 
@@ -13311,14 +13319,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
              * which don't participate in folds with Latin1-range characters,
              * as the latter's folds aren't known until runtime.  (We don't
              * need to figure this out until pass 2) */
-            bool maybe_exactfu = PASS2
-                               && (node_type == EXACTF || node_type == EXACTFL);
-
-            /* If a folding node contains only code points that don't
-             * participate in folds, it can be changed into an EXACT node,
-             * which allows the optimizer more things to look for, and is
-             * faster to match */
-            bool maybe_exact;
+            bool maybe_exactfu = PASS2;
 
             /* The node_type may change below, but since the size of the node
              * doesn't change, it works */
@@ -13332,15 +13333,6 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
 	  reparse:
 
-            /* We look for the EXACTFish to EXACT node optimizaton only if
-             * folding.  (And we don't need to figure this out until pass 2).
-             * XXX It might actually make sense to split the node into portions
-             * that are exact and ones that aren't, so that we could later use
-             * the exact ones to find the longest fixed and floating strings.
-             * One would want to join them back into a larger node.  One could
-             * use a pseudo regnode like 'EXACT_ORIG_FOLD' */
-            maybe_exact = FOLD && PASS2;
-
             /* This breaks under rare circumstances.  If folding, we do not
              * want to split a node at a character that is a non-final in a
              * multi-char fold, as an input string could just happen to want to
@@ -13357,7 +13349,9 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 
             /* Here, we have a literal character.  Find the maximal string of
              * them in the input that we can fit into a single EXACTish node.
-             * We quit at the first non-literal or when the node gets full */
+             * We quit at the first non-literal or when the node gets full, or
+             * under /i the categorization of folding/non-folding character
+             * changes */
 	    for (p = RExC_parse; len < upper_parse && p < RExC_end; ) {
 
                 /* In most cases each iteration adds one byte to the output.
@@ -13709,8 +13703,17 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 else if (LOC && is_PROBLEMATIC_LOCALE_FOLD_cp(ender)) {
 
                     /* Here are folding under /l, and the code point is
-                     * problematic.  First, we know we can't simplify things */
-                    maybe_exact = FALSE;
+                     * problematic.  If this is the first character in the
+                     * node, change the node type to folding.   Otherwise, if
+                     * this is the first problematic character, close up the
+                     * existing node, so can start a new node with this one */
+                    if (! len) {
+                        node_type = EXACTFL;
+                    }
+                    else if (node_type == EXACT) {
+                        p = oldp;
+                        goto loopdone;
+                    }
 
                     /* This code point means we can't simplify things */
                     maybe_exactfu = FALSE;
@@ -13730,51 +13733,87 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                      * do for both passes is the PASS2 code for non-folding */
                     goto not_fold_common;
                 }
-                else /* A regular FOLD code point */
-                    if (! (   UTF
-#if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
-   || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
-                                      || UNICODE_DOT_DOT_VERSION > 0)
-                            /* See comments for join_exact() as to why we fold
-                             * this non-UTF at compile time */
-                            || (   node_type == EXACTFU
-                                && ender == LATIN_SMALL_LETTER_SHARP_S)
-#endif
-                )) {
+                else                /* A regular FOLD code point */
+                     if (! UTF)
+                {
                     /* Here, are folding and are not UTF-8 encoded; therefore
                      * the character must be in the range 0-255, and is not /l.
                      * (Not /l because we already handled these under /l in
                      * is_PROBLEMATIC_LOCALE_FOLD_cp) */
-                    if (IS_IN_SOME_FOLD_L1(ender)) {
-                        maybe_exact = FALSE;
+                    if (! IS_IN_SOME_FOLD_L1(ender)) {
 
-                        /* See if the character's fold differs between /d and
-                         * /u.  This includes the multi-char fold SHARP S to
-                         * 'ss' */
-                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
-                            RExC_seen_unfolded_sharp_s = 1;
-                            maybe_exactfu = FALSE;
+                        /* Start a new node for this non-folding character if
+                         * previous ones in the node were folded */
+                        if (len && node_type != EXACT) {
+                            p = oldp;
+                            goto loopdone;
                         }
-                        else if (maybe_exactfu
-                            && (PL_fold[ender] != PL_fold_latin1[ender]
+
+                        *(s++) = (char) ender;
+                    }
+                    else {  /* Here, does participate in some fold */
+
+                        /* if this is the first character in the node, change
+                         * its type to folding.  Otherwise, if this is the
+                         * first folding character in the node, close up the
+                         * existing node, so can start a new node with this
+                         * one.  */
+                        if (! len) {
+                            node_type = compute_EXACTish(pRExC_state);
+                        }
+                        else if (node_type == EXACT) {
+                            p = oldp;
+                            goto loopdone;
+                        }
+
+                        /* See if the character's fold differs between /d and
+                         * /u.  On non-ancient Unicode versions, his includes
+                         * the multi-char fold SHARP S to 'ss' */
+
 #if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
    || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
                                       || UNICODE_DOT_DOT_VERSION > 0)
-                                || (   len > 0
-                                    && isALPHA_FOLD_EQ(ender, 's')
-                                    && isALPHA_FOLD_EQ(*(s-1), 's'))
+
+                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+                            if (node_type == EXACTF && RExC_uni_semantics) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            /* See comments for join_exact() as to why we fold
+                             * this non-UTF at compile time */
+                            if (node_type == EXACTFU) {
+                                *(s++) = 's';
+
+                                /* Let the code below add in the extra 's' */
+                                ender = 's';
+                                added_len = 2;
+                            }
+                            else {
+                                RExC_seen_unfolded_sharp_s = 1;
+                                maybe_exactfu = FALSE;
+                            }
+                        }
+                        else if (   len
+                                 && isALPHA_FOLD_EQ(ender, 's')
+                                 && isALPHA_FOLD_EQ(*(s-1), 's'))
+                        {
+                            maybe_exactfu = FALSE;
+                        }
+                        else
 #endif
-                        )) {
+
+                        if (PL_fold[ender] != PL_fold_latin1[ender]) {
                             maybe_exactfu = FALSE;
                         }
-                    }
 
                         /* Even when folding, we store just the input
                          * character, as we have an array that finds its fold
                          * quickly */
                         *(s++) = (char) ender;
+                    }
                 }
-                else {  /* FOLD, and UTF (or sharp s) */
+                else {  /* FOLD, and UTF */
                     /* Unlike the non-fold case, we do actually have to
                      * calculate the fold in pass 1.  This is for two reasons,
                      * the folded length may be longer than the unfolded, and
@@ -13783,41 +13822,72 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                      * of a potential multi-char fold, and have to back off
                      * accordingly.  */
 
-                    UV folded;
                     if (isASCII_uni(ender)) {
-                        folded = toFOLD(ender);
-                        *(s)++ = (U8) folded;
+
+                        /* As above, we close up and start a new node if the
+                         * previous characters don't match the fold/non-fold
+                         * state of this one.  And if this is the first
+                         * character in the node, and it folds, we change the
+                         * node away from being EXACT */
+                        if (! IS_IN_SOME_FOLD_L1(ender)) {
+                            if (len && node_type != EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            *(s)++ = (U8) ender;
+                        }
+                        else {  /* Is in a fold */
+
+                            if (! len) {
+                                node_type = compute_EXACTish(pRExC_state);
+                            }
+                            else if (node_type == EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            *(s)++ = (U8) toFOLD(ender);
+                        }
                     }
                     else {  /* Not ASCII */
                         STRLEN foldlen;
 
-                        folded = _to_uni_fold_flags(
+                        /* As above, we close up and start a new node if the
+                         * previous characters don't match the fold/non-fold
+                         * state of this one.  And if this is the first
+                         * character in the node, and it folds, we change the
+                         * node away from being EXACT */
+                        if (! _invlist_contains_cp(PL_utf8_foldable, ender)) {
+                            if (len && node_type != EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            s = (char *) uvchr_to_utf8((U8 *) s, ender);
+                            added_len = UVCHR_SKIP(ender);
+                        }
+                        else {
+
+                            if (! len) {
+                                node_type = compute_EXACTish(pRExC_state);
+                            }
+                            else if (node_type == EXACT) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
+                            ender = _to_uni_fold_flags(
                                      ender,
                                      (U8 *) s,
                                      &foldlen,
                                      FOLD_FLAGS_FULL | ((ASCII_FOLD_RESTRICTED)
                                                         ? FOLD_FLAGS_NOMIX_ASCII
                                                         : 0));
-                        s += foldlen;
-                        added_len = foldlen;
-                    }
-                    /* If this node only contains non-folding code points so
-                     * far, see if this new one is also non-folding */
-                    if (maybe_exact) {
-                        if (folded != ender) {
-                            maybe_exact = FALSE;
-                        }
-                        else {
-                            /* Here the fold is the original; we have to check
-                             * further to see if anything folds to it */
-                            if (_invlist_contains_cp(PL_utf8_foldable,
-                                                        ender))
-                            {
-                                maybe_exact = FALSE;
-                            }
+                            s += foldlen;
+                            added_len = foldlen;
                         }
                     }
-                    ender = folded;
 		}
 
                 len += added_len;
@@ -14003,21 +14073,27 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                 OP(ret) = NOTHING;
             }
             else {
-                if (FOLD) {
-                    /* If 'maybe_exact' is still set here, means there are no
-                     * code points in the node that participate in folds;
-                     * similarly for 'maybe_exactfu' and code points that match
-                     * differently depending on UTF8ness of the target string
-                     * (for /u), or depending on locale for /l */
-                    if (maybe_exact) {
-                        OP(ret) = (LOC)
-                                  ? EXACTL
-                                  : EXACT;
+                OP(ret) = node_type;
+
+                /* If the node type is EXACT here, check to see if it
+                 * should be EXACTL. */
+                if (node_type == EXACT) {
+                    if (LOC) {
+                        OP(ret) = EXACTL;
                     }
-                    else if (maybe_exactfu) {
-                        OP(ret) = (LOC)
-                                  ? EXACTFLU8
-                                  : EXACTFU;
+                }
+
+                if (FOLD) {
+                    /* If 'maybe_exactfu' is set, then there are no code points
+                     * that match differently depending on UTF8ness of the
+                     * target string (for /u), or depending on locale for /l */
+                    if (maybe_exactfu) {
+                        if (node_type == EXACTF) {
+                            OP(ret) = EXACTFU;
+                        }
+                        else if (node_type == EXACTFL) {
+                            OP(ret) = EXACTFLU8;
+                        }
                     }
                 }
 
diff --git a/regexec.c b/regexec.c
index 08bf7134cc..8e4ba1700f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1,4 +1,7 @@
 /*    regexec.c
+
+    valgrind ./perl -Ilib -DU -le "use re qw(Debug EXTRA); 'xyxy' =~ m'(0|\D*0?x)*?\1';" 2>&1 | more
+ *    
  */
 
 /*
@@ -5855,6 +5858,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 	state_num = OP(scan);
 
       reenter_switch:
+        DEBUG_U(PerlIO_printf( Perl_debug_log, "%s:%d: locinput=%p, strend=%p, nextchr=%d\n", __FILE__, __LINE__, locinput , reginfo->strend, UCHARAT(locinput)));
         DEBUG_EXECUTE_r(
             if (state_num <= REGNODE_MAX) {
                 SV * const prop = sv_newmortal();
@@ -5875,6 +5879,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
         to_complement = 0;
 
         SET_nextchr;
+        DEBUG_U(PerlIO_printf( Perl_debug_log, "%s:%d: locinput=%p, strend=%p, nextchr=%d\n", __FILE__, __LINE__, locinput , reginfo->strend, nextchr));
         assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS));
 
 	switch (state_num) {
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2018

From @tonycoz

On Mon, 05 Feb 2018 12​:39​:26 -0800, public@​khwilliamson.com wrote​:

On 02/05/2018 10​:41 AM, Karl Williamson wrote​:

On 02/04/2018 08​:38 PM, Karl Williamson wrote​:

That patch doesn't apply to blead; the very slightly revised one
here
does

And, in thinking about it overnight, I realized that my patch does
not
actually fix the problem.  It just means that it's harder to trigger.
You have to have an ASCII folding character before the \N{}, which
also
has to an ASCII folding character, then the sharp s.  It also did not
trigger with these unless I added more sharp s's.  Here's a case
where
it triggers​:

qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

Compiling REx
"0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\x"...
panic​: reg_node overrun trying to emit 34, 560fca5bd924>=560fca5bd8c4
at
-e line 1.

panic​: bad free, ->next->prev=7373737373737373, header=560fca5bdb60,
->prev->next=560fca5bdb60 at (null) line 1 during global destruction.

The small patch I originally sent turns out to be overkill. We don't
need to restart all the parsing; we just need to restart filling the
node. A patch that does just that is attached.

Thanks.

And that patch is easily incorporated into my other one, which the 2nd
attachment does.

So here's what I think should happen

We need to decide if we're going to sneak this in or wait for the CVE
date.

If we sneak it in, do we do it with Yves or my patch? Note that his
patch assumed that we would restart the whole parse, and doesn't make
sense if it's just restarting the current node. For efficiency, we
would want to change at some point to just restarting the node if we
went with his patch. I think we should just do it once and not make
any
changes until the CVE is out, which we could do with my patch. Also
my
patch is doing a bunch of stuff in the area of the code this affects,
including several times using the same paradigm it does. But this is
all moot if we decide to wait for the CVE. We can apply the portions
of
Yves patch and mine that don't specifically address this bug, and wait
for the CVE for the rest.

I'd prefer to wait until the issue is public before updating blead, but it's up to you.

Could you please provide a version of your most recent patch for maint-5.26/maint-5.24 with a better note than "132227"?

This will need to go downstream for vendors to update their perl packages.

I've requested a CVE ID for this issue.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2018

From @xsawyerx

I will contact the vendors. I just need a short description, a fix,
and a test case. I'll take the description from the CVE request, but
which patch do I take? Also, do we have different patches for
different versions? I would need one for each version.

On 6 February 2018 at 07​:44, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Mon, 05 Feb 2018 12​:39​:26 -0800, public@​khwilliamson.com wrote​:

On 02/05/2018 10​:41 AM, Karl Williamson wrote​:

On 02/04/2018 08​:38 PM, Karl Williamson wrote​:

That patch doesn't apply to blead; the very slightly revised one
here
does

And, in thinking about it overnight, I realized that my patch does
not
actually fix the problem. It just means that it's harder to trigger.
You have to have an ASCII folding character before the \N{}, which
also
has to an ASCII folding character, then the sharp s. It also did not
trigger with these unless I added more sharp s's. Here's a case
where
it triggers​:

qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

Compiling REx
"0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\x"...
panic​: reg_node overrun trying to emit 34, 560fca5bd924>=560fca5bd8c4
at
-e line 1.

panic​: bad free, ->next->prev=7373737373737373, header=560fca5bdb60,
->prev->next=560fca5bdb60 at (null) line 1 during global destruction.

The small patch I originally sent turns out to be overkill. We don't
need to restart all the parsing; we just need to restart filling the
node. A patch that does just that is attached.

Thanks.

And that patch is easily incorporated into my other one, which the 2nd
attachment does.

So here's what I think should happen

We need to decide if we're going to sneak this in or wait for the CVE
date.

If we sneak it in, do we do it with Yves or my patch? Note that his
patch assumed that we would restart the whole parse, and doesn't make
sense if it's just restarting the current node. For efficiency, we
would want to change at some point to just restarting the node if we
went with his patch. I think we should just do it once and not make
any
changes until the CVE is out, which we could do with my patch. Also
my
patch is doing a bunch of stuff in the area of the code this affects,
including several times using the same paradigm it does. But this is
all moot if we decide to wait for the CVE. We can apply the portions
of
Yves patch and mine that don't specifically address this bug, and wait
for the CVE for the rest.

I'd prefer to wait until the issue is public before updating blead, but it's up to you.

Could you please provide a version of your most recent patch for maint-5.26/maint-5.24 with a better note than "132227"?

This will need to go downstream for vendors to update their perl packages.

I've requested a CVE ID for this issue.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 7, 2018

From @tonycoz

On Mon, 05 Feb 2018 21​:44​:23 -0800, tonyc wrote​:

I've requested a CVE ID for this issue.

This issue is CVE-2018-6797.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2018

From @tonycoz

On Tue, 06 Feb 2018 21​:01​:35 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:44​:23 -0800, tonyc wrote​:

I've requested a CVE ID for this issue.

This issue is CVE-2018-6797.

Patches for 5.24 and 5.26, merged and de-conflicted (and with a more useful subject than "132227")

Karl fixed this in blead in 141513f.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2018

From @tonycoz

5.24-0005-perl-132227-restart-a-node-if-we-change-to-uni-rules.patch
From e02d7478ebfc399a9d10ba0df60eee217aa7ab8f Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: (perl #132227) restart a node if we change to uni rules within the
 node and encounter a sharp S

This could lead to a buffer overflow.
---
 regcomp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/regcomp.c b/regcomp.c
index c6c7cb4925..d79bd191c9 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13323,6 +13323,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
                         if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+                            /* If the node started out having uni rules, we
+                             * wouldn't have gotten here.  So this means
+                             * something in the middle has changed it, but
+                             * didn't think it needed to reparse.  But this
+                             * sharp s now does indicate the need for
+                             * reparsing. */
+                            if (RExC_uni_semantics) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
                             RExC_seen_unfolded_sharp_s = 1;
                             maybe_exactfu = FALSE;
                         }
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2018

From @tonycoz

5.26-0006-perl-132227-restart-a-node-if-we-change-to-uni-rules.patch
From 6041b59a21326bd9c2d58034943f07233d5ac1ec Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 2 Feb 2018 15:14:27 -0700
Subject: (perl #132227) restart a node if we change to uni rules within the
 node and encounter a sharp S

This could lead to a buffer overflow.
---
 regcomp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/regcomp.c b/regcomp.c
index b2de7f05e1..943029b154 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13538,6 +13538,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
                         if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+
+                            /* If the node started out having uni rules, we
+                             * wouldn't have gotten here.  So this means
+                             * something in the middle has changed it, but
+                             * didn't think it needed to reparse.  But this
+                             * sharp s now does indicate the need for
+                             * reparsing. */
+                            if (RExC_uni_semantics) {
+                                p = oldp;
+                                goto loopdone;
+                            }
+
                             RExC_seen_unfolded_sharp_s = 1;
                             maybe_exactfu = FALSE;
                         }
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2018

From @khwilliamson

On 02/19/2018 08​:31 PM, Tony Cook via RT wrote​:

On Tue, 06 Feb 2018 21​:01​:35 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:44​:23 -0800, tonyc wrote​:

I've requested a CVE ID for this issue.

This issue is CVE-2018-6797.

Patches for 5.24 and 5.26, merged and de-conflicted (and with a more useful subject than "132227")

Karl fixed this in blead in 141513f.

Tony

Actually, Tony asked me not to fix it in blead, and so I didn't.
141513f has the side effect of causing the test case that was
submitted in this ticket to not trigger this bug. But the bug remains;
it just needs a somewhat different set of circumstances to get triggered.

I wrote earlier in the thread that the following pattern still triggers
it, and I just verified that it still fails in blead

$ blead -le
'qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

panic​: reg_node overrun trying to emit 34, 55ba06753904>=55ba067538a4 at
-e line 1.
panic​: free from wrong pool, 7373737373737373!=55ba0670dc20 at (null)
line 1 during global destruction.

@p5pRT
Copy link
Author

p5pRT commented Feb 21, 2018

From @tonycoz

On Mon, 19 Feb 2018 20​:07​:30 -0800, public@​khwilliamson.com wrote​:

On 02/19/2018 08​:31 PM, Tony Cook via RT wrote​:

On Tue, 06 Feb 2018 21​:01​:35 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:44​:23 -0800, tonyc wrote​:

I've requested a CVE ID for this issue.

This issue is CVE-2018-6797.

Patches for 5.24 and 5.26, merged and de-conflicted (and with a more
useful subject than "132227")

Karl fixed this in blead in 141513f.

Tony

Actually, Tony asked me not to fix it in blead, and so I didn't.
141513f has the side effect of causing the test case that was
submitted in this ticket to not trigger this bug. But the bug
remains;
it just needs a somewhat different set of circumstances to get
triggered.

I wrote earlier in the thread that the following pattern still
triggers
it, and I just verified that it still fails in blead

$ blead -le
'qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

panic​: reg_node overrun trying to emit 34, 55ba06753904>=55ba067538a4
at
-e line 1.
panic​: free from wrong pool, 7373737373737373!=55ba0670dc20 at (null)
line 1 during global destruction.

Ok, thanks.

Do you have a patch that applies to blead to fix the problem?

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2018

From @khwilliamson

On 02/19/2018 09​:07 PM, Karl Williamson wrote​:

On 02/19/2018 08​:31 PM, Tony Cook via RT wrote​:

On Tue, 06 Feb 2018 21​:01​:35 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:44​:23 -0800, tonyc wrote​:

I've requested a CVE ID for this issue.

This issue is CVE-2018-6797.

Patches for 5.24 and 5.26, merged and de-conflicted (and with a more
useful subject than "132227")

Karl fixed this in blead in 141513f.

Tony

Actually, Tony asked me not to fix it in blead, and so I didn't.
141513f has the side effect of causing the test case that was
submitted in this ticket to not trigger this bug.  But the bug remains;
it just needs a somewhat different set of circumstances to get triggered.

I wrote earlier in the thread that the following pattern still triggers
it, and I just verified that it still fails in blead

$ blead -le
'qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i'

panic​: reg_node overrun trying to emit 34, 55ba06753904>=55ba067538a4 at
-e line 1.
panic​: free from wrong pool, 7373737373737373!=55ba0670dc20 at (null)
line 1 during global destruction.

Blead patch attached.

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2018

From @khwilliamson

0005-perl-132227-restart-node-if-change-to-uni-rules.patch
From c76970a0a5101dea085fe22ec93fdb1ed95b655e Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Thu, 22 Feb 2018 22:45:19 -0700
Subject: [PATCH 5/5] (perl #132227) restart node if change to uni rules

This is only necessary if within the node we encounter a sharp S.
Otherwise this could lead to a buffer overflow.
---
 regcomp.c           | 18 ++++++++++++++++++
 t/re/pat_advanced.t |  4 ++++
 2 files changed, 22 insertions(+)

diff --git a/regcomp.c b/regcomp.c
index 34ac9169f2..a78df8b8cb 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13968,6 +13968,24 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                                 ender = 's';
                                 added_len = 2;
                             }
+                            else if (RExC_uni_semantics) {
+
+                                /* Here, we are supossed to be using Unicode
+                                 * rules, but this folding node is not.  This
+                                 * happens during pass 1 when the node started
+                                 * out not under Unicode rules, but a \N{} was
+                                 * encountered during the processing of it,
+                                 * causing Unicode rules to be switched into.
+                                 * Pass 1 continues uninteruppted, as by the
+                                 * time we get to pass 2, we will know enough
+                                 * to generate the correct folds.  Except in
+                                 * this one case, we need to restart the node,
+                                 * because the fold of the sharp s requires 2
+                                 * characters, and the sizing needs to account
+                                 * for that. */
+                                p = oldp;
+                                goto loopdone;
+                            }
                             else {
                                 RExC_seen_unfolded_sharp_s = 1;
                                 maybe_exactfu = FALSE;
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index d90ceeb5bd..d9c4773d2d 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -2332,6 +2332,10 @@ EOF
         unlike("s\N{U+DF}", qr/^\x{00DF}/i, "\"s\\N{U+DF}\", qr/^\\x{00DF}/i");
     }
 
+    {   # Bug #132227, caused failed assertion
+        ok(qr/0b\N{U+41}\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xDF/i, "No seqgfault [perl #132227]");
+    }
+
     # User-defined Unicode properties to match above-Unicode code points
     sub Is_31_Bit_Super { return "110000\t7FFFFFFF\n" }
     sub Is_Portable_Super { return '!utf8::Any' }   # Matches beyond 32 bits
-- 
2.11.0

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2018

From @xsawyerx

The public disclosure date is set for March 15th.

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2018

From @geeknik

Is it okay to disclose this bug now?

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2018

From @tonycoz

On Thu, Mar 15, 2018 at 11​:38​:58PM -0500, Brian C. wrote​:

Is it okay to disclose this bug now?

Sorry, no, downstream has delayed disclosure.

The actual disclosure date is under discussion AFAIK.

Tony

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2018

From @geeknik

We are long past the industry standard of 90 days now.

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2018

From @xsawyerx

Industry standard is also to respect downstream vendors concerns and we
have always done this, sometimes to far longer than 90 days. Please be
patient. We will update shortly.

On Fri, Mar 16, 2018, 20​:26 Brian C. <brian.carpenter@​gmail.com> wrote​:

We are long past the industry standard of 90 days now.

@p5pRT
Copy link
Author

p5pRT commented Mar 19, 2018

From @xsawyerx

The disclosure date has been postponed officially to April 14th.

On 16 March 2018 at 20​:28, Sawyer X <xsawyerx@​gmail.com> wrote​:

Industry standard is also to respect downstream vendors concerns and we have
always done this, sometimes to far longer than 90 days. Please be patient.
We will update shortly.

On Fri, Mar 16, 2018, 20​:26 Brian C. <brian.carpenter@​gmail.com> wrote​:

We are long past the industry standard of 90 days now.

@p5pRT
Copy link
Author

p5pRT commented Apr 14, 2018

@xsawyerx - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Apr 14, 2018

From @xsawyerx

On Mon, 19 Mar 2018 15​:27​:19 -0700, xsawyerx@​gmail.com wrote​:

The disclosure date has been postponed officially to April 14th.

Moved to public queue.

@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2018

From @khwilliamson

Blead commit that fixed this is 2407a17
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2018

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release yesterday of Perl 5.28.0, this and 185 other issues have been
resolved.

Perl 5.28.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.28.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT p5pRT closed this as completed Jun 23, 2018
@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2018

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant