toke.c:3813: char S_scan_const(char ): Assertion Failed. #15790

p5pRT · 2017-01-03T18:57:21Z

Migrated from rt.perl.org#130495 (status was 'resolved')

Searchable as RT130495$

p5pRT · 2017-01-03T18:57:21Z

From @geeknik

Triggered with Perl v5.25.8-132-gc10193e while fuzzing with AFL.

od -tx1 test097
0000000 6d 00 20 00 30 30 30 6b 23 00 30 5c 30 30 78 00
0000020

./perl test097
perl: toke.c:3813: char *S_scan_const(char *): Assertion
`(__builtin_expect(((((U8 *) send) <= ((U8 *) s)) ? (_Bool)1 :
(_Bool)0),(0)) ? 0 : (((unsigned long)(((*(U8 *) s) | 0) | 0) < 128)) ? 1 :
__builtin_expect((((((U8 *) send) - ((U8 *) s)) < PL_utf8skip[*(const
U8*)((U8 *) s)]) ? (_Bool)1 : (_Bool)0),(0)) ? 0 : __builtin_expect(((((((
(sizeof(*(U8 *) s) == 1) || !(((unsigned long)((*(U8 *) s) | 0)) & ~0xFF)))
? (void) (0) : __assert_fail ("( (sizeof(*(U8 *) s) == 1) || !(((unsigned
long)((*(U8 *) s) | 0)) & ~0xFF))", "toke.c", 3813, __PRETTY_FUNCTION__)),
((U8) (*(U8 *) s))) <= 0xF7) ? (_Bool)1 : (_Bool)0),(1)) ? ( ( 0xC2 <=
((const U8*)(U8 *) s)[0] && ((const U8*)(U8 *) s)[0] <= 0xDF ) ? (
__builtin_expect(((( ((const U8*)(U8 *) s)[1] & 0xC0 ) == 0x80) ? (_Bool)1
: (_Bool)0),(1)) ? 2 : 0 ) : ( 0xE0 == ((const U8*)(U8 *) s)[0] ) ? (
__builtin_expect(((( ( ((const U8*)(U8 *) s)[1] & 0xE0 ) == 0xA0 ) && ( (
((const U8*)(U8 *) s)[2] & 0xC0 ) == 0x80 )) ? (_Bool)1 : (_Bool)0),(1)) ?
3 : 0 ): ( 0xE1 <= ((const U8*)(U8 *) s)[0] && ((const U8*)(U8 *) s)[0] <=
0xEF ) ? ( __builtin_expect(((( ( ((const U8*)(U8 *) s)[1] & 0xC0 ) == 0x80
) && ( ( ((const U8*)(U8 *) s)[2] & 0xC0 ) == 0x80 )) ? (_Bool)1 :
(_Bool)0),(1)) ? 3 : 0 ): ( 0xF0 == ((const U8*)(U8 *) s)[0] ) ? (
__builtin_expect(((( ( 0x90 <= ((const U8*)(U8 *) s)[1] && ((const U8*)(U8
*) s)[1] <= 0xBF ) && ( ( ((const U8*)(U8 *) s)[2] & 0xC0 ) == 0x80 ) ) &&
( ( ((const U8*)(U8 *) s)[3] & 0xC0 ) == 0x80 )) ? (_Bool)1 :
(_Bool)0),(1)) ? 4 : 0 ): ( ( ( ( 0xF1 <= ((const U8*)(U8 *) s)[0] &&
((const U8*)(U8 *) s)[0] <= 0xF7 ) && __builtin_expect(((( ((const U8*)(U8
*) s)[1] & 0xC0 ) == 0x80) ? (_Bool)1 : (_Bool)0),(1)) ) &&
__builtin_expect(((( ((const U8*)(U8 *) s)[2] & 0xC0 ) == 0x80) ? (_Bool)1
: (_Bool)0),(1)) ) && __builtin_expect(((( ((const U8*)(U8 *) s)[3] & 0xC0
) == 0x80) ? (_Bool)1 : (_Bool)0),(1)) ) ? 4 : 0 ) :
Perl__is_utf8_char_helper((U8 *) s, (U8 *) send, 0))' failed.
Aborted

p5pRT · 2017-01-03T18:57:21Z

From @geeknik

test097.gz

p5pRT · 2017-01-04T13:00:20Z

From @hvds

On Tue, 03 Jan 2017 10:57:21 -0800, brian.carpenter@gmail.com wrote:

Triggered with Perl v5.25.8-132-gc10193e while fuzzing with AFL.

od -tx1 test097
0000000 6d 00 20 00 30 30 30 6b 23 00 30 5c 30 30 78 00
0000020

./perl test097
perl: toke.c:3813: char *S_scan_const(char *): Assertion
`[...snip...]' failed.
Aborted

I'm slightly confused here; I think first we see "m\x{0} \x{0}" which triggers UTF-16LE detection, so the input gets converted into "m \x{3030}\x{6b30}#\x{5c30}\x{3030}x".

Next we parse that as a pattern, with \x{3030} as the delimiter, and the //x flag.

Next we see the '#', and attempt to scan past the //x-enabled comment with this at toke.c:3260:
while (s+1 < send && *s != '\n')
*d++ = *s++;

The string it's trying to scan past at that point is "#\x{5c30}" encoded as {23 e5 b0 b0}. Since send points past the end of our pattern, this does one byte too little, leaving s pointing at the trailing \xb0. We then try to process that remaining byte, and think we've seen invalid utf8.

I'm confused though that if I try to bypass the UTF-16LE conversion I don't get the same assert:
perl -e 'print "m \x{3030}\x{6b30}#\x{5c30}\x{3030}x"' | perl -CI
.. so I think I'm still missing something - and indeed, if I replace 's+1' with 's' in the above check I get test failures.

I'll try to dig further, but it may need more eyes.

Hugo

p5pRT · 2017-01-04T13:00:20Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2017-01-04T13:28:35Z

From @hvds

On Wed, 04 Jan 2017 05:00:20 -0800, hv wrote:

On Tue, 03 Jan 2017 10:57:21 -0800, brian.carpenter@gmail.com wrote:

Triggered with Perl v5.25.8-132-gc10193e while fuzzing with AFL.

od -tx1 test097
0000000 6d 00 20 00 30 30 30 6b 23 00 30 5c 30 30 78 00
0000020

./perl test097
perl: toke.c:3813: char *S_scan_const(char *): Assertion
`[...snip...]' failed.
Aborted

I'm slightly confused here; I think first we see "m\x{0} \x{0}" which
triggers UTF-16LE detection, so the input gets converted into "m
\x{3030}\x{6b30}#\x{5c30}\x{3030}x".

Next we parse that as a pattern, with \x{3030} as the delimiter, and
the //x flag.

Next we see the '#', and attempt to scan past the //x-enabled comment
with this at toke.c:3260:
while (s+1 < send && *s != '\n')
*d++ = *s++;

The string it's trying to scan past at that point is "#\x{5c30}"
encoded as {23 e5 b0 b0}. Since send points past the end of our
pattern, this does one byte too little, leaving s pointing at the
trailing \xb0. We then try to process that remaining byte, and think
we've seen invalid utf8.

I'm confused though that if I try to bypass the UTF-16LE conversion I
don't get the same assert:
perl -e 'print "m \x{3030}\x{6b30}#\x{5c30}\x{3030}x"' | perl -CI
.. so I think I'm still missing something - and indeed, if I replace
's+1' with 's' in the above check I get test failures.

I'll try to dig further, but it may need more eyes.

Hugo

Oh, I think I understand - two wrongs make a right, so failing to skip all the characters it should meant it was ok for there to be no s==send check at the end of the if/else chain.

So I think the fix should actually look like the below. I'll do some more testing.

Hugo
--- a/toke.c
+++ b/toke.c
@@ -3257,7 +3257,7 @@ S_scan_const(pTHX_ char *start)
&& !in_charclass
&& ((PMOP*)PL_lex_inpat)->op_pmflags & RXf_PMf_EXTENDED)
{
- while (s+1 < send && *s != '\n')
+ while (s < send && *s != '\n')
*d++ = *s++;
}

@@ -3298,6 +3298,11 @@ S_scan_const(pTHX_ char *start)

/* End of else if chain - OP_TRANS rejoin rest */

+ if (UNLIKELY(s >= send)) {
+ assert(s == send);
+ break;
+ }
+
/* backslashes */
if (*s == '\\' && s+1 < send) {
char* e; /* Can be used for ending '}', etc. */

p5pRT · 2017-01-04T15:03:42Z

From @hvds

On Wed, 04 Jan 2017 05:28:35 -0800, hv wrote:
[...]

So I think the fix should actually look like the below. I'll do some
more testing.

Hugo
--- a/toke.c
+++ b/toke.c
@@ -3257,7 +3257,7 @@ S_scan_const(pTHX_ char *start)
&& !in_charclass
&& ((PMOP*)PL_lex_inpat)->op_pmflags &
RXf_PMf_EXTENDED)
{
- while (s+1 < send && *s != '\n')
+ while (s < send && *s != '\n')
*d++ = *s++;
}

@@ -3298,6 +3298,11 @@ S_scan_const(pTHX_ char *start)

/* End of else if chain - OP_TRANS rejoin rest */

+ if (UNLIKELY(s >= send)) {
+ assert(s == send);
+ break;
+ }
+
/* backslashes */
if (*s == '\\' && s+1 < send) {
char* e; /* Can be used for ending '}', etc. */

Should be fixed by 8faf4f3:
[perl #130495] /x comment skipping stops a byte short

If that byte was part of a utf-8 character, this caused inappropriate
"malformed utf8" warnings or assertions.

In principle this should also skip the newline, but failing to do so
is safe.

Hugo

p5pRT · 2017-01-04T15:03:43Z

@hvds - Status changed from 'open' to 'pending release'

p5pRT · 2017-01-06T22:41:29Z

From @craigberry

On Wed, Jan 4, 2017 at 9:03 AM, Hugo van der Sanden via RT
<perlbug-followup@perl.org> wrote:

On Wed, 04 Jan 2017 05:28:35 -0800, hv wrote:
[...]

So I think the fix should actually look like the below. I'll do some
more testing.

Hugo
--- a/toke.c
+++ b/toke.c
@@ -3257,7 +3257,7 @@ S_scan_const(pTHX_ char *start)
&& !in_charclass
&& ((PMOP*)PL_lex_inpat)->op_pmflags &
RXf_PMf_EXTENDED)
{
- while (s+1 < send && *s != '\n')
+ while (s < send && *s != '\n')
*d++ = *s++;
}

@@ -3298,6 +3298,11 @@ S_scan_const(pTHX_ char *start)

/* End of else if chain - OP_TRANS rejoin rest */

+ if (UNLIKELY(s >= send)) {
+ assert(s == send);
+ break;
+ }
+
/* backslashes */
if (*s == '\\' && s+1 < send) {
char* e; /* Can be used for ending '}', etc. */

Should be fixed by 8faf4f3:
[perl #130495] /x comment skipping stops a byte short
If that byte was part of a utf\-8 character\, this caused inappropriate
"malformed utf8" warnings or assertions\.

The new test in t/re/pat.t throws a warning:

ok 827 - empty pattern in regex codeblock: produced the right exception message
Wide character in print at ./test.pl line 982.
ok 828 - [perl \#130495] utf-8 character at end of /x comment should
not misparse

That's in fresh_perl() in t/test.pl where we write out the test
program to a file like so:

open TEST, '>', $tmpfile or die "Cannot open $tmpfile: $!";
print TEST $prog;
close TEST or die "Cannot close $tmpfile: $!";

Surely we have other places where there are UTF-8 characters embedded
in a test program? Opening the file with '>:utf8' fixes this test but
breaks a lot of other things and I'm not sure it's safe to do that
within the primitive confines of test.pl anyway. Suggestions?

p5pRT · 2017-01-07T01:32:27Z

From @hvds

On Fri, 06 Jan 2017 14:41:29 -0800, craig.a.berry@gmail.com wrote:

The new test in t/re/pat.t throws a warning:

ok 827 - empty pattern in regex codeblock: produced the right
exception message
Wide character in print at ./test.pl line 982.
ok 828 - [perl \#130495] utf-8 character at end of /x comment should
not misparse

That's in fresh_perl() in t/test.pl where we write out the test
program to a file like so:

open TEST, '>', $tmpfile or die "Cannot open $tmpfile: $!";
print TEST $prog;
close TEST or die "Cannot close $tmpfile: $!";

Surely we have other places where there are UTF-8 characters embedded
in a test program? Opening the file with '>:utf8' fixes this test but
breaks a lot of other things and I'm not sure it's safe to do that
within the primitive confines of test.pl anyway. Suggestions?

I propose the patch below; I think it should be safe but would welcome a second opinion.

Hugo

commit ef6b07d107b3e73a8c13536ba41cc386f2bf075c
Author: Hugo van der Sanden <hv@crypt.org>
Date: Sat Jan 7 01:27:50 2017 +0000

[perl #130495] add fresh_perl() option for prog with embedded utf8

Support { wide_chars => 1 } in the args hash to fresh_perl*() functions,
and use it for the recently added test in re/pat.t

Inline Patch

diff --git a/t/re/pat.t b/t/re/pat.t
index a72989f..c5de2cd 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -1864,7 +1864,7 @@ EOF_CODE
         # [perl #130495] /x comment skipping stopped a byte short, leading
         # to assertion failure or 'malformed utf-8 character" warning
         fresh_perl_is(
-            "use utf8; m{a#\x{124}}x", '', {},
+            "use utf8; m{a#\x{124}}x", '', {wide_chars => 1},
             '[perl #130495] utf-8 character at end of /x comment should not misparse',
         );
     }
diff --git a/t/test.pl b/t/test.pl
index 32d0d64..d86f633 100644
--- a/t/test.pl
+++ b/t/test.pl
@@ -979,6 +979,7 @@ sub fresh_perl {
     $runperl_args->{stderr}     = 1 unless exists $runperl_args->{stderr};
 
     open TEST, '>', $tmpfile or die "Cannot open $tmpfile: $!";
+    binmode TEST, ':utf8' if $runperl_args->{wide_chars};
     print TEST $prog;
     close TEST or die "Cannot close $tmpfile: $!";

p5pRT · 2017-01-07T21:37:53Z

From @craigberry

On Fri, Jan 6, 2017 at 7:32 PM, Hugo van der Sanden via RT
<perlbug-followup@perl.org> wrote:

On Fri, 06 Jan 2017 14:41:29 -0800, craig.a.berry@gmail.com wrote:

The new test in t/re/pat.t throws a warning:

ok 827 - empty pattern in regex codeblock: produced the right
exception message
Wide character in print at ./test.pl line 982.
ok 828 - [perl \#130495] utf-8 character at end of /x comment should
not misparse

That's in fresh_perl() in t/test.pl where we write out the test
program to a file like so:

open TEST, '>', $tmpfile or die "Cannot open $tmpfile: $!";
print TEST $prog;
close TEST or die "Cannot close $tmpfile: $!";

Surely we have other places where there are UTF-8 characters embedded
in a test program? Opening the file with '>:utf8' fixes this test but
breaks a lot of other things and I'm not sure it's safe to do that
within the primitive confines of test.pl anyway. Suggestions?

I propose the patch below; I think it should be safe but would welcome a second opinion.

Hugo

commit ef6b07d107b3e73a8c13536ba41cc386f2bf075c
Author: Hugo van der Sanden <hv@crypt.org>
Date: Sat Jan 7 01:27:50 2017 +0000
\[perl \#130495\] add fresh\_perl option for prog with embedded utf8

Support \{ wide\_chars => 1 \} in the args hash to fresh\_perl\* functions\,
and use it for the recently added test in re/pat\.t
diff --git a/t/re/pat.t b/t/re/pat.t
index a72989f..c5de2cd 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -1864,7 +1864,7 @@ EOF_CODE
# [perl #130495] /x comment skipping stopped a byte short, leading
# to assertion failure or 'malformed utf-8 character" warning
fresh_perl_is(
- "use utf8; m{a#\x{124}}x", '', {},
+ "use utf8; m{a#\x{124}}x", '', {wide_chars => 1},
'[perl #130495] utf-8 character at end of /x comment should not misparse',
);
}
diff --git a/t/test.pl b/t/test.pl
index 32d0d64..d86f633 100644
--- a/t/test.pl
+++ b/t/test.pl
@@ -979,6 +979,7 @@ sub fresh_perl {
$runperl_args->{stderr} = 1 unless exists $runperl_args->{stderr};
 open TEST\, '>'\, $tmpfile or die "Cannot open $tmpfile&#8203;: $\!";
+ binmode TEST, ':utf8' if $runperl_args->{wide_chars};
print TEST $prog;
close TEST or die "Cannot close $tmpfile: $!";

I like the opt-in idea so only tests that specify this get it. Looks
like a good solution to me.

p5pRT · 2017-01-08T11:12:31Z

From @hvds

On Sat, 07 Jan 2017 13:37:53 -0800, craig.a.berry@gmail.com wrote:

I like the opt-in idea so only tests that specify this get it. Looks
like a good solution to me.

Ok, I checked stderr of a full test run and saw no output relating to this, so now pushed as 98c155b.

Hugo

p5pRT · 2017-05-30T20:16:44Z

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0, this and 210 other issues have been
resolved.

Perl 5.26.0 may be downloaded via:
https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists, feel free to reopen this ticket.

p5pRT · 2017-05-30T20:16:45Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed May 30, 2017

p5pRT added the Severity Low label Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toke.c:3813: char S_scan_const(char ): Assertion Failed. #15790

toke.c:3813: char S_scan_const(char ): Assertion Failed. #15790

p5pRT commented Jan 3, 2017

p5pRT commented Jan 3, 2017

p5pRT commented Jan 3, 2017

p5pRT commented Jan 4, 2017

p5pRT commented Jan 4, 2017

p5pRT commented Jan 4, 2017

p5pRT commented Jan 4, 2017

p5pRT commented Jan 4, 2017

p5pRT commented Jan 6, 2017

p5pRT commented Jan 7, 2017

p5pRT commented Jan 7, 2017

p5pRT commented Jan 8, 2017

p5pRT commented May 30, 2017

p5pRT commented May 30, 2017

toke.c:3813: char *S_scan_const(char *): Assertion Failed. #15790

toke.c:3813: char *S_scan_const(char *): Assertion Failed. #15790

Comments

p5pRT commented Jan 3, 2017

p5pRT commented Jan 3, 2017

From @geeknik

p5pRT commented Jan 3, 2017

From @geeknik

p5pRT commented Jan 4, 2017

From @hvds

p5pRT commented Jan 4, 2017

p5pRT commented Jan 4, 2017

From @hvds

p5pRT commented Jan 4, 2017

From @hvds

p5pRT commented Jan 4, 2017

p5pRT commented Jan 6, 2017

From @craigberry

p5pRT commented Jan 7, 2017

From @hvds

p5pRT commented Jan 7, 2017

From @craigberry

p5pRT commented Jan 8, 2017

From @hvds

p5pRT commented May 30, 2017

From @khwilliamson

p5pRT commented May 30, 2017

toke.c:3813: char S_scan_const(char ): Assertion Failed. #15790

toke.c:3813: char S_scan_const(char ): Assertion Failed. #15790