heap-buffer-overflow Perl_utf8_hop (utf8.c:1132) #15532

p5pRT · 2016-08-19T21:15:21Z

Migrated from rt.perl.org#129000 (status was 'resolved')

Searchable as RT129000$

p5pRT · 2016-08-19T21:15:21Z

From @geeknik

The attached test case triggers a heap-buffer-overflow in Perl_utf8_hop
(utf8.c:1132). This was found with AFL, ASAN and libdislocator.so
affects v5.25.4 (v5.25.3-305-g8c6b0c7). Perl 5.20.2 throws a few errors,
but doesn't crash.

==18525==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60600000e65f at pc 0x000000bb9089 bp 0x7ffca4ff1440 sp 0x7ffca4ff1438
READ of size 1 at 0x60600000e65f thread T0
#0 0xbb9088 in Perl_utf8_hop /root/perl/utf8.c:1132:6
#1 0x66e12e in Perl_yylex /root/perl/toke.c:4906:32
#2 0x6ac9d5 in Perl_yyparse /root/perl/perly.c:334:19
#3 0x59c4a1 in S_parse_body /root/perl/perl.c:2372:9
#4 0x59283c in perl_parse /root/perl/perl.c:1688:2
#5 0x4de835 in main /root/perl/perlmain.c:121:18
#6 0x7effffbe5b44 in __libc_start_main
/build/glibc-uPj9cH/glibc-2.19/csu/libc-start.c:287
#7 0x4de4cc in _start (/root/perl/perl+0x4de4cc)

0x60600000e65f is located 1 bytes to the left of 56-byte region
[0x60600000e660,0x60600000e698)
allocated by thread T0 here:
#0 0x4c113e in realloc (/root/perl/perl+0x4c113e)
#1 0x7f62c6 in Perl_safesysrealloc /root/perl/util.c:274:18

SUMMARY: AddressSanitizer: heap-buffer-overflow /root/perl/utf8.c:1132
Perl_utf8_hop
Shadow bytes around the buggy address:
0x0c0c7fff9c70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c7fff9c80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c7fff9c90: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd
0x0c0c7fff9ca0: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
0x0c0c7fff9cb0: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
=>0x0c0c7fff9cc0: fd fd fd fd fd fd fd fd fa fa fa[fa]00 00 00 00
0x0c0c7fff9cd0: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 06
0x0c0c7fff9ce0: fa fa fa fa 00 00 00 00 00 00 06 fa fa fa fa fa
0x0c0c7fff9cf0: fd fd fd fd fd fd fd fd fa fa fa fa 00 00 00 00
0x0c0c7fff9d00: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 00
0x0c0c7fff9d10: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Heap right redzone: fb
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack partial redzone: f4
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
ASan internal: fe
==18525==ABORTING

p5pRT · 2016-08-19T21:15:21Z

From @geeknik

test189.gz

p5pRT · 2016-08-23T01:19:40Z

From @tonycoz

On Fri Aug 19 14:15:21 2016, brian.carpenter@gmail.com wrote:

The attached test case triggers a heap-buffer-overflow in Perl_utf8_hop
(utf8.c:1132). This was found with AFL, ASAN and libdislocator.so
affects v5.25.4 (v5.25.3-305-g8c6b0c7). Perl 5.20.2 throws a few errors,
but doesn't crash.

A shorter test case attached.

00000000 42 45 47 49 4e 7b 24 5e 48 3d 30 78 38 30 30 30 |BEGIN{$^H=0x8000|
0000000 B E G I N { $ ^ H = 0 x 8 0 0 0
00000010 30 30 7d 0a 20 20 20 30 6d 20 30 b5 b5 30 30 b5 |00}. 0m 0..00.|
0000010 0 0 } \n 0 m 0 � � 0 0 �
0000020

Also detected by valgrind:

tony@mars:.../git/perl$ valgrind ./perl ../129000.pl
==14117== Memcheck, a memory error detector
==14117== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==14117== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==14117== Command: ./perl ../129000.pl
==14117==
Passing malformed UTF-8 to "XPosixWord" is deprecated at ../129000.pl line 2.
Passing malformed UTF-8 to "_Perl_IDStart" is deprecated at ../129000.pl line 2.
Malformed UTF-8 character (unexpected continuation byte 0xb5, with no preceding start byte) at ../129000.pl line 2.
==14117== Invalid read of size 1
==14117== at 0x6EE319: Perl_utf8_hop (utf8.c:1132)
==14117== by 0x4A719D: Perl_yylex (toke.c:4906)
==14117== by 0x4D318C: Perl_yyparse (perly.c:334)
==14117== by 0x461D5C: S_parse_body (perl.c:2373)
==14117== by 0x46023E: perl_parse (perl.c:1689)
==14117== by 0x41EFC5: main (perlmain.c:121)
==14117== Address 0x5f8198f is 1 bytes before a block of size 40 alloc'd
==14117== at 0x4C2AF2E: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14117== by 0x55A97A: Perl_safesysrealloc (util.c:274)
==14117== by 0x5C5448: Perl_sv_grow (sv.c:1602)
==14117== by 0x5FCFED: Perl_sv_gets (sv.c:8630)
==14117== by 0x4A3E6A: S_filter_gets (toke.c:4347)
==14117== by 0x4946B6: Perl_lex_next_chunk (toke.c:1309)
==14117== by 0x4A7CC2: Perl_yylex (toke.c:5020)
==14117== by 0x4D318C: Perl_yyparse (perly.c:334)
==14117== by 0x461D5C: S_parse_body (perl.c:2373)
==14117== by 0x46023E: perl_parse (perl.c:1689)
==14117== by 0x41EFC5: main (perlmain.c:121)
==14117==

As this requires feeding code to the perl parser I don't think it's security
issue (if you can feed code you can attack much more directly.)

It's still a bug, I suspect we're calling functions that are intended to
work on valid UTF-8 on strings with known invalid UTF-8.

Tony

p5pRT · 2016-08-23T01:19:40Z

From @tonycoz

129000b.pl.gz

p5pRT · 2016-08-23T01:19:40Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2016-09-15T06:03:31Z

From @geeknik

The attached test case triggers a heap-buffer-overflow in Perl_utf8_hop
(inline.h:497). This was found with AFL, ASAN and libdislocator. Perl
5.20.2 throws errors about malformed UTF8.

od -tx1 test86
0000000 42 45 47 49 4e 7b 24 5e 48 3d 32 2a 2a 34 30 30
0000020 7d 0a 22 30 30 22 22 30 30 30 81 81 22 81
0000036

==8516==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60300000ea0f at pc 0x000000614c1f bp 0x7ffc1a618970 sp 0x7ffc1a618968
READ of size 1 at 0x60300000ea0f thread T0
#0 0x614c1e in Perl_utf8_hop /root/perl/./inline.h:497:6
#1 0x614c1e in Perl_yylex /root/perl/toke.c:4906
#2 0x6add7e in Perl_yyparse /root/perl/perly.c:334:19
#3 0x59c451 in S_parse_body /root/perl/perl.c:2373:9
#4 0x5927fc in perl_parse /root/perl/perl.c:1689:2
#5 0x4de5d5 in main /root/perl/perlmain.c:121:18
#6 0x7f2154d02b44 in __libc_start_main
/build/glibc-uPj9cH/glibc-2.19/csu/libc-start.c:287
#7 0x4de26c in _start (/root/perl/perl+0x4de26c)

0x60300000ea0f is located 1 bytes to the left of 32-byte region
[0x60300000ea10,0x60300000ea30)
allocated by thread T0 here:
#0 0x4c0ede in realloc (/root/perl/perl+0x4c0ede)
#1 0x7f8ac6 in Perl_safesysrealloc /root/perl/util.c:274:18

SUMMARY: AddressSanitizer: heap-buffer-overflow /root/perl/./inline.h:497
Perl_utf8_hop
Shadow bytes around the buggy address:
0x0c067fff9cf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff9d00: fa fa fa fa fa fa fa fa fa fa fa fa fd fd fd fd
0x0c067fff9d10: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
0x0c067fff9d20: fd fd fa fa fd fd fd fd fa fa fd fd fd fa fa fa
0x0c067fff9d30: 00 00 00 00 fa fa fd fd fd fd fa fa fd fd fd fa
=>0x0c067fff9d40: fa[fa]00 00 00 00 fa fa 00 00 00 04 fa fa 00 00
0x0c067fff9d50: 00 01 fa fa 00 00 05 fa fa fa 00 00 00 06 fa fa
0x0c067fff9d60: 00 00 00 00 fa fa 00 00 00 00 fa fa 00 00 00 00
0x0c067fff9d70: fa fa 00 00 00 00 fa fa 00 00 00 00 fa fa 00 00
0x0c067fff9d80: 00 00 fa fa 00 00 00 00 fa fa fd fd fd fd fa fa
0x0c067fff9d90: 00 00 00 00 fa fa 00 00 00 00 fa fa fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Heap right redzone: fb
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack partial redzone: f4
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
ASan internal: fe
==8516==ABORTING

p5pRT · 2016-09-15T06:03:31Z

From @geeknik

test86.gz

p5pRT · 2016-10-06T14:34:29Z

From @hvds

The test program sets HINT_UTF8 via $^H, but the program itself is not valid utf8. Given perlvar says of $^H "WARNING: This variable is strictly for internal use only", I'm not sure how much we care about this case.

In any case, I don't believe it represents a security issue, just a way to shoot yourself in the foot.

Hugo

p5pRT · 2016-10-06T14:34:29Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2016-10-18T22:52:57Z

From @tonycoz

On Mon Aug 22 18:19:40 2016, tonyc wrote:

As this requires feeding code to the perl parser I don't think it's
security
issue (if you can feed code you can attack much more directly.)

It's still a bug, I suspect we're calling functions that are intended
to
work on valid UTF-8 on strings with known invalid UTF-8.

Since this isn't a security issue, I've moved it to the public queue.

I'm working on a fix.

Tony

p5pRT · 2016-10-19T03:51:49Z

From @tonycoz

On Tue Oct 18 15:52:57 2016, tonyc wrote:

I'm working on a fix.

Attached.

Tony

p5pRT · 2016-10-19T03:51:49Z

From @tonycoz

0001-perl-129000-create-a-safer-utf8_hop.patch

From b81f1e952df79b6b643768051d2d5a394d41cf81 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Oct 2016 13:48:58 +1100
Subject: (perl #129000) create a safer utf8_hop()

Unlike utf8_hop(), utf8_hop_safe() won't navigate before the
beginning or after the end of the supplied buffer.
---
 embed.fnc                 |  1 +
 embed.h                   |  1 +
 ext/XS-APItest/APItest.xs | 13 +++++++++++++
 ext/XS-APItest/t/utf8.t   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 inline.h                  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 proto.h                   |  6 ++++++
 6 files changed, 111 insertions(+)

diff --git a/embed.fnc b/embed.fnc
index 46426b6..51aaa1e 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1725,6 +1725,7 @@ Ap	|U8*	|utf16_to_utf8_reversed|NN U8* p|NN U8 *d|I32 bytelen|NN I32 *newlen
 AdpPR	|STRLEN	|utf8_length	|NN const U8* s|NN const U8 *e
 AipdPR	|IV	|utf8_distance	|NN const U8 *a|NN const U8 *b
 AipdPRn	|U8*	|utf8_hop	|NN const U8 *s|SSize_t off
+AipdPRn	|U8*	|utf8_hop_safe	|NN const U8 *s|SSize_t off|NN const U8 *start|NN const U8 *end
 ApMd	|U8*	|utf8_to_bytes	|NN U8 *s|NN STRLEN *len
 Apd	|int	|bytes_cmp_utf8	|NN const U8 *b|STRLEN blen|NN const U8 *u \
 				|STRLEN ulen
diff --git a/embed.h b/embed.h
index 5df381c..0013ca8 100644
--- a/embed.h
+++ b/embed.h
@@ -732,6 +732,7 @@
 #define utf16_to_utf8_reversed(a,b,c,d)	Perl_utf16_to_utf8_reversed(aTHX_ a,b,c,d)
 #define utf8_distance(a,b)	Perl_utf8_distance(aTHX_ a,b)
 #define utf8_hop		Perl_utf8_hop
+#define utf8_hop_safe		Perl_utf8_hop_safe
 #define utf8_length(a,b)	Perl_utf8_length(aTHX_ a,b)
 #define utf8_to_bytes(a,b)	Perl_utf8_to_bytes(aTHX_ a,b)
 #define utf8_to_uvchr(a,b)	Perl_utf8_to_uvchr(aTHX_ a,b)
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index 6b6f45f..17788ca 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -5538,6 +5538,19 @@ test_is_utf8_fixed_width_buf_loclen_flags(char *s, STRLEN len, U32 flags)
     OUTPUT:
         RETVAL
 
+IV
+test_utf8_hop_safe(SV *s_sv, STRLEN s_off, IV off)
+    PREINIT:
+        STRLEN len;
+        U8 *p;
+        U8 *r;
+    CODE:
+        p = (U8 *)SvPV(s_sv, len);
+        r = utf8_hop_safe(p + s_off, off, p, p + len);
+        RETVAL = r - p;
+    OUTPUT:
+        RETVAL
+
 UV
 test_toLOWER(UV ord)
     CODE:
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 0f2d9ee..3dd044e 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -2375,4 +2375,50 @@ foreach my $test (@tests) {
     }
 }
 
+SKIP:
+{
+    isASCII
+      or skip "These tests probably break on non-ASCII", 1;
+    my $simple = join "", "A" .. "J";
+    my $utf_ch = "\x{7fffffff}";
+    utf8::encode($utf_ch);
+    my $utf_ch_len = length $utf_ch;
+    note "utf_ch_len $utf_ch_len";
+    my $utf = $utf_ch x 10;
+    my $bad_start = substr($utf, 1);
+    # $bad_end ends with a start byte and a single continuation
+    my $bad_end = substr($utf, 0, length($utf)-$utf_ch_len+2);
+
+    # WARNING: all offsets are *byte* offsets
+    my @hop_tests =
+      (
+       # string      s                off        expected         name
+       [ $simple,    0,               5,         5,               "simple in range, forward" ],
+       [ $simple,    10,              -5,        5,               "simple in range, backward" ],
+       [ $simple,    5,               10,        10,              "simple out of range, forward" ],
+       [ $simple,    5,               -10,       0,               "simple out of range, backward" ],
+       [ $utf,       $utf_ch_len * 5, 5,         length($utf),    "utf in range, forward" ],
+       [ $utf,       $utf_ch_len * 5, -5,        0,               "utf in range, backward" ],
+       [ $utf,       $utf_ch_len * 5, 4,         $utf_ch_len * 9, "utf in range b, forward" ],
+       [ $utf,       $utf_ch_len * 5, -4,        $utf_ch_len,     "utf in range b, backward" ],
+       [ $utf,       $utf_ch_len * 5, 6,         length($utf),    "utf out of range, forward" ],
+       [ $utf,       $utf_ch_len * 5, -6,        0,               "utf out of range, backward"  ],
+       [ $bad_start, 0,               1,         1,               "bad start, forward 1 from 0" ],
+       [ $bad_start, 0,               $utf_ch_len-1, $utf_ch_len-1, "bad start, forward ch_len-1 from 0" ],
+       [ $bad_start, 0,               $utf_ch_len, $utf_ch_len*2-1, "bad start, forward ch_len from 0" ],
+       [ $bad_start, $utf_ch_len-1,   -1,        0,                "bad start, back 1 from first start byte" ],
+       [ $bad_start, $utf_ch_len-2,   -1,        0,                "bad start, back 1 from before first start byte" ],
+       [ $bad_start, 0,               -1,        0,                "bad start, back 1 from 0" ],
+       [ $bad_start, length $bad_start, -10,     0,                "bad start, back 10 from end" ],
+       [ $bad_end,   0,               10,        length $bad_end, "bad end, forward 10 from 0" ],
+       [ $bad_end,   length($bad_end)-1, 10,     length $bad_end, "bad end, forward 1 from end-1" ],
+       );
+       
+    for my $test (@hop_tests) {
+        my ($str, $s_off, $off, $want, $name) = @$test;
+        my $result = test_utf8_hop_safe($str, $s_off, $off);
+        is($result, $want, "utf8_hop_safe: $name");
+    }
+}
+
 done_testing;
diff --git a/inline.h b/inline.h
index 66ba348..32ed0aa 100644
--- a/inline.h
+++ b/inline.h
@@ -920,6 +920,50 @@ Perl_utf8_hop(const U8 *s, SSize_t off)
 }
 
 /*
+=for apidoc utf8_hop_safe
+
+Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
+either forward or backward.
+
+When moving backward it will not move before C<start>.
+
+When moving forward it will not move beyond C<end>.
+
+Will not exceed those limits even if the string is not valid "UTF-8".
+
+=cut
+*/
+
+PERL_STATIC_INLINE U8 *
+Perl_utf8_hop_safe(const U8 *s, SSize_t off, const U8 *start, const U8 *end)
+{
+    PERL_ARGS_ASSERT_UTF8_HOP_SAFE;
+
+    /* Note: cannot use UTF8_IS_...() too eagerly here since e.g
+     * the bitops (especially ~) can create illegal UTF-8.
+     * In other words: in Perl UTF-8 is not just for Unicode. */
+
+    assert(start <= s && s <= end);
+
+    if (off >= 0) {
+	while (off--) {
+            STRLEN skip = UTF8SKIP(s);
+            if ((STRLEN)(end - s) <= skip)
+                return (U8 *)end;
+	    s += skip;
+        }
+    }
+    else {
+	while (off++ && s > start) {
+	    s--;
+	    while (UTF8_IS_CONTINUATION(*s) && s > start)
+		s--;
+	}
+    }
+    return (U8 *)s;
+}
+
+/*
 
 =for apidoc is_utf8_valid_partial_char
 
diff --git a/proto.h b/proto.h
index 701dc9e..2c67126 100644
--- a/proto.h
+++ b/proto.h
@@ -3509,6 +3509,12 @@ PERL_STATIC_INLINE U8*	Perl_utf8_hop(const U8 *s, SSize_t off)
 #define PERL_ARGS_ASSERT_UTF8_HOP	\
 	assert(s)
 
+PERL_STATIC_INLINE U8*	Perl_utf8_hop_safe(const U8 *s, SSize_t off, const U8 *start, const U8 *end)
+			__attribute__warn_unused_result__
+			__attribute__pure__;
+#define PERL_ARGS_ASSERT_UTF8_HOP_SAFE	\
+	assert(s); assert(start); assert(end)
+
 PERL_CALLCONV STRLEN	Perl_utf8_length(pTHX_ const U8* s, const U8 *e)
 			__attribute__warn_unused_result__
 			__attribute__pure__;
-- 
2.1.4

p5pRT · 2016-10-19T03:51:49Z

From @tonycoz

0002-perl-129000-use-the-new-utf8_hop_safe.patch

From 48b8f39309cc65e0c3445079a83e3e9633d596de Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Oct 2016 14:20:53 +1100
Subject: (perl #129000) use the new utf8_hop_safe()

when reporting unrecognized characters in UTF mode.

This could start the display before the beginning of the line if the
line was badly encoded.
---
 t/op/lex.t | 8 +++++++-
 toke.c     | 3 ++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/t/op/lex.t b/t/op/lex.t
index f3cb510..3d9eec7 100644
--- a/t/op/lex.t
+++ b/t/op/lex.t
@@ -7,7 +7,7 @@ use warnings;
 
 BEGIN { chdir 't' if -d 't'; require './test.pl'; }
 
-plan(tests => 33);
+plan(tests => 34);
 
 {
     no warnings 'deprecated';
@@ -261,3 +261,9 @@ fresh_perl_like(
     {},
     '[perl #128997] - buffer read overflow'
 );
+fresh_perl_like(
+    qq(BEGIN{\$^H=0x800000}\n   0m 0\xB5\xB500\xB5\0),
+    qr/Unrecognized character \\x\{0\}; marked by <-- HERE after    0m.*<-- HERE near column 12 at - line 2./,
+    {},
+    '[perl #129000] read before buffer'
+);
diff --git a/toke.c b/toke.c
index 041996f..32a2f43 100644
--- a/toke.c
+++ b/toke.c
@@ -4910,7 +4910,8 @@ Perl_yylex(pTHX)
         }
         len = UTF ? Perl_utf8_length(aTHX_ (U8 *) PL_linestart, (U8 *) s) : (STRLEN) (s - PL_linestart);
         if (len > UNRECOGNIZED_PRECEDE_COUNT) {
-            d = UTF ? (char *) utf8_hop((U8 *) s, -UNRECOGNIZED_PRECEDE_COUNT) : s - UNRECOGNIZED_PRECEDE_COUNT;
+            d = UTF ? (char *) utf8_hop_safe((U8 *) s, -UNRECOGNIZED_PRECEDE_COUNT, (U8 *)PL_linestart, (U8 *)PL_bufend)
+                : s - UNRECOGNIZED_PRECEDE_COUNT;
         } else {
             d = PL_linestart;
         }
-- 
2.1.4

p5pRT · 2016-10-31T04:02:22Z

From @tonycoz

On Tue Oct 18 20:51:49 2016, tonyc wrote:

On Tue Oct 18 15:52:57 2016, tonyc wrote:

I'm working on a fix.

Attached.

Here's an alternate version.

In most cases we know which direction we're going at the call
site.

Tony

p5pRT · 2016-10-31T04:02:22Z

From @tonycoz

0001-perl-129000-create-a-safer-utf8_hop.patch

From d32880fb910f2d3e0d67256d4eceabaa7ac9b414 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 31 Oct 2016 14:28:34 +1100
Subject: (perl #129000) create a safer utf8_hop()

Unlike utf8_hop(), utf8_hop_safe() won't navigate before the
beginning or after the end of the supplied buffer.

The original version of this put all of the logic into
utf8_hop_safe(), but in many cases a caller specifically
needs to go forward or backward, and supplying the other limit
made the function less usable, so I split the function
into forward and backward cases.

This split may also make inlining these functions more efficient
or more likely.
---
 embed.fnc                 |   3 ++
 embed.h                   |   3 ++
 ext/XS-APItest/APItest.pm |   2 +-
 ext/XS-APItest/APItest.xs |  13 ++++++
 ext/XS-APItest/t/utf8.t   |  46 +++++++++++++++++++
 inline.h                  | 111 ++++++++++++++++++++++++++++++++++++++++++++++
 proto.h                   |  18 ++++++++
 7 files changed, 195 insertions(+), 1 deletion(-)

diff --git a/embed.fnc b/embed.fnc
index 5cc73b7..f806db0 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1733,6 +1733,9 @@ Ap	|U8*	|utf16_to_utf8_reversed|NN U8* p|NN U8 *d|I32 bytelen|NN I32 *newlen
 AdpPR	|STRLEN	|utf8_length	|NN const U8* s|NN const U8 *e
 AipdPR	|IV	|utf8_distance	|NN const U8 *a|NN const U8 *b
 AipdPRn	|U8*	|utf8_hop	|NN const U8 *s|SSize_t off
+AipdPRn	|U8*	|utf8_hop_back|NN const U8 *s|SSize_t off|NN const U8 *start
+AipdPRn	|U8*	|utf8_hop_forward|NN const U8 *s|SSize_t off|NN const U8 *end
+AipdPRn	|U8*	|utf8_hop_safe	|NN const U8 *s|SSize_t off|NN const U8 *start|NN const U8 *end
 ApMd	|U8*	|utf8_to_bytes	|NN U8 *s|NN STRLEN *len
 Apd	|int	|bytes_cmp_utf8	|NN const U8 *b|STRLEN blen|NN const U8 *u \
 				|STRLEN ulen
diff --git a/embed.h b/embed.h
index 1af2917..2fdbff6 100644
--- a/embed.h
+++ b/embed.h
@@ -733,6 +733,9 @@
 #define utf16_to_utf8_reversed(a,b,c,d)	Perl_utf16_to_utf8_reversed(aTHX_ a,b,c,d)
 #define utf8_distance(a,b)	Perl_utf8_distance(aTHX_ a,b)
 #define utf8_hop		Perl_utf8_hop
+#define utf8_hop_back		Perl_utf8_hop_back
+#define utf8_hop_forward	Perl_utf8_hop_forward
+#define utf8_hop_safe		Perl_utf8_hop_safe
 #define utf8_length(a,b)	Perl_utf8_length(aTHX_ a,b)
 #define utf8_to_bytes(a,b)	Perl_utf8_to_bytes(aTHX_ a,b)
 #define utf8_to_uvchr(a,b)	Perl_utf8_to_uvchr(aTHX_ a,b)
diff --git a/ext/XS-APItest/APItest.pm b/ext/XS-APItest/APItest.pm
index 64a25f1..473d4a3 100644
--- a/ext/XS-APItest/APItest.pm
+++ b/ext/XS-APItest/APItest.pm
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use Carp;
 
-our $VERSION = '0.86';
+our $VERSION = '0.87';
 
 require XSLoader;
 
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index 6dbb297..a2a31ec 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -5538,6 +5538,19 @@ test_is_utf8_fixed_width_buf_loclen_flags(char *s, STRLEN len, U32 flags)
     OUTPUT:
         RETVAL
 
+IV
+test_utf8_hop_safe(SV *s_sv, STRLEN s_off, IV off)
+    PREINIT:
+        STRLEN len;
+        U8 *p;
+        U8 *r;
+    CODE:
+        p = (U8 *)SvPV(s_sv, len);
+        r = utf8_hop_safe(p + s_off, off, p, p + len);
+        RETVAL = r - p;
+    OUTPUT:
+        RETVAL
+
 UV
 test_toLOWER(UV ord)
     CODE:
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index fc04dfc..9e7685e 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -2379,4 +2379,50 @@ foreach my $test (@tests) {
     }
 }
 
+SKIP:
+{
+    isASCII
+      or skip "These tests probably break on non-ASCII", 1;
+    my $simple = join "", "A" .. "J";
+    my $utf_ch = "\x{7fffffff}";
+    utf8::encode($utf_ch);
+    my $utf_ch_len = length $utf_ch;
+    note "utf_ch_len $utf_ch_len";
+    my $utf = $utf_ch x 10;
+    my $bad_start = substr($utf, 1);
+    # $bad_end ends with a start byte and a single continuation
+    my $bad_end = substr($utf, 0, length($utf)-$utf_ch_len+2);
+
+    # WARNING: all offsets are *byte* offsets
+    my @hop_tests =
+      (
+       # string      s                off        expected         name
+       [ $simple,    0,               5,         5,               "simple in range, forward" ],
+       [ $simple,    10,              -5,        5,               "simple in range, backward" ],
+       [ $simple,    5,               10,        10,              "simple out of range, forward" ],
+       [ $simple,    5,               -10,       0,               "simple out of range, backward" ],
+       [ $utf,       $utf_ch_len * 5, 5,         length($utf),    "utf in range, forward" ],
+       [ $utf,       $utf_ch_len * 5, -5,        0,               "utf in range, backward" ],
+       [ $utf,       $utf_ch_len * 5, 4,         $utf_ch_len * 9, "utf in range b, forward" ],
+       [ $utf,       $utf_ch_len * 5, -4,        $utf_ch_len,     "utf in range b, backward" ],
+       [ $utf,       $utf_ch_len * 5, 6,         length($utf),    "utf out of range, forward" ],
+       [ $utf,       $utf_ch_len * 5, -6,        0,               "utf out of range, backward"  ],
+       [ $bad_start, 0,               1,         1,               "bad start, forward 1 from 0" ],
+       [ $bad_start, 0,               $utf_ch_len-1, $utf_ch_len-1, "bad start, forward ch_len-1 from 0" ],
+       [ $bad_start, 0,               $utf_ch_len, $utf_ch_len*2-1, "bad start, forward ch_len from 0" ],
+       [ $bad_start, $utf_ch_len-1,   -1,        0,                "bad start, back 1 from first start byte" ],
+       [ $bad_start, $utf_ch_len-2,   -1,        0,                "bad start, back 1 from before first start byte" ],
+       [ $bad_start, 0,               -1,        0,                "bad start, back 1 from 0" ],
+       [ $bad_start, length $bad_start, -10,     0,                "bad start, back 10 from end" ],
+       [ $bad_end,   0,               10,        length $bad_end, "bad end, forward 10 from 0" ],
+       [ $bad_end,   length($bad_end)-1, 10,     length $bad_end, "bad end, forward 1 from end-1" ],
+       );
+
+    for my $test (@hop_tests) {
+        my ($str, $s_off, $off, $want, $name) = @$test;
+        my $result = test_utf8_hop_safe($str, $s_off, $off);
+        is($result, $want, "utf8_hop_safe: $name");
+    }
+}
+
 done_testing;
diff --git a/inline.h b/inline.h
index 66ba348..adcd85d 100644
--- a/inline.h
+++ b/inline.h
@@ -920,6 +920,117 @@ Perl_utf8_hop(const U8 *s, SSize_t off)
 }
 
 /*
+=for apidoc utf8_hop_forward
+
+Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
+forward.
+
+C<off> must be non-negative.
+
+C<s> must be before or equal to C<end>.
+
+When moving forward it will not move beyond C<end>.
+
+Will not exceed this limit even if the string is not valid "UTF-8".
+
+=cut
+*/
+
+PERL_STATIC_INLINE U8 *
+Perl_utf8_hop_forward(const U8 *s, SSize_t off, const U8 *end)
+{
+    PERL_ARGS_ASSERT_UTF8_HOP_FORWARD;
+
+    /* Note: cannot use UTF8_IS_...() too eagerly here since e.g
+     * the bitops (especially ~) can create illegal UTF-8.
+     * In other words: in Perl UTF-8 is not just for Unicode. */
+
+    assert(s <= end);
+    assert(off >= 0);
+
+    while (off--) {
+        STRLEN skip = UTF8SKIP(s);
+        if ((STRLEN)(end - s) <= skip)
+            return (U8 *)end;
+        s += skip;
+    }
+
+    return (U8 *)s;
+}
+
+/*
+=for apidoc utf8_hop_back
+
+Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
+backward.
+
+C<off> must be non-positive.
+
+C<s> must be after or equal to C<start>.
+
+When moving backward it will not move before C<start>.
+
+Will not exceed this limit even if the string is not valid "UTF-8".
+
+=cut
+*/
+
+PERL_STATIC_INLINE U8 *
+Perl_utf8_hop_back(const U8 *s, SSize_t off, const U8 *start)
+{
+    PERL_ARGS_ASSERT_UTF8_HOP_BACK;
+
+    /* Note: cannot use UTF8_IS_...() too eagerly here since e.g
+     * the bitops (especially ~) can create illegal UTF-8.
+     * In other words: in Perl UTF-8 is not just for Unicode. */
+
+    assert(start <= s);
+    assert(off <= 0);
+
+    while (off++ && s > start) {
+        s--;
+        while (UTF8_IS_CONTINUATION(*s) && s > start)
+            s--;
+    }
+    
+    return (U8 *)s;
+}
+
+/*
+=for apidoc utf8_hop_safe
+
+Return the UTF-8 pointer C<s> displaced by up to C<off> characters,
+either forward or backward.
+
+When moving backward it will not move before C<start>.
+
+When moving forward it will not move beyond C<end>.
+
+Will not exceed those limits even if the string is not valid "UTF-8".
+
+=cut
+*/
+
+PERL_STATIC_INLINE U8 *
+Perl_utf8_hop_safe(const U8 *s, SSize_t off, const U8 *start, const U8 *end)
+{
+    PERL_ARGS_ASSERT_UTF8_HOP_SAFE;
+
+    /* Note: cannot use UTF8_IS_...() too eagerly here since e.g
+     * the bitops (especially ~) can create illegal UTF-8.
+     * In other words: in Perl UTF-8 is not just for Unicode. */
+
+    assert(start <= s && s <= end);
+
+    if (off >= 0) {
+        return utf8_hop_forward(s, off, end);
+    }
+    else {
+        return utf8_hop_back(s, off, start);
+    }
+}
+
+/*
 
 =for apidoc is_utf8_valid_partial_char
 
diff --git a/proto.h b/proto.h
index 1d79c46..ce7a92c 100644
--- a/proto.h
+++ b/proto.h
@@ -3512,6 +3512,24 @@ PERL_STATIC_INLINE U8*	Perl_utf8_hop(const U8 *s, SSize_t off)
 #define PERL_ARGS_ASSERT_UTF8_HOP	\
 	assert(s)
 
+PERL_STATIC_INLINE U8*	Perl_utf8_hop_back(const U8 *s, SSize_t off, const U8 *start)
+			__attribute__warn_unused_result__
+			__attribute__pure__;
+#define PERL_ARGS_ASSERT_UTF8_HOP_BACK	\
+	assert(s); assert(start)
+
+PERL_STATIC_INLINE U8*	Perl_utf8_hop_forward(const U8 *s, SSize_t off, const U8 *end)
+			__attribute__warn_unused_result__
+			__attribute__pure__;
+#define PERL_ARGS_ASSERT_UTF8_HOP_FORWARD	\
+	assert(s); assert(end)
+
+PERL_STATIC_INLINE U8*	Perl_utf8_hop_safe(const U8 *s, SSize_t off, const U8 *start, const U8 *end)
+			__attribute__warn_unused_result__
+			__attribute__pure__;
+#define PERL_ARGS_ASSERT_UTF8_HOP_SAFE	\
+	assert(s); assert(start); assert(end)
+
 PERL_CALLCONV STRLEN	Perl_utf8_length(pTHX_ const U8* s, const U8 *e)
 			__attribute__warn_unused_result__
 			__attribute__pure__;
-- 
2.1.4

p5pRT · 2016-10-31T04:02:22Z

From @tonycoz

0002-perl-129000-use-the-new-utf8_hop_safe.patch

From d8f9ff90f4f432b887308b04139ba6ba39887876 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 31 Oct 2016 14:49:17 +1100
Subject: (perl #129000) use the new utf8_hop_safe()

when reporting unrecognized characters in UTF mode.
---
 t/op/lex.t | 8 +++++++-
 toke.c     | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/t/op/lex.t b/t/op/lex.t
index db0cf3a..a0ed5cb 100644
--- a/t/op/lex.t
+++ b/t/op/lex.t
@@ -7,7 +7,7 @@ use warnings;
 
 BEGIN { chdir 't' if -d 't'; require './test.pl'; }
 
-plan(tests => 32);
+plan(tests => 33);
 
 {
     no warnings 'deprecated';
@@ -255,3 +255,9 @@ fresh_perl_is(
     {},
     '[perl #128996] - use of PL_op after op is freed'
 );
+fresh_perl_like(
+    qq(BEGIN{\$^H=0x800000}\n   0m 0\xB5\xB500\xB5\0),
+    qr/Unrecognized character \\x\{0\}; marked by <-- HERE after    0m.*<-- HERE near column 12 at - line 2./,
+    {},
+    '[perl #129000] read before buffer'
+);
diff --git a/toke.c b/toke.c
index b2d9209..261764b 100644
--- a/toke.c
+++ b/toke.c
@@ -4908,7 +4908,7 @@ Perl_yylex(pTHX)
                             : Perl_form(aTHX_ "\\x%02X", (unsigned char)*s);
         len = UTF ? Perl_utf8_length(aTHX_ (U8 *) PL_linestart, (U8 *) s) : (STRLEN) (s - PL_linestart);
         if (len > UNRECOGNIZED_PRECEDE_COUNT) {
-            d = UTF ? (char *) utf8_hop((U8 *) s, -UNRECOGNIZED_PRECEDE_COUNT) : s - UNRECOGNIZED_PRECEDE_COUNT;
+            d = UTF ? (char *) utf8_hop_back((U8 *) s, -UNRECOGNIZED_PRECEDE_COUNT, (U8 *)PL_linestart) : s - UNRECOGNIZED_PRECEDE_COUNT;
         } else {
             d = PL_linestart;
         }
-- 
2.1.4

p5pRT · 2016-11-09T03:44:38Z

From @tonycoz

On Sun, 30 Oct 2016 21:02:22 -0700, tonyc wrote:

On Tue Oct 18 20:51:49 2016, tonyc wrote:

On Tue Oct 18 15:52:57 2016, tonyc wrote:

I'm working on a fix.

Attached.

Here's an alternate version.

In most cases we know which direction we're going at the call
site.

Applied as 65df57a and cbe6b21.

Also added a skip to skip the new test and a few others that depend on UTF-8
vs UTF-EBCIDIC.

Tony

p5pRT · 2016-11-09T03:44:39Z

@tonycoz - Status changed from 'open' to 'pending release'

p5pRT · 2016-11-29T04:28:24Z

From @tonycoz

On Thu, 06 Oct 2016 07:34:29 -0700, hv wrote:

The test program sets HINT_UTF8 via $^H, but the program itself is not
valid utf8. Given perlvar says of $^H "WARNING: This variable is
strictly for internal use only", I'm not sure how much we care about
this case.

In any case, I don't believe it represents a security issue, just a
way to shoot yourself in the foot.

This is a duplicate of 129000 which was fixed in cbe6b21.

Merging into 129000.

Tony

p5pRT · 2017-05-30T20:14:46Z

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0, this and 210 other issues have been
resolved.

Perl 5.26.0 may be downloaded via:
https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists, feel free to reopen this ticket.

p5pRT · 2017-05-30T20:14:47Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heap-buffer-overflow Perl_utf8_hop (utf8.c:1132) #15532

heap-buffer-overflow Perl_utf8_hop (utf8.c:1132) #15532

p5pRT commented Aug 19, 2016

p5pRT commented Aug 19, 2016

p5pRT commented Aug 19, 2016

p5pRT commented Aug 23, 2016

p5pRT commented Aug 23, 2016

p5pRT commented Aug 23, 2016

p5pRT commented Sep 15, 2016

p5pRT commented Sep 15, 2016

p5pRT commented Oct 6, 2016

p5pRT commented Oct 6, 2016

p5pRT commented Oct 18, 2016

p5pRT commented Oct 19, 2016

p5pRT commented Oct 19, 2016

p5pRT commented Oct 19, 2016

p5pRT commented Oct 31, 2016

p5pRT commented Oct 31, 2016

p5pRT commented Oct 31, 2016

p5pRT commented Nov 9, 2016

p5pRT commented Nov 9, 2016

p5pRT commented Nov 29, 2016

p5pRT commented May 30, 2017

p5pRT commented May 30, 2017

heap-buffer-overflow Perl_utf8_hop (utf8.c:1132) #15532

heap-buffer-overflow Perl_utf8_hop (utf8.c:1132) #15532

Comments

p5pRT commented Aug 19, 2016

p5pRT commented Aug 19, 2016

From @geeknik

p5pRT commented Aug 19, 2016

From @geeknik

p5pRT commented Aug 23, 2016

From @tonycoz

p5pRT commented Aug 23, 2016

From @tonycoz

p5pRT commented Aug 23, 2016

p5pRT commented Sep 15, 2016

From @geeknik

p5pRT commented Sep 15, 2016

From @geeknik

p5pRT commented Oct 6, 2016

From @hvds

p5pRT commented Oct 6, 2016

p5pRT commented Oct 18, 2016

From @tonycoz

p5pRT commented Oct 19, 2016

From @tonycoz

p5pRT commented Oct 19, 2016

From @tonycoz

p5pRT commented Oct 19, 2016

From @tonycoz

p5pRT commented Oct 31, 2016

From @tonycoz

p5pRT commented Oct 31, 2016

From @tonycoz

p5pRT commented Oct 31, 2016

From @tonycoz

p5pRT commented Nov 9, 2016

From @tonycoz

p5pRT commented Nov 9, 2016

p5pRT commented Nov 29, 2016

From @tonycoz

p5pRT commented May 30, 2017

From @khwilliamson

p5pRT commented May 30, 2017