Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8-c8 is not reversable: encode/decode mismatch #5330

Closed
p6rt opened this issue May 19, 2016 · 11 comments
Closed

utf8-c8 is not reversable: encode/decode mismatch #5330

p6rt opened this issue May 19, 2016 · 11 comments
Labels

Comments

@p6rt
Copy link

p6rt commented May 19, 2016

Migrated from rt.perl.org#128184 (status was 'resolved')

Searchable as RT128184$

@p6rt
Copy link
Author

p6rt commented May 19, 2016

From @Tux

Two issues

I use this to test​:
--8<---
#!perl6

use v6;

use Test;

for ^32 {

  say "";
  say $_;

  my @​data = ^20 .map({ 256.rand.Int }).list;
  @​data.unshift​: 61;

  #dd @​data;

  my $b = Buf.new(@​data);

  ok((my Str $u = $b.decode("utf8-c8")), "decode");

  my @​back = $u.encode("utf8-c8").list;

  #dd @​back;

  my $n = Buf.new(@​back);

  is-deeply($n, $b, "Data");
  }
-->8---

First issue is that the buffer returns something longer than the original (a \0 is added)​:

# expected​: Buf.new(61,29,61,200,30,99,107,150,71,11,253,134,110,27,35,227,88,140,180,158,209)
# got​: Buf.new(61,29,61,200,30,99,107,150,71,11,253,134,110,27,35,227,88,140,180,158,209,0)

# expected​: Buf.new(61,2,71,91,58,252,6,247,88,58,121,32,124,129,191,126,36,222,185,109,213)
# got​: Buf.new(61,2,71,91,58,252,6,247,88,58,121,32,124,129,191,126,36,222,185,109,213,0)

The second issue is more fun, pairs are swapped​:

# expected​: Buf.new(61,147,135,8,82,78,208,66,205,164,204,162,140,97,175,37,108,194,27,192,119)
# got​: Buf.new(61,147,135,8,82,78,208,66,204,162,205,164,140,97,175,37,108,194,27,192,119)

205,164,204,162 => 204,162,205,164

@p6rt
Copy link
Author

p6rt commented May 24, 2016

From @Tux

# expected​: Buf.new(61,10,0,56,143,36,56,119,182,81,88,70,88,139,28,119,142,151,108,12,215)
# got​: Buf.new(61,10,0,56,143,36,56,119,182,81,88,70,88,139,28,119,142,151,108,12,215,127)

@p6rt
Copy link
Author

p6rt commented May 25, 2016

From @Tux

# expected​: Buf.new(61,93,12,110,139,89,42,134,251,165,68,32,104,225,44,112,194,178,75,64,243)
# got​: Buf.new(61,93,12,110,139,89,42,134,251,165,68,32,104,225,44,112,194,178,75,64,243,10)

@p6rt
Copy link
Author

p6rt commented Jun 27, 2016

From @jdv

This change at or around line 281 of MoarVM/src/strings/utf8_c8.c​:

- } while (++last_accept_utf8 != utf8);
+ } while (++last_accept_utf8 != utf8 - 1);

seems to clear up the additional trailing
char issue. The bug only seems to happen when the last byte is in the range of 194-244.

@p6rt
Copy link
Author

p6rt commented Jun 27, 2016

From @jdv

The other issue with swapped pairs seems to be the normalizer getting confused. I don't know enough to even suggest a fix but if one
comments out the call to canonical_sort inside of MVM_unicode_normalizer_eof in MoarVM/src/strings/normalize.c then a simple
example run using Buf.new(205,164,204,150)
will then work out ok.

@p6rt
Copy link
Author

p6rt commented Jun 30, 2016

From @Tux

# expected​: Buf.new(61,185,242,97,170,122,52,182,62,236,186,222,213,63,189,203,241,176,1,149,233)
# got​: Buf.new(61,185,242,97,170,122,52,182,62,236,186,222,213,63,189,203,241,176,1,149,233,127)

@p6rt
Copy link
Author

p6rt commented Jul 16, 2016

From @zoffixznet

♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁

TODO-fudged tests added in Raku/roast@16bcd2693d

🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁🏁
♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥

@p6rt
Copy link
Author

p6rt commented Jul 16, 2016

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Jul 17, 2016

From @nwc10

On Sat Jul 16 06​:03​:19 2016, cpan@​zoffix.com wrote​:

TODO-fudged tests added in Raku/roast@16bcd2693d

All 6 tests trigger ASAN aborts. For example​:

$ ./perl6-m -Ilib 128184
1..6

==2192==
ERROR​: AddressSanitizer​: heap-buffer-overflow on address 0x60300048aa95 at pc 0x7fad84a21974 bp 0x7fffc99206f0 sp 0x7fffc99206e8
READ of size 1 at 0x60300048aa95 thread T0
  #​0 0x7fad84a21973 in MVM_string_utf8_c8_decode src/strings/utf8_c8.c​:280
  #​1 0x7fad84a2ea2e in MVM_string_decode src/strings/ops.c​:823
  #​2 0x7fad84a2f4b9 in MVM_string_decode_from_buf src/strings/ops.c​:927
  #​3 0x7fad84797908 in MVM_interp_run src/core/interp.c​:1653
  #​4 0x7fad84a83704 in MVM_vm_run_file src/moar.c​:304
  #​5 0x401a4f in main src/main.c​:191
  #​6 0x7fad83fc4d5c in __libc_start_main (/lib64/libc.so.6+0x1ed5c)
  #​7 0x401058 (/home/nicholas/Sandpit/moar-san/bin/moar+0x401058)

0x60300048aa95 is located 0 bytes to the right of 21-byte region [0x60300048aa80,0x60300048aa95)
allocated by thread T0 here​:
  #​0 0x7fad853a662f in __interceptor_malloc ../../.././libsanitizer/asan/asan_malloc_linux.cc​:72
  #​1 0x7fad848d8d39 in MVM_malloc src/core/alloc.h​:2
  #​2 0x7fad848dbb9e in set_size_internal src/6model/reprs/MVMArray.c​:334
  #​3 0x7fad848dc9d6 in set_elems src/6model/reprs/MVMArray.c​:431
  #​4 0x7fad847a5a6d in MVM_interp_run src/core/interp.c​:2263
  #​5 0x7fad84a83704 in MVM_vm_run_file src/moar.c​:304
  #​6 0x401a4f in main src/main.c​:191
  #​7 0x7fad83fc4d5c in __libc_start_main (/lib64/libc.so.6+0x1ed5c)

SUMMARY​: AddressSanitizer​: heap-buffer-overflow src/strings/utf8_c8.c​:280 MVM_string_utf8_c8_decode
Shadow bytes around the buggy address​:
  0x0c0680089500​: fa fa 00 00 04 fa fa fa 00 00 06 fa fa fa fd fd
  0x0c0680089510​: fd fa fa fa fd fd fd fd fa fa fd fd fd fd fa fa
  0x0c0680089520​: fd fd fd fd fa fa fd fd fd fd fa fa fd fd fd fd
  0x0c0680089530​: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
  0x0c0680089540​: fd fd fa fa 00 00 00 00 fa fa 00 00 04 fa fa fa
=>0x0c0680089550​: 00 00[05]fa fa fa 00 00 00 fa fa fa 00 00 04 fa
  0x0c0680089560​: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00
  0x0c0680089570​: 00 00 fa fa 00 00 00 00 fa fa 00 00 00 fa fa fa
  0x0c0680089580​: 00 00 04 fa fa fa 00 00 06 fa fa fa 00 00 00 00
  0x0c0680089590​: fa fa fd fd fd fa fa fa fd fd fd fa fa fa 00 00
  0x0c06800895a0​: 00 00 fa fa fd fd fd fd fa fa fd fd fd fd fa fa
Shadow byte legend (one shadow byte represents 8 application bytes)​:
  Addressable​: 00
  Partially addressable​: 01 02 03 04 05 06 07
  Heap left redzone​: fa
  Heap right redzone​: fb
  Freed heap region​: fd
  Stack left redzone​: f1
  Stack mid redzone​: f2
  Stack right redzone​: f3
  Stack partial redzone​: f4
  Stack after return​: f5
  Stack use after scope​: f8
  Global redzone​: f9
  Global init order​: f6
  Poisoned by user​: f7
  Contiguous container OOB​:fc
  ASan internal​: fe
==2192==ABORTING

A mininal test case seems to be something like this​:

$ ./perl6-m -Ilib -e 'Buf.new(61,61,61,61,61,61,61,194).decode("utf8-c8")'

==4442==ERROR​: AddressSanitizer​: heap-buffer-overflow on address 0x60200017b7d8 at pc 0x7f6226675974 bp 0x7fffa5689a60 sp 0x7fffa5689a58
READ of size 1 at 0x60200017b7d8 thread T0
  #​0 0x7f6226675973 in MVM_string_utf8_c8_decode src/strings/utf8_c8.c​:280
  #​1 0x7f6226682a2e in MVM_string_decode src/strings/ops.c​:823
  #​2 0x7f62266834b9 in MVM_string_decode_from_buf src/strings/ops.c​:927
  #​3 0x7f62263eb908 in MVM_interp_run src/core/interp.c​:1653
  #​4 0x7f62266d7704 in MVM_vm_run_file src/moar.c​:304
  #​5 0x401a4f in main src/main.c​:191
  #​6 0x7f6225c18d5c in __libc_start_main (/lib64/libc.so.6+0x1ed5c)
  #​7 0x401058 (/home/nicholas/Sandpit/moar-san/bin/moar+0x401058)

0x60200017b7d8 is located 0 bytes to the right of 8-byte region [0x60200017b7d0,0x60200017b7d8)
allocated by thread T0 here​:
  #​0 0x7f6226ffa62f in __interceptor_malloc ../../.././libsanitizer/asan/asan_malloc_linux.cc​:72
  #​1 0x7f622652cd39 in MVM_malloc src/core/alloc.h​:2
  #​2 0x7f622652fb9e in set_size_internal src/6model/reprs/MVMArray.c​:334
  #​3 0x7f62265309d6 in set_elems src/6model/reprs/MVMArray.c​:431
  #​4 0x7f62263f9a6d in MVM_interp_run src/core/interp.c​:2263
  #​5 0x7f62266d7704 in MVM_vm_run_file src/moar.c​:304
  #​6 0x401a4f in main src/main.c​:191
  #​7 0x7f6225c18d5c in __libc_start_main (/lib64/libc.so.6+0x1ed5c)

SUMMARY​: AddressSanitizer​: heap-buffer-overflow src/strings/utf8_c8.c​:280 MVM_string_utf8_c8_decode
Shadow bytes around the buggy address​:
  0x0c04800276a0​: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04800276b0​: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04800276c0​: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04800276d0​: fa fa fa fa fa fa fa fa fa fa 02 fa fa fa 00 00
  0x0c04800276e0​: fa fa 00 04 fa fa 00 00 fa fa 04 fa fa fa 00 00
=>0x0c04800276f0​: fa fa 00 fa fa fa 00 00 fa fa 00[fa]fa fa 05 fa
  0x0c0480027700​: fa fa 00 02 fa fa 02 fa fa fa 00 00 fa fa 00 00
  0x0c0480027710​: fa fa 04 fa fa fa 00 00 fa fa fd fa fa fa 00 02
  0x0c0480027720​: fa fa 03 fa fa fa 06 fa fa fa 03 fa fa fa 06 fa
  0x0c0480027730​: fa fa 01 fa fa fa 01 fa fa fa 00 00 fa fa 00 03
  0x0c0480027740​: fa fa 00 fa fa fa 00 00 fa fa 00 00 fa fa 00 00
Shadow byte legend (one shadow byte represents 8 application bytes)​:
  Addressable​: 00
  Partially addressable​: 01 02 03 04 05 06 07
  Heap left redzone​: fa
  Heap right redzone​: fb
  Freed heap region​: fd
  Stack left redzone​: f1
  Stack mid redzone​: f2
  Stack right redzone​: f3
  Stack partial redzone​: f4
  Stack after return​: f5
  Stack use after scope​: f8
  Global redzone​: f9
  Global init order​: f6
  Poisoned by user​: f7
  Contiguous container OOB​:fc
  ASan internal​: fe
==4442==ABORTING

Note that removing any more "61"s makes it happy, as does using 193 instead of 194. I have no idea why that threshold is 0xC2, ie the lowest valued legitimate octet in a valid UTF-8 sequence. Curiously also 244, the highest valid UTF-8 start octet triggers ASAN, whereas 245 (and 255) do not.

Nicholas Clark

@p6rt
Copy link
Author

p6rt commented Jan 2, 2017

From @jnthn

On Thu, 19 May 2016 05​:32​:10 -0700, hmbrand wrote​:

Two issues

I use this to test​:
--8<---
#!perl6

use v6;

use Test;

for ^32 {

say "";
say $_;

my @​data = ^20 .map({ 256.rand.Int }).list;
@​data.unshift​: 61;

#dd @​data;

my $b = Buf.new(@​data);

ok((my Str $u = $b.decode("utf8-c8")), "decode");

my @​back = $u.encode("utf8-c8").list;

#dd @​back;

my $n = Buf.new(@​back);

is-deeply($n, $b, "Data");
}
-->8---

First issue is that the buffer returns something longer than the
original (a \0 is added)​:

# expected​:
Buf.new(61,29,61,200,30,99,107,150,71,11,253,134,110,27,35,227,88,140,180,158,209)
# got​:
Buf.new(61,29,61,200,30,99,107,150,71,11,253,134,110,27,35,227,88,140,180,158,209,0)

# expected​:
Buf.new(61,2,71,91,58,252,6,247,88,58,121,32,124,129,191,126,36,222,185,109,213)
# got​:
Buf.new(61,2,71,91,58,252,6,247,88,58,121,32,124,129,191,126,36,222,185,109,213,0)

The second issue is more fun, pairs are swapped​:

# expected​:
Buf.new(61,147,135,8,82,78,208,66,205,164,204,162,140,97,175,37,108,194,27,192,119)
# got​:
Buf.new(61,147,135,8,82,78,208,66,204,162,205,164,140,97,175,37,108,194,27,192,119)

205,164,204,162 => 204,162,205,164

I've done a total re-write of the UTF8-C8 decoder. The original approach turned out to be a lot too fragile, so I took a different approach. Along the way, I got it to properly handle the cases where normalization would re-order, fixing all of the examples above. It also fixes the various things that ASAN/Valgrind tripped over in the failing tests, and the tests - plus a number of new ones I've added - now come out clean under both.

So far as I'm aware, this deals with the outstanding issues in utf8-c8. The tests are unfudged, though I moved them to S32-str/utf8-c8.t to make fudging of the added test cases easier (we fudge the whole file for JVM at present).

/jnthn

@p6rt
Copy link
Author

p6rt commented Jan 2, 2017

@jnthn - Status changed from 'open' to 'resolved'

@p6rt p6rt closed this as completed Jan 2, 2017
@p6rt p6rt added the uni label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant