read from file containing 1-char of malformed UTF returns empty string #6278

p6rt · 2017-05-27T01:32:07Z

Migrated from rt.perl.org#131379 (status was 'new')

Searchable as RT131379$

p6rt · 2017-05-27T01:32:07Z

From @zoffixznet

This seem to have existed for a while and isn't due to encoding refactor.

01:24 c: 2017.04 with '/tmp/foo999'.IO { .spurt: Buf.new: 200; dd .open.slurp }
01:24 committable6 Zoffix, ¦2017.04: «""»
01:24 Zoffix m: with '/tmp/foo999'.IO { .spurt: Buf.new: 200; dd .open.slurp }
01:24 camelia rakudo-moar 094e77: OUTPUT: «""␤»

Note that for file with two of such chars, the behaviour is what I'd expect it to be: throwing malformed
UTF8 error, same as you get when decoding Buf.new(200):

01:27 m: with '/tmp/foo999'.IO { .spurt: Buf.new: 200, 200; dd .open.slurp }
01:27 camelia rakudo-moar 094e77: OUTPUT: «Malformed UTF-8␤ in block <unit> at <tmp> line 1␤␤»
01:25 m: Buf.new(200).decode
01:25 camelia rakudo-moar 094e77: OUTPUT: «Malformed termination of UTF-8 string␤ in block <unit> at <tmp> line 1␤␤»

usev6 · 2020-09-27T21:22:11Z

I've stumbled upon this while working on the streaming decoder for the JVM backend. If I'm not mistaken, the problem can be shown with plain nqp (in the style that tests in t/nqp/116-streaming-decoder.t are written):

my class VMDecoder is repr("Decoder") {};
my $buf_type := nqp::newtype(nqp::knowhow(), "VMArray");
nqp::composetype($buf_type, nqp::hash("array", nqp::hash("type", uint8)));
my $dec := nqp::create(VMDecoder);
nqp::decoderconfigure($dec, "utf8", nqp::hash());
my $testbuf1 := nqp::create($buf_type);
nqp::bindpos_i($testbuf1, 0, 0xD0);
nqp::decoderaddbytes($dec, $testbuf1);

is(nqp::decodertakeavailablechars($dec), '');
is(nqp::decodertakeallchars($dec), '');

Both tests are passing. As far as I understand, it's totally fine for nqp::decodertakeavailablechars to accept the byte as a possibly incomplete code point.

But I'd expect nqp::decodertakeallchars to behave differently. (Both from the documentation in https://github.com/Raku/nqp/blob/master/docs/ops.markdown#decodertakeallchars, and from the point of view that it's unclear why we need a second op if it behaves like the first one.)

This avoids the problem reported in Raku/old-issue-tracker#6278.

usev6 · 2020-09-28T06:35:33Z

OOC, I tried to pass eof to MVM_string_utf8_decodestream and add an extra check at the end of the function and with that the test passes on MoarVM. But be aware, that I don't understand all details that happen there.

diff --git a/src/strings/decode_stream.c b/src/strings/decode_stream.c
index 5deb34c8b..f02f8e9cc 100644
--- a/src/strings/decode_stream.c
+++ b/src/strings/decode_stream.c
@@ -118,7 +118,7 @@ static MVMuint32 run_decode(MVMThreadContext *tc, MVMDecodeStream *ds, const MVM
     MVMuint32 reached_stopper;
     switch (ds->encoding) {
     case MVM_encoding_type_utf8:
-        reached_stopper = MVM_string_utf8_decodestream(tc, ds, stopper_chars, sep_spec);
+        reached_stopper = MVM_string_utf8_decodestream(tc, ds, stopper_chars, sep_spec, eof);
         break;
     case MVM_encoding_type_ascii:
         reached_stopper = MVM_string_ascii_decodestream(tc, ds, stopper_chars, sep_spec);
diff --git a/src/strings/utf8.c b/src/strings/utf8.c
index f86d01523..ffe1169b3 100644
--- a/src/strings/utf8.c
+++ b/src/strings/utf8.c
@@ -343,7 +343,8 @@ static void encoding_error(MVMThreadContext *tc, MVMuint8 *bytes, int error_pos)
  * buffers, or until a stopper is reached. */
 MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds,
                                   const MVMuint32 *stopper_chars,
-                                  MVMDecodeStreamSeparators *seps) {
+                                  MVMDecodeStreamSeparators *seps,
+                                  MVMint32 eof) {
     MVMuint32 count = 0, total = 0;
     MVMint32 state = 0;
     MVMCodepoint codepoint = 0;
@@ -535,6 +536,10 @@ MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds
         cur_bytes = cur_bytes->next;
     }
   done:
+    if (eof && state != UTF8_ACCEPT) {
+        MVM_free(buffer);
+        MVM_exception_throw_adhoc(tc, "Malformed termination of UTF-8 string");
+    }
 
     /* Attach what we successfully parsed as a result buffer, and trim away
      * what we chewed through. */
diff --git a/src/strings/utf8.h b/src/strings/utf8.h
index 6d8f45d37..48b8b1c46 100644
--- a/src/strings/utf8.h
+++ b/src/strings/utf8.h
@@ -1,6 +1,6 @@
 MVM_PUBLIC MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, const MVMObject *result_type, const char *utf8, size_t bytes);
 MVM_PUBLIC MVMString * MVM_string_utf8_decode_strip_bom(MVMThreadContext *tc, const MVMObject *result_type, const char *utf8, size_t bytes);
-MVM_PUBLIC MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds, const MVMuint32 *stopper_chars, MVMDecodeStreamSeparators *seps);
+MVM_PUBLIC MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds, const MVMuint32 *stopper_chars, MVMDecodeStreamSeparators *seps, MVMint32 eof);
 MVM_PUBLIC char * MVM_string_utf8_encode_substr(MVMThreadContext *tc,
         MVMString *str, MVMuint64 *output_size, MVMint64 start, MVMint64 length, MVMString *replacement, MVMint32 translate_newlines);
 MVM_PUBLIC char * MVM_string_utf8_encode(MVMThreadContext *tc, MVMString *str, MVMuint64 *output_size, MVMint32 translate_newlines);

usev6 · 2020-10-03T09:04:33Z

For the record: With the above patch, NQP's and Rakudo's tests are passing. Spectest look ok, too (there were these failures, which look like flappers: running the test files in isolation didn't show an error most of the time):

t/spec/S16-io/eof.t                                             (Wstat: 256 Tests: 5 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
t/spec/S16-io/watch.t                                           (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
t/spec/S17-promise/basic.t                                      (Wstat: 512 Tests: 41 Failed: 2)
  Failed tests:  33, 35
  Non-zero exit status: 2

p6rt added the IO label Jan 5, 2020

usev6 mentioned this issue Sep 27, 2020

Please do not merge: Add more tests for streaming decoder and fixes for JVM Raku/nqp#673

Closed

usev6 added a commit to usev6/nqp that referenced this issue Sep 28, 2020

[JVM] Throw for incomplete char at end of input

a210b3c

This avoids the problem reported in Raku/old-issue-tracker#6278.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read from file containing 1-char of malformed UTF returns empty string #6278

read from file containing 1-char of malformed UTF returns empty string #6278

p6rt commented May 27, 2017

p6rt commented May 27, 2017

usev6 commented Sep 27, 2020

usev6 commented Sep 28, 2020

usev6 commented Oct 3, 2020

read from file containing 1-char of malformed UTF returns empty string #6278

read from file containing 1-char of malformed UTF returns empty string #6278

Comments

p6rt commented May 27, 2017

p6rt commented May 27, 2017

From @zoffixznet

usev6 commented Sep 27, 2020

usev6 commented Sep 28, 2020

usev6 commented Oct 3, 2020