Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read from file containing 1-char of malformed UTF returns empty string #6278

Open
p6rt opened this issue May 27, 2017 · 4 comments
Open

read from file containing 1-char of malformed UTF returns empty string #6278

p6rt opened this issue May 27, 2017 · 4 comments
Labels

Comments

@p6rt
Copy link

p6rt commented May 27, 2017

Migrated from rt.perl.org#131379 (status was 'new')

Searchable as RT131379$

@p6rt
Copy link
Author

p6rt commented May 27, 2017

From @zoffixznet

This seem to have existed for a while and isn't due to encoding refactor.

01​:24 c​: 2017.04 with '/tmp/foo999'.IO { .spurt​: Buf.new​: 200; dd .open.slurp }
01​:24 committable6 Zoffix, ¦2017.04​: «""»
01​:24 Zoffix m​: with '/tmp/foo999'.IO { .spurt​: Buf.new​: 200; dd .open.slurp }
01​:24 camelia rakudo-moar 094e77​: OUTPUT​: «""␤»

Note that for file with two of such chars, the behaviour is what I'd expect it to be​: throwing malformed
UTF8 error, same as you get when decoding Buf.new(200)​:

01​:27 m​: with '/tmp/foo999'.IO { .spurt​: Buf.new​: 200, 200; dd .open.slurp }
01​:27 camelia rakudo-moar 094e77​: OUTPUT​: «Malformed UTF-8␤ in block <unit> at <tmp> line 1␤␤»
01​:25 m​: Buf.new(200).decode
01​:25 camelia rakudo-moar 094e77​: OUTPUT​: «Malformed termination of UTF-8 string␤ in block <unit> at <tmp> line 1␤␤»

@usev6
Copy link

usev6 commented Sep 27, 2020

I've stumbled upon this while working on the streaming decoder for the JVM backend. If I'm not mistaken, the problem can be shown with plain nqp (in the style that tests in t/nqp/116-streaming-decoder.t are written):

my class VMDecoder is repr("Decoder") {};
my $buf_type := nqp::newtype(nqp::knowhow(), "VMArray");
nqp::composetype($buf_type, nqp::hash("array", nqp::hash("type", uint8)));
my $dec := nqp::create(VMDecoder);
nqp::decoderconfigure($dec, "utf8", nqp::hash());
my $testbuf1 := nqp::create($buf_type);
nqp::bindpos_i($testbuf1, 0, 0xD0);
nqp::decoderaddbytes($dec, $testbuf1);

is(nqp::decodertakeavailablechars($dec), '');
is(nqp::decodertakeallchars($dec), '');

Both tests are passing. As far as I understand, it's totally fine for nqp::decodertakeavailablechars to accept the byte as a possibly incomplete code point.

But I'd expect nqp::decodertakeallchars to behave differently. (Both from the documentation in https://github.com/Raku/nqp/blob/master/docs/ops.markdown#decodertakeallchars, and from the point of view that it's unclear why we need a second op if it behaves like the first one.)

usev6 added a commit to usev6/nqp that referenced this issue Sep 28, 2020
@usev6
Copy link

usev6 commented Sep 28, 2020

OOC, I tried to pass eof to MVM_string_utf8_decodestream and add an extra check at the end of the function and with that the test passes on MoarVM. But be aware, that I don't understand all details that happen there.

diff --git a/src/strings/decode_stream.c b/src/strings/decode_stream.c
index 5deb34c8b..f02f8e9cc 100644
--- a/src/strings/decode_stream.c
+++ b/src/strings/decode_stream.c
@@ -118,7 +118,7 @@ static MVMuint32 run_decode(MVMThreadContext *tc, MVMDecodeStream *ds, const MVM
     MVMuint32 reached_stopper;
     switch (ds->encoding) {
     case MVM_encoding_type_utf8:
-        reached_stopper = MVM_string_utf8_decodestream(tc, ds, stopper_chars, sep_spec);
+        reached_stopper = MVM_string_utf8_decodestream(tc, ds, stopper_chars, sep_spec, eof);
         break;
     case MVM_encoding_type_ascii:
         reached_stopper = MVM_string_ascii_decodestream(tc, ds, stopper_chars, sep_spec);
diff --git a/src/strings/utf8.c b/src/strings/utf8.c
index f86d01523..ffe1169b3 100644
--- a/src/strings/utf8.c
+++ b/src/strings/utf8.c
@@ -343,7 +343,8 @@ static void encoding_error(MVMThreadContext *tc, MVMuint8 *bytes, int error_pos)
  * buffers, or until a stopper is reached. */
 MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds,
                                   const MVMuint32 *stopper_chars,
-                                  MVMDecodeStreamSeparators *seps) {
+                                  MVMDecodeStreamSeparators *seps,
+                                  MVMint32 eof) {
     MVMuint32 count = 0, total = 0;
     MVMint32 state = 0;
     MVMCodepoint codepoint = 0;
@@ -535,6 +536,10 @@ MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds
         cur_bytes = cur_bytes->next;
     }
   done:
+    if (eof && state != UTF8_ACCEPT) {
+        MVM_free(buffer);
+        MVM_exception_throw_adhoc(tc, "Malformed termination of UTF-8 string");
+    }
 
     /* Attach what we successfully parsed as a result buffer, and trim away
      * what we chewed through. */
diff --git a/src/strings/utf8.h b/src/strings/utf8.h
index 6d8f45d37..48b8b1c46 100644
--- a/src/strings/utf8.h
+++ b/src/strings/utf8.h
@@ -1,6 +1,6 @@
 MVM_PUBLIC MVMString * MVM_string_utf8_decode(MVMThreadContext *tc, const MVMObject *result_type, const char *utf8, size_t bytes);
 MVM_PUBLIC MVMString * MVM_string_utf8_decode_strip_bom(MVMThreadContext *tc, const MVMObject *result_type, const char *utf8, size_t bytes);
-MVM_PUBLIC MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds, const MVMuint32 *stopper_chars, MVMDecodeStreamSeparators *seps);
+MVM_PUBLIC MVMuint32 MVM_string_utf8_decodestream(MVMThreadContext *tc, MVMDecodeStream *ds, const MVMuint32 *stopper_chars, MVMDecodeStreamSeparators *seps, MVMint32 eof);
 MVM_PUBLIC char * MVM_string_utf8_encode_substr(MVMThreadContext *tc,
         MVMString *str, MVMuint64 *output_size, MVMint64 start, MVMint64 length, MVMString *replacement, MVMint32 translate_newlines);
 MVM_PUBLIC char * MVM_string_utf8_encode(MVMThreadContext *tc, MVMString *str, MVMuint64 *output_size, MVMint32 translate_newlines);

@usev6
Copy link

usev6 commented Oct 3, 2020

For the record: With the above patch, NQP's and Rakudo's tests are passing. Spectest look ok, too (there were these failures, which look like flappers: running the test files in isolation didn't show an error most of the time):

t/spec/S16-io/eof.t                                             (Wstat: 256 Tests: 5 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
t/spec/S16-io/watch.t                                           (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
t/spec/S17-promise/basic.t                                      (Wstat: 512 Tests: 41 Failed: 2)
  Failed tests:  33, 35
  Non-zero exit status: 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants