Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PerlIO::scalar does not handle UTF-8 #11938

Closed
p5pRT opened this issue Feb 4, 2012 · 76 comments
Closed

PerlIO::scalar does not handle UTF-8 #11938

p5pRT opened this issue Feb 4, 2012 · 76 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 4, 2012

Migrated from rt.perl.org#109828 (status was 'resolved')

Searchable as RT109828$

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2012

From @dgl

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output, although there may be other ramifications.

Here's a todo test​:

Inline Patch
diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t
index a02107b..59b65ad 100644
--- a/ext/PerlIO-scalar/t/scalar.t
+++ b/ext/PerlIO-scalar/t/scalar.t
@@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere.
 
 $| = 1;
 
-use Test::More tests => 79;
+use Test::More tests => 80;
 
 my $fh;
 my $var = "aaa\n";
@@ -360,3 +360,11 @@ SKIP: {
     ok has_trailing_nul $memfile,
         'write appends null when growing string after seek past end';
 }
+
+# [perl #xxxx]
+{
+  local $TODO = "UTF-8 support";
+  my $string = "\x{ffe}";
+  open my $fh, "> :encoding(UTF-8)", \(my $out);
+  ok $string eq $out;
+}

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2012

From @Leont

On Sat, Feb 4, 2012 at 6​:10 PM, David Leadbeater
<perlbug-followup@​perl.org> wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output, although there may be other ramifications.

Here's a todo test​:

diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t
index a02107b..59b65ad 100644
--- a/ext/PerlIO-scalar/t/scalar.t
+++ b/ext/PerlIO-scalar/t/scalar.t
@​@​ -16,7 +16,7 @​@​ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere.

 $| = 1;

-use Test​::More tests => 79;
+use Test​::More tests => 80;

 my $fh;
 my $var = "aaa\n";
@​@​ -360,3 +360,11 @​@​ SKIP​: {
    ok has_trailing_nul $memfile,
        'write appends null when growing string after seek past end';
 }
+
+# [perl #xxxx]
+{
+  local $TODO = "UTF-8 support";
+  my $string = "\x{ffe}";
+  open my $fh, "> :encoding(UTF-8)", \(my $out);
+  ok $string eq $out;
+}

PerlIO does bytes, always. It's utf8 support is literally a one bit
flag that promises the bytes will be validly encoded utf8. There's no
easy way for lower layers to know what the upper layers do with regard
utf8. Nor am I sure that really should tinkle down.

The other direction would seem to be more important. When opening a
utf8 scalar, it should automatically be a utf8 handle. Anything else
is plain buggy and potentially dangerous.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2012

From tchrist@perl.com

David Leadbeater (via RT) <perlbug-followup@​perl.org> wrote
  on Sat, 04 Feb 2012 09​:10​:41 PST​:

+{
+ local $TODO = "UTF-8 support";
+ my $string = "\x{ffe}";

Why don't you use an assigned Unicode code point there, please?

+ open my $fh, "> :encoding(UTF-8)", \(my $out);

Why are you involving the Encode module? Why isn't that simply​:

  open(my $fh, "> :utf8", \my $out) || die $!​:

+ ok $string eq $out;
+}

I absolutely gave up on this. It was too unreliable. Even if you are
careful about decoding your string, now and then (about 1 in 10) it gets
double-encoded no matter what you do. It is not even deterministic in any
fashion I can see to make work.

--tom

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2012

From @ikegami

On Sat, Feb 4, 2012 at 12​:10 PM, David Leadbeater <perlbug-followup@​perl.org

wrote​:

+# [perl #xxxx]
+{
+ local $TODO = "UTF-8 support";
+ my $string = "\x{ffe}";
+ open my $fh, "> :encoding(UTF-8)", \(my $out);
+ ok $string eq $out;
+}

Files can only contain bytes. This makes no sense to me.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2012

From @ikegami

On Sat, Feb 4, 2012 at 5​:49 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Sat, Feb 4, 2012 at 12​:10 PM, David Leadbeater <
perlbug-followup@​perl.org> wrote​:

+# [perl #xxxx]
+{
+ local $TODO = "UTF-8 support";
+ my $string = "\x{ffe}";
+ open my $fh, "> :encoding(UTF-8)", \(my $out);
+ ok $string eq $out;
+}

Files can only contain bytes. This makes no sense to me.

... especially since you specially ask for encode whatever you print.
encode "UTF-8" cannot possibly produce something that contains 0xFFE.

And your patch is buggy​: You forgot to actually print to $fh.

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2012

From @xdg

On Sat, Feb 4, 2012 at 12​:10 PM, David Leadbeater
<perlbug-followup@​perl.org> wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think that one should expect PerlIO​::scalar to provide a black box
-- it's an in-memory substitution for bytes on disk with no associated
encoding, just like a file on disk has no associated encoding.

If the referenced string already has the utf8 flag set, I think it's
sufficient to warn rather than try to guess the correct behavior.

David

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2012

From @Tux

On Sat, 4 Feb 2012 18​:55​:27 +0100, Leon Timmermans <fawaka@​gmail.com>
wrote​:

On Sat, Feb 4, 2012 at 6​:10 PM, David Leadbeater
<perlbug-followup@​perl.org> wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output, although there may be other ramifications.

Here's a todo test​:

diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t
index a02107b..59b65ad 100644
--- a/ext/PerlIO-scalar/t/scalar.t
+++ b/ext/PerlIO-scalar/t/scalar.t
@​@​ -16,7 +16,7 @​@​ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere.

 $| = 1;

-use Test​::More tests => 79;
+use Test​::More tests => 80;

 my $fh;
 my $var = "aaa\n";
@​@​ -360,3 +360,11 @​@​ SKIP​: {
    ok has_trailing_nul $memfile,
        'write appends null when growing string after seek past end';
 }
+
+# [perl #xxxx]
+{
+  local $TODO = "UTF-8 support";
+  my $string = "\x{ffe}";
+  open my $fh, "> :encoding(UTF-8)", \(my $out);
+  ok $string eq $out;
+}

PerlIO does bytes, always. It's utf8 support is literally a one bit
flag that promises the bytes will be validly encoded utf8. There's no
easy way for lower layers to know what the upper layers do with regard
utf8. Nor am I sure that really should tinkle down.

The other direction would seem to be more important. When opening a
utf8 scalar, it should automatically be a utf8 handle. Anything else
is plain buggy and potentially dangerous.

including pragma's?

use open OUT => "encoding(utf16)";
open my $fh, ">", \my $x;
print { $fh } "The \x{20ac} is \x{a71c} again}\n";
close $fh;

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.14 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @nwc10

On Sat, Feb 04, 2012 at 08​:12​:30PM -0500, David Golden wrote​:

On Sat, Feb 4, 2012 at 12​:10 PM, David Leadbeater
<perlbug-followup@​perl.org> wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think that one should expect PerlIO​::scalar to provide a black box
-- it's an in-memory substitution for bytes on disk with no associated
encoding, just like a file on disk has no associated encoding.

If the referenced string already has the utf8 flag set, I think it's
sufficient to warn rather than try to guess the correct behavior.

Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0
would have exposed it. SvUTF8() shouldn't be visible as a proxy for
"characters vs bytes" (yes, I know there are still holes in that).

I *think* it needs to be strictly bytes-only (just like any real file handle)
and refuse to open an existing string that doesn't meet that constraint.
(With the inevitable ambiguity that if you only shove characters in the
range 0-255 into your string, you're not going to realise that your code
is buggy.)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @nwc10

On Sat, Feb 04, 2012 at 06​:55​:27PM +0100, Leon Timmermans wrote​:

On Sat, Feb 4, 2012 at 6​:10 PM, David Leadbeater
<perlbug-followup@​perl.org> wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think this makes sense for output, although there may be other ramifications.

PerlIO does bytes, always. It's utf8 support is literally a one bit
flag that promises the bytes will be validly encoded utf8. There's no
easy way for lower layers to know what the upper layers do with regard
utf8. Nor am I sure that really should tinkle down.

The other direction would seem to be more important. When opening a
utf8 scalar, it should automatically be a utf8 handle. Anything else
is plain buggy and potentially dangerous.

No, as I replied elsewhere, I think it should refuse to open any scalar
that isn't bytes.

Or, at least, the user's code needs to be different to say "I want to open a
byte buffer as if it's a file handle" and "I'm expecting characters here".
That way allows symmetry between opening for reading and opening for writing.

Having open for reading have some sort of "did they mean characters or bytes?
I'll guess for them" ends up with the same mess that unpack is in, whereby
it's a runtime decision based *implicitly* on the *parameters* as to whether
it's doing a bytes => characters conversion or a characters => characters
mapping. Sure, it's not as *bad* as unpack, which can attempt to do both
in the same statement, but trying to have open "DWIM" is in the same
ball-park of design misfeature.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @Leont

On Mon, Feb 6, 2012 at 12​:05 PM, Nicholas Clark <nick@​ccl4.org> wrote​:

No, as I replied elsewhere, I think it should refuse to open any scalar
that isn't bytes.

Or, at least, the user's code needs to be different to say "I want to open a
byte buffer as if it's a file handle" and "I'm expecting characters here".
That way allows symmetry between opening for reading and opening for writing.

Having open for reading have some sort of "did they mean characters or bytes?
I'll guess for them" ends up with the same mess that unpack is in, whereby
it's a runtime decision based *implicitly* on the *parameters* as to whether
it's doing a bytes => characters conversion or a characters => characters
mapping. Sure, it's not as *bad* as unpack, which can attempt to do both
in the same statement, but trying to have open "DWIM" is in the same
ball-park of design misfeature.

Yeah, that is a good point, how about making things explicit? E.G
«open my $fh, '+<​:scalar(utf8)', \my $scalar». I suspect the current
PerlIO/PerlIO​::scalar can't easily support that though.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @xdg

On Mon, Feb 6, 2012 at 9​:17 AM, Leon Timmermans <fawaka@​gmail.com> wrote​:

Yeah, that is a good point, how about making things explicit? E.G
«open my $fh, '+<​:scalar(utf8)', \my $scalar». I suspect the current
PerlIO/PerlIO​::scalar can't easily support that though.

Isn't that just C<open my $fh, "+<​:utf8", \my $scalar>?

If you *know* that you have UTF-8 characters in a string, it's not
different than knowing you have UTF-8 characters in a disk file. The
*user* needs to be clear what they expect the bytes to be.

Or is the question about what Perl should do about returning bytes
from a string that coincidentally happens to be a character string?
I.e. how should Perl mimic an on-disk file using its internal string
data structure?

Assume that Perl's internal character encoding is a black box. Maybe
it's UTF-8, maybe not (maybe it changes in the future). Whatever.
It's an internal implementation detail and nothing external should
rely on it.

Then when something wants to use that string as a source of bytes,
should Perl (a) just dump out whatever bytes it uses internally for
its implementation? Or (b) should it convert the internal
representation to some standard representation? Or (c) should it blow
up?

I don't like (a) or (c). (b) is tempting. (Coincidentally, it's
easy, since the internal encoding is utf8.) My naive inclination is
to amend the documentation to clarify that the bytes returned are
either raw bytes or utf8 encoded if the string already contains
characters. And then I'd *still* leave it up to the user to know
what's in the "file" (i.e. string) and set the correct encoding layer
on it, just as if they were using a disk file.

-- David

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @nwc10

On Mon, Feb 06, 2012 at 10​:18​:39AM -0500, David Golden wrote​:

Or is the question about what Perl should do about returning bytes
from a string that coincidentally happens to be a character string?
I.e. how should Perl mimic an on-disk file using its internal string
data structure?

That was what I thought the question was.

Assume that Perl's internal character encoding is a black box. Maybe
it's UTF-8, maybe not (maybe it changes in the future). Whatever.
It's an internal implementation detail and nothing external should
rely on it.

Agree

Then when something wants to use that string as a source of bytes,
should Perl (a) just dump out whatever bytes it uses internally for
its implementation? Or (b) should it convert the internal
representation to some standard representation? Or (c) should it blow
up?

I don't like (a) or (c). (b) is tempting. (Coincidentally, it's
easy, since the internal encoding is utf8.) My naive inclination is
to amend the documentation to clarify that the bytes returned are
either raw bytes or utf8 encoded if the string already contains
characters. And then I'd *still* leave it up to the user to know

How do you know that the string contains characters?

what's in the "file" (i.e. string) and set the correct encoding layer
on it, just as if they were using a disk file.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @xdg

On Mon, Feb 6, 2012 at 10​:24 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

I don't like (a) or (c).  (b) is tempting.  (Coincidentally, it's
easy, since the internal encoding is utf8.)  My naive inclination is
to amend the documentation to clarify that the bytes returned are
either raw bytes or utf8 encoded if the string already contains
characters.  And then I'd *still* leave it up to the user to know

How do you know that the string contains characters?

Which "you" do you mean? The user? How does a user know that *any*
file contains characters? Generally, by knowing what was written
there originally or by analyzing the file in some way to guess an
encoding, I'd think. (E.g. read it as bytes and then use
Encode​::Guess?)

None of that is the interpreter's concern.

-- David

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @Leont

On Mon, Feb 6, 2012 at 4​:18 PM, David Golden <xdaveg@​gmail.com> wrote​:

Isn't that just C<open my $fh, "+<​:utf8", \my $scalar>?

Not at all. :utf8 means «assume the bytestream is utf8 encoded». It
does not mean «store as characters» (though doing the latter without
the former doesn't make sense).

Or is the question about what Perl should do about returning bytes
from a string that coincidentally happens to be a character string?
I.e. how should Perl mimic an on-disk file using its internal string
data structure?

Yeah, pretty much.

Then when something wants to use that string as a source of bytes,
should Perl (a) just dump out whatever bytes it uses internally for
its implementation?  Or (b) should it convert the internal
representation to some standard representation?  Or (c) should it blow
up?

(a) Is what we're doing right now, and I think it's just plain wrong,
and possibly dangerous.
(b) Maybe, but for reasons Nicholas explained guesswork may be rather suboptimal
(c) Is sane, unlike (a) and some versions of (b).

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @nwc10

On Mon, Feb 06, 2012 at 11​:02​:36AM -0500, David Golden wrote​:

On Mon, Feb 6, 2012 at 10​:24 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

I don't like (a) or (c).  (b) is tempting.  (Coincidentally, it's
easy, since the internal encoding is utf8.)  My naive inclination is
to amend the documentation to clarify that the bytes returned are
either raw bytes or utf8 encoded if the string already contains
characters.  And then I'd *still* leave it up to the user to know

How do you know that the string contains characters?

Which "you" do you mean? The user? How does a user know that *any*
file contains characters? Generally, by knowing what was written
there originally or by analyzing the file in some way to guess an
encoding, I'd think. (E.g. read it as bytes and then use
Encode​::Guess?)

None of that is the interpreter's concern.

OK, which means that the interpreter can't *do* option (b) (or (a) for that
matter)​:

On Mon, Feb 06, 2012 at 03​:24​:04PM +0000, Nicholas Clark wrote​:

On Mon, Feb 06, 2012 at 10​:18​:39AM -0500, David Golden wrote​:

Or is the question about what Perl should do about returning bytes
from a string that coincidentally happens to be a character string?
I.e. how should Perl mimic an on-disk file using its internal string
data structure?

Then when something wants to use that string as a source of bytes,
should Perl (a) just dump out whatever bytes it uses internally for
its implementation? Or (b) should it convert the internal
representation to some standard representation? Or (c) should it blow
up?

because you've just stated that the interpreter can't make a determination
as to whether a string contains characters or bytes (for the ambiguous
case of a string containing one or more code points in the range 128-255,
but no code points outside the range 0-255)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @ikegami

On Mon, Feb 6, 2012 at 10​:18 AM, David Golden <xdaveg@​gmail.com> wrote​:

On Mon, Feb 6, 2012 at 9​:17 AM, Leon Timmermans <fawaka@​gmail.com> wrote​:

Yeah, that is a good point, how about making things explicit? E.G
«open my $fh, '+<​:scalar(utf8)', \my $scalar». I suspect the current
PerlIO/PerlIO​::scalar can't easily support that though.

Isn't that just C<open my $fh, "+<​:utf8", \my $scalar>?

No, that means "decode the input on read". The question is about a buffer
that contains decoded data, so what's needed is a layer or some such that
indicates "the underlying data is already decoded". That's his intent for
:scalar(utf8).

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2012

From @xdg

On Mon, Feb 6, 2012 at 11​:09 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

because you've just stated that the interpreter can't make a determination
as to whether a string contains characters or bytes (for the ambiguous
case of a string containing one or more code points in the range 128-255,
but no code points outside the range 0-255)

You're right. I was being imprecise. I think if the string contains
no wide characters, it should be "read" by PerlIO​::scalar as bytes. If
the string does contain wide characters, PerlIO​::scalar should either
fail or should encode them in some "standard" way and return them as
bytes in encoded form.

The whole idea is to provide an in-memory abstraction of a *file*,
which means returning a sequence of bytes.

David

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2012

From @cpansprout

On Mon Feb 06 07​:19​:37 2012, xdaveg@​gmail.com wrote​:

Then when something wants to use that string as a source of bytes,
should Perl (a) just dump out whatever bytes it uses internally for
its implementation? Or (b) should it convert the internal
representation to some standard representation? Or (c) should it blow
up?

(a) is what Perl currently does, as Leon Timmerman said.

By (b) I presume you mean to treat \xff as \xff regardless of how it is
stored internally, which makes sense.

But what happens if I open a reading handle to a scalar containing
\x{100}? Here we have a choice between (b) and (c).

An in-memory scalar could be considered a byte stream. Or it could just
be considered a string of characters.

The latter does make some sense. If I print \xff to an in-memory file
with no layers applied, I simply get \xff in my scalar. So if I print
\x{100}, it would make sense to get \x{100} in my scalar, no? But if
the scalar is considered byte-sized, I should get \x{100} utf8-encoded,
accompanied by a wide character warning; and reading a scalar with
\x{100} would croak.

That it is currently buggy is not being questioned. But which model
should be followed in fixing it is debatable. Would it be reasonable to
implement the byte-sized version for now and upgrade it later?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2012

From @xdg

On Sun, Feb 12, 2012 at 5​:02 PM, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Mon Feb 06 07​:19​:37 2012, xdaveg@​gmail.com wrote​:

Then when something wants to use that string as a source of bytes,
should Perl (a) just dump out whatever bytes it uses internally for
its implementation?  Or (b) should it convert the internal
representation to some standard representation?  Or (c) should it blow
up?

(a) is what Perl currently does, as Leon Timmerman said.

By (b) I presume you mean to treat \xff as \xff regardless of how it is
stored internally, which makes sense.

Sort of. What I meant is that (a) is "whatever we do" and (b) is "a
specific encoding". Those are likely to be similar, but one is vague
and mutable and the other specific and fixed. Such a promise would
persist under the usual back-compatibility rules even if we changed
the internal representation in the future for some reason. It could
also mean that we could choose give UTF-8 and not "utf8" (i.e. lax,
internal encoding) -- and would croak if we can't translate from the
internal to UTF-8.

For example, for a string with wide characters used as in in-memory
file, we could promise to translate from the internal encoding to
UTF-8 when the handle is read. That would make it resemble a disk
file encoded in UTF-8, requiring the "​:encoding(UTF-8)" flag and so
on. Thus some function that is passed a handle to read shouldn't know
or care whether it's an in memory string or an on-disk file -- though
the *programmer* would need to know what encoding they expect to
receive given their particular application.

An in-memory scalar could be considered a byte stream.  Or it could just
be considered a string of characters.

My bias is strongly that it should be a byte-stream, which is why I'm
only considering how we choose to take a string of (wide) characters
and make it into a byte stream in some standard way​: (a) "whatever"
(b) "a promise" and (c) "boom!"

-- David

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2012

From @ap

* Nicholas Clark <nick@​ccl4.org> [2012-02-06 12​:00]​:

On Sat, Feb 04, 2012 at 08​:12​:30PM -0500, David Golden wrote​:

If the referenced string already has the utf8 flag set, I think it's
sufficient to warn rather than try to guess the correct behavior.

Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0
would have exposed it. SvUTF8() shouldn't be visible as a proxy for
"characters vs bytes" (yes, I know there are still holes in that).

This. Thank you. I was despairing as I read the thread, waiting for
someone to interject with it.

As far as the user is concerned, there is never to be any difference
between a string with UTF8 on vs a string with UTF8 off as long as
$utf8on eq $utf8off.

I *think* it needs to be strictly bytes-only (just like any real file
handle) and refuse to open an existing string that doesn't meet that
constraint. (With the inevitable ambiguity that if you only shove
characters in the range 0-255 into your string, you're not going to
realise that your code is buggy.)

What it should do on input is treat each character as a byte, throwing
an error if there are any characters > 0xFF in the string, i.e. the
moral equivalent of downgrading the input string and croaking if that
fails. That’s it.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2012

From @Tux

On Mon, 6 Feb 2012 10​:58​:02 +0000, Nicholas Clark <nick@​ccl4.org> wrote​:

On Sat, Feb 04, 2012 at 08​:12​:30PM -0500, David Golden wrote​:

On Sat, Feb 4, 2012 at 12​:10 PM, David Leadbeater
<perlbug-followup@​perl.org> wrote​:

If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.

I think that one should expect PerlIO​::scalar to provide a black box
-- it's an in-memory substitution for bytes on disk with no associated
encoding, just like a file on disk has no associated encoding.

If the referenced string already has the utf8 flag set, I think it's
sufficient to warn rather than try to guess the correct behavior.

Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0
would have exposed it. SvUTF8() shouldn't be visible as a proxy for
"characters vs bytes" (yes, I know there are still holes in that).

I *think* it needs to be strictly bytes-only (just like any real file handle)
and refuse to open an existing string that doesn't meet that constraint.
(With the inevitable ambiguity that if you only shove characters in the
range 0-255 into your string, you're not going to realise that your code
is buggy.)

Nicholas Clark

Personally, I see no harm in doing a decode on close when opened for
writing as utf-8

--8<---
use v5.12;
use warnings;

binmode STDOUT, "​:utf8";

my $data = "";

{ open my $fh, ">​:encoding(utf-8)", \$data;
  print { $fh } "\x{20ac}\n";
  close $fh;
  }

{ open my $fh, "<​:encoding(utf-8)", \$data;
  print <$fh>;
  close $fh;
  }

print $data;
utf8​::decode ($data);
print $data;

{ open my $fh, "<​:encoding(utf-8)", \$data;
  print <$fh>;
  close $fh;
  }

{ use open OUT => "​:encoding(utf-8)";
  open my $fh, ">", \$data;
  print { $fh } "\x{20ac}\n";
  close $fh;
  }

{ use open IN => "​:encoding(utf-8)";
  open my $fh, "<", \$data;
  print <$fh>;
  close $fh;
  }

print $data;
utf8​::decode ($data);
print $data;

{ use open IN => "​:encoding(utf-8)";
  open my $fh, "<", \$data;
  print <$fh>;
  close $fh;
  }
-->8---

$ perl test.pl

â¬






--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.14 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Feb 14, 2012

From @ikegami

On Sun, Feb 12, 2012 at 5​:02 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

That it is currently buggy is not being questioned.

And the following test will detect regressions once its fixed.

=====

use strict;
use warnings;

use Test​::More tests => 1;

sub read_from_scalar {
  my ($file, $perlio) = @​_;
  $perlio //= '';
  open my $fh, "<$perlio", \$file or die $!;
  local $/;
  return <$fh>;
}

sub hexify { join ' ', map sprintf('%02X', ord), split //, $_[0] }

{
  my $s = chr(0xE9);
  utf8​::upgrade( my $u = $s );
  utf8​::downgrade( my $d = $s );
  is( hexify(read_from_scalar($u)), hexify(read_from_scalar($d)),
'Unicode bug in :scalar read' );
}

1;

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2012

From @ribasushi

Is there any word on this issue? I just hit this bug in reverse[1] and
while there is ample discussion about it being a problem I see the same
behavior under current blead. Is there a chance *at least* for a warning
to be added so that it lands in 5.18?

Cheers

[1] http​://www.perlmonks.org/?node_id=1010601

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2012

From @Leont

On Fri, Dec 28, 2012 at 8​:34 AM, Peter Rabbitson <rabbit+p5p@​rabbit.us> wrote​:

Is there any word on this issue? I just hit this bug in reverse[1] and
while there is ample discussion about it being a problem I see the same
behavior under current blead. Is there a chance *at least* for a warning
to be added so that it lands in 5.18?

The process kind of fizzled somewhere. I'm in favor of a warning in
5.18, see attachment.

Leon

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2012

From @Leont

0001-Warn-when-opening-utf8-string-into-handle.patch
From d486949439c66bd1a6e76af94468c374a58590f8 Mon Sep 17 00:00:00 2001
From: Leon Timmermans <fawaka@gmail.com>
Date: Fri, 28 Dec 2012 16:41:53 +0100
Subject: [PATCH] Warn when opening utf8 string into handle

---
 ext/PerlIO-scalar/scalar.pm  |    2 +-
 ext/PerlIO-scalar/scalar.xs  |    2 ++
 ext/PerlIO-scalar/t/scalar.t |   11 ++++++++++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/ext/PerlIO-scalar/scalar.pm b/ext/PerlIO-scalar/scalar.pm
index 813f5e6..64ecc22 100644
--- a/ext/PerlIO-scalar/scalar.pm
+++ b/ext/PerlIO-scalar/scalar.pm
@@ -1,5 +1,5 @@
 package PerlIO::scalar;
-our $VERSION = '0.15';
+our $VERSION = '0.16';
 require XSLoader;
 XSLoader::load();
 1;
diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs
index d7b8828..48dbd32 100644
--- a/ext/PerlIO-scalar/scalar.xs
+++ b/ext/PerlIO-scalar/scalar.xs
@@ -41,6 +41,8 @@ PerlIOScalar_pushed(pTHX_ PerlIO * f, const char *mode, SV * arg,
 		SvREFCNT_inc(perl_get_sv
 			     (SvPV_nolen(arg), GV_ADD | GV_ADDMULTI));
 	}
+	if (SvUTF8(s->var) && ckWARN(WARN_UTF8))
+	    Perl_warner(aTHX_ packWARN(WARN_UTF8), "Should only map byte strings into in-memory filehandles\n");
     }
     else {
 	s->var = newSVpvn("", 0);
diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t
index d255a05..d96e2db 100644
--- a/ext/PerlIO-scalar/t/scalar.t
+++ b/ext/PerlIO-scalar/t/scalar.t
@@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere.
 
 $| = 1;
 
-use Test::More tests => 82;
+use Test::More tests => 83;
 
 my $fh;
 my $var = "aaa\n";
@@ -384,3 +384,12 @@ SKIP: {
   close FILE;
   is $content, "Foo-Bar\n", 'duping via >&=';
 }
+
+{
+  use warnings;
+  my $content = "\x{100}";
+  my @warnings;
+  local $SIG{__WARN__} = sub { push @warnings, $_[0] };
+  open my $fh, '<', \$content;
+  is($warnings[0], "Should only map byte strings into in-memory filehandles\n", 'Trying to open a character string warns');
+}
-- 
1.7.6.1

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2012

From @tonycoz

On Fri Dec 28 07​:45​:25 2012, LeonT wrote​:

On Fri, Dec 28, 2012 at 8​:34 AM, Peter Rabbitson
<rabbit+p5p@​rabbit.us> wrote​:

Is there any word on this issue? I just hit this bug in reverse[1]
and
while there is ample discussion about it being a problem I see the
same
behavior under current blead. Is there a chance *at least* for a
warning
to be added so that it lands in 5.18?

The process kind of fizzled somewhere. I'm in favor of a warning in
5.18, see attachment.

It should fail to open. If you open a UTF8 flagged string for append
and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8
string.

Your patch as written ignores the principle that the SvUTF8() flag only
controls the internal encoding, not other behaviour. If the SV contains
only code point 0xFF or lower we should downgrade it and work with that
rather than failing (or producing a warning).

This should also be done for _read() and _write(), since the SV can be
modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and
SvRV(arg) without calling SvGETMAGIC().

I'll take a look at these issues when I get home.

Tony

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2012

From @Leont

On Fri, Dec 28, 2012 at 11​:06 PM, Tony Cook via RT
<perlbug-followup@​perl.org> wrote​:

It should fail to open. If you open a UTF8 flagged string for append
and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8
string.

Your patch as written ignores the principle that the SvUTF8() flag only
controls the internal encoding, not other behaviour. If the SV contains
only code point 0xFF or lower we should downgrade it and work with that
rather than failing (or producing a warning).

I didn't see enough consensus to change it that much, but I would be in favor.

This should also be done for _read() and _write(), since the SV can be
modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and
SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether, and use
the proper APIs (sv_insert and friends). For that matter, I sometimes
feel like it should be rewritten from scratch to actually make sense.
Pretty much all of it is problematic.

Leon

@p5pRT
Copy link
Author

p5pRT commented Dec 29, 2012

From @ribasushi

On Fri, Dec 28, 2012 at 11​:16​:36PM +0100, Leon Timmermans wrote​:

On Fri, Dec 28, 2012 at 11​:06 PM, Tony Cook via RT
<perlbug-followup@​perl.org> wrote​:

It should fail to open. If you open a UTF8 flagged string for append
and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8
string.

Your patch as written ignores the principle that the SvUTF8() flag only
controls the internal encoding, not other behaviour. If the SV contains
only code point 0xFF or lower we should downgrade it and work with that
rather than failing (or producing a warning).

I didn't see enough consensus to change it that much, but I would be in favor.

This should also be done for _read() and _write(), since the SV can be
modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and
SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether, and use
the proper APIs (sv_insert and friends). For that matter, I sometimes
feel like it should be rewritten from scratch to actually make sense.
Pretty much all of it is problematic.

This particular bit risks derailing the simpler yet more urgent bugfix.
Focuse please ;)

Cheers

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2012

From @tonycoz

On Fri, Dec 28, 2012 at 11​:16​:36PM +0100, Leon Timmermans wrote​:

On Fri, Dec 28, 2012 at 11​:06 PM, Tony Cook via RT
<perlbug-followup@​perl.org> wrote​:

It should fail to open. If you open a UTF8 flagged string for append
and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8
string.

Your patch as written ignores the principle that the SvUTF8() flag only
controls the internal encoding, not other behaviour. If the SV contains
only code point 0xFF or lower we should downgrade it and work with that
rather than failing (or producing a warning).

I didn't see enough consensus to change it that much, but I would be in favor.

This should also be done for _read() and _write(), since the SV can be
modified between I/O operations.

There's an unrelated problem that _pushed() checks flags on both arg and
SvRV(arg) without calling SvGETMAGIC().

It should just stop peeking and poking into the SV altogether, and use
the proper APIs (sv_insert and friends). For that matter, I sometimes
feel like it should be rewritten from scratch to actually make sense.
Pretty much all of it is problematic.

I've attached my suggested changes (in several parts), also available
on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity.

Reasons for failing instead of warning​:

1) reading - to follow the "SVf_UTF8 is only representation"
principle, we'd need to download where possible, so a \xA1 (for
example) in the stream is always treated as that byte, but this means
we have an inconsistency when the scalar cannot be downgraded - the
first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}"
would be different.

2) writing - if the SV is flagged UTF8, and the user of the handle
doesn't write correct UTF8 data at the correct offsets, the SV will no
longer be properly formed utf-8, which I believe we're trying to
maintain. One of my tests produced a warning about invalid UTF-8
during before the fix was applied.

It's possible could be avoided if we always treat the written bytes as
code points and upgrade them when writing to a UTF8 string, but then
we run into a consitency issue vs reading - what happens when a read
on a UTF8 string reaches a code point > 0xFF?

As written I think the warning message could be improved and the
documentation of the warning could be improved.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2013

From @Leont

On Fri, Feb 1, 2013 at 8​:16 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

1. You can’t. The string is downgraded far earlier, by the parser.

$ perl -MDevel​::Peek -e 'use utf8; Dump qq[aÅb]'
SV = PV(0x7fad5b801090) at 0x7fad5b8267a8
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"]
CUR = 4
LEN = 16

That's not downgraded at all, it has the utf8 flag.

2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the
UTF8 flag is off, or C3 85 and UTF8 is on – both mean the same thing. If
it is downgradable, then it very well should be downgraded and accepted
silently. (I just realised some of my previous mail was a red herring, due
to this point.)

That abstraction leaks when it comes into contact with PerlIO. No
getting around that. Question is only​: how does it leak

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2013

From @nwc10

On Fri, Feb 01, 2013 at 03​:54​:37PM +0100, Leon Timmermans wrote​:

On Fri, Feb 1, 2013 at 8​:16 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

2. It doesn't matter if the byte value C5 is spelled C5 in the buffer and the
UTF8 flag is off, or C3 85 and UTF8 is on - both mean the same thing. If
it is downgradable, then it very well should be downgraded and accepted
silently. (I just realised some of my previous mail was a red herring, due
to this point.)

That abstraction leaks when it comes into contact with PerlIO. No
getting around that. Question is only​: how does it leak

I feel that I'm asking a stupid question here, but why/how does it leak?
Is it leaking for the same reason as eval "leaks"? There, source code from
disk is in bytes, which needs an encoding layered atop it to map to
characters (even if it's a 1​:1 mapping). So "obviously", that's what the
parser expects. Stuff in the range 0-255, which might be a variable-width
encoding. But eval takes strings, and Perl-code has generated strings
of *characters* to feed to the parser. Stuff in the range 0-0x1FFFF (ish),
abstract representation (as far as Perl-space is concerned)

So, here, some code wants to think in terms of using file-like operations
on a sequence of octets held in a scalar (which were "obviously" octets
because that was what it was dealing with when it assigned to that scalar.)

Whereas other code wants to think in terms of using file-like operations on
a sequence of characters. (which were "obviously" characters because that
was what it was dealing with when it assigned to that scalar.)

And it's the same syntax to open either.

Is that the leakage you mean? That by the time the code comes to open the
scalar, it simply isn't clear whether the scalar is supposed to be holding
sequences of octets, or sequences of characters, and so the opening code
*can't* get the semantics of the open correct.

Or have I misunderstood?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2013

From @ap

* Leon Timmermans <fawaka@​gmail.com> [2013-02-01 16​:00]​:

On Fri, Feb 1, 2013 at 8​:16 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

1. You can’t. The string is downgraded far earlier, by the parser.

$ perl -MDevel​::Peek -e 'use utf8; Dump qq[aÅb]'
SV = PV(0x7fad5b801090) at 0x7fad5b8267a8
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"]
CUR = 4
LEN = 16

That's not downgraded at all, it has the utf8 flag.

Err. I looked at the output thrice and always saw it clearly absent.
This is where I would blame lack of coffee, were I a coffee drinker.

Sorry for the noise.

2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer
and the UTF8 flag is off, or C3 85 and UTF8 is on – both mean the
same thing. If it is downgradable, then it very well should be
downgraded and accepted silently. (I just realised some of my
previous mail was a red herring, due to this point.)

That abstraction leaks when it comes into contact with PerlIO. No
getting around that. Question is only​: how does it leak

There is no leak​: if you stick an `encode('UTF-8', ...)` in there, then
the code will be correct, regardless of whether the string is downgraded
or upgraded. As long as all operations on the string treat it as 3 units
long, the middle one of which has the value 0xC5, it is water-tight.

This is a matter not of leaky abstraction but of a missing affordance,
which leaves the programmer to carefully keep track of intent by hand
with no aid from the interpreter.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2013

From @Leont

On Fri, Feb 1, 2013 at 4​:08 PM, Nicholas Clark <nick@​ccl4.org> wrote​:

On Fri, Feb 01, 2013 at 03​:54​:37PM +0100, Leon Timmermans wrote​:

On Fri, Feb 1, 2013 at 8​:16 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

2. It doesn't matter if the byte value C5 is spelled C5 in the buffer and the
UTF8 flag is off, or C3 85 and UTF8 is on - both mean the same thing. If
it is downgradable, then it very well should be downgraded and accepted
silently. (I just realised some of my previous mail was a red herring, due
to this point.)

That abstraction leaks when it comes into contact with PerlIO. No
getting around that. Question is only​: how does it leak

I feel that I'm asking a stupid question here, but why/how does it leak?
Is it leaking for the same reason as eval "leaks"? There, source code from
disk is in bytes, which needs an encoding layered atop it to map to
characters (even if it's a 1​:1 mapping). So "obviously", that's what the
parser expects. Stuff in the range 0-255, which might be a variable-width
encoding. But eval takes strings, and Perl-code has generated strings
of *characters* to feed to the parser. Stuff in the range 0-0x1FFFF (ish),
abstract representation (as far as Perl-space is concerned)

So, here, some code wants to think in terms of using file-like operations
on a sequence of octets held in a scalar (which were "obviously" octets
because that was what it was dealing with when it assigned to that scalar.)

Whereas other code wants to think in terms of using file-like operations on
a sequence of characters. (which were "obviously" characters because that
was what it was dealing with when it assigned to that scalar.)

And it's the same syntax to open either.

Is that the leakage you mean? That by the time the code comes to open the
scalar, it simply isn't clear whether the scalar is supposed to be holding
sequences of octets, or sequences of characters, and so the opening code
*can't* get the semantics of the open correct.

Or have I misunderstood?

Nicholas Clark

Yeah, it's that problem.

The old behavior was to leak the internal encoding.

The new behavior is to always expose something as Latin-1, even when
it originally wasn't, which I consider leaky too.

The third option is to reject any character string. This is obviously
leaky, but I'd still consider it the sanest of these methods.

A fourth would be for scalar to be more utf8 aware, I'm not sure
that's a good idea conceptually though.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2013

From @ikegami

All that's needed to make it sane is a "Wide character" warning/error when
the input has chars>255 to remind users they are missing an encode().

@p5pRT
Copy link
Author

p5pRT commented Feb 1, 2013

From @Leont

On Fri, Feb 1, 2013 at 7​:13 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

All that's needed to make it sane is a "Wide character" warning/error when
the input has chars>255 to remind users they are missing an encode().

But what about the chars 127-255?

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2013

From @ap

* Leon Timmermans <fawaka@​gmail.com> [2013-02-01 17​:15]​:

The new behavior is to always expose something as Latin-1

It only looks like Latin-1, because it’s exposing the U+0000..U+00FF
range of Unicode characters as faux bytes, and that happens to coincide
with Latin-1. But you can’t tell they are characters rather than bytes,
because there is no way to mark the string as containing one or the
other.

even when it originally wasn't, which I consider leaky too.

How was it not? The file contained the bytes C3 85 and a declaration to
`use utf8`. Taken together those mean that the bytes in the quote-like
literal are to automatically be decoded to a single string element with
ordinal value C5. Which is what happens.

If you’re saying the decoding is undesirable here then you’re saying
that the programmer forgot to say `do { no utf8; qq[aÅb] }` so he would
get the undecoded UTF-8 bytes.**

This still doesn’t track programmer intent. It’s impossible to tell by
looking at either the undecoded or decoded strings whether the intent
was for them to be octet or character strings.

** That is the symmetric alternate to an explicit `encode('UTF-8',...)`
  somewhere. But note that the encoding declared to the `open` is then
  dependent on the encoding of the source file. With explicit `encode`,
  only the encodings given to the `encode` and `open` must be identical
  but the source file encoding becomes irrelevant. Therefore I like
  that approach better.

The third option is to reject any character string. This is obviously
leaky, but I'd still consider it the sanest of these methods.

Except of course you cannot implement this in Perl as she is because no
way of marking strings as characters exists.

But leaky how? Intent would be clear here.

A fourth would be for scalar to be more utf8 aware, I'm not sure
that's a good idea conceptually though.

How do you mean?

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Feb 2, 2013

From @Leont

On Sat, Feb 2, 2013 at 5​:29 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

* Leon Timmermans <fawaka@​gmail.com> [2013-02-01 17​:15]​:

The new behavior is to always expose something as Latin-1

It only looks like Latin-1, because it’s exposing the U+0000..U+00FF
range of Unicode characters as faux bytes, and that happens to coincide
with Latin-1. But you can’t tell they are characters rather than bytes,
because there is no way to mark the string as containing one or the
other.

True, I phrased it poorly. How about​: The new behavior is to always
expose character strings as Latin-1, or failure if that's impossible.

even when it originally wasn't, which I consider leaky too.

How was it not? The file contained the bytes C3 85 and a declaration to
`use utf8`. Taken together those mean that the bytes in the quote-like
literal are to automatically be decoded to a single string element with
ordinal value C5. Which is what happens.

Well, instead of the current encoding you have to know if it can be
encoded as Latin-1. I don't think that's really better.

** That is the symmetric alternate to an explicit `encode('UTF-8',...)`
somewhere. But note that the encoding declared to the `open` is then
dependent on the encoding of the source file. With explicit `encode`,
only the encodings given to the `encode` and `open` must be identical
but the source file encoding becomes irrelevant. Therefore I like
that approach better.

I like that approach better too

The third option is to reject any character string. This is obviously
leaky, but I'd still consider it the sanest of these methods.

Except of course you cannot implement this in Perl as she is because no
way of marking strings as characters exists.

But leaky how? Intent would be clear here.

You mostly answered your own question. You can't get closer than
reject utf-8 strings and assume anything else is a bytestring.

A fourth would be for scalar to be more utf8 aware, I'm not sure
that's a good idea conceptually though.

How do you mean?

I meant the :scalar layer

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 3, 2013

From @ikegami

On Fri, Feb 1, 2013 at 1​:16 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

On Fri, Feb 1, 2013 at 7​:13 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

All that's needed to make it sane is a "Wide character" warning/error
when
the input has chars>255 to remind users they are missing an encode().

But what about the chars 127-255?

What about them? Char 127 is char 127, no matter how it's stored.

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @tonycoz

On Wed, Jan 30, 2013 at 03​:26​:32PM -0800, Karl Williamson via RT wrote​:

I don't know what I pressed to cause it to send while typing the
message, but send it did. So hopefully this will work better.

On Wed Jan 30 15​:20​:46 2013, khw wrote​:

Commit 02c3c86 changed the behavior so
one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8;
my $string = qq[a�b];
my $fh = IO​::File->new();
$fh->open(\$string, '<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up
incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the
string is encodable in Latin1, the open succeeds, while silently
downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play
well with that, with the result that this silently breaks.

As others have said, you're expecting $string to behave as if it
contains more than \x61\xC5\x62, which it doesn't.

But the silent change in behaviour (especially this late) could be
dangerous.

Some possible changes we could make​:

a) none - leave it alone and hopefully people will notice their broken
code and fix it.

b) keep the new behaviour, but only when

  use feature 'unicode_strings';

is in scope where the "file" is opened, since unicode_strings includes
related behaviour - treating \xC5 as that code-point whether the
string is SVf_UTF8 flagged or not.

Of course, that's not exactly the same concept as the other behaviour
changes for unicode_string, and would end up enabled wherever

c) keep the new behaviour, but only when

  use feature 'sane_perlio_scalar';

is in scope when the "file" is opened, since unicode_strings covers
different types of behaviour and has been the default under use
<someversion> since 5.12.

'sane_perlio_scalar' would be part of the 5.20 feature bundle.

d) just remove the change and revert to 5.16 behaviour

e) revert the change and warn if SvUTF8 is on

f) revert the change and fail the open if SvUTF8 is on

g) revert the change and provide a deprecation notice, and reapply the
change immediately after 5.18 is released.

and combinations of g) and e) or f).

At this point I'd favour c).

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @ribasushi

On Sat, Feb 02, 2013 at 10​:11​:30AM +0100, Leon Timmermans wrote​:

On Sat, Feb 2, 2013 at 5​:29 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

Except of course you cannot implement this in Perl as she is because no
way of marking strings as characters exists.

But leaky how? Intent would be clear here.

You mostly answered your own question. You can't get closer than
reject utf-8 strings and assume anything else is a bytestring.

You could go one step further though, when you are adamant about
operating in bytes, as sen in this if-block​: [1]

Cheers

[1] https://metacpan.org/source/RIBASUSHI/Devel-PeekPoke-0.03/lib/Devel/PeekPoke/PP.pm#L100

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @ribasushi

On Mon, Feb 04, 2013 at 09​:50​:59PM +1100, Tony Cook wrote​:

On Wed, Jan 30, 2013 at 03​:26​:32PM -0800, Karl Williamson via RT wrote​:

I don't know what I pressed to cause it to send while typing the
message, but send it did. So hopefully this will work better.

On Wed Jan 30 15​:20​:46 2013, khw wrote​:

Commit 02c3c86 changed the behavior so
one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8;
my $string = qq[a�b];
my $fh = IO​::File->new();
$fh->open(\$string, '<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up
incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the
string is encodable in Latin1, the open succeeds, while silently
downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play
well with that, with the result that this silently breaks.

As others have said, you're expecting $string to behave as if it
contains more than \x61\xC5\x62, which it doesn't.

But the silent change in behaviour (especially this late) could be
dangerous.

Some possible changes we could make​:

a) none - leave it alone and hopefully people will notice their broken
code and fix it.

b) keep the new behaviour, but only when

use feature 'unicode_strings';

is in scope where the "file" is opened, since unicode_strings includes
related behaviour - treating \xC5 as that code-point whether the
string is SVf_UTF8 flagged or not.

Of course, that's not exactly the same concept as the other behaviour
changes for unicode_string, and would end up enabled wherever

c) keep the new behaviour, but only when

use feature 'sane_perlio_scalar';

is in scope when the "file" is opened, since unicode_strings covers
different types of behaviour and has been the default under use
<someversion> since 5.12.

'sane_perlio_scalar' would be part of the 5.20 feature bundle.

d) just remove the change and revert to 5.16 behaviour

e) revert the change and warn if SvUTF8 is on

f) revert the change and fail the open if SvUTF8 is on

g) revert the change and provide a deprecation notice, and reapply the
change immediately after 5.18 is released.

and combinations of g) and e) or f).

I vote for a minimum of E), preferrably G) or (best case scenarion,
closest to fulfilling the least surprise principle) F)

At this point I'd favour c).

That would be rather unfortunate, since it is no different from D) for
the overwhelmingly vast majority of perl programmers in the field.

Cheers

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @doy

On Mon, Feb 04, 2013 at 11​:02​:35PM +1100, Peter Rabbitson wrote​:

On Mon, Feb 04, 2013 at 09​:50​:59PM +1100, Tony Cook wrote​:

On Wed, Jan 30, 2013 at 03​:26​:32PM -0800, Karl Williamson via RT wrote​:

I don't know what I pressed to cause it to send while typing the
message, but send it did. So hopefully this will work better.

On Wed Jan 30 15​:20​:46 2013, khw wrote​:

Commit 02c3c86 changed the behavior so
one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8;
my $string = qq[a�b];
my $fh = IO​::File->new();
$fh->open(\$string, '<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up
incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the
string is encodable in Latin1, the open succeeds, while silently
downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play
well with that, with the result that this silently breaks.

As others have said, you're expecting $string to behave as if it
contains more than \x61\xC5\x62, which it doesn't.

But the silent change in behaviour (especially this late) could be
dangerous.

Some possible changes we could make​:

a) none - leave it alone and hopefully people will notice their broken
code and fix it.

b) keep the new behaviour, but only when

use feature 'unicode_strings';

is in scope where the "file" is opened, since unicode_strings includes
related behaviour - treating \xC5 as that code-point whether the
string is SVf_UTF8 flagged or not.

Of course, that's not exactly the same concept as the other behaviour
changes for unicode_string, and would end up enabled wherever

c) keep the new behaviour, but only when

use feature 'sane_perlio_scalar';

is in scope when the "file" is opened, since unicode_strings covers
different types of behaviour and has been the default under use
<someversion> since 5.12.

'sane_perlio_scalar' would be part of the 5.20 feature bundle.

d) just remove the change and revert to 5.16 behaviour

e) revert the change and warn if SvUTF8 is on

f) revert the change and fail the open if SvUTF8 is on

g) revert the change and provide a deprecation notice, and reapply the
change immediately after 5.18 is released.

and combinations of g) and e) or f).

I vote for a minimum of E), preferrably G) or (best case scenarion,
closest to fulfilling the least surprise principle) F)

At this point I'd favour c).

That would be rather unfortunate, since it is no different from D) for
the overwhelmingly vast majority of perl programmers in the field.

The SVf_UTF8 flag itself should have no effect on the behavior of a
program (assuming it results in the string having an equivalent list of
logical characters). It is entirely an implementation detail, and
relying on it for anything other than that is broken. It is true that
people do do this, but it is wrong, and not something we should
encourage, as it just leads to more confusion about how Unicode things
work.

-doy

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @ribasushi

On Mon, Feb 04, 2013 at 09​:54​:01AM -0600, Jesse Luehrs wrote​:

On Mon, Feb 04, 2013 at 11​:02​:35PM +1100, Peter Rabbitson wrote​:

On Mon, Feb 04, 2013 at 09​:50​:59PM +1100, Tony Cook wrote​:

On Wed, Jan 30, 2013 at 03​:26​:32PM -0800, Karl Williamson via RT wrote​:

I don't know what I pressed to cause it to send while typing the
message, but send it did. So hopefully this will work better.

On Wed Jan 30 15​:20​:46 2013, khw wrote​:

Commit 02c3c86 changed the behavior so
one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8;
my $string = qq[a�b];
my $fh = IO​::File->new();
$fh->open(\$string, '<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up
incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the
string is encodable in Latin1, the open succeeds, while silently
downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play
well with that, with the result that this silently breaks.

As others have said, you're expecting $string to behave as if it
contains more than \x61\xC5\x62, which it doesn't.

But the silent change in behaviour (especially this late) could be
dangerous.

Some possible changes we could make​:

a) none - leave it alone and hopefully people will notice their broken
code and fix it.

b) keep the new behaviour, but only when

use feature 'unicode_strings';

is in scope where the "file" is opened, since unicode_strings includes
related behaviour - treating \xC5 as that code-point whether the
string is SVf_UTF8 flagged or not.

Of course, that's not exactly the same concept as the other behaviour
changes for unicode_string, and would end up enabled wherever

c) keep the new behaviour, but only when

use feature 'sane_perlio_scalar';

is in scope when the "file" is opened, since unicode_strings covers
different types of behaviour and has been the default under use
<someversion> since 5.12.

'sane_perlio_scalar' would be part of the 5.20 feature bundle.

d) just remove the change and revert to 5.16 behaviour

e) revert the change and warn if SvUTF8 is on

f) revert the change and fail the open if SvUTF8 is on

g) revert the change and provide a deprecation notice, and reapply the
change immediately after 5.18 is released.

and combinations of g) and e) or f).

I vote for a minimum of E), preferrably G) or (best case scenarion,
closest to fulfilling the least surprise principle) F)

At this point I'd favour c).

That would be rather unfortunate, since it is no different from D) for
the overwhelmingly vast majority of perl programmers in the field.

The SVf_UTF8 flag itself should have no effect on the behavior of a
program (assuming it results in the string having an equivalent list of
logical characters). It is entirely an implementation detail, and
relying on it for anything other than that is broken. It is true that
people do do this, but it is wrong, and not something we should
encourage, as it just leads to more confusion about how Unicode things
work.

Just to clarify - which part of my answer were you replying to...? Your
reply reads like you share my view that option D) is a bad idea. Yet you
seem to support C). Do you find them different as far as the experience
of most users is concerned?

Cheers

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @doy

On Tue, Feb 05, 2013 at 04​:24​:50AM +1100, Peter Rabbitson wrote​:

On Mon, Feb 04, 2013 at 09​:54​:01AM -0600, Jesse Luehrs wrote​:

On Mon, Feb 04, 2013 at 11​:02​:35PM +1100, Peter Rabbitson wrote​:

On Mon, Feb 04, 2013 at 09​:50​:59PM +1100, Tony Cook wrote​:

On Wed, Jan 30, 2013 at 03​:26​:32PM -0800, Karl Williamson via RT wrote​:

I don't know what I pressed to cause it to send while typing the
message, but send it did. So hopefully this will work better.

On Wed Jan 30 15​:20​:46 2013, khw wrote​:

Commit 02c3c86 changed the behavior so
one could not successfully open a scalar with code points above 0xFF.

But this test case shows an issue with this​:

use utf8;
my $string = qq[a�b];
my $fh = IO​::File->new();
$fh->open(\$string, '<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up
incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the
string is encodable in Latin1, the open succeeds, while silently
downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play
well with that, with the result that this silently breaks.

As others have said, you're expecting $string to behave as if it
contains more than \x61\xC5\x62, which it doesn't.

But the silent change in behaviour (especially this late) could be
dangerous.

Some possible changes we could make​:

a) none - leave it alone and hopefully people will notice their broken
code and fix it.

b) keep the new behaviour, but only when

use feature 'unicode_strings';

is in scope where the "file" is opened, since unicode_strings includes
related behaviour - treating \xC5 as that code-point whether the
string is SVf_UTF8 flagged or not.

Of course, that's not exactly the same concept as the other behaviour
changes for unicode_string, and would end up enabled wherever

c) keep the new behaviour, but only when

use feature 'sane_perlio_scalar';

is in scope when the "file" is opened, since unicode_strings covers
different types of behaviour and has been the default under use
<someversion> since 5.12.

'sane_perlio_scalar' would be part of the 5.20 feature bundle.

d) just remove the change and revert to 5.16 behaviour

e) revert the change and warn if SvUTF8 is on

f) revert the change and fail the open if SvUTF8 is on

g) revert the change and provide a deprecation notice, and reapply the
change immediately after 5.18 is released.

and combinations of g) and e) or f).

I vote for a minimum of E), preferrably G) or (best case scenarion,
closest to fulfilling the least surprise principle) F)

At this point I'd favour c).

That would be rather unfortunate, since it is no different from D) for
the overwhelmingly vast majority of perl programmers in the field.

The SVf_UTF8 flag itself should have no effect on the behavior of a
program (assuming it results in the string having an equivalent list of
logical characters). It is entirely an implementation detail, and
relying on it for anything other than that is broken. It is true that
people do do this, but it is wrong, and not something we should
encourage, as it just leads to more confusion about how Unicode things
work.

Just to clarify - which part of my answer were you replying to...? Your
reply reads like you share my view that option D) is a bad idea. Yet you
seem to support C). Do you find them different as far as the experience
of most users is concerned?

My only point was that I am opposed to E and F. I don't really have much
of an opinion otherwise.

-doy

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @Leont

On Mon, Feb 4, 2013 at 4​:54 PM, Jesse Luehrs <doy@​tozt.net> wrote​:

The SVf_UTF8 flag itself should have no effect on the behavior of a
program (assuming it results in the string having an equivalent list of
logical characters).

I don't think Tony listed any option for which that is really true.
Some leak on open, and others leak on reading/writing to them. Some
leak depending on content. Some leak differently in different
circumstances. But all of them leak. *All*.

It is entirely an implementation detail, and
relying on it for anything other than that is broken. It is true that
people do do this, but it is wrong, and not something we should
encourage, as it just leads to more confusion about how Unicode things
work.

That may be true for strings, but it isn't true for IO. Problem is
that those two worlds interact here inside-out here.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @doy

On Mon, Feb 04, 2013 at 08​:06​:08PM +0100, Leon Timmermans wrote​:

On Mon, Feb 4, 2013 at 4​:54 PM, Jesse Luehrs <doy@​tozt.net> wrote​:

The SVf_UTF8 flag itself should have no effect on the behavior of a
program (assuming it results in the string having an equivalent list of
logical characters).

I don't think Tony listed any option for which that is really true.
Some leak on open, and others leak on reading/writing to them. Some
leak depending on content. Some leak differently in different
circumstances. But all of them leak. *All*.

It is entirely an implementation detail, and
relying on it for anything other than that is broken. It is true that
people do do this, but it is wrong, and not something we should
encourage, as it just leads to more confusion about how Unicode things
work.

That may be true for strings, but it isn't true for IO. Problem is
that those two worlds interact here inside-out here.

When is it not true for IO?

  $ perl -wE'my $str = "\xce"; utf8​::upgrade($str); say $str' | hexdump
  0000000 0ace
  0000002

  $ perl -wE'my $str = "\xce"; say $str' | hexdump
  0000000 0ace
  0000002

Or am I missing something?

-doy

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @Leont

On Mon, Feb 4, 2013 at 8​:19 PM, Jesse Luehrs <doy@​tozt.net> wrote​:

When is it not true for IO?

It's true on the inside, but not on the outside (or at least it gives
a warning when that expectation is broken).

The problem here is that the string is used as the outside.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @Leont

On Mon, Feb 4, 2013 at 8​:41 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

It's true on the inside, but not on the outside (or at least it gives
a warning when that expectation is broken).

The problem here is that the string is used as the outside.

Better said, any read after «open my $fh, '<', \$scalar» leaks the
representation of $scalar. Always. The previous behavior was to expose
the bytevalue of eiter. The current behavior is to force Latin-1, and
fail if that isn't possible. Both leaky in their own way.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 4, 2013

From @doy

On Mon, Feb 04, 2013 at 09​:03​:43PM +0100, Leon Timmermans wrote​:

On Mon, Feb 4, 2013 at 8​:41 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

It's true on the inside, but not on the outside (or at least it gives
a warning when that expectation is broken).

The problem here is that the string is used as the outside.

Better said, any read after «open my $fh, '<', \$scalar» leaks the
representation of $scalar. Always. The previous behavior was to expose
the bytevalue of eiter. The current behavior is to force Latin-1, and
fail if that isn't possible. Both leaky in their own way.

Sure, but that leak is a bug (the bug that is the topic of this ticket).
I'm just talking about what the correct behavior should be, not what the
current implementation does. Looking at the SVf_UTF8 flag is not going
to be correct regardless of the situation, because you can end up with
two strings that look identical from perl space, but do different things
when opened. It might make certain cases do the right thing, but it's
never going to be 100% correct. Only allowing codepoints less than 256
is probably the only reasonable option here.

-doy

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2013

From @khwilliamson

On 02/04/2013 03​:50 AM, Tony Cook wrote​:

As others have said, you're expecting $string to behave as if it
contains more than \x61\xC5\x62, which it doesn't.

A slight tangent​:

I did not originate this example; it came from another ticket
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116322
and the relevant line from it reads​:
  my $string = qq[a�b]; # use utf8 makes $string UTF-8.

My mental model of how things should work bought this without thinking.
  It's clear that that poster also thought it was obvious that things
should work the way he thought.

Looking just now at the documentation for 'use utf8', I don't see
anything that makes it clear that our mental models (and I bet those of
a lot of other people) are wrong. And that tells me that whatever the
outcome of #109828, utf8.pm's pod needs to be significantly clarified.

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2013

From @nwc10

On Mon, Feb 04, 2013 at 02​:17​:53PM -0600, Jesse Luehrs wrote​:

On Mon, Feb 04, 2013 at 09​:03​:43PM +0100, Leon Timmermans wrote​:

On Mon, Feb 4, 2013 at 8​:41 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

It's true on the inside, but not on the outside (or at least it gives
a warning when that expectation is broken).

The problem here is that the string is used as the outside.

Better said, any read after «open my $fh, '<', \$scalar» leaks the
representation of $scalar. Always. The previous behavior was to expose
the bytevalue of eiter. The current behavior is to force Latin-1, and
fail if that isn't possible. Both leaky in their own way.

Not sure if the word "leak" (in the sense of information leak) is the right
word here.

In that, to my mind, the design problem is that there are two possible
(conflicting) actions the programmer wanted to mean by
«open my $fh, '<', \$scalar», which the interpreter has no way of
distinguishing.

1) treat scalar as "the outside", in which case it has to be in octets, to
  be meaningful. (And the programmer knows that he/she needs to deal with
  encoding issues, as this is a boundary where that is necessary)

2) treat scalar as "inside", in which case it has to be in characters, to be
  meaningful.

Both make sense.

But we have an internal representation which is ambiguous between "octets
from the outside" and "characters from the inside". So we can't tell.

Sure, but that leak is a bug (the bug that is the topic of this ticket).
I'm just talking about what the correct behavior should be, not what the
current implementation does. Looking at the SVf_UTF8 flag is not going
to be correct regardless of the situation, because you can end up with
two strings that look identical from perl space, but do different things
when opened. It might make certain cases do the right thing, but it's
never going to be 100% correct. Only allowing codepoints less than 256
is probably the only reasonable option here.

Given that the model we're trying to converge on is that SvUTF8()
is representation, not semantics*, I think this ends up as the only
consistent answer, even though it rules out use case (2)

Really, we need two versions of the syntax.
Given that we have polymorphic strings, we need monomorphic operators.

Nicholas Clark

* And if that's not the case, why am I allowed to concatenate a string having
  SvUTF8() true with one having SvUTF8() false?

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2013

From @Leont

On Tue, Feb 5, 2013 at 2​:47 PM, Nicholas Clark <nick@​ccl4.org> wrote​:

Not sure if the word "leak" (in the sense of information leak) is the right
word here.

In that, to my mind, the design problem is that there are two possible
(conflicting) actions the programmer wanted to mean by
«open my $fh, '<', \$scalar», which the interpreter has no way of
distinguishing.

1) treat scalar as "the outside", in which case it has to be in octets, to
be meaningful. (And the programmer knows that he/she needs to deal with
encoding issues, as this is a boundary where that is necessary)

2) treat scalar as "inside", in which case it has to be in characters, to be
meaningful.

Both make sense.

But we have an internal representation which is ambiguous between "octets
from the outside" and "characters from the inside". So we can't tell.

Agreed.

Sure, but that leak is a bug (the bug that is the topic of this ticket).
I'm just talking about what the correct behavior should be, not what the
current implementation does. Looking at the SVf_UTF8 flag is not going
to be correct regardless of the situation, because you can end up with
two strings that look identical from perl space, but do different things
when opened. It might make certain cases do the right thing, but it's
never going to be 100% correct. Only allowing codepoints less than 256
is probably the only reasonable option here.

Given that the model we're trying to converge on is that SvUTF8()
is representation, not semantics*, I think this ends up as the only
consistent answer, even though it rules out use case (2)

Really, we need two versions of the syntax.
Given that we have polymorphic strings, we need monomorphic operators.

open my $fh, '<​:scalar(characters)', \$buffer ?

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 5, 2013

From @rjbs

* Karl Williamson via RT <perlbug-followup@​perl.org> [2013-01-30T18​:26​:32]

use utf8;
my $string = qq[aÅb];
my $fh = IO​::File->new();
$fh->open(\$string, '<​:encoding(UTF-8)');

The problem is that the character in the string (which is showing up
incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the
string is encodable in Latin1, the open succeeds, while silently
downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play
well with that, with the result that this silently breaks.

Okay, I have not read the whole huge thread that has sprung up, so forgive me
if I'm replaying it. The downgrade seems to be the error here.

When we say 'open' we have to act like we're opening a bytestream, and that the
layers mediate the byte/maybe-character boundary. Saying "open my $fh,
\$charstring" is like saying "5 + q{hello}" -- we're using a byteish operator
on a textish string. Too bad scalar types are to lax, in this case, huh?

At any rate, the thing we *can* say about strings is that they are sequences of
non-negative integer values. If we use them in a byteish context, we should
treat them like bytes. The fact that the byte \xC5 might be represented in
memory by two bytes with a flag on the SV saying "decode this before giving the
value back" is at a different level of abstraction.

If someone tries to open the string qq[aÅb] with a UTF-8 decoding layer, it
should not be readable, because the three values to be read from the stream are
not valid UTF-8. Barf at decode time.

If someone tries to open the string qq[aĥb] (three codepoints, the second being
0x0125) we can probably fail at open time, which is, I believe, the intent of
Tony's patch. We could issue a wide character warning, instead, but I think
that this is excessively permissive.

Quite possibly I have missed some subtleties. I hope someone will point them
out, if so.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Feb 6, 2013

From @xdg

On Tue, Feb 5, 2013 at 6​:52 PM, Ricardo Signes
<perl.p5p@​rjbs.manxome.org> wrote​:

If someone tries to open the string qq[aÅb] with a UTF-8 decoding layer, it
should not be readable, because the three values to be read from the stream are
not valid UTF-8. Barf at decode time.

If someone tries to open the string qq[aĥb] (three codepoints, the second being
0x0125) we can probably fail at open time, which is, I believe, the intent of
Tony's patch. We could issue a wide character warning, instead, but I think
that this is excessively permissive.

+1

To the fullest extent possible, opening a scalar should be opening a
bytestream and the UTF8 flag should be off. Fatality is fine, but
perhaps the error should hint about "maybe utf8​::encode($string)
first" for people who want to create a character string (e.g. for
testing) and then read :encoding(UTF-8) from that.

People who expect to edit the string in the middle of reading or
writing are asking for trouble and we should note that. (Not unlike,
say, having multiple handles editing a file on disk.)

David

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Feb 8, 2013

From @tonycoz

From the discussion here and in #p5p, I don't plan to make any
changes to the behaviour of PerlIO​::scalar from where it is in
blead.

Of course, this change in behaviour doesn't directly address the
behaviour originally requested - that the :utf8 layer mark the
output scalar as containing UTF-8, and at least one respondent[1]
thought there was no harm in it.

But I believe the original request was incorrect, for pretty much
the same reason others expressed - files contain bytes, and our
scalar mirror of a file should behave the same.

Unless someone objects, I'll close this ticket in the next couple
of days.

[1] https://rt-archive.perl.org/perl5/Ticket/Display.html?id=109828#txn-1084484

@p5pRT
Copy link
Author

p5pRT commented Feb 14, 2013

From @tonycoz

On Fri Feb 08 01​:20​:52 2013, tonyc wrote​:

But I believe the original request was incorrect, for pretty much
the same reason others expressed - files contain bytes, and our
scalar mirror of a file should behave the same.

Unless someone objects, I'll close this ticket in the next couple
of days.

And hence closed.

Tony

@p5pRT p5pRT closed this as completed Feb 14, 2013
@p5pRT
Copy link
Author

p5pRT commented Feb 14, 2013

@tonycoz - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant