Skip Menu |
 
Report information
Id: 109828
Status: resolved
Priority: 0/
Queue: perl5

Owner: tonyc <tony [at] develop-help.com>
Requestors: dgl <dgl [at] dgl.cx>
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)

Attachments
0001-TODO-tests-for-opening-upgraded-scalars.patch
0001-Warn-when-opening-utf8-string-into-handle.patch
0002-fail-to-open-scalars-containing-characters-that-don-.patch
0002-Suggestion-for-rewording-message-and-pod-for-PerlIO-.patch
0003-document-the-new-warning.patch
0004-bump-PerlIO-scalar-s-version.patch
0005-TODO-tests-for-reads-from-a-scalar-changed-to-upgrad.patch
0006-handle-reading-from-a-SVf_UTF8-scalar.patch
0007-TODO-tests-for-writing-to-a-SVf_UTF8-scalar.patch
0008-warn-and-fail-on-writes-to-SVf_UTF8-SVs.patch
signature.asc



Subject: PerlIO::scalar does not handle UTF-8
Date: Sat, 4 Feb 2012 18:10:26 +0100
To: perlbug [...] perl.org
From: David Leadbeater <dgl [...] dgl.cx>
Download (untitled) / with headers
text/plain 850b
If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on. I think this makes sense for output, although there may be other ramifications. Here's a todo test: diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t index a02107b..59b65ad 100644 --- a/ext/PerlIO-scalar/t/scalar.t +++ b/ext/PerlIO-scalar/t/scalar.t @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. $| = 1; -use Test::More tests => 79; +use Test::More tests => 80; my $fh; my $var = "aaa\n"; @@ -360,3 +360,11 @@ SKIP: { ok has_trailing_nul $memfile, 'write appends null when growing string after seek past end'; } + +# [perl #xxxx] +{ + local $TODO = "UTF-8 support"; + my $string = "\x{ffe}"; + open my $fh, "> :encoding(UTF-8)", \(my $out); + ok $string eq $out; +}
CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 4 Feb 2012 18:55:27 +0100
To: perl5-porters [...] perl.org
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 1.4k
On Sat, Feb 4, 2012 at 6:10 PM, David Leadbeater <perlbug-followup@perl.org> wrote: Show quoted text
> If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on. > > I think this makes sense for output, although there may be other ramifications. > > Here's a todo test: > > diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t > index a02107b..59b65ad 100644 > --- a/ext/PerlIO-scalar/t/scalar.t > +++ b/ext/PerlIO-scalar/t/scalar.t > @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. > >  $| = 1; > > -use Test::More tests => 79; > +use Test::More tests => 80; > >  my $fh; >  my $var = "aaa\n"; > @@ -360,3 +360,11 @@ SKIP: { >     ok has_trailing_nul $memfile, >         'write appends null when growing string after seek past end'; >  } > + > +# [perl #xxxx] > +{ > +  local $TODO = "UTF-8 support"; > +  my $string = "\x{ffe}"; > +  open my $fh, "> :encoding(UTF-8)", \(my $out); > +  ok $string eq $out; > +} >
PerlIO does bytes, always. It's utf8 support is literally a one bit flag that promises the bytes will be validly encoded utf8. There's no easy way for lower layers to know what the upper layers do with regard utf8. Nor am I sure that really should tinkle down. The other direction would seem to be more important. When opening a utf8 scalar, it should automatically be a utf8 handle. Anything else is plain buggy and potentially dangerous. Leon
CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 04 Feb 2012 11:03:48 -0700
To: perl5-porters [...] perl.org
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 689b
David Leadbeater (via RT) <perlbug-followup@perl.org> wrote on Sat, 04 Feb 2012 09:10:41 PST: Show quoted text
>+{ >+ local $TODO = "UTF-8 support"; >+ my $string = "\x{ffe}";
Why don't you use an assigned Unicode code point there, please? Show quoted text
>+ open my $fh, "> :encoding(UTF-8)", \(my $out);
Why are you involving the Encode module? Why isn't that simply: open(my $fh, "> :utf8", \my $out) || die $!: Show quoted text
>+ ok $string eq $out; >+}
I absolutely gave up on this. It was too unreliable. Even if you are careful about decoding your string, now and then (about 1 in 10) it gets double-encoded no matter what you do. It is not even deterministic in any fashion I can see to make work. --tom
CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 4 Feb 2012 17:49:11 -0500
To: perl5-porters [...] perl.org
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 325b
On Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater <perlbug-followup@perl.org> wrote:
Show quoted text
+# [perl #xxxx]
+{
+  local $TODO = "UTF-8 support";
+  my $string = "\x{ffe}";
+  open my $fh, "> :encoding(UTF-8)", \(my $out);
+  ok $string eq $out;
+}

Files can only contain bytes. This makes no sense to me.

- Eric

CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 4 Feb 2012 17:52:11 -0500
To: perl5-porters [...] perl.org
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 611b


On Sat, Feb 4, 2012 at 5:49 PM, Eric Brine <ikegami@adaelis.com> wrote:
Show quoted text
On Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater <perlbug-followup@perl.org> wrote:
+# [perl #xxxx]
+{
+  local $TODO = "UTF-8 support";
+  my $string = "\x{ffe}";
+  open my $fh, "> :encoding(UTF-8)", \(my $out);
+  ok $string eq $out;
+}

Files can only contain bytes. This makes no sense to me.
... especially since you specially ask for encode whatever you print. encode "UTF-8" cannot possibly produce something that contains 0xFFE.

And your patch is buggy: You forgot to actually print to $fh.

Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 4 Feb 2012 20:12:30 -0500
To: p5p <perl5-porters [...] perl.org>
From: David Golden <xdaveg [...] gmail.com>
Download (untitled) / with headers
text/plain 528b
On Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater <perlbug-followup@perl.org> wrote: Show quoted text
> > If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
I think that one should expect PerlIO::scalar to provide a black box -- it's an in-memory substitution for bytes on disk with no associated encoding, just like a file on disk has no associated encoding. If the referenced string already has the utf8 flag set, I think it's sufficient to warn rather than try to guess the correct behavior. David
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sun, 5 Feb 2012 15:47:43 +0100
To: perl5-porters [...] perl.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Download (untitled) / with headers
text/plain 1.9k
On Sat, 4 Feb 2012 18:55:27 +0100, Leon Timmermans <fawaka@gmail.com> wrote: Show quoted text
> On Sat, Feb 4, 2012 at 6:10 PM, David Leadbeater > <perlbug-followup@perl.org> wrote:
> > If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on. > > > > I think this makes sense for output, although there may be other ramifications. > > > > Here's a todo test: > > > > diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t > > index a02107b..59b65ad 100644 > > --- a/ext/PerlIO-scalar/t/scalar.t > > +++ b/ext/PerlIO-scalar/t/scalar.t > > @@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere. > > > >  $| = 1; > > > > -use Test::More tests => 79; > > +use Test::More tests => 80; > > > >  my $fh; > >  my $var = "aaa\n"; > > @@ -360,3 +360,11 @@ SKIP: { > >     ok has_trailing_nul $memfile, > >         'write appends null when growing string after seek past end'; > >  } > > + > > +# [perl #xxxx] > > +{ > > +  local $TODO = "UTF-8 support"; > > +  my $string = "\x{ffe}"; > > +  open my $fh, "> :encoding(UTF-8)", \(my $out); > > +  ok $string eq $out; > > +} > >
> > PerlIO does bytes, always. It's utf8 support is literally a one bit > flag that promises the bytes will be validly encoded utf8. There's no > easy way for lower layers to know what the upper layers do with regard > utf8. Nor am I sure that really should tinkle down. > > The other direction would seem to be more important. When opening a > utf8 scalar, it should automatically be a utf8 handle. Anything else > is plain buggy and potentially dangerous.
including pragma's? use open OUT => "encoding(utf16)"; open my $fh, ">", \my $x; print { $fh } "The \x{20ac} is \x{a71c} again}\n"; close $fh; -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.14 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 10:58:02 +0000
To: perl5-porters [...] perl.org
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.1k
On Sat, Feb 04, 2012 at 08:12:30PM -0500, David Golden wrote: Show quoted text
> On Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater > <perlbug-followup@perl.org> wrote:
> > > > If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
> > I think that one should expect PerlIO::scalar to provide a black box > -- it's an in-memory substitution for bytes on disk with no associated > encoding, just like a file on disk has no associated encoding. > > If the referenced string already has the utf8 flag set, I think it's > sufficient to warn rather than try to guess the correct behavior.
Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 would have exposed it. SvUTF8() shouldn't be visible as a proxy for "characters vs bytes" (yes, I know there are still holes in that). I *think* it needs to be strictly bytes-only (just like any real file handle) and refuse to open an existing string that doesn't meet that constraint. (With the inevitable ambiguity that if you only shove characters in the range 0-255 into your string, you're not going to realise that your code is buggy.) Nicholas Clark
CC: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 11:05:38 +0000
To: Leon Timmermans <fawaka [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.5k
On Sat, Feb 04, 2012 at 06:55:27PM +0100, Leon Timmermans wrote: Show quoted text
> On Sat, Feb 4, 2012 at 6:10 PM, David Leadbeater > <perlbug-followup@perl.org> wrote:
> > If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on. > > > > I think this makes sense for output, although there may be other ramifications.
Show quoted text
> PerlIO does bytes, always. It's utf8 support is literally a one bit > flag that promises the bytes will be validly encoded utf8. There's no > easy way for lower layers to know what the upper layers do with regard > utf8. Nor am I sure that really should tinkle down. > > The other direction would seem to be more important. When opening a > utf8 scalar, it should automatically be a utf8 handle. Anything else > is plain buggy and potentially dangerous.
No, as I replied elsewhere, I think it should refuse to open any scalar that isn't bytes. Or, at least, the user's code needs to be different to say "I want to open a byte buffer as if it's a file handle" and "I'm expecting characters here". That way allows symmetry between opening for reading and opening for writing. Having open for reading have some sort of "did they mean characters or bytes? I'll guess for them" ends up with the same mess that unpack is in, whereby it's a runtime decision based *implicitly* on the *parameters* as to whether it's doing a bytes => characters conversion or a characters => characters mapping. Sure, it's not as *bad* as unpack, which can attempt to do both in the same statement, but trying to have open "DWIM" is in the same ball-park of design misfeature. Nicholas Clark
CC: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 15:17:04 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: Leon Timmermans <fawaka [...] gmail.com>
On Mon, Feb 6, 2012 at 12:05 PM, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> No, as I replied elsewhere, I think it should refuse to open any scalar > that isn't bytes. > > Or, at least, the user's code needs to be different to say "I want to open a > byte buffer as if it's a file handle" and "I'm expecting characters here". > That way allows symmetry between opening for reading and opening for writing. > > Having open for reading have some sort of "did they mean characters or bytes? > I'll guess for them" ends up with the same mess that unpack is in, whereby > it's a runtime decision based *implicitly* on the *parameters* as to whether > it's doing a bytes => characters conversion or a characters => characters > mapping. Sure, it's not as *bad* as unpack, which can attempt to do both > in the same statement, but trying to have open "DWIM" is in the same > ball-park of design misfeature.
Yeah, that is a good point, how about making things explicit? E.G «open my $fh, '+<:scalar(utf8)', \my $scalar». I suspect the current PerlIO/PerlIO::scalar can't easily support that though. Leon
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 10:18:39 -0500
To: p5p <perl5-porters [...] perl.org>
From: David Golden <xdaveg [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On Mon, Feb 6, 2012 at 9:17 AM, Leon Timmermans <fawaka@gmail.com> wrote: Show quoted text
> Yeah, that is a good point, how about making things explicit? E.G > «open my $fh, '+<:scalar(utf8)', \my $scalar». I suspect the current > PerlIO/PerlIO::scalar can't easily support that though.
Isn't that just C<open my $fh, "+<:utf8", \my $scalar>? If you *know* that you have UTF-8 characters in a string, it's not different than knowing you have UTF-8 characters in a disk file. The *user* needs to be clear what they expect the bytes to be. Or is the question about what Perl should do about returning bytes from a string that coincidentally happens to be a character string? I.e. how should Perl mimic an on-disk file using its internal string data structure? Assume that Perl's internal character encoding is a black box. Maybe it's UTF-8, maybe not (maybe it changes in the future). Whatever. It's an internal implementation detail and nothing external should rely on it. Then when something wants to use that string as a source of bytes, should Perl (a) just dump out whatever bytes it uses internally for its implementation? Or (b) should it convert the internal representation to some standard representation? Or (c) should it blow up? I don't like (a) or (c). (b) is tempting. (Coincidentally, it's easy, since the internal encoding is utf8.) My naive inclination is to amend the documentation to clarify that the bytes returned are either raw bytes or utf8 encoded if the string already contains characters. And then I'd *still* leave it up to the user to know what's in the "file" (i.e. string) and set the correct encoding layer on it, just as if they were using a disk file. -- David
CC: p5p <perl5-porters [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 15:24:04 +0000
To: David Golden <xdaveg [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.3k
On Mon, Feb 06, 2012 at 10:18:39AM -0500, David Golden wrote: Show quoted text
> Or is the question about what Perl should do about returning bytes > from a string that coincidentally happens to be a character string? > I.e. how should Perl mimic an on-disk file using its internal string > data structure?
That was what I thought the question was. Show quoted text
> Assume that Perl's internal character encoding is a black box. Maybe > it's UTF-8, maybe not (maybe it changes in the future). Whatever. > It's an internal implementation detail and nothing external should > rely on it.
Agree Show quoted text
> Then when something wants to use that string as a source of bytes, > should Perl (a) just dump out whatever bytes it uses internally for > its implementation? Or (b) should it convert the internal > representation to some standard representation? Or (c) should it blow > up? > > I don't like (a) or (c). (b) is tempting. (Coincidentally, it's > easy, since the internal encoding is utf8.) My naive inclination is > to amend the documentation to clarify that the bytes returned are > either raw bytes or utf8 encoded if the string already contains > characters. And then I'd *still* leave it up to the user to know
How do you know that the string contains characters? Show quoted text
> what's in the "file" (i.e. string) and set the correct encoding layer > on it, just as if they were using a disk file.
Nicholas Clark
CC: p5p <perl5-porters [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 11:02:36 -0500
To: Nicholas Clark <nick [...] ccl4.org>
From: David Golden <xdaveg [...] gmail.com>
Download (untitled) / with headers
text/plain 809b
On Mon, Feb 6, 2012 at 10:24 AM, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
>> I don't like (a) or (c).  (b) is tempting.  (Coincidentally, it's >> easy, since the internal encoding is utf8.)  My naive inclination is >> to amend the documentation to clarify that the bytes returned are >> either raw bytes or utf8 encoded if the string already contains >> characters.  And then I'd *still* leave it up to the user to know
> > How do you know that the string contains characters?
Which "you" do you mean? The user? How does a user know that *any* file contains characters? Generally, by knowing what was written there originally or by analyzing the file in some way to guess an encoding, I'd think. (E.g. read it as bytes and then use Encode::Guess?) None of that is the interpreter's concern. -- David
CC: p5p <perl5-porters [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 17:04:13 +0100
To: David Golden <xdaveg [...] gmail.com>
From: Leon Timmermans <fawaka [...] gmail.com>
On Mon, Feb 6, 2012 at 4:18 PM, David Golden <xdaveg@gmail.com> wrote: Show quoted text
> Isn't that just C<open my $fh, "+<:utf8", \my $scalar>?
Not at all. :utf8 means «assume the bytestream is utf8 encoded». It does not mean «store as characters» (though doing the latter without the former doesn't make sense). Show quoted text
> Or is the question about what Perl should do about returning bytes > from a string that coincidentally happens to be a character string? > I.e. how should Perl mimic an on-disk file using its internal string > data structure?
Yeah, pretty much. Show quoted text
> Then when something wants to use that string as a source of bytes, > should Perl (a) just dump out whatever bytes it uses internally for > its implementation?  Or (b) should it convert the internal > representation to some standard representation?  Or (c) should it blow > up?
(a) Is what we're doing right now, and I think it's just plain wrong, and possibly dangerous. (b) Maybe, but for reasons Nicholas explained guesswork may be rather suboptimal (c) Is sane, unlike (a) and some versions of (b). Leon
CC: p5p <perl5-porters [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 16:09:39 +0000
To: David Golden <xdaveg [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.8k
On Mon, Feb 06, 2012 at 11:02:36AM -0500, David Golden wrote: Show quoted text
> On Mon, Feb 6, 2012 at 10:24 AM, Nicholas Clark <nick@ccl4.org> wrote:
> >> I don't like (a) or (c).  (b) is tempting.  (Coincidentally, it's > >> easy, since the internal encoding is utf8.)  My naive inclination is > >> to amend the documentation to clarify that the bytes returned are > >> either raw bytes or utf8 encoded if the string already contains > >> characters.  And then I'd *still* leave it up to the user to know
> > > > How do you know that the string contains characters?
> > Which "you" do you mean? The user? How does a user know that *any* > file contains characters? Generally, by knowing what was written > there originally or by analyzing the file in some way to guess an > encoding, I'd think. (E.g. read it as bytes and then use > Encode::Guess?) > > None of that is the interpreter's concern.
OK, which means that the interpreter can't *do* option (b) (or (a) for that matter): On Mon, Feb 06, 2012 at 03:24:04PM +0000, Nicholas Clark wrote: Show quoted text
> On Mon, Feb 06, 2012 at 10:18:39AM -0500, David Golden wrote: >
> > Or is the question about what Perl should do about returning bytes > > from a string that coincidentally happens to be a character string? > > I.e. how should Perl mimic an on-disk file using its internal string > > data structure?
Show quoted text
> > Then when something wants to use that string as a source of bytes, > > should Perl (a) just dump out whatever bytes it uses internally for > > its implementation? Or (b) should it convert the internal > > representation to some standard representation? Or (c) should it blow > > up?
because you've just stated that the interpreter can't make a determination as to whether a string contains characters or bytes (for the ambiguous case of a string containing one or more code points in the range 128-255, but no code points outside the range 0-255) Nicholas Clark
CC: p5p <perl5-porters [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 11:27:57 -0500
To: David Golden <xdaveg [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 654b
On Mon, Feb 6, 2012 at 10:18 AM, David Golden <xdaveg@gmail.com> wrote:
Show quoted text
On Mon, Feb 6, 2012 at 9:17 AM, Leon Timmermans <fawaka@gmail.com> wrote:
> Yeah, that is a good point, how about making things explicit? E.G
> «open my $fh, '+<:scalar(utf8)', \my $scalar». I suspect the current
> PerlIO/PerlIO::scalar can't easily support that though.

Isn't that just C<open my $fh, "+<:utf8", \my $scalar>?

No, that means "decode the input on read". The question is about a buffer that contains decoded data, so what's needed is a layer or some such that indicates "the underlying data is already decoded". That's his intent for :scalar(utf8).

CC: p5p <perl5-porters [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 6 Feb 2012 11:38:39 -0500
To: Nicholas Clark <nick [...] ccl4.org>
From: David Golden <xdaveg [...] gmail.com>
Download (untitled) / with headers
text/plain 766b
On Mon, Feb 6, 2012 at 11:09 AM, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> because you've just stated that the interpreter can't make a determination > as to whether a string contains characters or bytes (for the ambiguous > case of a string containing one or more code points in the range 128-255, > but no code points outside the range 0-255)
You're right. I was being imprecise. I think if the string contains no wide characters, it should be "read" by PerlIO::scalar as bytes. If the string does contain wide characters, PerlIO::scalar should either fail or should encode them in some "standard" way and return them as bytes in encoded form. The whole idea is to provide an in-memory abstraction of a *file*, which means returning a sequence of bytes. David
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Mon Feb 06 07:19:37 2012, xdaveg@gmail.com wrote: Show quoted text
> Then when something wants to use that string as a source of bytes, > should Perl (a) just dump out whatever bytes it uses internally for > its implementation? Or (b) should it convert the internal > representation to some standard representation? Or (c) should it blow > up?
(a) is what Perl currently does, as Leon Timmerman said. By (b) I presume you mean to treat \xff as \xff regardless of how it is stored internally, which makes sense. But what happens if I open a reading handle to a scalar containing \x{100}? Here we have a choice between (b) and (c). An in-memory scalar could be considered a byte stream. Or it could just be considered a string of characters. The latter does make some sense. If I print \xff to an in-memory file with no layers applied, I simply get \xff in my scalar. So if I print \x{100}, it would make sense to get \x{100} in my scalar, no? But if the scalar is considered byte-sized, I should get \x{100} utf8-encoded, accompanied by a wide character warning; and reading a scalar with \x{100} would croak. That it is currently buggy is not being questioned. But which model should be followed in fixing it is debatable. Would it be reasonable to implement the byte-sized version for now and upgrade it later? -- Father Chrysostomos
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sun, 12 Feb 2012 21:29:59 -0500
To: p5p <perl5-porters [...] perl.org>
From: David Golden <xdaveg [...] gmail.com>
Download (untitled) / with headers
text/plain 1.9k
On Sun, Feb 12, 2012 at 5:02 PM, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Mon Feb 06 07:19:37 2012, xdaveg@gmail.com wrote:
>> Then when something wants to use that string as a source of bytes, >> should Perl (a) just dump out whatever bytes it uses internally for >> its implementation?  Or (b) should it convert the internal >> representation to some standard representation?  Or (c) should it blow >> up?
> > (a) is what Perl currently does, as Leon Timmerman said. > > By (b) I presume you mean to treat \xff as \xff regardless of how it is > stored internally, which makes sense.
Sort of. What I meant is that (a) is "whatever we do" and (b) is "a specific encoding". Those are likely to be similar, but one is vague and mutable and the other specific and fixed. Such a promise would persist under the usual back-compatibility rules even if we changed the internal representation in the future for some reason. It could also mean that we could choose give UTF-8 and not "utf8" (i.e. lax, internal encoding) -- and would croak if we can't translate from the internal to UTF-8. For example, for a string with wide characters used as in in-memory file, we could promise to translate from the internal encoding to UTF-8 when the handle is read. That would make it resemble a disk file encoded in UTF-8, requiring the ":encoding(UTF-8)" flag and so on. Thus some function that is passed a handle to read shouldn't know or care whether it's an in memory string or an on-disk file -- though the *programmer* would need to know what encoding they expect to receive given their particular application. Show quoted text
> An in-memory scalar could be considered a byte stream.  Or it could just > be considered a string of characters.
My bias is strongly that it should be a byte-stream, which is why I'm only considering how we choose to take a string of (wide) characters and make it into a byte stream in some standard way: (a) "whatever" (b) "a promise" and (c) "boom!" -- David
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 13 Feb 2012 14:13:03 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 1.3k
* Nicholas Clark <nick@ccl4.org> [2012-02-06 12:00]: Show quoted text
> On Sat, Feb 04, 2012 at 08:12:30PM -0500, David Golden wrote:
> > If the referenced string already has the utf8 flag set, I think it's > > sufficient to warn rather than try to guess the correct behavior.
> > Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 > would have exposed it. SvUTF8() shouldn't be visible as a proxy for > "characters vs bytes" (yes, I know there are still holes in that).
This. Thank you. I was despairing as I read the thread, waiting for someone to interject with it. As far as the user is concerned, there is never to be any difference between a string with UTF8 on vs a string with UTF8 off as long as $utf8on eq $utf8off. Show quoted text
> I *think* it needs to be strictly bytes-only (just like any real file > handle) and refuse to open an existing string that doesn't meet that > constraint. (With the inevitable ambiguity that if you only shove > characters in the range 0-255 into your string, you're not going to > realise that your code is buggy.)
What it should do on input is treat each character as a byte, throwing an error if there are any characters > 0xFF in the string, i.e. the moral equivalent of downgrading the input string and croaking if that fails. That’s it. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 13 Feb 2012 14:32:38 +0100
To: perl5-porters [...] perl.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Download (untitled) / with headers
text/plain 2.4k
On Mon, 6 Feb 2012 10:58:02 +0000, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> On Sat, Feb 04, 2012 at 08:12:30PM -0500, David Golden wrote:
> > On Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater > > <perlbug-followup@perl.org> wrote:
> > > > > > If a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on.
> > > > I think that one should expect PerlIO::scalar to provide a black box > > -- it's an in-memory substitution for bytes on disk with no associated > > encoding, just like a file on disk has no associated encoding. > > > > If the referenced string already has the utf8 flag set, I think it's > > sufficient to warn rather than try to guess the correct behavior.
> > Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 > would have exposed it. SvUTF8() shouldn't be visible as a proxy for > "characters vs bytes" (yes, I know there are still holes in that). > > I *think* it needs to be strictly bytes-only (just like any real file handle) > and refuse to open an existing string that doesn't meet that constraint. > (With the inevitable ambiguity that if you only shove characters in the > range 0-255 into your string, you're not going to realise that your code > is buggy.) > > > Nicholas Clark
Personally, I see no harm in doing a decode on close when opened for writing as utf-8 --8<--- use v5.12; use warnings; binmode STDOUT, ":utf8"; my $data = ""; { open my $fh, ">:encoding(utf-8)", \$data; print { $fh } "\x{20ac}\n"; close $fh; } { open my $fh, "<:encoding(utf-8)", \$data; print <$fh>; close $fh; } print $data; utf8::decode ($data); print $data; { open my $fh, "<:encoding(utf-8)", \$data; print <$fh>; close $fh; } { use open OUT => ":encoding(utf-8)"; open my $fh, ">", \$data; print { $fh } "\x{20ac}\n"; close $fh; } { use open IN => ":encoding(utf-8)"; open my $fh, "<", \$data; print <$fh>; close $fh; } print $data; utf8::decode ($data); print $data; { use open IN => ":encoding(utf-8)"; open my $fh, "<", \$data; print <$fh>; close $fh; } -->8--- $ perl test.pl € ⬠€ € € € € € € -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.14 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Tue, 14 Feb 2012 12:42:12 -0500
To: perlbug-followup [...] perl.org
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 704b
On Sun, Feb 12, 2012 at 5:02 PM, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
Show quoted text
That it is currently buggy is not being questioned.

And the following test will detect regressions once its fixed.

=====

use strict;
use warnings;

use Test::More tests => 1;

sub read_from_scalar {
    my ($file, $perlio) = @_;
    $perlio //= '';
    open my $fh, "<$perlio", \$file or die $!;
    local $/;
    return <$fh>;
}

sub hexify { join ' ', map sprintf('%02X', ord), split //, $_[0] }

{
    my $s = chr(0xE9);
    utf8::upgrade(   my $u = $s );
    utf8::downgrade( my $d = $s );
    is( hexify(read_from_scalar($u)), hexify(read_from_scalar($d)), 'Unicode bug in :scalar read' );
}

1;

Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 28 Dec 2012 18:34:21 +1100
To: Father Chrysostomos via RT <perlbug-followup [...] perl.org>
From: Peter Rabbitson <rabbit+p5p [...] rabbit.us>
Download (untitled) / with headers
text/plain 310b
Is there any word on this issue? I just hit this bug in reverse[1] and while there is ample discussion about it being a problem I see the same behavior under current blead. Is there a chance *at least* for a warning to be added so that it lands in 5.18? Cheers [1] http://www.perlmonks.org/?node_id=1010601
CC: perlbug followup <perlbug-followup [...] perl.org>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 28 Dec 2012 16:44:13 +0100
To: Peter Rabbitson <rabbit+p5p [...] rabbit.us>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 438b
On Fri, Dec 28, 2012 at 8:34 AM, Peter Rabbitson <rabbit+p5p@rabbit.us> wrote: Show quoted text
> Is there any word on this issue? I just hit this bug in reverse[1] and > while there is ample discussion about it being a problem I see the same > behavior under current blead. Is there a chance *at least* for a warning > to be added so that it lands in 5.18?
The process kind of fizzled somewhere. I'm in favor of a warning in 5.18, see attachment. Leon

Message body is not shown because sender requested not to inline it.

RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.1k
On Fri Dec 28 07:45:25 2012, LeonT wrote: Show quoted text
> On Fri, Dec 28, 2012 at 8:34 AM, Peter Rabbitson > <rabbit+p5p@rabbit.us> wrote:
> > Is there any word on this issue? I just hit this bug in reverse[1]
> and
> > while there is ample discussion about it being a problem I see the
> same
> > behavior under current blead. Is there a chance *at least* for a
> warning
> > to be added so that it lands in 5.18?
> > The process kind of fizzled somewhere. I'm in favor of a warning in > 5.18, see attachment.
It should fail to open. If you open a UTF8 flagged string for append and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 string. Your patch as written ignores the principle that the SvUTF8() flag only controls the internal encoding, not other behaviour. If the SV contains only code point 0xFF or lower we should downgrade it and work with that rather than failing (or producing a warning). This should also be done for _read() and _write(), since the SV can be modified between I/O operations. There's an unrelated problem that _pushed() checks flags on both arg and SvRV(arg) without calling SvGETMAGIC(). I'll take a look at these issues when I get home. Tony
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 28 Dec 2012 23:16:36 +0100
To: perlbug-followup [...] perl.org
From: Leon Timmermans <fawaka [...] gmail.com>
On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT <perlbug-followup@perl.org> wrote: Show quoted text
> It should fail to open. If you open a UTF8 flagged string for append > and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 > string. > > Your patch as written ignores the principle that the SvUTF8() flag only > controls the internal encoding, not other behaviour. If the SV contains > only code point 0xFF or lower we should downgrade it and work with that > rather than failing (or producing a warning).
I didn't see enough consensus to change it that much, but I would be in favor. Show quoted text
> This should also be done for _read() and _write(), since the SV can be > modified between I/O operations. > > There's an unrelated problem that _pushed() checks flags on both arg and > SvRV(arg) without calling SvGETMAGIC().
It should just stop peeking and poking into the SV altogether, and use the proper APIs (sv_insert and friends). For that matter, I sometimes feel like it should be rewritten from scratch to actually make sense. Pretty much all of it is problematic. Leon
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sun, 30 Dec 2012 01:43:37 +1100
To: perl5-porters [...] perl.org
From: Peter Rabbitson <rabbit+p5p [...] rabbit.us>
Download (untitled) / with headers
text/plain 1.2k
On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote: Show quoted text
> On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT > <perlbug-followup@perl.org> wrote:
> > It should fail to open. If you open a UTF8 flagged string for append > > and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 > > string. > > > > Your patch as written ignores the principle that the SvUTF8() flag only > > controls the internal encoding, not other behaviour. If the SV contains > > only code point 0xFF or lower we should downgrade it and work with that > > rather than failing (or producing a warning).
> > I didn't see enough consensus to change it that much, but I would be in favor. >
> > This should also be done for _read() and _write(), since the SV can be > > modified between I/O operations. > > > > There's an unrelated problem that _pushed() checks flags on both arg and > > SvRV(arg) without calling SvGETMAGIC().
> > It should just stop peeking and poking into the SV altogether, and use > the proper APIs (sv_insert and friends). For that matter, I sometimes > feel like it should be rewritten from scratch to actually make sense. > Pretty much all of it is problematic.
This particular bit risks derailing the simpler yet more urgent bugfix. Focuse please ;) Cheers
CC: perlbug-followup [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 31 Dec 2012 19:00:45 +1100
To: perl5-porters [...] perl.org
From: Tony Cook <tony [...] develop-help.com>
Download (untitled) / with headers
text/plain 2.3k
On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote: Show quoted text
> On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT > <perlbug-followup@perl.org> wrote:
> > It should fail to open. If you open a UTF8 flagged string for append > > and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 > > string. > > > > Your patch as written ignores the principle that the SvUTF8() flag only > > controls the internal encoding, not other behaviour. If the SV contains > > only code point 0xFF or lower we should downgrade it and work with that > > rather than failing (or producing a warning).
> > I didn't see enough consensus to change it that much, but I would be in favor. >
> > This should also be done for _read() and _write(), since the SV can be > > modified between I/O operations. > > > > There's an unrelated problem that _pushed() checks flags on both arg and > > SvRV(arg) without calling SvGETMAGIC().
> > It should just stop peeking and poking into the SV altogether, and use > the proper APIs (sv_insert and friends). For that matter, I sometimes > feel like it should be rewritten from scratch to actually make sense. > Pretty much all of it is problematic.
I've attached my suggested changes (in several parts), also available on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity. Reasons for failing instead of warning: 1) reading - to follow the "SVf_UTF8 is only representation" principle, we'd need to download where possible, so a \xA1 (for example) in the stream is always treated as that byte, but this means we have an inconsistency when the scalar cannot be downgraded - the first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" would be different. 2) writing - if the SV is flagged UTF8, and the user of the handle doesn't write correct UTF8 data at the correct offsets, the SV will no longer be properly formed utf-8, which I believe we're trying to maintain. One of my tests produced a warning about invalid UTF-8 during before the fix was applied. It's possible could be avoided if we always treat the written bytes as code points and upgrade them when writing to a UTF8 string, but then we run into a consitency issue vs reading - what happens when a read on a UTF8 string reaches a code point > 0xFF? As written I think the warning message could be improved and the documentation of the warning could be improved. Tony

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

CC: perlbug-followup [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Thu, 3 Jan 2013 08:31:19 +1100
To: perl5-porters [...] perl.org
From: Tony Cook <tony [...] develop-help.com>
Download (untitled) / with headers
text/plain 354b
On Mon, Dec 31, 2012 at 07:00:45PM +1100, Tony Cook wrote: Show quoted text
> 1) reading - to follow the "SVf_UTF8 is only representation" > principle, we'd need to *download* where possible, so a \xA1 (for
Urr, downgrade. Show quoted text
> As written I think the warning message could be improved and the > documentation of the warning could be improved.
Suggestions welcome. Tony
CC: perl5-porters [...] perl.org, perlbug-followup [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sun, 13 Jan 2013 20:37:37 -0700
To: Tony Cook <tony [...] develop-help.com>
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.6k
On 12/31/2012 01:00 AM, Tony Cook wrote: Show quoted text
> On Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote:
>> On Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT >> <perlbug-followup@perl.org> wrote:
>>> It should fail to open. If you open a UTF8 flagged string for append >>> and write non-UTF8 bytes you will produce an invalidly encoded SvUTF8 >>> string. >>> >>> Your patch as written ignores the principle that the SvUTF8() flag only >>> controls the internal encoding, not other behaviour. If the SV contains >>> only code point 0xFF or lower we should downgrade it and work with that >>> rather than failing (or producing a warning).
>> >> I didn't see enough consensus to change it that much, but I would be in favor. >>
>>> This should also be done for _read() and _write(), since the SV can be >>> modified between I/O operations. >>> >>> There's an unrelated problem that _pushed() checks flags on both arg and >>> SvRV(arg) without calling SvGETMAGIC().
>> >> It should just stop peeking and poking into the SV altogether, and use >> the proper APIs (sv_insert and friends). For that matter, I sometimes >> feel like it should be rewritten from scratch to actually make sense. >> Pretty much all of it is problematic.
> > I've attached my suggested changes (in several parts), also available > on perl5.git.perl.org/perl.git as tonyc/perlio-scalar-sanity. > > Reasons for failing instead of warning: > > 1) reading - to follow the "SVf_UTF8 is only representation" > principle, we'd need to download where possible, so a \xA1 (for > example) in the stream is always treated as that byte, but this means > we have an inconsistency when the scalar cannot be downgraded - the > first bytes of the character sequences "\xA1\x40" and "\xA1\x{101}" > would be different. > > 2) writing - if the SV is flagged UTF8, and the user of the handle > doesn't write correct UTF8 data at the correct offsets, the SV will no > longer be properly formed utf-8, which I believe we're trying to > maintain. One of my tests produced a warning about invalid UTF-8 > during before the fix was applied. > > It's possible could be avoided if we always treat the written bytes as > code points and upgrade them when writing to a UTF8 string, but then > we run into a consitency issue vs reading - what happens when a read > on a UTF8 string reaches a code point > 0xFF? > > As written I think the warning message could be improved and the > documentation of the warning could be improved. > > Tony >
Attached are some suggestions for wording changes. I've never liked our distinction between bytes and character semantics. It makes no sense to me. Everything is ultimately a byte.

Message body is not shown because sender requested not to inline it.

RT-Send-CC: perl5-porters [...] perl.org
I am in favor of your proposed changes Tony, thanks.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 310b
Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so one could not successfully open a scalar with code points above 0xFF. But this test case shows an issue with this: use utf8; my $string = qq[aÅb]; my $fh = IO::File->new(); $fh->open(\$string, '<:encoding(UTF-8)'); -- Karl Williamson
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 116b
On Wed Jan 23 19:08:08 2013, rjbs wrote: Show quoted text
> I am in favor of your proposed changes Tony, thanks.
-- Karl Williamson
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 832b
I don't know what I pressed to cause it to send while typing the message, but send it did. So hopefully this will work better. On Wed Jan 30 15:20:46 2013, khw wrote: Show quoted text
> Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so > one could not successfully open a scalar with code points above 0xFF. > > But this test case shows an issue with this: > > use utf8; > my $string = qq[a�b]; > my $fh = IO::File->new(); > $fh->open(\$string, '<:encoding(UTF-8)');
The problem is that the character in the string (which is showing up incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the string is encodable in Latin1, the open succeeds, while silently downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play well with that, with the result that this silently breaks. -- Karl Williamson
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Thu, 31 Jan 2013 08:32:22 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
* Karl Williamson via RT <perlbug-followup@perl.org> [2013-01-31 00:30]: Show quoted text
> On Wed Jan 30 15:20:46 2013, khw wrote:
> >Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior > >so one could not successfully open a scalar with code points above > >0xFF. > > > >But this test case shows an issue with this: > > > >use utf8; > >my $string = qq[aÅb]; > >my $fh = IO::File->new(); > >$fh->open(\$string, '<:encoding(UTF-8)');
> > The problem is that the character in the string (which is showing up > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > string is encodable in Latin1, the open succeeds, while silently > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't > play well with that, with the result that this silently breaks.
Well the code as written is broken. Whether the utf8 pragma ultimately leaves the string downgraded or upgraded, the code is wrong either way. Whichever is the case, the code needs an `encode("UTF8", ...)` in there somewhere before the `open` in order to be correct. So the fact that this breaks means things are working as designed. That it breaks silently is not so great. But how could that be detected? You could argue for changing the parser to leave literals encoded and with their UTF8 flag on. But that would break other code – granted, only code that is already wrong. But the dictate of backcompat demands to try not to needlessly expose their brokenness if so far it wasn’t. The only way to satisfy both requirements would be if there was a way to mark strings as character strings, independently of whether their UTF8 flag is turned on. Then the utf8 pragma could turn that flag on for all literals, even if it leaves them with UTF8 flags off, and `open` could check for that flag instead of the UTF8 flag. The current scans for codepoints > 0xFF are a proximate facsimile of such a flag – presence of such codepoints is sufficient evidence for the string being a character string. But in the converse case, absence of evidence not being evidence of absence applies. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Thu, 31 Jan 2013 14:28:01 +0100
To: perl5-porters [...] perl.org
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 667b
On Thu, Jan 31, 2013 at 8:32 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote: Show quoted text
> Well the code as written is broken. Whether the utf8 pragma ultimately > leaves the string downgraded or upgraded, the code is wrong either way. > Whichever is the case, the code needs an `encode("UTF8", ...)` in there > somewhere before the `open` in order to be correct. So the fact that > this breaks means things are working as designed. > > That it breaks silently is not so great. > > But how could that be detected?
We can return an error instead of trying to downgrade. It would still break this code, but at least it would do so loudly (and at least it would be sane). Leon
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 08:16:53 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 1.5k
* Leon Timmermans <fawaka@gmail.com> [2013-01-31 14:30]: Show quoted text
> On Thu, Jan 31, 2013 at 8:32 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> > Well the code as written is broken. Whether the utf8 pragma > > ultimately leaves the string downgraded or upgraded, the code is > > wrong either way. Whichever is the case, the code needs an > > `encode("UTF8", ...)` in there somewhere before the `open` in order > > to be correct. So the fact that this breaks means things are working > > as designed. > > > > That it breaks silently is not so great. > > > > But how could that be detected?
> > We can return an error instead of trying to downgrade. It would still > break this code, but at least it would do so loudly (and at least it > would be sane).
1. You can’t. The string is downgraded far earlier, by the parser. $ perl -MDevel::Peek -e 'use utf8; Dump qq[aÅb]' SV = PV(0x7fad5b801090) at 0x7fad5b8267a8 REFCNT = 1 FLAGS = (POK,READONLY,pPOK,UTF8) PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"] CUR = 4 LEN = 16 2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the UTF8 flag is off, or C3 85 and UTF8 is on – both mean the same thing. If it is downgradable, then it very well should be downgraded and accepted silently. (I just realised some of my previous mail was a red herring, due to this point.) If you’re opening the string as an octet stream, then you need a string that contains an octet stream. Not characters. Regardless of the UTF8 flag value. -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 15:54:37 +0100
To: perl5-porters [...] perl.org
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 923b
On Fri, Feb 1, 2013 at 8:16 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote: Show quoted text
> 1. You can’t. The string is downgraded far earlier, by the parser. > > $ perl -MDevel::Peek -e 'use utf8; Dump qq[aÅb]' > SV = PV(0x7fad5b801090) at 0x7fad5b8267a8 > REFCNT = 1 > FLAGS = (POK,READONLY,pPOK,UTF8) > PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"] > CUR = 4 > LEN = 16
That's not downgraded at all, it has the utf8 flag. Show quoted text
> 2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer and the > UTF8 flag is off, or C3 85 and UTF8 is on – both mean the same thing. If > it is downgradable, then it very well should be downgraded and accepted > silently. (I just realised some of my previous mail was a red herring, due > to this point.)
That abstraction leaks when it comes into contact with PerlIO. No getting around that. Question is only: how does it leak Leon
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 15:08:25 +0000
To: Leon Timmermans <fawaka [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.9k
On Fri, Feb 01, 2013 at 03:54:37PM +0100, Leon Timmermans wrote: Show quoted text
> On Fri, Feb 1, 2013 at 8:16 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
Show quoted text
> > 2. It doesn't matter if the byte value C5 is spelled C5 in the buffer and the > > UTF8 flag is off, or C3 85 and UTF8 is on - both mean the same thing. If > > it is downgradable, then it very well should be downgraded and accepted > > silently. (I just realised some of my previous mail was a red herring, due > > to this point.)
> > That abstraction leaks when it comes into contact with PerlIO. No > getting around that. Question is only: how does it leak
I feel that I'm asking a stupid question here, but why/how does it leak? Is it leaking for the same reason as eval "leaks"? There, source code from disk is in bytes, which needs an encoding layered atop it to map to characters (even if it's a 1:1 mapping). So "obviously", that's what the parser expects. Stuff in the range 0-255, which might be a variable-width encoding. But eval takes strings, and Perl-code has generated strings of *characters* to feed to the parser. Stuff in the range 0-0x1FFFF (ish), abstract representation (as far as Perl-space is concerned) So, here, some code wants to think in terms of using file-like operations on a sequence of octets held in a scalar (which were "obviously" octets because that was what it was dealing with when it assigned to that scalar.) Whereas other code wants to think in terms of using file-like operations on a sequence of characters. (which were "obviously" characters because that was what it was dealing with when it assigned to that scalar.) And it's the same syntax to open either. Is that the leakage you mean? That by the time the code comes to open the scalar, it simply isn't clear whether the scalar is supposed to be holding sequences of octets, or sequences of characters, and so the opening code *can't* get the semantics of the open correct. Or have I misunderstood? Nicholas Clark
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 16:41:14 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 1.6k
* Leon Timmermans <fawaka@gmail.com> [2013-02-01 16:00]: Show quoted text
> On Fri, Feb 1, 2013 at 8:16 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> >1. You can’t. The string is downgraded far earlier, by the parser. > > > > $ perl -MDevel::Peek -e 'use utf8; Dump qq[aÅb]' > > SV = PV(0x7fad5b801090) at 0x7fad5b8267a8 > > REFCNT = 1 > > FLAGS = (POK,READONLY,pPOK,UTF8) > > PV = 0x10f10f850 "a\303\205b"\0 [UTF8 "a\x{c5}b"] > > CUR = 4 > > LEN = 16
> > That's not downgraded at all, it has the utf8 flag.
Err. I looked at the output thrice and always saw it clearly absent. This is where I would blame lack of coffee, were I a coffee drinker. Sorry for the noise. Show quoted text
> >2. It doesn’t matter if the byte value C5 is spelled C5 in the buffer > > and the UTF8 flag is off, or C3 85 and UTF8 is on – both mean the > > same thing. If it is downgradable, then it very well should be > > downgraded and accepted silently. (I just realised some of my > > previous mail was a red herring, due to this point.)
> > That abstraction leaks when it comes into contact with PerlIO. No > getting around that. Question is only: how does it leak
There is no leak: if you stick an `encode('UTF-8', ...)` in there, then the code will be correct, regardless of whether the string is downgraded or upgraded. As long as all operations on the string treat it as 3 units long, the middle one of which has the value 0xC5, it is water-tight. This is a matter not of leaky abstraction but of a missing affordance, which leaves the programmer to carefully keep track of intent by hand with no aid from the interpreter. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 17:12:41 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 2.4k
On Fri, Feb 1, 2013 at 4:08 PM, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> On Fri, Feb 01, 2013 at 03:54:37PM +0100, Leon Timmermans wrote:
>> On Fri, Feb 1, 2013 at 8:16 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
>
>> > 2. It doesn't matter if the byte value C5 is spelled C5 in the buffer and the >> > UTF8 flag is off, or C3 85 and UTF8 is on - both mean the same thing. If >> > it is downgradable, then it very well should be downgraded and accepted >> > silently. (I just realised some of my previous mail was a red herring, due >> > to this point.)
>> >> That abstraction leaks when it comes into contact with PerlIO. No >> getting around that. Question is only: how does it leak
> > I feel that I'm asking a stupid question here, but why/how does it leak? > Is it leaking for the same reason as eval "leaks"? There, source code from > disk is in bytes, which needs an encoding layered atop it to map to > characters (even if it's a 1:1 mapping). So "obviously", that's what the > parser expects. Stuff in the range 0-255, which might be a variable-width > encoding. But eval takes strings, and Perl-code has generated strings > of *characters* to feed to the parser. Stuff in the range 0-0x1FFFF (ish), > abstract representation (as far as Perl-space is concerned) > > > So, here, some code wants to think in terms of using file-like operations > on a sequence of octets held in a scalar (which were "obviously" octets > because that was what it was dealing with when it assigned to that scalar.) > > Whereas other code wants to think in terms of using file-like operations on > a sequence of characters. (which were "obviously" characters because that > was what it was dealing with when it assigned to that scalar.) > > And it's the same syntax to open either. > > Is that the leakage you mean? That by the time the code comes to open the > scalar, it simply isn't clear whether the scalar is supposed to be holding > sequences of octets, or sequences of characters, and so the opening code > *can't* get the semantics of the open correct. > > Or have I misunderstood? > > Nicholas Clark
Yeah, it's that problem. The old behavior was to leak the internal encoding. The new behavior is to always expose something as Latin-1, even when it originally wasn't, which I consider leaky too. The third option is to reject any character string. This is obviously leaky, but I'd still consider it the sanest of these methods. A fourth would be for scalar to be more utf8 aware, I'm not sure that's a good idea conceptually though. Leon
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 13:13:13 -0500
To: perl5-porters [...] perl.org
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 145b
All that's needed to make it sane is a "Wide character" warning/error when the input has chars>255 to remind users they are missing an encode().

CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Fri, 1 Feb 2013 19:16:17 +0100
To: Eric Brine <ikegami [...] adaelis.com>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 263b
On Fri, Feb 1, 2013 at 7:13 PM, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text
> All that's needed to make it sane is a "Wide character" warning/error when > the input has chars>255 to remind users they are missing an encode().
But what about the chars 127-255? Leon
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 2 Feb 2013 05:29:05 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 1.9k
* Leon Timmermans <fawaka@gmail.com> [2013-02-01 17:15]: Show quoted text
> The new behavior is to always expose something as Latin-1
It only looks like Latin-1, because it’s exposing the U+0000..U+00FF range of Unicode characters as faux bytes, and that happens to coincide with Latin-1. But you can’t tell they are characters rather than bytes, because there is no way to mark the string as containing one or the other. Show quoted text
> even when it originally wasn't, which I consider leaky too.
How was it not? The file contained the bytes C3 85 and a declaration to `use utf8`. Taken together those mean that the bytes in the quote-like literal are to automatically be decoded to a single string element with ordinal value C5. Which is what happens. If you’re saying the decoding is undesirable here then you’re saying that the programmer forgot to say `do { no utf8; qq[aÅb] }` so he would get the undecoded UTF-8 bytes.** This still doesn’t track programmer intent. It’s impossible to tell by looking at either the undecoded or decoded strings whether the intent was for them to be octet or character strings. ** That is the symmetric alternate to an explicit `encode('UTF-8',...)` somewhere. But note that the encoding declared to the `open` is then dependent on the encoding of the source file. With explicit `encode`, only the encodings given to the `encode` and `open` must be identical but the source file encoding becomes irrelevant. Therefore I like that approach better. Show quoted text
> The third option is to reject any character string. This is obviously > leaky, but I'd still consider it the sanest of these methods.
Except of course you cannot implement this in Perl as she is because no way of marking strings as characters exists. But leaky how? Intent would be clear here. Show quoted text
> A fourth would be for scalar to be more utf8 aware, I'm not sure > that's a good idea conceptually though.
How do you mean? Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sat, 2 Feb 2013 10:11:30 +0100
To: perl5-porters [...] perl.org
From: Leon Timmermans <fawaka [...] gmail.com>
On Sat, Feb 2, 2013 at 5:29 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote: Show quoted text
> * Leon Timmermans <fawaka@gmail.com> [2013-02-01 17:15]:
>> The new behavior is to always expose something as Latin-1
> > It only looks like Latin-1, because it’s exposing the U+0000..U+00FF > range of Unicode characters as faux bytes, and that happens to coincide > with Latin-1. But you can’t tell they are characters rather than bytes, > because there is no way to mark the string as containing one or the > other.
True, I phrased it poorly. How about: The new behavior is to always expose character strings as Latin-1, or failure if that's impossible. Show quoted text
>> even when it originally wasn't, which I consider leaky too.
> > How was it not? The file contained the bytes C3 85 and a declaration to > `use utf8`. Taken together those mean that the bytes in the quote-like > literal are to automatically be decoded to a single string element with > ordinal value C5. Which is what happens.
Well, instead of the current encoding you have to know if it can be encoded as Latin-1. I don't think that's really better. Show quoted text
> ** That is the symmetric alternate to an explicit `encode('UTF-8',...)` > somewhere. But note that the encoding declared to the `open` is then > dependent on the encoding of the source file. With explicit `encode`, > only the encodings given to the `encode` and `open` must be identical > but the source file encoding becomes irrelevant. Therefore I like > that approach better.
I like that approach better too Show quoted text
>> The third option is to reject any character string. This is obviously >> leaky, but I'd still consider it the sanest of these methods.
> > Except of course you cannot implement this in Perl as she is because no > way of marking strings as characters exists. > > But leaky how? Intent would be clear here.
You mostly answered your own question. You can't get closer than reject utf-8 strings and assume anything else is a bytestring. Show quoted text
>> A fourth would be for scalar to be more utf8 aware, I'm not sure >> that's a good idea conceptually though.
> > How do you mean?
I meant the :scalar layer Leon
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Sun, 3 Feb 2013 08:51:08 -0500
To: Leon Timmermans <fawaka [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 411b
On Fri, Feb 1, 2013 at 1:16 PM, Leon Timmermans <fawaka@gmail.com> wrote:
Show quoted text
On Fri, Feb 1, 2013 at 7:13 PM, Eric Brine <ikegami@adaelis.com> wrote:
> All that's needed to make it sane is a "Wide character" warning/error when
> the input has chars>255 to remind users they are missing an encode().

But what about the chars 127-255?

What about them? Char 127 is char 127, no matter how it's stored.

CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 21:50:59 +1100
To: Karl Williamson via RT <perlbug-followup [...] perl.org>
From: Tony Cook <tony [...] develop-help.com>
Download (untitled) / with headers
text/plain 2.2k
On Wed, Jan 30, 2013 at 03:26:32PM -0800, Karl Williamson via RT wrote: Show quoted text
> I don't know what I pressed to cause it to send while typing the > message, but send it did. So hopefully this will work better. > > On Wed Jan 30 15:20:46 2013, khw wrote:
> > Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so > > one could not successfully open a scalar with code points above 0xFF. > > > > But this test case shows an issue with this: > > > > use utf8; > > my $string = qq[a�b]; > > my $fh = IO::File->new(); > > $fh->open(\$string, '<:encoding(UTF-8)');
> > The problem is that the character in the string (which is showing up > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > string is encodable in Latin1, the open succeeds, while silently > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play > well with that, with the result that this silently breaks.
As others have said, you're expecting $string to behave as if it contains more than \x61\xC5\x62, which it doesn't. But the silent change in behaviour (especially this late) could be dangerous. Some possible changes we could make: a) none - leave it alone and hopefully people will notice their broken code and fix it. b) keep the new behaviour, but only when use feature 'unicode_strings'; is in scope where the "file" is opened, since unicode_strings includes related behaviour - treating \xC5 as that code-point whether the string is SVf_UTF8 flagged or not. Of course, that's not exactly the same concept as the other behaviour changes for unicode_string, and would end up enabled wherever c) keep the new behaviour, but only when use feature 'sane_perlio_scalar'; is in scope when the "file" is opened, since unicode_strings covers different types of behaviour and has been the default under use <someversion> since 5.12. 'sane_perlio_scalar' would be part of the 5.20 feature bundle. d) just remove the change and revert to 5.16 behaviour e) revert the change and warn if SvUTF8 is on f) revert the change and fail the open if SvUTF8 is on g) revert the change and provide a deprecation notice, and reapply the change immediately after 5.18 is released. and combinations of g) and e) or f). At this point I'd favour c). Tony
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 22:57:02 +1100
To: Leon Timmermans <fawaka [...] gmail.com>
From: Peter Rabbitson <rabbit-p5p [...] rabbit.us>
Download (untitled) / with headers
text/plain 678b
On Sat, Feb 02, 2013 at 10:11:30AM +0100, Leon Timmermans wrote: Show quoted text
> On Sat, Feb 2, 2013 at 5:29 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> > > > Except of course you cannot implement this in Perl as she is because no > > way of marking strings as characters exists. > > > > But leaky how? Intent would be clear here.
> > You mostly answered your own question. You can't get closer than > reject utf-8 strings and assume anything else is a bytestring.
You could go one step further though, when you are adamant about operating in bytes, as sen in this if-block: [1] Cheers [1] https://metacpan.org/source/RIBASUSHI/Devel-PeekPoke-0.03/lib/Devel/PeekPoke/PP.pm#L100
CC: Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 23:02:35 +1100
To: Tony Cook <tony [...] develop-help.com>
From: Peter Rabbitson <rabbit-p5p [...] rabbit.us>
Download (untitled) / with headers
text/plain 2.6k
On Mon, Feb 04, 2013 at 09:50:59PM +1100, Tony Cook wrote: Show quoted text
> On Wed, Jan 30, 2013 at 03:26:32PM -0800, Karl Williamson via RT wrote:
> > I don't know what I pressed to cause it to send while typing the > > message, but send it did. So hopefully this will work better. > > > > On Wed Jan 30 15:20:46 2013, khw wrote:
> > > Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so > > > one could not successfully open a scalar with code points above 0xFF. > > > > > > But this test case shows an issue with this: > > > > > > use utf8; > > > my $string = qq[a�b]; > > > my $fh = IO::File->new(); > > > $fh->open(\$string, '<:encoding(UTF-8)');
> > > > The problem is that the character in the string (which is showing up > > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > > string is encodable in Latin1, the open succeeds, while silently > > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play > > well with that, with the result that this silently breaks.
> > As others have said, you're expecting $string to behave as if it > contains more than \x61\xC5\x62, which it doesn't. > > But the silent change in behaviour (especially this late) could be > dangerous. > > Some possible changes we could make: > > a) none - leave it alone and hopefully people will notice their broken > code and fix it. > > b) keep the new behaviour, but only when > > use feature 'unicode_strings'; > > is in scope where the "file" is opened, since unicode_strings includes > related behaviour - treating \xC5 as that code-point whether the > string is SVf_UTF8 flagged or not. > > Of course, that's not exactly the same concept as the other behaviour > changes for unicode_string, and would end up enabled wherever > > c) keep the new behaviour, but only when > > use feature 'sane_perlio_scalar'; > > is in scope when the "file" is opened, since unicode_strings covers > different types of behaviour and has been the default under use > <someversion> since 5.12. > > 'sane_perlio_scalar' would be part of the 5.20 feature bundle. > > d) just remove the change and revert to 5.16 behaviour > > e) revert the change and warn if SvUTF8 is on > > f) revert the change and fail the open if SvUTF8 is on > > g) revert the change and provide a deprecation notice, and reapply the > change immediately after 5.18 is released. > > and combinations of g) and e) or f). >
I vote for a minimum of E), preferrably G) or (best case scenarion, closest to fulfilling the least surprise principle) F) Show quoted text
> At this point I'd favour c).
That would be rather unfortunate, since it is no different from D) for the overwhelmingly vast majority of perl programmers in the field. Cheers
CC: Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 09:54:01 -0600
To: Peter Rabbitson <rabbit-p5p [...] rabbit.us>
From: Jesse Luehrs <doy [...] tozt.net>
Download (untitled) / with headers
text/plain 3.2k
On Mon, Feb 04, 2013 at 11:02:35PM +1100, Peter Rabbitson wrote: Show quoted text
> On Mon, Feb 04, 2013 at 09:50:59PM +1100, Tony Cook wrote:
> > On Wed, Jan 30, 2013 at 03:26:32PM -0800, Karl Williamson via RT wrote:
> > > I don't know what I pressed to cause it to send while typing the > > > message, but send it did. So hopefully this will work better. > > > > > > On Wed Jan 30 15:20:46 2013, khw wrote:
> > > > Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so > > > > one could not successfully open a scalar with code points above 0xFF. > > > > > > > > But this test case shows an issue with this: > > > > > > > > use utf8; > > > > my $string = qq[a�b]; > > > > my $fh = IO::File->new(); > > > > $fh->open(\$string, '<:encoding(UTF-8)');
> > > > > > The problem is that the character in the string (which is showing up > > > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > > > string is encodable in Latin1, the open succeeds, while silently > > > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play > > > well with that, with the result that this silently breaks.
> > > > As others have said, you're expecting $string to behave as if it > > contains more than \x61\xC5\x62, which it doesn't. > > > > But the silent change in behaviour (especially this late) could be > > dangerous. > > > > Some possible changes we could make: > > > > a) none - leave it alone and hopefully people will notice their broken > > code and fix it. > > > > b) keep the new behaviour, but only when > > > > use feature 'unicode_strings'; > > > > is in scope where the "file" is opened, since unicode_strings includes > > related behaviour - treating \xC5 as that code-point whether the > > string is SVf_UTF8 flagged or not. > > > > Of course, that's not exactly the same concept as the other behaviour > > changes for unicode_string, and would end up enabled wherever > > > > c) keep the new behaviour, but only when > > > > use feature 'sane_perlio_scalar'; > > > > is in scope when the "file" is opened, since unicode_strings covers > > different types of behaviour and has been the default under use > > <someversion> since 5.12. > > > > 'sane_perlio_scalar' would be part of the 5.20 feature bundle. > > > > d) just remove the change and revert to 5.16 behaviour > > > > e) revert the change and warn if SvUTF8 is on > > > > f) revert the change and fail the open if SvUTF8 is on > > > > g) revert the change and provide a deprecation notice, and reapply the > > change immediately after 5.18 is released. > > > > and combinations of g) and e) or f). > >
> > I vote for a minimum of E), preferrably G) or (best case scenarion, > closest to fulfilling the least surprise principle) F) >
> > At this point I'd favour c).
> > That would be rather unfortunate, since it is no different from D) for > the overwhelmingly vast majority of perl programmers in the field.
The SVf_UTF8 flag itself should have no effect on the behavior of a program (assuming it results in the string having an equivalent list of logical characters). It is entirely an implementation detail, and relying on it for anything other than that is broken. It is true that people do do this, but it is wrong, and not something we should encourage, as it just leads to more confusion about how Unicode things work. -doy
CC: Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Tue, 5 Feb 2013 04:24:50 +1100
To: Jesse Luehrs <doy [...] tozt.net>
From: Peter Rabbitson <rabbit-p5p [...] rabbit.us>
Download (untitled) / with headers
text/plain 3.7k
On Mon, Feb 04, 2013 at 09:54:01AM -0600, Jesse Luehrs wrote: Show quoted text
> On Mon, Feb 04, 2013 at 11:02:35PM +1100, Peter Rabbitson wrote:
> > On Mon, Feb 04, 2013 at 09:50:59PM +1100, Tony Cook wrote:
> > > On Wed, Jan 30, 2013 at 03:26:32PM -0800, Karl Williamson via RT wrote:
> > > > I don't know what I pressed to cause it to send while typing the > > > > message, but send it did. So hopefully this will work better. > > > > > > > > On Wed Jan 30 15:20:46 2013, khw wrote:
> > > > > Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so > > > > > one could not successfully open a scalar with code points above 0xFF. > > > > > > > > > > But this test case shows an issue with this: > > > > > > > > > > use utf8; > > > > > my $string = qq[a�b]; > > > > > my $fh = IO::File->new(); > > > > > $fh->open(\$string, '<:encoding(UTF-8)');
> > > > > > > > The problem is that the character in the string (which is showing up > > > > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > > > > string is encodable in Latin1, the open succeeds, while silently > > > > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play > > > > well with that, with the result that this silently breaks.
> > > > > > As others have said, you're expecting $string to behave as if it > > > contains more than \x61\xC5\x62, which it doesn't. > > > > > > But the silent change in behaviour (especially this late) could be > > > dangerous. > > > > > > Some possible changes we could make: > > > > > > a) none - leave it alone and hopefully people will notice their broken > > > code and fix it. > > > > > > b) keep the new behaviour, but only when > > > > > > use feature 'unicode_strings'; > > > > > > is in scope where the "file" is opened, since unicode_strings includes > > > related behaviour - treating \xC5 as that code-point whether the > > > string is SVf_UTF8 flagged or not. > > > > > > Of course, that's not exactly the same concept as the other behaviour > > > changes for unicode_string, and would end up enabled wherever > > > > > > c) keep the new behaviour, but only when > > > > > > use feature 'sane_perlio_scalar'; > > > > > > is in scope when the "file" is opened, since unicode_strings covers > > > different types of behaviour and has been the default under use > > > <someversion> since 5.12. > > > > > > 'sane_perlio_scalar' would be part of the 5.20 feature bundle. > > > > > > d) just remove the change and revert to 5.16 behaviour > > > > > > e) revert the change and warn if SvUTF8 is on > > > > > > f) revert the change and fail the open if SvUTF8 is on > > > > > > g) revert the change and provide a deprecation notice, and reapply the > > > change immediately after 5.18 is released. > > > > > > and combinations of g) and e) or f). > > >
> > > > I vote for a minimum of E), preferrably G) or (best case scenarion, > > closest to fulfilling the least surprise principle) F) > >
> > > At this point I'd favour c).
> > > > That would be rather unfortunate, since it is no different from D) for > > the overwhelmingly vast majority of perl programmers in the field.
> > The SVf_UTF8 flag itself should have no effect on the behavior of a > program (assuming it results in the string having an equivalent list of > logical characters). It is entirely an implementation detail, and > relying on it for anything other than that is broken. It is true that > people do do this, but it is wrong, and not something we should > encourage, as it just leads to more confusion about how Unicode things > work. >
Just to clarify - which part of my answer were you replying to...? Your reply reads like you share my view that option D) is a bad idea. Yet you seem to support C). Do you find them different as far as the experience of most users is concerned? Cheers
CC: Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 11:48:27 -0600
To: Peter Rabbitson <rabbit-p5p [...] rabbit.us>
From: Jesse Luehrs <doy [...] tozt.net>
On Tue, Feb 05, 2013 at 04:24:50AM +1100, Peter Rabbitson wrote: Show quoted text
> On Mon, Feb 04, 2013 at 09:54:01AM -0600, Jesse Luehrs wrote:
> > On Mon, Feb 04, 2013 at 11:02:35PM +1100, Peter Rabbitson wrote:
> > > On Mon, Feb 04, 2013 at 09:50:59PM +1100, Tony Cook wrote:
> > > > On Wed, Jan 30, 2013 at 03:26:32PM -0800, Karl Williamson via RT wrote:
> > > > > I don't know what I pressed to cause it to send while typing the > > > > > message, but send it did. So hopefully this will work better. > > > > > > > > > > On Wed Jan 30 15:20:46 2013, khw wrote:
> > > > > > Commit 02c3c86bb8fe791df9608437f0844f9a8017e3b6 changed the behavior so > > > > > > one could not successfully open a scalar with code points above 0xFF. > > > > > > > > > > > > But this test case shows an issue with this: > > > > > > > > > > > > use utf8; > > > > > > my $string = qq[a�b]; > > > > > > my $fh = IO::File->new(); > > > > > > $fh->open(\$string, '<:encoding(UTF-8)');
> > > > > > > > > > The problem is that the character in the string (which is showing up > > > > > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > > > > > string is encodable in Latin1, the open succeeds, while silently > > > > > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play > > > > > well with that, with the result that this silently breaks.
> > > > > > > > As others have said, you're expecting $string to behave as if it > > > > contains more than \x61\xC5\x62, which it doesn't. > > > > > > > > But the silent change in behaviour (especially this late) could be > > > > dangerous. > > > > > > > > Some possible changes we could make: > > > > > > > > a) none - leave it alone and hopefully people will notice their broken > > > > code and fix it. > > > > > > > > b) keep the new behaviour, but only when > > > > > > > > use feature 'unicode_strings'; > > > > > > > > is in scope where the "file" is opened, since unicode_strings includes > > > > related behaviour - treating \xC5 as that code-point whether the > > > > string is SVf_UTF8 flagged or not. > > > > > > > > Of course, that's not exactly the same concept as the other behaviour > > > > changes for unicode_string, and would end up enabled wherever > > > > > > > > c) keep the new behaviour, but only when > > > > > > > > use feature 'sane_perlio_scalar'; > > > > > > > > is in scope when the "file" is opened, since unicode_strings covers > > > > different types of behaviour and has been the default under use > > > > <someversion> since 5.12. > > > > > > > > 'sane_perlio_scalar' would be part of the 5.20 feature bundle. > > > > > > > > d) just remove the change and revert to 5.16 behaviour > > > > > > > > e) revert the change and warn if SvUTF8 is on > > > > > > > > f) revert the change and fail the open if SvUTF8 is on > > > > > > > > g) revert the change and provide a deprecation notice, and reapply the > > > > change immediately after 5.18 is released. > > > > > > > > and combinations of g) and e) or f). > > > >
> > > > > > I vote for a minimum of E), preferrably G) or (best case scenarion, > > > closest to fulfilling the least surprise principle) F) > > >
> > > > At this point I'd favour c).
> > > > > > That would be rather unfortunate, since it is no different from D) for > > > the overwhelmingly vast majority of perl programmers in the field.
> > > > The SVf_UTF8 flag itself should have no effect on the behavior of a > > program (assuming it results in the string having an equivalent list of > > logical characters). It is entirely an implementation detail, and > > relying on it for anything other than that is broken. It is true that > > people do do this, but it is wrong, and not something we should > > encourage, as it just leads to more confusion about how Unicode things > > work. > >
> > Just to clarify - which part of my answer were you replying to...? Your > reply reads like you share my view that option D) is a bad idea. Yet you > seem to support C). Do you find them different as far as the experience > of most users is concerned?
My only point was that I am opposed to E and F. I don't really have much of an opinion otherwise. -doy
CC: Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 20:06:08 +0100
To: Jesse Luehrs <doy [...] tozt.net>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 872b
On Mon, Feb 4, 2013 at 4:54 PM, Jesse Luehrs <doy@tozt.net> wrote: Show quoted text
> The SVf_UTF8 flag itself should have no effect on the behavior of a > program (assuming it results in the string having an equivalent list of > logical characters).
I don't think Tony listed any option for which that is really true. Some leak on open, and others leak on reading/writing to them. Some leak depending on content. Some leak differently in different circumstances. But all of them leak. *All*. Show quoted text
> It is entirely an implementation detail, and > relying on it for anything other than that is broken. It is true that > people do do this, but it is wrong, and not something we should > encourage, as it just leads to more confusion about how Unicode things > work.
That may be true for strings, but it isn't true for IO. Problem is that those two worlds interact here inside-out here. Leon
CC: Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 13:19:21 -0600
To: Leon Timmermans <fawaka [...] gmail.com>
From: Jesse Luehrs <doy [...] tozt.net>
Download (untitled) / with headers
text/plain 1.1k
On Mon, Feb 04, 2013 at 08:06:08PM +0100, Leon Timmermans wrote: Show quoted text
> On Mon, Feb 4, 2013 at 4:54 PM, Jesse Luehrs <doy@tozt.net> wrote:
> > The SVf_UTF8 flag itself should have no effect on the behavior of a > > program (assuming it results in the string having an equivalent list of > > logical characters).
> > I don't think Tony listed any option for which that is really true. > Some leak on open, and others leak on reading/writing to them. Some > leak depending on content. Some leak differently in different > circumstances. But all of them leak. *All*. >
> > It is entirely an implementation detail, and > > relying on it for anything other than that is broken. It is true that > > people do do this, but it is wrong, and not something we should > > encourage, as it just leads to more confusion about how Unicode things > > work.
> > That may be true for strings, but it isn't true for IO. Problem is > that those two worlds interact here inside-out here.
When is it not true for IO? $ perl -wE'my $str = "\xce"; utf8::upgrade($str); say $str' | hexdump 0000000 0ace 0000002 $ perl -wE'my $str = "\xce"; say $str' | hexdump 0000000 0ace 0000002 Or am I missing something? -doy
CC: Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 20:41:46 +0100
To: Jesse Luehrs <doy [...] tozt.net>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 280b
On Mon, Feb 4, 2013 at 8:19 PM, Jesse Luehrs <doy@tozt.net> wrote: Show quoted text
> When is it not true for IO?
It's true on the inside, but not on the outside (or at least it gives a warning when that expectation is broken). The problem here is that the string is used as the outside. Leon
CC: Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 21:03:43 +0100
To: Jesse Luehrs <doy [...] tozt.net>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 532b
On Mon, Feb 4, 2013 at 8:41 PM, Leon Timmermans <fawaka@gmail.com> wrote: Show quoted text
> It's true on the inside, but not on the outside (or at least it gives > a warning when that expectation is broken). > > The problem here is that the string is used as the outside.
Better said, any read after «open my $fh, '<', \$scalar» leaks the representation of $scalar. Always. The previous behavior was to expose the bytevalue of eiter. The current behavior is to force Latin-1, and fail if that isn't possible. Both leaky in their own way. Leon
CC: Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 4 Feb 2013 14:17:53 -0600
To: Leon Timmermans <fawaka [...] gmail.com>
From: Jesse Luehrs <doy [...] tozt.net>
Download (untitled) / with headers
text/plain 1.1k
On Mon, Feb 04, 2013 at 09:03:43PM +0100, Leon Timmermans wrote: Show quoted text
> On Mon, Feb 4, 2013 at 8:41 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> > It's true on the inside, but not on the outside (or at least it gives > > a warning when that expectation is broken). > > > > The problem here is that the string is used as the outside.
> > Better said, any read after «open my $fh, '<', \$scalar» leaks the > representation of $scalar. Always. The previous behavior was to expose > the bytevalue of eiter. The current behavior is to force Latin-1, and > fail if that isn't possible. Both leaky in their own way.
Sure, but that leak is a bug (the bug that is the topic of this ticket). I'm just talking about what the correct behavior should be, not what the current implementation does. Looking at the SVf_UTF8 flag is not going to be correct regardless of the situation, because you can end up with two strings that look identical from perl space, but do different things when opened. It might make certain cases do the right thing, but it's never going to be 100% correct. Only allowing codepoints less than 256 is probably the only reasonable option here. -doy
CC: Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Mon, 04 Feb 2013 22:16:19 -0700
To: Tony Cook <tony [...] develop-help.com>
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 854b
On 02/04/2013 03:50 AM, Tony Cook wrote: Show quoted text
> As others have said, you're expecting $string to behave as if it > contains more than \x61\xC5\x62, which it doesn't.
A slight tangent: I did not originate this example; it came from another ticket https://rt.perl.org/rt3/Ticket/Display.html?id=116322 and the relevant line from it reads: my $string = qq[a�b]; # use utf8 makes $string UTF-8. My mental model of how things should work bought this without thinking. It's clear that that poster also thought it was obvious that things should work the way he thought. Looking just now at the documentation for 'use utf8', I don't see anything that makes it clear that our mental models (and I bet those of a lot of other people) are wrong. And that tells me that whatever the outcome of #109828, utf8.pm's pod needs to be significantly clarified.
CC: Leon Timmermans <fawaka [...] gmail.com>, Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Tue, 5 Feb 2013 13:47:16 +0000
To: Jesse Luehrs <doy [...] tozt.net>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 2.4k
On Mon, Feb 04, 2013 at 02:17:53PM -0600, Jesse Luehrs wrote: Show quoted text
> On Mon, Feb 04, 2013 at 09:03:43PM +0100, Leon Timmermans wrote:
> > On Mon, Feb 4, 2013 at 8:41 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> > > It's true on the inside, but not on the outside (or at least it gives > > > a warning when that expectation is broken). > > > > > > The problem here is that the string is used as the outside.
> > > > Better said, any read after «open my $fh, '<', \$scalar» leaks the > > representation of $scalar. Always. The previous behavior was to expose > > the bytevalue of eiter. The current behavior is to force Latin-1, and > > fail if that isn't possible. Both leaky in their own way.
Not sure if the word "leak" (in the sense of information leak) is the right word here. In that, to my mind, the design problem is that there are two possible (conflicting) actions the programmer wanted to mean by «open my $fh, '<', \$scalar», which the interpreter has no way of distinguishing. 1) treat scalar as "the outside", in which case it has to be in octets, to be meaningful. (And the programmer knows that he/she needs to deal with encoding issues, as this is a boundary where that is necessary) 2) treat scalar as "inside", in which case it has to be in characters, to be meaningful. Both make sense. But we have an internal representation which is ambiguous between "octets from the outside" and "characters from the inside". So we can't tell. Show quoted text
> Sure, but that leak is a bug (the bug that is the topic of this ticket). > I'm just talking about what the correct behavior should be, not what the > current implementation does. Looking at the SVf_UTF8 flag is not going > to be correct regardless of the situation, because you can end up with > two strings that look identical from perl space, but do different things > when opened. It might make certain cases do the right thing, but it's > never going to be 100% correct. Only allowing codepoints less than 256 > is probably the only reasonable option here.
Given that the model we're trying to converge on is that SvUTF8() is representation, not semantics*, I think this ends up as the only consistent answer, even though it rules out use case (2) Really, we need two versions of the syntax. Given that we have polymorphic strings, we need monomorphic operators. Nicholas Clark * And if that's not the case, why am I allowed to concatenate a string having SvUTF8() true with one having SvUTF8() false?
CC: Jesse Luehrs <doy [...] tozt.net>, Peter Rabbitson <rabbit-p5p [...] rabbit.us>, Tony Cook <tony [...] develop-help.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Tue, 5 Feb 2013 17:38:13 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 1.7k
On Tue, Feb 5, 2013 at 2:47 PM, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> Not sure if the word "leak" (in the sense of information leak) is the right > word here. > > In that, to my mind, the design problem is that there are two possible > (conflicting) actions the programmer wanted to mean by > «open my $fh, '<', \$scalar», which the interpreter has no way of > distinguishing. > > 1) treat scalar as "the outside", in which case it has to be in octets, to > be meaningful. (And the programmer knows that he/she needs to deal with > encoding issues, as this is a boundary where that is necessary) > > 2) treat scalar as "inside", in which case it has to be in characters, to be > meaningful. > > Both make sense. > > But we have an internal representation which is ambiguous between "octets > from the outside" and "characters from the inside". So we can't tell.
Agreed. Show quoted text
>> Sure, but that leak is a bug (the bug that is the topic of this ticket). >> I'm just talking about what the correct behavior should be, not what the >> current implementation does. Looking at the SVf_UTF8 flag is not going >> to be correct regardless of the situation, because you can end up with >> two strings that look identical from perl space, but do different things >> when opened. It might make certain cases do the right thing, but it's >> never going to be 100% correct. Only allowing codepoints less than 256 >> is probably the only reasonable option here.
> > > Given that the model we're trying to converge on is that SvUTF8() > is representation, not semantics*, I think this ends up as the only > consistent answer, even though it rules out use case (2) > > Really, we need two versions of the syntax. > Given that we have polymorphic strings, we need monomorphic operators.
open my $fh, '<:scalar(characters)', \$buffer ? Leon
CC: perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Tue, 5 Feb 2013 18:52:01 -0500
To: Karl Williamson via RT <perlbug-followup [...] perl.org>
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 1.8k
* Karl Williamson via RT <perlbug-followup@perl.org> [2013-01-30T18:26:32] Show quoted text
> > use utf8; > > my $string = qq[aÅb]; > > my $fh = IO::File->new(); > > $fh->open(\$string, '<:encoding(UTF-8)');
> > The problem is that the character in the string (which is showing up > incorrectly encoded here, but is a U+00C5) is in Latin 1. Since the > string is encodable in Latin1, the open succeeds, while silently > downgrading from UTF-8 to Latin1, but the :encoding(UTF-8) doesn't play > well with that, with the result that this silently breaks.
Okay, I have not read the whole huge thread that has sprung up, so forgive me if I'm replaying it. The downgrade seems to be the error here. When we say 'open' we have to act like we're opening a bytestream, and that the layers mediate the byte/maybe-character boundary. Saying "open my $fh, \$charstring" is like saying "5 + q{hello}" -- we're using a byteish operator on a textish string. Too bad scalar types are to lax, in this case, huh? At any rate, the thing we *can* say about strings is that they are sequences of non-negative integer values. If we use them in a byteish context, we should treat them like bytes. The fact that the byte \xC5 might be represented in memory by two bytes with a flag on the SV saying "decode this before giving the value back" is at a different level of abstraction. If someone tries to open the string qq[aÅb] with a UTF-8 decoding layer, it should not be readable, because the three values to be read from the stream are not valid UTF-8. Barf at decode time. If someone tries to open the string qq[aĥb] (three codepoints, the second being 0x0125) we can probably fail at open time, which is, I believe, the intent of Tony's patch. We could issue a wide character warning, instead, but I think that this is excessively permissive. Quite possibly I have missed some subtleties. I hope someone will point them out, if so. -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

CC: Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #109828] PerlIO::scalar does not handle UTF-8
Date: Wed, 6 Feb 2013 13:10:23 -0500
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: David Golden <xdg [...] xdg.me>
Download (untitled) / with headers
text/plain 1.1k
On Tue, Feb 5, 2013 at 6:52 PM, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: Show quoted text
> > If someone tries to open the string qq[aÅb] with a UTF-8 decoding layer, it > should not be readable, because the three values to be read from the stream are > not valid UTF-8. Barf at decode time. > > If someone tries to open the string qq[aĥb] (three codepoints, the second being > 0x0125) we can probably fail at open time, which is, I believe, the intent of > Tony's patch. We could issue a wide character warning, instead, but I think > that this is excessively permissive.
+1 To the fullest extent possible, opening a scalar should be opening a bytestream and the UTF8 flag should be off. Fatality is fine, but perhaps the error should hint about "maybe utf8::encode($string) first" for people who want to create a character string (e.g. for testing) and then read :encoding(UTF-8) from that. People who expect to edit the string in the middle of reading or writing are asking for trouble and we should note that. (Not unlike, say, having multiple handles editing a file on disk.) David -- David Golden <xdg@xdg.me> Take back your inbox! → http://www.bunchmail.com/ Twitter/IRC: @xdg
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 686b
From the discussion here and in #p5p, I don't plan to make any changes to the behaviour of PerlIO::scalar from where it is in blead. Of course, this change in behaviour doesn't directly address the behaviour originally requested - that the :utf8 layer mark the output scalar as containing UTF-8, and at least one respondent[1] thought there was no harm in it. But I believe the original request was incorrect, for pretty much the same reason others expressed - files contain bytes, and our scalar mirror of a file should behave the same. Unless someone objects, I'll close this ticket in the next couple of days. [1] https://rt.perl.org/rt3/Ticket/Display.html?id=109828#txn-1084484
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 332b
On Fri Feb 08 01:20:52 2013, tonyc wrote: Show quoted text
> But I believe the original request was incorrect, for pretty much > the same reason others expressed - files contain bytes, and our > scalar mirror of a file should behave the same. > > Unless someone objects, I'll close this ticket in the next couple > of days.
And hence closed. Tony


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org