New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PerlIO::scalar does not handle UTF-8 #11938
Comments
From @dglIf a UTF-8 output layer is specified the resulting scalar does not have the UTF-8 flag on. I think this makes sense for output, although there may be other ramifications. Here's a todo test: Inline Patchdiff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t
index a02107b..59b65ad 100644
--- a/ext/PerlIO-scalar/t/scalar.t
+++ b/ext/PerlIO-scalar/t/scalar.t
@@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere.
$| = 1;
-use Test::More tests => 79;
+use Test::More tests => 80;
my $fh;
my $var = "aaa\n";
@@ -360,3 +360,11 @@ SKIP: {
ok has_trailing_nul $memfile,
'write appends null when growing string after seek past end';
}
+
+# [perl #xxxx]
+{
+ local $TODO = "UTF-8 support";
+ my $string = "\x{ffe}";
+ open my $fh, "> :encoding(UTF-8)", \(my $out);
+ ok $string eq $out;
+} |
From @LeontOn Sat, Feb 4, 2012 at 6:10 PM, David Leadbeater
PerlIO does bytes, always. It's utf8 support is literally a one bit The other direction would seem to be more important. When opening a Leon |
The RT System itself - Status changed from 'new' to 'open' |
From tchrist@perl.comDavid Leadbeater (via RT) <perlbug-followup@perl.org> wrote
Why don't you use an assigned Unicode code point there, please?
Why are you involving the Encode module? Why isn't that simply: open(my $fh, "> :utf8", \my
I absolutely gave up on this. It was too unreliable. Even if you are --tom |
From @ikegamiOn Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater <perlbug-followup@perl.org
Files can only contain bytes. This makes no sense to me. - Eric |
From @ikegamiOn Sat, Feb 4, 2012 at 5:49 PM, Eric Brine <ikegami@adaelis.com> wrote:
... especially since you specially ask for encode whatever you print. And your patch is buggy: You forgot to actually print to $fh. |
From @xdgOn Sat, Feb 4, 2012 at 12:10 PM, David Leadbeater
I think that one should expect PerlIO::scalar to provide a black box If the referenced string already has the utf8 flag set, I think it's David |
From @TuxOn Sat, 4 Feb 2012 18:55:27 +0100, Leon Timmermans <fawaka@gmail.com>
including pragma's? use open OUT => "encoding(utf16)"; -- |
From @nwc10On Sat, Feb 04, 2012 at 08:12:30PM -0500, David Golden wrote:
Whoa. I don't think you mean "has the utf8 flag set". That's how 5.6.0 I *think* it needs to be strictly bytes-only (just like any real file handle) Nicholas Clark |
From @nwc10On Sat, Feb 04, 2012 at 06:55:27PM +0100, Leon Timmermans wrote:
No, as I replied elsewhere, I think it should refuse to open any scalar Or, at least, the user's code needs to be different to say "I want to open a Having open for reading have some sort of "did they mean characters or bytes? Nicholas Clark |
From @LeontOn Mon, Feb 6, 2012 at 12:05 PM, Nicholas Clark <nick@ccl4.org> wrote:
Yeah, that is a good point, how about making things explicit? E.G Leon |
From @xdgOn Mon, Feb 6, 2012 at 9:17 AM, Leon Timmermans <fawaka@gmail.com> wrote:
Isn't that just C<open my $fh, "+<:utf8", \my $scalar>? If you *know* that you have UTF-8 characters in a string, it's not Or is the question about what Perl should do about returning bytes Assume that Perl's internal character encoding is a black box. Maybe Then when something wants to use that string as a source of bytes, I don't like (a) or (c). (b) is tempting. (Coincidentally, it's -- David |
From @nwc10On Mon, Feb 06, 2012 at 10:18:39AM -0500, David Golden wrote:
That was what I thought the question was.
Agree
How do you know that the string contains characters?
Nicholas Clark |
From @xdgOn Mon, Feb 6, 2012 at 10:24 AM, Nicholas Clark <nick@ccl4.org> wrote:
Which "you" do you mean? The user? How does a user know that *any* None of that is the interpreter's concern. -- David |
From @LeontOn Mon, Feb 6, 2012 at 4:18 PM, David Golden <xdaveg@gmail.com> wrote:
Not at all. :utf8 means «assume the bytestream is utf8 encoded». It
Yeah, pretty much.
(a) Is what we're doing right now, and I think it's just plain wrong, Leon |
From @nwc10On Mon, Feb 06, 2012 at 11:02:36AM -0500, David Golden wrote:
OK, which means that the interpreter can't *do* option (b) (or (a) for that On Mon, Feb 06, 2012 at 03:24:04PM +0000, Nicholas Clark wrote:
because you've just stated that the interpreter can't make a determination Nicholas Clark |
From @ikegamiOn Mon, Feb 6, 2012 at 10:18 AM, David Golden <xdaveg@gmail.com> wrote:
No, that means "decode the input on read". The question is about a buffer |
From @xdgOn Mon, Feb 6, 2012 at 11:09 AM, Nicholas Clark <nick@ccl4.org> wrote:
You're right. I was being imprecise. I think if the string contains The whole idea is to provide an in-memory abstraction of a *file*, David |
From @cpansproutOn Mon Feb 06 07:19:37 2012, xdaveg@gmail.com wrote:
(a) is what Perl currently does, as Leon Timmerman said. By (b) I presume you mean to treat \xff as \xff regardless of how it is But what happens if I open a reading handle to a scalar containing An in-memory scalar could be considered a byte stream. Or it could just The latter does make some sense. If I print \xff to an in-memory file That it is currently buggy is not being questioned. But which model -- Father Chrysostomos |
From @xdgOn Sun, Feb 12, 2012 at 5:02 PM, Father Chrysostomos via RT
Sort of. What I meant is that (a) is "whatever we do" and (b) is "a For example, for a string with wide characters used as in in-memory
My bias is strongly that it should be a byte-stream, which is why I'm -- David |
From @ap* Nicholas Clark <nick@ccl4.org> [2012-02-06 12:00]:
This. Thank you. I was despairing as I read the thread, waiting for As far as the user is concerned, there is never to be any difference
What it should do on input is treat each character as a byte, throwing Regards, |
From @TuxOn Mon, 6 Feb 2012 10:58:02 +0000, Nicholas Clark <nick@ccl4.org> wrote:
Personally, I see no harm in doing a decode on close when opened for --8<--- binmode STDOUT, ":utf8"; my $data = ""; { open my $fh, ">:encoding(utf-8)", \$data; { open my $fh, "<:encoding(utf-8)", \$data; print $data; { open my $fh, "<:encoding(utf-8)", \$data; { use open OUT => ":encoding(utf-8)"; { use open IN => ":encoding(utf-8)"; print $data; { use open IN => ":encoding(utf-8)"; $ perl test.pl -- |
From @ikegamiOn Sun, Feb 12, 2012 at 5:02 PM, Father Chrysostomos via RT <
And the following test will detect regressions once its fixed. ===== use strict; use Test::More tests => 1; sub read_from_scalar { sub hexify { join ' ', map sprintf('%02X', ord), split //, $_[0] } { 1; |
From @ribasushiIs there any word on this issue? I just hit this bug in reverse[1] and Cheers [1] http://www.perlmonks.org/?node_id=1010601 |
From @LeontOn Fri, Dec 28, 2012 at 8:34 AM, Peter Rabbitson <rabbit+p5p@rabbit.us> wrote:
The process kind of fizzled somewhere. I'm in favor of a warning in Leon |
From @Leont0001-Warn-when-opening-utf8-string-into-handle.patchFrom d486949439c66bd1a6e76af94468c374a58590f8 Mon Sep 17 00:00:00 2001
From: Leon Timmermans <fawaka@gmail.com>
Date: Fri, 28 Dec 2012 16:41:53 +0100
Subject: [PATCH] Warn when opening utf8 string into handle
---
ext/PerlIO-scalar/scalar.pm | 2 +-
ext/PerlIO-scalar/scalar.xs | 2 ++
ext/PerlIO-scalar/t/scalar.t | 11 ++++++++++-
3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/ext/PerlIO-scalar/scalar.pm b/ext/PerlIO-scalar/scalar.pm
index 813f5e6..64ecc22 100644
--- a/ext/PerlIO-scalar/scalar.pm
+++ b/ext/PerlIO-scalar/scalar.pm
@@ -1,5 +1,5 @@
package PerlIO::scalar;
-our $VERSION = '0.15';
+our $VERSION = '0.16';
require XSLoader;
XSLoader::load();
1;
diff --git a/ext/PerlIO-scalar/scalar.xs b/ext/PerlIO-scalar/scalar.xs
index d7b8828..48dbd32 100644
--- a/ext/PerlIO-scalar/scalar.xs
+++ b/ext/PerlIO-scalar/scalar.xs
@@ -41,6 +41,8 @@ PerlIOScalar_pushed(pTHX_ PerlIO * f, const char *mode, SV * arg,
SvREFCNT_inc(perl_get_sv
(SvPV_nolen(arg), GV_ADD | GV_ADDMULTI));
}
+ if (SvUTF8(s->var) && ckWARN(WARN_UTF8))
+ Perl_warner(aTHX_ packWARN(WARN_UTF8), "Should only map byte strings into in-memory filehandles\n");
}
else {
s->var = newSVpvn("", 0);
diff --git a/ext/PerlIO-scalar/t/scalar.t b/ext/PerlIO-scalar/t/scalar.t
index d255a05..d96e2db 100644
--- a/ext/PerlIO-scalar/t/scalar.t
+++ b/ext/PerlIO-scalar/t/scalar.t
@@ -16,7 +16,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); # Not 0, 1, 2 everywhere.
$| = 1;
-use Test::More tests => 82;
+use Test::More tests => 83;
my $fh;
my $var = "aaa\n";
@@ -384,3 +384,12 @@ SKIP: {
close FILE;
is $content, "Foo-Bar\n", 'duping via >&=';
}
+
+{
+ use warnings;
+ my $content = "\x{100}";
+ my @warnings;
+ local $SIG{__WARN__} = sub { push @warnings, $_[0] };
+ open my $fh, '<', \$content;
+ is($warnings[0], "Should only map byte strings into in-memory filehandles\n", 'Trying to open a character string warns');
+}
--
1.7.6.1
|
From @tonycozOn Fri Dec 28 07:45:25 2012, LeonT wrote:
It should fail to open. If you open a UTF8 flagged string for append Your patch as written ignores the principle that the SvUTF8() flag only This should also be done for _read() and _write(), since the SV can be There's an unrelated problem that _pushed() checks flags on both arg and I'll take a look at these issues when I get home. Tony |
From @LeontOn Fri, Dec 28, 2012 at 11:06 PM, Tony Cook via RT
I didn't see enough consensus to change it that much, but I would be in favor.
It should just stop peeking and poking into the SV altogether, and use Leon |
From @ribasushiOn Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote:
This particular bit risks derailing the simpler yet more urgent bugfix. Cheers |
From @tonycozOn Fri, Dec 28, 2012 at 11:16:36PM +0100, Leon Timmermans wrote:
I've attached my suggested changes (in several parts), also available Reasons for failing instead of warning: 1) reading - to follow the "SVf_UTF8 is only representation" 2) writing - if the SV is flagged UTF8, and the user of the handle It's possible could be avoided if we always treat the written bytes as As written I think the warning message could be improved and the Tony |
From @LeontOn Fri, Feb 1, 2013 at 8:16 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
That's not downgraded at all, it has the utf8 flag.
That abstraction leaks when it comes into contact with PerlIO. No Leon |
From @nwc10On Fri, Feb 01, 2013 at 03:54:37PM +0100, Leon Timmermans wrote:
I feel that I'm asking a stupid question here, but why/how does it leak? So, here, some code wants to think in terms of using file-like operations Whereas other code wants to think in terms of using file-like operations on And it's the same syntax to open either. Is that the leakage you mean? That by the time the code comes to open the Or have I misunderstood? Nicholas Clark |
From @ap* Leon Timmermans <fawaka@gmail.com> [2013-02-01 16:00]:
Err. I looked at the output thrice and always saw it clearly absent. Sorry for the noise.
There is no leak: if you stick an `encode('UTF-8', ...)` in there, then This is a matter not of leaky abstraction but of a missing affordance, Regards, |
From @LeontOn Fri, Feb 1, 2013 at 4:08 PM, Nicholas Clark <nick@ccl4.org> wrote:
Yeah, it's that problem. The old behavior was to leak the internal encoding. The new behavior is to always expose something as Latin-1, even when The third option is to reject any character string. This is obviously A fourth would be for scalar to be more utf8 aware, I'm not sure Leon |
From @ikegamiAll that's needed to make it sane is a "Wide character" warning/error when |
From @LeontOn Fri, Feb 1, 2013 at 7:13 PM, Eric Brine <ikegami@adaelis.com> wrote:
But what about the chars 127-255? Leon |
From @ap* Leon Timmermans <fawaka@gmail.com> [2013-02-01 17:15]:
It only looks like Latin-1, because it’s exposing the U+0000..U+00FF
How was it not? The file contained the bytes C3 85 and a declaration to If you’re saying the decoding is undesirable here then you’re saying This still doesn’t track programmer intent. It’s impossible to tell by ** That is the symmetric alternate to an explicit `encode('UTF-8',...)`
Except of course you cannot implement this in Perl as she is because no But leaky how? Intent would be clear here.
How do you mean? Regards, |
From @LeontOn Sat, Feb 2, 2013 at 5:29 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
True, I phrased it poorly. How about: The new behavior is to always
Well, instead of the current encoding you have to know if it can be
I like that approach better too
You mostly answered your own question. You can't get closer than
I meant the :scalar layer Leon |
From @ikegamiOn Fri, Feb 1, 2013 at 1:16 PM, Leon Timmermans <fawaka@gmail.com> wrote:
What about them? Char 127 is char 127, no matter how it's stored. |
From @tonycozOn Wed, Jan 30, 2013 at 03:26:32PM -0800, Karl Williamson via RT wrote:
As others have said, you're expecting $string to behave as if it But the silent change in behaviour (especially this late) could be Some possible changes we could make: a) none - leave it alone and hopefully people will notice their broken b) keep the new behaviour, but only when use feature 'unicode_strings'; is in scope where the "file" is opened, since unicode_strings includes Of course, that's not exactly the same concept as the other behaviour c) keep the new behaviour, but only when use feature 'sane_perlio_scalar'; is in scope when the "file" is opened, since unicode_strings covers 'sane_perlio_scalar' would be part of the 5.20 feature bundle. d) just remove the change and revert to 5.16 behaviour e) revert the change and warn if SvUTF8 is on f) revert the change and fail the open if SvUTF8 is on g) revert the change and provide a deprecation notice, and reapply the and combinations of g) and e) or f). At this point I'd favour c). Tony |
From @ribasushiOn Sat, Feb 02, 2013 at 10:11:30AM +0100, Leon Timmermans wrote:
You could go one step further though, when you are adamant about Cheers [1] https://metacpan.org/source/RIBASUSHI/Devel-PeekPoke-0.03/lib/Devel/PeekPoke/PP.pm#L100 |
From @ribasushiOn Mon, Feb 04, 2013 at 09:50:59PM +1100, Tony Cook wrote:
I vote for a minimum of E), preferrably G) or (best case scenarion,
That would be rather unfortunate, since it is no different from D) for Cheers |
From @doyOn Mon, Feb 04, 2013 at 11:02:35PM +1100, Peter Rabbitson wrote:
The SVf_UTF8 flag itself should have no effect on the behavior of a -doy |
From @ribasushiOn Mon, Feb 04, 2013 at 09:54:01AM -0600, Jesse Luehrs wrote:
Just to clarify - which part of my answer were you replying to...? Your Cheers |
From @doyOn Tue, Feb 05, 2013 at 04:24:50AM +1100, Peter Rabbitson wrote:
My only point was that I am opposed to E and F. I don't really have much -doy |
From @LeontOn Mon, Feb 4, 2013 at 4:54 PM, Jesse Luehrs <doy@tozt.net> wrote:
I don't think Tony listed any option for which that is really true.
That may be true for strings, but it isn't true for IO. Problem is Leon |
From @doyOn Mon, Feb 04, 2013 at 08:06:08PM +0100, Leon Timmermans wrote:
When is it not true for IO? $ perl -wE'my $str = "\xce"; utf8::upgrade($str); say $str' | hexdump $ perl -wE'my $str = "\xce"; say $str' | hexdump Or am I missing something? -doy |
From @LeontOn Mon, Feb 4, 2013 at 8:19 PM, Jesse Luehrs <doy@tozt.net> wrote:
It's true on the inside, but not on the outside (or at least it gives The problem here is that the string is used as the outside. Leon |
From @LeontOn Mon, Feb 4, 2013 at 8:41 PM, Leon Timmermans <fawaka@gmail.com> wrote:
Better said, any read after «open my $fh, '<', \$scalar» leaks the Leon |
From @doyOn Mon, Feb 04, 2013 at 09:03:43PM +0100, Leon Timmermans wrote:
Sure, but that leak is a bug (the bug that is the topic of this ticket). -doy |
From @khwilliamsonOn 02/04/2013 03:50 AM, Tony Cook wrote:
A slight tangent: I did not originate this example; it came from another ticket My mental model of how things should work bought this without thinking. Looking just now at the documentation for 'use utf8', I don't see |
From @nwc10On Mon, Feb 04, 2013 at 02:17:53PM -0600, Jesse Luehrs wrote:
Not sure if the word "leak" (in the sense of information leak) is the right In that, to my mind, the design problem is that there are two possible 1) treat scalar as "the outside", in which case it has to be in octets, to 2) treat scalar as "inside", in which case it has to be in characters, to be Both make sense. But we have an internal representation which is ambiguous between "octets
Given that the model we're trying to converge on is that SvUTF8() Really, we need two versions of the syntax. Nicholas Clark * And if that's not the case, why am I allowed to concatenate a string having |
From @LeontOn Tue, Feb 5, 2013 at 2:47 PM, Nicholas Clark <nick@ccl4.org> wrote:
Agreed.
open my $fh, '<:scalar(characters)', \$buffer ? Leon |
From @rjbs* Karl Williamson via RT <perlbug-followup@perl.org> [2013-01-30T18:26:32]
Okay, I have not read the whole huge thread that has sprung up, so forgive me When we say 'open' we have to act like we're opening a bytestream, and that the At any rate, the thing we *can* say about strings is that they are sequences of If someone tries to open the string qq[aÅb] with a UTF-8 decoding layer, it If someone tries to open the string qq[aĥb] (three codepoints, the second being Quite possibly I have missed some subtleties. I hope someone will point them -- |
From @xdgOn Tue, Feb 5, 2013 at 6:52 PM, Ricardo Signes
+1 To the fullest extent possible, opening a scalar should be opening a People who expect to edit the string in the middle of reading or David -- |
From @tonycozFrom the discussion here and in #p5p, I don't plan to make any Of course, this change in behaviour doesn't directly address the But I believe the original request was incorrect, for pretty much Unless someone objects, I'll close this ticket in the next couple [1] https://rt-archive.perl.org/perl5/Ticket/Display.html?id=109828#txn-1084484 |
From @tonycozOn Fri Feb 08 01:20:52 2013, tonyc wrote:
And hence closed. Tony |
@tonycoz - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#109828 (status was 'resolved')
Searchable as RT109828$
The text was updated successfully, but these errors were encountered: