New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perl leaves broken UTF-8 in SVs whose UTF8 is set #11671
Comments
From tchrist@perl.comRemebering how setting $/ to an int ref can cause Perl to erroneously leave % perl -C0 -le 'print "\xC0\x81"' | perl -CS -nle 'printf "U+%v04X\n", $_' % perl -C0 -le 'print "\xC1\x81"' | perl -CS -nle 'print for length, defined, ord' Surely this is an error?? We are actually storing invalid UTF-8 % perl -C0 -le 'print "\xC1\x81"' | perl -MDevel::Peek -CS -nle 'Dump($_)' % perl -C0 -le 'print "bad\xC1\x81stuff"' | perl -MDevel::Peek -CS -nle 'Dump($_)' % perl -C0 -le 'print "bad\xC1\x88stuff"' | perl -MDevel::Peek -CS -nle 'Dump($_)' The UTF8 flag is on, but that is not UTF8. I can't see how this isn't a bug, but am willing to be enlightened. --tom Summary of my perl5 (revision 5 version 14 subversion 0) configuration: Characteristics of this binary (from libperl): |
From @cpansproutOn Mon Sep 26 13:19:50 2011, tom christiansen wrote:
I think it was agreed some time ago that that is a bug. The utf8 layer
|
The RT System itself - Status changed from 'new' to 'open' |
From tchrist@perl.com
I do have some mail from Mark Davis explaining why a UTF-8 decoder must --tom |
From @LeontOn Mon, Sep 26, 2011 at 10:25 PM, Father Chrysostomos via RT
The complication of course is that there is no such thing as a utf8 Leon |
From tchrist@perl.com"Leon Timmermans via RT" <perlbug-followup@perl.org> wrote
The description in perlrun of what happens when you use :utf8 in the PERLIO --tom |
From @ikegamiOn Mon, Sep 26, 2011 at 4:55 PM, Tom Christiansen <tchrist@perl.com> wrote:
The warning is coming from sprintf, not the layer. $ perl -C0 -le 'print "\xC0\x81"' | perl -CS -nle1 $ perl -C0 -le 'print "\xC0\x81"' | perl -CS -MDevel::Peek -nle'Dump($_)' |
From @khwilliamsonOn 09/26/2011 02:27 PM, Tom Christiansen wrote:
This issue keeps coming back up, when I think we have long ago resolved The default utf8 layer should prohibit malformed utf8, surrogates, There should be an alternate layer, called something like utf8-lax, My understanding is that the the original reason for not doing the input I have been waiting for that code to be complete, and then planned to Having now read Mark's email, I don't think that contradicts anything |
From @LeontOn Wed, Sep 28, 2011 at 1:09 AM, Karl Williamson
I would personally prefer it to be one layer with multiple options. I Leon |
From @nwc10On Tue, Sep 27, 2011 at 05:09:33PM -0600, Karl Williamson wrote:
I had hoped to work on it over last Christmas, but everyone got ill and Whilst I have a feel for how to do it for UTF-8, I have no idea how do to I also wasn't sure how to benchmark it properly, to be confident about the It's also blocking on lack of feedback to bug #79960 Nicholas Clark |
From @khwilliamsonOn 09/28/2011 05:50 AM, Nicholas Clark wrote:
I believe I have the expertise to take what you do for UTF-8 and extend
I remember seeing the code somewhere, and thinking that it could be
So, here's my comments on that bug. FWIW, here is a link to what I have never used $/ set to a fixed length, but reading the pod, it But why not just return only as many complete characters as will fit in I do think that the buffer length should only be construed as bytes and |
From tchrist@perl.comKarl Williamson <public@khwilliamson.com> wrote
Could you please explain why you think that? Why not have binmode(FH, ":utf8"); mean binmode(FH, ":utf8"); I vagule feel like you should never have byte operation But maybe I'm wrong. --tom |
From @khwilliamsonOn 09/28/2011 05:56 PM, Tom Christiansen wrote:
I found this persuasive (from the original ticket) "Or we could try to I would also be ok with just croaking when attempting a byte-type |
From @nwc10On Wed, Sep 28, 2011 at 01:43:18AM +0200, Leon Timmermans wrote:
I *think* so, but somewhere I have notes on what made sense, and some Whilst it would be nice for :utf8 to be the flexible layer, I think it would Then you run that code on an older perl: $ echo Works already | perl -we 'open my $fh, "<:utf8(surrogates-ok,nonchars-ok)", "/dev/fd/0"; print <$fh>' a) No error. No warning that your input isn't subject to paranoia I guess that one can solve (a) by having :utf8 fault the new arguments Also, in terms of Jesse's 5.16+ plan, I can't see how layers are anything Nicholas Clark |
From @LeontOn Thu, Sep 29, 2011 at 10:09 AM, Nicholas Clark <nick@ccl4.org> wrote:
If we had aliases in PerlIO, all of this could be handled much more Leon |
From @davidnicolOn Wed, Sep 28, 2011 at 8:00 PM, Karl Williamson <public@khwilliamson.com>wrote:
it may torpedo and sink the original fixed length records for mainframe IO Is anyone here actually shoehorning UTF8 into fixed-length records, using How do major commercial databases handle unicode and "CHAR 20" fields? |
From tchrist@perl.com
Oh sure.
Java. --tom |
From @HugmeirOn Fri, Oct 14, 2011 at 5:13 PM, Tom Christiansen <tchrist@perl.com> wrote:
Is their model worth borrowing? |
From @cpansproutOn Tue Sep 27 16:11:29 2011, public@khwilliamson.com wrote:
Yes, of course.
That might be going to far.
Indeed, but the only example given where non-characters were a security (Have I already said this? I have a backlog of messages I wanted to |
From @rjbsI believe that the correct behavior is for $/=\10 to cause getlines behave as reads with length The security concern, for now, can be addressed at a different level. |
From @rjbsOn Thu Mar 22 14:53:40 2012, rjbs wrote:
...clearly I'm talking about the $/ bug specifically, which I've marked as not blocking 5.16. |
From [Unknown Contact. See original ticket]On Thu Mar 22 14:53:40 2012, rjbs wrote:
...clearly I'm talking about the $/ bug specifically, which I've marked as not blocking 5.16. |
From tchrist@perl.comOk. --tom |
From @rjbsAs of cd7e6c8 the lone remaining change to be made is the default-strictness of the :utf8 layer, The last few commits, however, make utf-8 handling much stricter, including fixing a number of This ticket will now be removed from blocking. |
From @rjbsOn Thu Apr 26 11:11:43 2012, rjbs wrote:
We hoped to see strict utf8 available in 5.17, but it didn't happen for various reasons. I got an Thanks! -- |
Migrated from rt.perl.org#100058 (status was 'open')
Searchable as RT100058$
The text was updated successfully, but these errors were encountered: