New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation of byte I/O #14803
Comments
From the.rob.dixon@gmail.comThis is a bug report for perl from the.rob.dixon@gmail.com, I recently read "perldoc bytes" and saw This pragma reflects early attempts to incorporate Unicode into I think it is a mistake to fail to say /what/ has superseded it, and In any statement that deprecates "bytes", I think something should In short, I believe the perluni* should have been The journey into the promised land of Unicode has been arduous As you can tell, these are infant musings without any coherent plan. Or perhaps I am just asking for a better index into the perldoc tomes Thank you for reading Flags: Site configuration information for perl 5.22.0: Configured by strawberry-perl at Mon Jun 1 20:06:45 2015. Summary of my perl5 (revision 5 version 22 subversion 0) configuration: Platform: @INC for perl 5.22.0: Environment for perl 5.22.0: |
From @cowensWould it be correct to say that explicit encoding using the Encode module expected result of the bytes pragma: #!/usr/bin/perl use strict; use Encode qw/encode/; my $utf8 = "é"; print "expected result of the bytes pragma:\n"; my $length = length $utf8; print "$length [@bytes]\n"; print "unexpected results when string is not in the expected encoding\n"; my $length = length $latin1; print "$length [@bytes]\n"; print "explicit encoding yields the expected results\n"; print "$length [@bytes]\n"; print "even when the string isn't in the expected encoding\n"; print "$length [@bytes]\n"; On Wed, Jul 15, 2015 at 4:20 PM Rob Dixon <perlbug-followup@perl.org> wrote:
|
The RT System itself - Status changed from 'new' to 'open' |
From @ap* Rob Dixon <perlbug-followup@perl.org> [2015-07-15 22:20]:
It has been superseded by nothing. I/O is in terms of bytes by default. If, however, you are using code that does at some point decode your
Please explain. FWIW, `use utf8` has a *completely* different purpose
Somehow I don’t think the documentation is going to get less confusing
Correct.
That is necessary on Windows only because the :crlf layer is added to It is also necessary if you want to turn off decoding on a filehandle
Well. The documentation is sprawling and has no coherent organisation, What this produces is documentation that is great so long as you read If the documentation is supposed to be more penetrable then it must have In short, you need an architect. We don’t have one. In fact we are probably even further away from having Regards, |
From @cowensI think the following patch addresses the confusion of what to use Inline Patchdiff --git a/lib/bytes.pm b/lib/bytes.pm
index 6dad41a..77d849d 100644
--- a/lib/bytes.pm
+++ b/lib/bytes.pm
@@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics rather
=head1 NOTICE -This pragma reflects early attempts to incorporate Unicode into perl and =head1 SYNOPSIS Chas. Owens |
From @tonycozOn Mon Aug 03 07:05:54 2015, cowens wrote:
It seems like an improvement to me. Should it mention utf8::encode()? Tony |
From @cowensOn Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org>
At first, I was going to say no, but then I benchmarked them: utf8::encode |
From @cowensbytes.patchdiff --git a/lib/bytes.pm b/lib/bytes.pm
index 6dad41a..96a243c 100644
--- a/lib/bytes.pm
+++ b/lib/bytes.pm
@@ -35,15 +35,31 @@ bytes - Perl pragma to force byte semantics rather than character semantics
=head1 NOTICE
-This pragma reflects early attempts to incorporate Unicode into perl and
-has since been superseded. It breaks encapsulation (i.e. it exposes the
-innards of how the perl executable currently happens to store a string),
-and use of this module for anything other than debugging purposes is
-strongly discouraged. If you feel that the functions here within might be
-useful for your application, this possibly indicates a mismatch between
-your mental model of Perl Unicode and the current reality. In that case,
-you may wish to read some of the perl Unicode documentation:
-L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
+This pragma reflects early attempts to incorporate Unicode into perl and has
+since been superseded by explicit (rather than this pragma's implict) encoding
+using the L<Encode> module:
+
+ use Encode qw/encode/;
+
+ my $utf8_byte_string = encode "UTF8", $string;
+ my $latin1_byte_string = encode "Latin1", $string;
+
+Or, if performance is needed and you are only interested in the UTF-8
+representation:
+
+ use utf8;
+
+ utf8::encode(my $utf8_byte_string = $string);
+
+Because the bytes pragma breaks encapsulation (i.e. it exposes the innards of
+how the perl executable currently happens to store a string), the byte values
+that result are in an unspecified encoding. Use of this module for anything
+other than debugging purposes is strongly discouraged. If you feel that the
+functions here within might be useful for your application, this possibly
+indicates a mismatch between your mental model of Perl Unicode and the current
+reality. In that case, you may wish to read some of the perl Unicode
+documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
+L<perlunicode>.
=head1 SYNOPSIS
@@ -95,6 +111,6 @@ bytes::substr() does not work as an lvalue().
=head1 SEE ALSO
-L<perluniintro>, L<perlunicode>, L<utf8>
+L<perluniintro>, L<perlunicode>, L<utf8>, L<Encode>
=cut
|
From @khwilliamsonOn 08/06/2015 02:18 AM, Chas. Owens wrote:
Consider doing an approach of doing something like this: =head1 NAME bytes - Perl pragma to access the individual bytes of characters stressing that this is to be mostly confined to temporary debugging |
From @cowensOn Tue, Aug 11, 2015 at 9:35 PM, Karl Williamson
Here is my rewritten version. Everything following this text is the =head1 NAME bytes - Perl pragma to access the individual bytes of characters in strings =head1 NOTICE This pragma is no longer recommended for anything other than debugging A better solution is to create a byte string with an explicit encoding using use Encode qw/encode/; my $utf8_byte_string = encode "UTF8", $string; Or, if performance is needed and you are only interested in the UTF-8 use utf8; utf8::encode(my $utf8_byte_string = $string); If you feel this pragma might be useful for your application, this possibly -- |
From lasse.makholm@gmail.comHi, It seems like bytes is not only deprecated and easily misunderstood but I can't remember ever using bytes for anything except The bytes docs explicitly state that the byte strings it operates on are, $x = chr(400); Yields: Length is 1 However, using U+00F8 ( chr(248) - "ø" ) instead - not so much: Length is 1 Ouch. I'm guessing there's some code somewhere that mistakenly assumes that I'm slightly baffled as to how I've never noticed this before... This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level /L On 13 August 2015 at 16:10, Chas. Owens <chas.owens@gmail.com> wrote:
|
From @rjbs* Lasse Makholm <lasse.makholm@gmail.com> [2015-08-13T16:46:05]
The perl runtime stores strings in memory in either Type-A or Type-B format. In Type-A format, it is an array of chars, and each char is the value of the In Type-B format, it is an array of chars forming a valid UTF-8 sequence. To "use bytes" makes things operate on the underlying "array of chars" without -- |
From @cowensThe bytes pragma is broken, but not in the way you think. Perl using bytes #!/usr/bin/perl use strict; binmode STDOUT => ":utf8"; my $latin1 = chr(255); my ($latin1_length, $utf8_length); print utf8::encode(my $latin1_byte_string = $latin1); $latin1_length = length $latin1_byte_string; print Also, if you or your framework hasn't been setting the output $ perl -e 'print chr(255)' | wc -c On Thu, Aug 13, 2015 at 4:46 PM, Lasse Makholm <lasse.makholm@gmail.com> wrote:
-- |
From @cowensI think you are confused by this line: "As an example, when Perl sees On Thu, Aug 13, 2015 at 5:48 PM, Chas. Owens <chas.owens@gmail.com> wrote:
-- |
From @LeontOn Thu, Aug 13, 2015 at 11:48 PM, Chas. Owens <chas.owens@gmail.com> wrote:
No, it can use utf8 internally even when all of the characters are below
Indeed. Leon |
From @cowensOn Thu, Aug 13, 2015 at 6:08 PM, Leon Timmermans <fawaka@gmail.com> wrote:
Like I said, I don't fully understand the rules (and I don't have to #!/usr/bin/perl use strict; my $utf8 = "ÿ"; { -- |
From @cowensA new version that attempts to address the confusion of how strings =head1 NAME bytes - Perl pragma to access the individual bytes of characters in strings =head1 NOTICE This pragma is no longer recommended for anything other than debugging of how # this string can be encoded internally with Latin1, # but this string has a character above U+00FF and can't be encoded A better solution is to create a byte string with an explicit encoding using use Encode qw/encode/; my $utf8_byte_string = encode "UTF8", $string; Or, if performance is needed and you are only interested in the UTF-8 use utf8; utf8::encode(my $utf8_byte_string = $string); If you feel this pragma might be useful for your application, this possibly On Thu, Aug 13, 2015 at 6:25 PM, Chas. Owens <chas.owens@gmail.com> wrote:
-- |
From lasse.makholm@gmail.comOn 13 August 2015 at 23:50, Chas. Owens <chas.owens@gmail.com> wrote:
Spot on. Perhaps some stronger/earlier wording is needed in discouraging Maybe just something more a la the docs for "local" which basically start /L
|
From lasse.makholm@gmail.comOn 13 August 2015 at 23:48, Chas. Owens <chas.owens@gmail.com> wrote:
You are right, of course. And despite having read through most of the perl If you have been using the bytes
Thankfully my responses are almost without exception JSON blobs encoded in The mistake is real though. /L
|
From lasse.makholm@gmail.comOn 14 August 2015 at 00:30, Chas. Owens <chas.owens@gmail.com> wrote:
FWIW, I like this version. It clearly shows why bytes::length() is not what The "use utf8;" statement should probably be removed though. According to Do not use this pragma for anything else than telling Perl that your And indeed, you can call utf8::encode() without use'ing utf8 first. /L |
From @ikegamiOn Thu, Aug 13, 2015 at 10:10 AM, Chas. Owens <chas.owens@gmail.com> wrote:
Sorry if this was already mentioned, but that should be utf8::encode(my $utf8_byte_string = $string); without use utf8; "use utf8;" is a pragma that indicates the source code is encoded using |
From @tonycozOn Thu Aug 13 15:31:03 2015, cowens wrote:
If the string has code points over 0xff it's encoded as perl's extended UTF-8, otherwise it could be encoded either way. For example Encode::decode() always (I'm unaware of any exceptions) returns a UTF-8 encoded string, even if all of the characters are between 0 and 0xff: tony@mars:.../git/perl$ ./perl -Ilib -MDevel::Peek -MEncode -e '$x = decode("latin1", " "); Dump($x)' Also some string literals under use utf8: tony@mars:.../git/perl$ ./perl -Ilib -MDevel::Peek -Mutf8 -e '$x = "ÿ"; Dump($x)' ...
As others have said, the "use utf8;" isn't needed. Tony |
From @tonycozOn Fri, 14 Aug 2015 11:28:28 -0700, ikegami@adaelis.com wrote:
Is there anything else we want to do for this ticket? The only patch (with some extras and changes) was applied as: commit 01e331e Update bytes.pm doc (though it retained the unneeded C<use utf8;>.) Tony |
From @khwilliamsonOn Sun, 15 Oct 2017 22:09:20 -0700, tonyc wrote:
No one replied to this; I'm unclear if it was sent to all the concerned parties. The latest Encode running on blead is much faster than before. I'm fine with leaving the wording as it is now; but before closing I wasnt to make sure that others don't have objections. If I don't hear any withing 30 days, I will close the ticket. -- |
From @iabynOn Tue, Apr 03, 2018 at 09:38:50AM -0700, Karl Williamson via RT wrote:
With the just-pushed v5.27.10-105-g0d372decae, I've removed the spurious -- |
From @khwilliamsonThe deadline for 5.28 is fast upon us, and since no objections have been raised, I'm now closing this ticket |
@khwilliamson - Status changed from 'open' to 'pending release' |
From @khwilliamsonThank you for filing this report. You have helped make Perl better. With the release yesterday of Perl 5.28.0, this and 185 other issues have been Perl 5.28.0 may be downloaded via: If you find that the problem persists, feel free to reopen this ticket. |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#125619 (status was 'resolved')
Searchable as RT125619$
The text was updated successfully, but these errors were encountered: