Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of byte I/O #14803

Closed
p5pRT opened this issue Jul 15, 2015 · 29 comments
Closed

Documentation of byte I/O #14803

p5pRT opened this issue Jul 15, 2015 · 29 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 15, 2015

Migrated from rt.perl.org#125619 (status was 'resolved')

Searchable as RT125619$

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2015

From the.rob.dixon@gmail.com

This is a bug report for perl from the.rob.dixon@​gmail.com,
generated with the help of perlbug 1.40 running under perl 5.22.0.


I recently read "perldoc bytes" and saw

  This pragma reflects early attempts to incorporate Unicode into
  perl (sic.) and has since been superseded

I think it is a mistake to fail to say /what/ has superseded it, and
the perluni* pods aren't something I should expect to look at if I'm
not using Unicode

In any statement that deprecates "bytes", I think something should
be said about "utf8", which seems to be its dual but is not

In short, I believe the perluni* should have been
"perlcharacterencoding", as there is no "perlascii" or "perllatin1"

The journey into the promised land of Unicode has been arduous
enough as it is, and the perluni* pods are a major achievement, so I
don't really expect all of that to be ripped apart on my whim. But
perhaps we could do with something like my "perlcharacterencoding"
as an entry point to all of the above?

As you can tell, these are infant musings without any coherent plan.
But I have recently been asked how to enable byte semantics on a
Perl input stream, and I was troubled to find that I was neither
certain nor able to locate the documentation that told me. I suspect
the answer is that file handles have byte semantics by default, and
can be opened that way explicitly with a mode of :raw or using
binmode, but I am far from sure

Or perhaps I am just asking for a better index into the perldoc tomes

Thank you for reading



Flags​:
  category=docs
  severity=wishlist


Site configuration information for perl 5.22.0​:

Configured by strawberry-perl at Mon Jun 1 20​:06​:45 2015.

Summary of my perl5 (revision 5 version 22 subversion 0) configuration​:

  Platform​:
  osname=MSWin32, osvers=6.3, archname=MSWin32-x86-multi-thread-64int
  uname='Win32 strawberry-perl 5.22.0.1 #1 Mon Jun 1 20​:04​:50 2015 i386'
  config_args='undef'
  hint=recommended, useposix=true, d_sigaction=undef
  useithreads=define, usemultiplicity=define
  use64bitint=define, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='gcc', ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -fwrapv
-fno-strict-aliasing -mms-bitfields',
  optimize='-s -O2',
  cppflags='-DWIN32'
  ccversion='', gccversion='4.9.2', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8,
byteorder=12345678, doublekind=3
  d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=8, longdblkind=3
  ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
Off_t='long long', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='g++', ldflags ='-s -L"C​:\STRAWB1\perl\lib\CORE" -L"C​:\STRAWB1\c\lib"'
  libpth=C​:\STRAWB1\c\lib C​:\STRAWB1\c\i686-w64-mingw32\lib
C​:\STRAWB1\c\lib\gcc\i686-w64-mingw32\4.9.2
  libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32
-lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
  perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool
-lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid
-lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
  libc=, so=dll, useshrplib=true, libperl=libperl522.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags='-mdll -s -L"C​:\STRAWB
1\perl\lib\CORE"
-L"C​:\STRAWB~1\c\lib"'


@​INC for perl 5.22.0​:
  C​:/Strawberry/perl/site/lib/MSWin32-x86-multi-thread-64int
  C​:/Strawberry/perl/site/lib
  C​:/Strawberry/perl/vendor/lib
  C​:/Strawberry/perl/lib
  .


Environment for perl 5.22.0​:
  HOME (unset)
  LANG (unset)
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=C​:\Python27\;C​:\Python27\Scripts;C​:\Program Files
(x86)\Common Files\Intel\Shared
Files\cpp\bin\Intel64;C​:\Windows\system32;C​:\Windows;C​:\Windows\System32\Wbem;C​:\Windows\System32\WindowsPowerShell\v1.0\;C​:\Strawberry\c\bin;C​:\Strawberry\perl\site\bin;C​:\Strawberry\perl\bin;C​:\ffmpeg\bin;C​:\Program
Files (x86)\Git\cmd;C​:\Program Files (x86)\NVIDIA
Corporation\PhysX\Common;C​:\WINDOWS\system32;C​:\WINDOWS;C​:\WINDOWS\System32\Wbem;C​:\WINDOWS\System32\WindowsPowerShell\v1.0\;E​:\Perl\source;C​:\Program
Files (x86)\EaseUS\Todo Backup\bin\x64\
  PERL_BADLANG (unset)
  SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2015

From @cowens

Would it be correct to say that explicit encoding using the Encode module
has replaced the bytes pragma's implicit treatment of strings as bytes of
unknown encoding?

expected result of the bytes pragma​:
2 [0xc3 0xa9]
unexpected results when string is not in the expected encoding
1 [0xe9]
explicit encoding yields the expected results
2 [0xc3 0xa9]
even when the string isn't in the expected encoding
2 [0xc3 0xa9]

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

use Encode qw/encode/;

my $utf8 = "é";

print "expected result of the bytes pragma​:\n";
{
use bytes;

my $length = length $utf8;
my @​bytes = map { sprintf "0x%02x", ord } split //, $utf8;

print "$length [@​bytes]\n";
}

print "unexpected results when string is not in the expected encoding\n";
my $latin1 = encode("Latin1", $utf8);
{
use bytes;

my $length = length $latin1;
my @​bytes = map { sprintf "0x%02x", ord } split //, $latin1;

print "$length [@​bytes]\n";
}

print "explicit encoding yields the expected results\n";
{
my $raw = encode('UTF-8', $utf8);
my $length = length $raw;
my @​bytes = map { sprintf "0x%02x", ord } split //, $raw;

print "$length [@​bytes]\n";
}

print "even when the string isn't in the expected encoding\n";
{
my $raw = encode('UTF-8', $latin1);
my $length = length $raw;
my @​bytes = map { sprintf "0x%02x", ord } split //, $raw;

print "$length [@​bytes]\n";
}

On Wed, Jul 15, 2015 at 4​:20 PM Rob Dixon <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Rob Dixon
# Please include the string​: [perl #125619]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619 >

This is a bug report for perl from the.rob.dixon@​gmail.com,
generated with the help of perlbug 1.40 running under perl 5.22.0.

-----------------------------------------------------------------

I recently read "perldoc bytes" and saw

This pragma reflects early attempts to incorporate Unicode into
perl (sic.) and has since been superseded

I think it is a mistake to fail to say /what/ has superseded it, and
the perluni* pods aren't something I should expect to look at if I'm
not using Unicode

In any statement that deprecates "bytes", I think something should
be said about "utf8", which seems to be its dual but is not

In short, I believe the perluni* should have been
"perlcharacterencoding", as there is no "perlascii" or "perllatin1"

The journey into the promised land of Unicode has been arduous
enough as it is, and the perluni* pods are a major achievement, so I
don't really expect all of that to be ripped apart on my whim. But
perhaps we could do with something like my "perlcharacterencoding"
as an entry point to all of the above?

As you can tell, these are infant musings without any coherent plan.
But I have recently been asked how to enable byte semantics on a
Perl input stream, and I was troubled to find that I was neither
certain nor able to locate the documentation that told me. I suspect
the answer is that file handles have byte semantics by default, and
can be opened that way explicitly with a mode of :raw or using
binmode, but I am far from sure

Or perhaps I am just asking for a better index into the perldoc tomes

Thank you for reading

-----------------------------------------------------------------
---
Flags​:
category=docs
severity=wishlist
---
Site configuration information for perl 5.22.0​:

Configured by strawberry-perl at Mon Jun 1 20​:06​:45 2015.

Summary of my perl5 (revision 5 version 22 subversion 0) configuration​:

Platform​:
osname=MSWin32, osvers=6.3, archname=MSWin32-x86-multi-thread-64int
uname='Win32 strawberry-perl 5.22.0.1 #1 Mon Jun 1 20​:04​:50 2015 i386'
config_args='undef'
hint=recommended, useposix=true, d_sigaction=undef
useithreads=define, usemultiplicity=define
use64bitint=define, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='gcc', ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -fwrapv
-fno-strict-aliasing -mms-bitfields',
optimize='-s -O2',
cppflags='-DWIN32'
ccversion='', gccversion='4.9.2', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8,
byteorder=12345678, doublekind=3
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=8, longdblkind=3
ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
Off_t='long long', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='g++', ldflags ='-s -L"C​:\STRAWB1\perl\lib\CORE"
-L"C​:\STRAWB
1\c\lib"'
libpth=C​:\STRAWB1\c\lib C​:\STRAWB1\c\i686-w64-mingw32\lib
C​:\STRAWB1\c\lib\gcc\i686-w64-mingw32\4.9.2
libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32
-lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool
-lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid
-lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
libc=, so=dll, useshrplib=true, libperl=libperl522.a
gnulibc_version=''
Dynamic Linking​:
dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags='-mdll -s -L"C​:\STRAWB
1\perl\lib\CORE"
-L"C​:\STRAWB~1\c\lib"'

---
@​INC for perl 5.22.0​:
C​:/Strawberry/perl/site/lib/MSWin32-x86-multi-thread-64int
C​:/Strawberry/perl/site/lib
C​:/Strawberry/perl/vendor/lib
C​:/Strawberry/perl/lib
.

---
Environment for perl 5.22.0​:
HOME (unset)
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=C​:\Python27\;C​:\Python27\Scripts;C​:\Program Files
(x86)\Common Files\Intel\Shared

Files\cpp\bin\Intel64;C​:\Windows\system32;C​:\Windows;C​:\Windows\System32\Wbem;C​:\Windows\System32\WindowsPowerShell\v1.0\;C​:\Strawberry\c\bin;C​:\Strawberry\perl\site\bin;C​:\Strawberry\perl\bin;C​:\ffmpeg\bin;C​:\Program
Files (x86)\Git\cmd;C​:\Program Files (x86)\NVIDIA

Corporation\PhysX\Common;C​:\WINDOWS\system32;C​:\WINDOWS;C​:\WINDOWS\System32\Wbem;C​:\WINDOWS\System32\WindowsPowerShell\v1.0\;E​:\Perl\source;C​:\Program
Files (x86)\EaseUS\Todo Backup\bin\x64\
PERL_BADLANG (unset)
SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 29, 2015

From @ap

* Rob Dixon <perlbug-followup@​perl.org> [2015-07-15 22​:20]​:

I recently read "perldoc bytes" and saw

This pragma reflects early attempts to incorporate Unicode into
perl (sic.) and has since been superseded

I think it is a mistake to fail to say /what/ has superseded it, and
the perluni* pods aren't something I should expect to look at if I'm
not using Unicode

It has been superseded by nothing. I/O is in terms of bytes by default.
If you are not using Unicode, you do not need to do anything special.

If, however, you are using code that does at some point decode your
bytes into text, then you have to re-encode it appropriately yourself;
you never could just say `use bytes` and make its decodedness magically
go away.

In any statement that deprecates "bytes", I think something should
be said about "utf8", which seems to be its dual but is not

Please explain. FWIW, `use utf8` has a *completely* different purpose
than `use bytes`.

In short, I believe the perluni* should have been
"perlcharacterencoding", as there is no "perlascii" or "perllatin1"

The journey into the promised land of Unicode has been arduous
enough as it is, and the perluni* pods are a major achievement, so I
don't really expect all of that to be ripped apart on my whim. But
perhaps we could do with something like my "perlcharacterencoding"
as an entry point to all of the above?

Somehow I don’t think the documentation is going to get less confusing
if we keep adding more documents to it…

As you can tell, these are infant musings without any coherent plan.
But I have recently been asked how to enable byte semantics on a
Perl input stream, and I was troubled to find that I was neither
certain nor able to locate the documentation that told me. I suspect
the answer is that file handles have byte semantics by default

Correct.

and can be opened that way explicitly with a mode of :raw or using
binmode

That is necessary on Windows only because the :crlf layer is added to
filehandles by default there. (And maybe other platforms, but none of
them are really relevant in practice.)

It is also necessary if you want to turn off decoding on a filehandle
after the fact, but if you ask me, you should change your code to avoid
the need to do that – the semantics of doing it are too murky.

but I am far from sure

Or perhaps I am just asking for a better index into the perldoc tomes

Thank you for reading

Well. The documentation is sprawling and has no coherent organisation,
basically because it has grown by the same principle as is behind your
proposal​: someone needed (it) to explain a particular topic, so they
wrote a document or two to explain that topic. But coherent docs do not
happen by that method.

What this produces is documentation that is great so long as you read
each document by itself top to bottom, and you take the time out to sit
down and read every last document, one by one.

If the documentation is supposed to be more penetrable then it must have
some topical organisation, which requires editing by someone with a good
idea of what topics are covered where, who can then decide how some new
topic should be broken up to fit into the existing structure and/or how
the structure needs to be shifted around to make room for the new topic.

In short, you need an architect.

We don’t have one. In fact we are probably even further away from having
one for the docs than we are from one for the interpreter.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Aug 3, 2015

From @cowens

I think the following patch addresses the confusion of what to use
instead of bytes when you feel the need to access the byte level of a
string.

Inline Patch
diff --git a/lib/bytes.pm b/lib/bytes.pm
index 6dad41a..77d849d 100644
--- a/lib/bytes.pm
+++ b/lib/bytes.pm
@@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics rather
than character semantics

=head1 NOTICE

-This pragma reflects early attempts to incorporate Unicode into perl and
-has since been superseded. It breaks encapsulation (i.e. it exposes the
-innards of how the perl executable currently happens to store a string),
-and use of this module for anything other than debugging purposes is
-strongly discouraged. If you feel that the functions here within might be
-useful for your application, this possibly indicates a mismatch between
-your mental model of Perl Unicode and the current reality. In that case,
-you may wish to read some of the perl Unicode documentation​:
-L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
+This pragma reflects early attempts to incorporate Unicode into perl and has
+since been superseded by explicit (rather than this pragma's implict) encoding
+using the L<Encode> module​:
+
+ use Encode qw/encode/;
+
+ my $utf8_byte_string = encode "UTF-8", $string;
+ my $latin1_byte_string = encode "Latin1", $string;
+
+Because this module breaks encapsulation (i.e. it exposes the innards of how
+the perl executable currently happens to store a string), the byte values that
+result are in an unspecified encoding. Use of this module for anything other
+than debugging purposes is strongly discouraged. If you feel that the
+functions here within might be useful for your application, this possibly
+indicates a mismatch between your mental model of Perl Unicode and the current
+reality. In that case, you may wish to read some of the perl Unicode
+documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
+L<perlunicode>.

=head1 SYNOPSIS

Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2015

From @tonycoz

On Mon Aug 03 07​:05​:54 2015, cowens wrote​:

I think the following patch addresses the confusion of what to use
instead of bytes when you feel the need to access the byte level of a
string.

diff --git a/lib/bytes.pm b/lib/bytes.pm
index 6dad41a..77d849d 100644
--- a/lib/bytes.pm
+++ b/lib/bytes.pm
@​@​ -35,15 +35,24 @​@​ bytes - Perl pragma to force byte semantics rather
than character semantics

=head1 NOTICE

-This pragma reflects early attempts to incorporate Unicode into perl
and
-has since been superseded. It breaks encapsulation (i.e. it exposes
the
-innards of how the perl executable currently happens to store a
string),
-and use of this module for anything other than debugging purposes is
-strongly discouraged. If you feel that the functions here within
might be
-useful for your application, this possibly indicates a mismatch
between
-your mental model of Perl Unicode and the current reality. In that
case,
-you may wish to read some of the perl Unicode documentation​:
-L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
+This pragma reflects early attempts to incorporate Unicode into perl
and has
+since been superseded by explicit (rather than this pragma's implict)
encoding
+using the L<Encode> module​:
+
+ use Encode qw/encode/;
+
+ my $utf8_byte_string = encode "UTF-8", $string;
+ my $latin1_byte_string = encode "Latin1", $string;
+
+Because this module breaks encapsulation (i.e. it exposes the innards
of how
+the perl executable currently happens to store a string), the byte
values that
+result are in an unspecified encoding. Use of this module for
anything other
+than debugging purposes is strongly discouraged. If you feel that
the
+functions here within might be useful for your application, this
possibly
+indicates a mismatch between your mental model of Perl Unicode and
the current
+reality. In that case, you may wish to read some of the perl Unicode
+documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
+L<perlunicode>.

=head1 SYNOPSIS

It seems like an improvement to me.

Should it mention utf8​::encode()?

Tony

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2015

From @cowens

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <perlbug-followup@​perl.org>
wrote​:

On Mon Aug 03 07​:05​:54 2015, cowens wrote​:

I think the following patch addresses the confusion of what to use
instead of bytes when you feel the need to access the byte level of a
string.

diff --git a/lib/bytes.pm b/lib/bytes.pm
index 6dad41a..77d849d 100644
--- a/lib/bytes.pm
+++ b/lib/bytes.pm
@​@​ -35,15 +35,24 @​@​ bytes - Perl pragma to force byte semantics rather
than character semantics

=head1 NOTICE

-This pragma reflects early attempts to incorporate Unicode into perl
and
-has since been superseded. It breaks encapsulation (i.e. it exposes
the
-innards of how the perl executable currently happens to store a
string),
-and use of this module for anything other than debugging purposes is
-strongly discouraged. If you feel that the functions here within
might be
-useful for your application, this possibly indicates a mismatch
between
-your mental model of Perl Unicode and the current reality. In that
case,
-you may wish to read some of the perl Unicode documentation​:
-L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
+This pragma reflects early attempts to incorporate Unicode into perl
and has
+since been superseded by explicit (rather than this pragma's implict)
encoding
+using the L<Encode> module​:
+
+ use Encode qw/encode/;
+
+ my $utf8_byte_string = encode "UTF-8", $string;
+ my $latin1_byte_string = encode "Latin1", $string;
+
+Because this module breaks encapsulation (i.e. it exposes the innards
of how
+the perl executable currently happens to store a string), the byte
values that
+result are in an unspecified encoding. Use of this module for
anything other
+than debugging purposes is strongly discouraged. If you feel that
the
+functions here within might be useful for your application, this
possibly
+indicates a mismatch between your mental model of Perl Unicode and
the current
+reality. In that case, you may wish to read some of the perl Unicode
+documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
+L<perlunicode>.

=head1 SYNOPSIS

It seems like an improvement to me.

Should it mention utf8​::encode()?

Tony

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​: utf8​::encode
is an order of magnitude faster even with the assignment needed to make it
work like Encode​::encode. Even factoring out the call to find_encoding
leaves utf​::encode twice as fast as $obj->encode(). It also handles
malformed UTF-8 like the "UTF8" encoding (ie it doesn't replace malformed
characters with the replacement character). Here is a second attempt at a
patch.

@p5pRT
Copy link
Author

p5pRT commented Aug 6, 2015

From @cowens

bytes.patch
diff --git a/lib/bytes.pm b/lib/bytes.pm
index 6dad41a..96a243c 100644
--- a/lib/bytes.pm
+++ b/lib/bytes.pm
@@ -35,15 +35,31 @@ bytes - Perl pragma to force byte semantics rather than character semantics
 
 =head1 NOTICE
 
-This pragma reflects early attempts to incorporate Unicode into perl and
-has since been superseded. It breaks encapsulation (i.e. it exposes the
-innards of how the perl executable currently happens to store a string),
-and use of this module for anything other than debugging purposes is
-strongly discouraged. If you feel that the functions here within might be
-useful for your application, this possibly indicates a mismatch between
-your mental model of Perl Unicode and the current reality. In that case,
-you may wish to read some of the perl Unicode documentation:
-L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
+This pragma reflects early attempts to incorporate Unicode into perl and has
+since been superseded by explicit (rather than this pragma's implict) encoding
+using the L<Encode> module:
+
+    use Encode qw/encode/;
+
+    my $utf8_byte_string   = encode "UTF8",   $string;
+    my $latin1_byte_string = encode "Latin1", $string;
+
+Or, if performance is needed and you are only interested in the UTF-8
+representation:
+
+    use utf8;
+
+    utf8::encode(my $utf8_byte_string = $string);
+
+Because the bytes pragma breaks encapsulation (i.e. it exposes the innards of
+how the perl executable currently happens to store a string), the byte values
+that result are in an unspecified encoding.  Use of this module for anything
+other than debugging purposes is strongly discouraged.  If you feel that the
+functions here within might be useful for your application, this possibly
+indicates a mismatch between your mental model of Perl Unicode and the current
+reality. In that case, you may wish to read some of the perl Unicode
+documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
+L<perlunicode>.
 
 =head1 SYNOPSIS
 
@@ -95,6 +111,6 @@ bytes::substr() does not work as an lvalue().
 
 =head1 SEE ALSO
 
-L<perluniintro>, L<perlunicode>, L<utf8>
+L<perluniintro>, L<perlunicode>, L<utf8>, L<Encode>
 
 =cut

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2015

From @khwilliamson

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what to use
 > instead of bytes when you feel the need to access the byte level of a
 > string\.
 >
 > diff \-\-git a/lib/bytes\.pm \<http&#8203;://bytes\.pm> b/lib/bytes\.pm
\<http&#8203;://bytes\.pm>
 > index 6dad41a\.\.77d849d 100644
 > \-\-\- a/lib/bytes\.pm \<http&#8203;://bytes\.pm>
 > \+\+\+ b/lib/bytes\.pm \<http&#8203;://bytes\.pm>
 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte semantics
rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode into perl
 > and
 > \-has since been superseded\. It breaks encapsulation \(i\.e\. it exposes
 > the
 > \-innards of how the perl executable currently happens to store a
 > string\)\,
 > \-and use of this module for anything other than debugging purposes is
 > \-strongly discouraged\. If you feel that the functions here within
 > might be
 > \-useful for your application\, this possibly indicates a mismatch
 > between
 > \-your mental model of Perl Unicode and the current reality\. In that
 > case\,
 > \-you may wish to read some of the perl Unicode documentation&#8203;:
 > \-L\<perluniintro>\, L\<perlunitut>\, L\<perlunifaq> and L\<perlunicode>\.
 > \+This pragma reflects early attempts to incorporate Unicode into perl
 > and has
 > \+since been superseded by explicit \(rather than this pragma's
implict\)
 > encoding
 > \+using the L\<Encode> module&#8203;:
 > \+
 > \+    use Encode qw/encode/;
 > \+
 > \+    my $utf8\_byte\_string   = encode "UTF\-8"\,  $string;
 > \+    my $latin1\_byte\_string = encode "Latin1"\, $string;
 > \+
 > \+Because this module breaks encapsulation \(i\.e\. it exposes the
innards
 > of how
 > \+the perl executable currently happens to store a string\)\, the byte
 > values that
 > \+result are in an unspecified encoding\.  Use of this module for
 > anything other
 > \+than debugging purposes is strongly discouraged\.  If you feel that
 > the
 > \+functions here within might be useful for your application\, this
 > possibly
 > \+indicates a mismatch between your mental model of Perl Unicode and
 > the current
 > \+reality\. In that case\, you may wish to read some of the perl Unicode
 > \+documentation&#8203;: L\<perluniintro>\, L\<perlunitut>\, L\<perlunifaq> and
 > \+L\<perlunicode>\.
 >
 > =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the assignment
needed to make it work like Encode​::encode. Even factoring out the call
to find_encoding leaves utf​::encode twice as fast as $obj->encode(). It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here is a
second attempt at a patch.

Consider doing an approach of doing something like this​:

  =head1 NAME

  bytes - Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging
code. And then later say that this pragma used to be for more things,
but don't do that anymore, as it has been found to be broken. I think
that text should incorporate Chas.' patch.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @cowens

On Tue, Aug 11, 2015 at 9​:35 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what to use
 > instead of bytes when you feel the need to access the byte level of

a
> string.
>
> diff --git a/lib/bytes.pm <http​://bytes.pm> b/lib/bytes.pm
<http​://bytes.pm>
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm <http​://bytes.pm>
> +++ b/lib/bytes.pm <http​://bytes.pm>

 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte semantics
rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode into

perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it
exposes
> the
> -innards of how the perl executable currently happens to store a
> string),
> -and use of this module for anything other than debugging purposes
is
> -strongly discouraged. If you feel that the functions here within
> might be
> -useful for your application, this possibly indicates a mismatch
> between
> -your mental model of Perl Unicode and the current reality. In that
> case,
> -you may wish to read some of the perl Unicode documentation​:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode into
perl
> and has
> +since been superseded by explicit (rather than this pragma's
implict)
> encoding
> +using the L<Encode> module​:
> +
> + use Encode qw/encode/;
> +
> + my $utf8_byte_string = encode "UTF-8", $string;
> + my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes the
innards
> of how
> +the perl executable currently happens to store a string), the byte
> values that
> +result are in an unspecified encoding. Use of this module for
> anything other
> +than debugging purposes is strongly discouraged. If you feel that
> the
> +functions here within might be useful for your application, this
> possibly
> +indicates a mismatch between your mental model of Perl Unicode and
> the current
> +reality. In that case, you may wish to read some of the perl
Unicode
> +documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the assignment
needed to make it work like Encode​::encode. Even factoring out the call
to find_encoding leaves utf​::encode twice as fast as $obj->encode(). It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here is a
second attempt at a patch.

Consider doing an approach of doing something like this​:

 =head1 NAME

 bytes \- Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging code.
And then later say that this pragma used to be for more things, but don't do
that anymore, as it has been found to be broken. I think that text should
incorporate Chas.' patch.

Here is my rewritten version. Everything following this text is the
same as it currently is. If people like this version I will make a
patch. Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally. Perl strings can be
represented internally in a number of different encodings, and, therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding using
the C<encode> function from the L<Encode> module​:

  use Encode qw/encode/;

  my $utf8_byte_string = encode "UTF8", $string;
  my $latin1_byte_string = encode "Latin1", $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8> pragma​:

  use utf8;

  utf8​::encode(my $utf8_byte_string = $string);

If you feel this pragma might be useful for your application, this possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl Unicode
documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From lasse.makholm@gmail.com

Hi,

It seems like bytes is not only deprecated and easily misunderstood but
also broken... (Unless I'm easily misunderstanding it...)

I can't remember ever using bytes for anything except
bytes​::length($string) for calculating HTTP Content-Length headers and
such... Mostly because it's easier to type than length Encode blah blah...

The bytes docs explicitly state that the byte strings it operates on are,
in fact, the bytes that would make up the string in UTF-8 encoding. The
example given works for the character (U+0190) in the example​:

  $x = chr(400);
  print "Length is ", length $x, "\n"; # "Length is 1"
  printf "Contents are %vd\n", $x; # "Contents are 400"
  {
  use bytes; # or "require bytes; bytes​::length()"
  print "Length is ", length $x, "\n"; # "Length is 2"
  printf "Contents are %vd\n", $x; # "Contents are 198.144"
  }

Yields​:

Length is 1
Contents are 400
Length is 2
Contents are 198.144

However, using U+00F8 ( chr(248) - "ø" ) instead - not so much​:

Length is 1
Contents are 248
Length is 1
Contents are 248

Ouch. I'm guessing there's some code somewhere that mistakenly assumes that
all characters below U+0100 encode as 1 byte in UTF-8... :-(

I'm slightly baffled as to how I've never noticed this before...

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level

/L

On 13 August 2015 at 16​:10, Chas. Owens <chas.owens@​gmail.com> wrote​:

On Tue, Aug 11, 2015 at 9​:35 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what to

use

 > instead of bytes when you feel the need to access the byte level

of

a
> string.
>
> diff --git a/lib/bytes.pm <http​://bytes.pm> b/lib/bytes.pm
<http​://bytes.pm>
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm <http​://bytes.pm>
> +++ b/lib/bytes.pm <http​://bytes.pm>

 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte semantics
rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode into

perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it
exposes
> the
> -innards of how the perl executable currently happens to store a
> string),
> -and use of this module for anything other than debugging
purposes
is
> -strongly discouraged. If you feel that the functions here within
> might be
> -useful for your application, this possibly indicates a mismatch
> between
> -your mental model of Perl Unicode and the current reality. In
that
> case,
> -you may wish to read some of the perl Unicode documentation​:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode into
perl
> and has
> +since been superseded by explicit (rather than this pragma's
implict)
> encoding
> +using the L<Encode> module​:
> +
> + use Encode qw/encode/;
> +
> + my $utf8_byte_string = encode "UTF-8", $string;
> + my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes the
innards
> of how
> +the perl executable currently happens to store a string), the
byte
> values that
> +result are in an unspecified encoding. Use of this module for
> anything other
> +than debugging purposes is strongly discouraged. If you feel
that
> the
> +functions here within might be useful for your application, this
> possibly
> +indicates a mismatch between your mental model of Perl Unicode
and
> the current
> +reality. In that case, you may wish to read some of the perl
Unicode
> +documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the assignment
needed to make it work like Encode​::encode. Even factoring out the call
to find_encoding leaves utf​::encode twice as fast as $obj->encode(). It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here is a
second attempt at a patch.

Consider doing an approach of doing something like this​:

 =head1 NAME

 bytes \- Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging code.
And then later say that this pragma used to be for more things, but
don't do
that anymore, as it has been found to be broken. I think that text
should
incorporate Chas.' patch.

Here is my rewritten version. Everything following this text is the
same as it currently is. If people like this version I will make a
patch. Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally. Perl strings can be
represented internally in a number of different encodings, and, therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding
using
the C<encode> function from the L<Encode> module​:

use Encode qw/encode/;

my $utf8\_byte\_string   = encode "UTF8"\,   $string;
my $latin1\_byte\_string = encode "Latin1"\, $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8> pragma​:

use utf8;

utf8&#8203;::encode\(my $utf8\_byte\_string = $string\);

If you feel this pragma might be useful for your application, this possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl
Unicode
documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @rjbs

* Lasse Makholm <lasse.makholm@​gmail.com> [2015-08-13T16​:46​:05]

The bytes docs explicitly state that the byte strings it operates on are,
in fact, the bytes that would make up the string in UTF-8 encoding. The
example given works for the character (U+0190) in the example​:

The perl runtime stores strings in memory in either Type-A or Type-B format.

In Type-A format, it is an array of chars, and each char is the value of the
character at that position.

In Type-B format, it is an array of chars forming a valid UTF-8 sequence. To
know the character at a given position, you must decode.

"use bytes" makes things operate on the underlying "array of chars" without
reference to whether the storage is Type-A or Type-B, which can be determined
through other means.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @cowens

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8. This is why the bytes pragma is broken (it isn't
easy to predict which encoding is being used). You don't necessarily
get the bytes you are expecting. If you have been using the bytes
pragma to get the length for HTTP Content-Length, then you have likely
been sending incorrect sizes.

using bytes
ÿ is length 1
ÿĀ is length 4
using utf8​::encode
ÿ is length 2
ÿĀ is length 4

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

binmode STDOUT => "​:utf8";

my $latin1 = chr(255);
my $utf8 = chr(255) . chr(256);

my ($latin1_length, $utf8_length);
{
  use bytes;
  $latin1_length = length $latin1;
  $utf8_length = length $utf8;
}

print
  "using bytes\n",
  "$latin1 is length $latin1_length\n",
  "$utf8 is length $utf8_length\n";

utf8​::encode(my $latin1_byte_string = $latin1);
utf8​::encode(my $utf8_byte_string = $utf8);

$latin1_length = length $latin1_byte_string;
$utf8_length = length $utf8_byte_string;

print
  "using utf8​::encode\n",
  "$latin1 is length $latin1_length\n",
  "$utf8 is length $utf8_length\n";

Also, if you or your framework hasn't been setting the output
filehandle to UTF-8, then you might have had the right length, but the
wrong encoding​:

$ perl -e 'print chr(255)' | wc -c
1
$ perl -C -e 'print chr(255)' | wc -c
2

On Thu, Aug 13, 2015 at 4​:46 PM, Lasse Makholm <lasse.makholm@​gmail.com> wrote​:

Hi,

It seems like bytes is not only deprecated and easily misunderstood but also
broken... (Unless I'm easily misunderstanding it...)

I can't remember ever using bytes for anything except bytes​::length($string)
for calculating HTTP Content-Length headers and such... Mostly because it's
easier to type than length Encode blah blah...

The bytes docs explicitly state that the byte strings it operates on are, in
fact, the bytes that would make up the string in UTF-8 encoding. The example
given works for the character (U+0190) in the example​:

$x = chr\(400\);
print "Length is "\, length $x\, "\\n";     \# "Length is 1"
printf "Contents are %vd\\n"\, $x;         \# "Contents are 400"
\{
    use bytes; \# or "require bytes; bytes&#8203;::length\(\)"
    print "Length is "\, length $x\, "\\n"; \# "Length is 2"
    printf "Contents are %vd\\n"\, $x;     \# "Contents are 198\.144"
\}

Yields​:

Length is 1
Contents are 400
Length is 2
Contents are 198.144

However, using U+00F8 ( chr(248) - "ø" ) instead - not so much​:

Length is 1
Contents are 248
Length is 1
Contents are 248

Ouch. I'm guessing there's some code somewhere that mistakenly assumes that
all characters below U+0100 encode as 1 byte in UTF-8... :-(

I'm slightly baffled as to how I've never noticed this before...

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level

/L

On 13 August 2015 at 16​:10, Chas. Owens <chas.owens@​gmail.com> wrote​:

On Tue, Aug 11, 2015 at 9​:35 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what to

use
> instead of bytes when you feel the need to access the byte level
of
a
> string.
>
> diff --git a/lib/bytes.pm <http​://bytes.pm> b/lib/bytes.pm
<http​://bytes.pm>
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm <http​://bytes.pm>
> +++ b/lib/bytes.pm <http​://bytes.pm>

 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte semantics
rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode into

perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it
exposes
> the
> -innards of how the perl executable currently happens to store a
> string),
> -and use of this module for anything other than debugging
purposes
is
> -strongly discouraged. If you feel that the functions here
within
> might be
> -useful for your application, this possibly indicates a mismatch
> between
> -your mental model of Perl Unicode and the current reality. In
that
> case,
> -you may wish to read some of the perl Unicode documentation​:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode into
perl
> and has
> +since been superseded by explicit (rather than this pragma's
implict)
> encoding
> +using the L<Encode> module​:
> +
> + use Encode qw/encode/;
> +
> + my $utf8_byte_string = encode "UTF-8", $string;
> + my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes the
innards
> of how
> +the perl executable currently happens to store a string), the
byte
> values that
> +result are in an unspecified encoding. Use of this module for
> anything other
> +than debugging purposes is strongly discouraged. If you feel
that
> the
> +functions here within might be useful for your application,
this
> possibly
> +indicates a mismatch between your mental model of Perl Unicode
and
> the current
> +reality. In that case, you may wish to read some of the perl
Unicode
> +documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq>
and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the assignment
needed to make it work like Encode​::encode. Even factoring out the
call
to find_encoding leaves utf​::encode twice as fast as $obj->encode().
It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here is
a
second attempt at a patch.

Consider doing an approach of doing something like this​:

 =head1 NAME

 bytes \- Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging
code.
And then later say that this pragma used to be for more things, but
don't do
that anymore, as it has been found to be broken. I think that text
should
incorporate Chas.' patch.

Here is my rewritten version. Everything following this text is the
same as it currently is. If people like this version I will make a
patch. Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in
strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally. Perl strings can be
represented internally in a number of different encodings, and, therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding
using
the C<encode> function from the L<Encode> module​:

use Encode qw/encode/;

my $utf8\_byte\_string   = encode "UTF8"\,   $string;
my $latin1\_byte\_string = encode "Latin1"\, $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8&#8203;::encode\(my $utf8\_byte\_string = $string\);

If you feel this pragma might be useful for your application, this
possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl
Unicode
documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @cowens

I think you are confused by this line​: "As an example, when Perl sees
$x = chr(400) , it encodes the character in UTF-8 and stores it in
$x". That could just as easily read "As an example, when Perl sees $x
= chr(255) , it encodes the character in Latin1 and stores it in $x".
It doesn't specify that it always encodes it that way. Maybe we
should clean up that language so it isn't confusing.

On Thu, Aug 13, 2015 at 5​:48 PM, Chas. Owens <chas.owens@​gmail.com> wrote​:

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8. This is why the bytes pragma is broken (it isn't
easy to predict which encoding is being used). You don't necessarily
get the bytes you are expecting. If you have been using the bytes
pragma to get the length for HTTP Content-Length, then you have likely
been sending incorrect sizes.

using bytes
ÿ is length 1
ÿĀ is length 4
using utf8​::encode
ÿ is length 2
ÿĀ is length 4

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

binmode STDOUT => "​:utf8";

my $latin1 = chr(255);
my $utf8 = chr(255) . chr(256);

my ($latin1_length, $utf8_length);
{
use bytes;
$latin1_length = length $latin1;
$utf8_length = length $utf8;
}

print
"using bytes\n",
"$latin1 is length $latin1_length\n",
"$utf8 is length $utf8_length\n";

utf8​::encode(my $latin1_byte_string = $latin1);
utf8​::encode(my $utf8_byte_string = $utf8);

$latin1_length = length $latin1_byte_string;
$utf8_length = length $utf8_byte_string;

print
"using utf8​::encode\n",
"$latin1 is length $latin1_length\n",
"$utf8 is length $utf8_length\n";

Also, if you or your framework hasn't been setting the output
filehandle to UTF-8, then you might have had the right length, but the
wrong encoding​:

$ perl -e 'print chr(255)' | wc -c
1
$ perl -C -e 'print chr(255)' | wc -c
2

On Thu, Aug 13, 2015 at 4​:46 PM, Lasse Makholm <lasse.makholm@​gmail.com> wrote​:

Hi,

It seems like bytes is not only deprecated and easily misunderstood but also
broken... (Unless I'm easily misunderstanding it...)

I can't remember ever using bytes for anything except bytes​::length($string)
for calculating HTTP Content-Length headers and such... Mostly because it's
easier to type than length Encode blah blah...

The bytes docs explicitly state that the byte strings it operates on are, in
fact, the bytes that would make up the string in UTF-8 encoding. The example
given works for the character (U+0190) in the example​:

$x = chr\(400\);
print "Length is "\, length $x\, "\\n";     \# "Length is 1"
printf "Contents are %vd\\n"\, $x;         \# "Contents are 400"
\{
    use bytes; \# or "require bytes; bytes&#8203;::length\(\)"
    print "Length is "\, length $x\, "\\n"; \# "Length is 2"
    printf "Contents are %vd\\n"\, $x;     \# "Contents are 198\.144"
\}

Yields​:

Length is 1
Contents are 400
Length is 2
Contents are 198.144

However, using U+00F8 ( chr(248) - "ø" ) instead - not so much​:

Length is 1
Contents are 248
Length is 1
Contents are 248

Ouch. I'm guessing there's some code somewhere that mistakenly assumes that
all characters below U+0100 encode as 1 byte in UTF-8... :-(

I'm slightly baffled as to how I've never noticed this before...

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level

/L

On 13 August 2015 at 16​:10, Chas. Owens <chas.owens@​gmail.com> wrote​:

On Tue, Aug 11, 2015 at 9​:35 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what to

use
> instead of bytes when you feel the need to access the byte level
of
a
> string.
>
> diff --git a/lib/bytes.pm <http​://bytes.pm> b/lib/bytes.pm
<http​://bytes.pm>
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm <http​://bytes.pm>
> +++ b/lib/bytes.pm <http​://bytes.pm>

 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte semantics
rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode into

perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it
exposes
> the
> -innards of how the perl executable currently happens to store a
> string),
> -and use of this module for anything other than debugging
purposes
is
> -strongly discouraged. If you feel that the functions here
within
> might be
> -useful for your application, this possibly indicates a mismatch
> between
> -your mental model of Perl Unicode and the current reality. In
that
> case,
> -you may wish to read some of the perl Unicode documentation​:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode into
perl
> and has
> +since been superseded by explicit (rather than this pragma's
implict)
> encoding
> +using the L<Encode> module​:
> +
> + use Encode qw/encode/;
> +
> + my $utf8_byte_string = encode "UTF-8", $string;
> + my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes the
innards
> of how
> +the perl executable currently happens to store a string), the
byte
> values that
> +result are in an unspecified encoding. Use of this module for
> anything other
> +than debugging purposes is strongly discouraged. If you feel
that
> the
> +functions here within might be useful for your application,
this
> possibly
> +indicates a mismatch between your mental model of Perl Unicode
and
> the current
> +reality. In that case, you may wish to read some of the perl
Unicode
> +documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq>
and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the assignment
needed to make it work like Encode​::encode. Even factoring out the
call
to find_encoding leaves utf​::encode twice as fast as $obj->encode().
It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here is
a
second attempt at a patch.

Consider doing an approach of doing something like this​:

 =head1 NAME

 bytes \- Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging
code.
And then later say that this pragma used to be for more things, but
don't do
that anymore, as it has been found to be broken. I think that text
should
incorporate Chas.' patch.

Here is my rewritten version. Everything following this text is the
same as it currently is. If people like this version I will make a
patch. Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in
strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally. Perl strings can be
represented internally in a number of different encodings, and, therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding
using
the C<encode> function from the L<Encode> module​:

use Encode qw/encode/;

my $utf8\_byte\_string   = encode "UTF8"\,   $string;
my $latin1\_byte\_string = encode "Latin1"\, $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8&#8203;::encode\(my $utf8\_byte\_string = $string\);

If you feel this pragma might be useful for your application, this
possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl
Unicode
documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @Leont

On Thu, Aug 13, 2015 at 11​:48 PM, Chas. Owens <chas.owens@​gmail.com> wrote​:

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8.

No, it can use utf8 internally even when all of the characters are below
255. Whichever it is using is rather situational, which is why it's such a
mess. The only sane way to handle this is to ignore the internal encoding
altogether, but consistently either decode/encode everything or keep
everything binary. This is orthogonal to the internal encoding.

This is why the bytes pragma is broken (it isn't
easy to predict which encoding is being used). You don't necessarily
get the bytes you are expecting. If you have been using the bytes
pragma to get the length for HTTP Content-Length, then you have likely
been sending incorrect sizes.

Indeed.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @cowens

On Thu, Aug 13, 2015 at 6​:08 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

On Thu, Aug 13, 2015 at 11​:48 PM, Chas. Owens <chas.owens@​gmail.com> wrote​:

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8.

No, it can use utf8 internally even when all of the characters are below
255. Whichever it is using is rather situational, which is why it's such a
mess. The only sane way to handle this is to ignore the internal encoding
altogether, but consistently either decode/encode everything or keep
everything binary. This is orthogonal to the internal encoding.

Like I said, I don't fully understand the rules (and I don't have to
if I don't use the bytes pragma). There are all sorts of edge cases​:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

my $utf8 = "ÿ";
my $latin1 = "\x{ff}";

{
  use bytes;
  print
  "utf8​: ", length $utf8, "\n",
  "latin1​: ", length $latin1, "\n";
}

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2015

From @cowens

A new version that attempts to address the confusion of how strings
are encoded. I am not very happy with it (especially because I don't
know all of the rules for which encoding will be used internally), but
it is a good strawman to draw debate over how to phrase it.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging of how
Perl represents a given string internally. Perl strings can be represented
internally in a number of different encodings, and, therefore, the byte values
may not be the ones you are expecting. For example,

  # this string can be encoded internally with Latin1,
  # so it is two bytes long on most systems
  my $s1 = "\x{ff}\x{ff}";

  # but this string has a character above U+00FF and can't be encoded
  # with Latin1, so it is encoded with UTF-8 and is four bytes long on
  # most systems
  my $s2 = "\x{ff}\x{100}";

A better solution is to create a byte string with an explicit encoding using
the C<encode> function from the L<Encode> module​:

  use Encode qw/encode/;

  my $utf8_byte_string = encode "UTF8", $string;
  my $latin1_byte_string = encode "Latin1", $string;
  my $utf16_byte_string = encode "UTF-16BE", $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8> pragma​:

  use utf8;

  utf8​::encode(my $utf8_byte_string = $string);

If you feel this pragma might be useful for your application, this possibly
indicates a mismatch between your mental model of how perl handles strings and
the current reality. In that case, you may wish to read some of the Perl
Unicode documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

On Thu, Aug 13, 2015 at 6​:25 PM, Chas. Owens <chas.owens@​gmail.com> wrote​:

On Thu, Aug 13, 2015 at 6​:08 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

On Thu, Aug 13, 2015 at 11​:48 PM, Chas. Owens <chas.owens@​gmail.com> wrote​:

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8.

No, it can use utf8 internally even when all of the characters are below
255. Whichever it is using is rather situational, which is why it's such a
mess. The only sane way to handle this is to ignore the internal encoding
altogether, but consistently either decode/encode everything or keep
everything binary. This is orthogonal to the internal encoding.

Like I said, I don't fully understand the rules (and I don't have to
if I don't use the bytes pragma). There are all sorts of edge cases​:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

my $utf8 = "ÿ";
my $latin1 = "\x{ff}";

{
use bytes;
print
"utf8​: ", length $utf8, "\n",
"latin1​: ", length $latin1, "\n";
}

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From lasse.makholm@gmail.com

On 13 August 2015 at 23​:50, Chas. Owens <chas.owens@​gmail.com> wrote​:

I think you are confused by this line​: "As an example, when Perl sees
$x = chr(400) , it encodes the character in UTF-8 and stores it in
$x". That could just as easily read "As an example, when Perl sees $x
= chr(255) , it encodes the character in Latin1 and stores it in $x".
It doesn't specify that it always encodes it that way. Maybe we
should clean up that language so it isn't confusing.

Spot on. Perhaps some stronger/earlier wording is needed in discouraging
its use. The real explanation is a bit buried in a long sentence at the end
of the first paragraph.

Maybe just something more a la the docs for "local" which basically start
out by saying "Don't use this. Use my instead."

/L

On Thu, Aug 13, 2015 at 5​:48 PM, Chas. Owens <chas.owens@​gmail.com> wrote​:

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8. This is why the bytes pragma is broken (it isn't
easy to predict which encoding is being used). You don't necessarily
get the bytes you are expecting. If you have been using the bytes
pragma to get the length for HTTP Content-Length, then you have likely
been sending incorrect sizes.

using bytes
ÿ is length 1
ÿĀ is length 4
using utf8​::encode
ÿ is length 2
ÿĀ is length 4

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

binmode STDOUT => "​:utf8";

my $latin1 = chr(255);
my $utf8 = chr(255) . chr(256);

my ($latin1_length, $utf8_length);
{
use bytes;
$latin1_length = length $latin1;
$utf8_length = length $utf8;
}

print
"using bytes\n",
"$latin1 is length $latin1_length\n",
"$utf8 is length $utf8_length\n";

utf8​::encode(my $latin1_byte_string = $latin1);
utf8​::encode(my $utf8_byte_string = $utf8);

$latin1_length = length $latin1_byte_string;
$utf8_length = length $utf8_byte_string;

print
"using utf8​::encode\n",
"$latin1 is length $latin1_length\n",
"$utf8 is length $utf8_length\n";

Also, if you or your framework hasn't been setting the output
filehandle to UTF-8, then you might have had the right length, but the
wrong encoding​:

$ perl -e 'print chr(255)' | wc -c
1
$ perl -C -e 'print chr(255)' | wc -c
2

On Thu, Aug 13, 2015 at 4​:46 PM, Lasse Makholm <lasse.makholm@​gmail.com>
wrote​:

Hi,

It seems like bytes is not only deprecated and easily misunderstood but
also
broken... (Unless I'm easily misunderstanding it...)

I can't remember ever using bytes for anything except
bytes​::length($string)
for calculating HTTP Content-Length headers and such... Mostly because
it's
easier to type than length Encode blah blah...

The bytes docs explicitly state that the byte strings it operates on
are, in
fact, the bytes that would make up the string in UTF-8 encoding. The
example
given works for the character (U+0190) in the example​:

$x = chr\(400\);
print "Length is "\, length $x\, "\\n";     \# "Length is 1"
printf "Contents are %vd\\n"\, $x;         \# "Contents are 400"
\{
    use bytes; \# or "require bytes; bytes&#8203;::length\(\)"
    print "Length is "\, length $x\, "\\n"; \# "Length is 2"
    printf "Contents are %vd\\n"\, $x;     \# "Contents are 198\.144"
\}

Yields​:

Length is 1
Contents are 400
Length is 2
Contents are 198.144

However, using U+00F8 ( chr(248) - "ø" ) instead - not so much​:

Length is 1
Contents are 248
Length is 1
Contents are 248

Ouch. I'm guessing there's some code somewhere that mistakenly assumes
that
all characters below U+0100 encode as 1 byte in UTF-8... :-(

I'm slightly baffled as to how I've never noticed this before...

This is perl 5, version 18, subversion 2 (v5.18.2) built for
darwin-2level

/L

On 13 August 2015 at 16​:10, Chas. Owens <chas.owens@​gmail.com> wrote​:

On Tue, Aug 11, 2015 at 9​:35 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <
perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what

to

use
> instead of bytes when you feel the need to access the byte
level
of
a
> string.
>
> diff --git a/lib/bytes.pm <http​://bytes.pm> b/lib/bytes.pm
<http​://bytes.pm>
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm <http​://bytes.pm>
> +++ b/lib/bytes.pm <http​://bytes.pm>

 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte

semantics

rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode

into

perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it
exposes
> the
> -innards of how the perl executable currently happens to
store a
> string),
> -and use of this module for anything other than debugging
purposes
is
> -strongly discouraged. If you feel that the functions here
within
> might be
> -useful for your application, this possibly indicates a
mismatch
> between
> -your mental model of Perl Unicode and the current reality.
In
that
> case,
> -you may wish to read some of the perl Unicode documentation​:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode
into
perl
> and has
> +since been superseded by explicit (rather than this pragma's
implict)
> encoding
> +using the L<Encode> module​:
> +
> + use Encode qw/encode/;
> +
> + my $utf8_byte_string = encode "UTF-8", $string;
> + my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes
the
innards
> of how
> +the perl executable currently happens to store a string),
the
byte
> values that
> +result are in an unspecified encoding. Use of this module
for
> anything other
> +than debugging purposes is strongly discouraged. If you
feel
that
> the
> +functions here within might be useful for your application,
this
> possibly
> +indicates a mismatch between your mental model of Perl
Unicode
and
> the current
> +reality. In that case, you may wish to read some of the perl
Unicode
> +documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq>
and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the
assignment
needed to make it work like Encode​::encode. Even factoring out the
call
to find_encoding leaves utf​::encode twice as fast as $obj->encode().
It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here
is
a
second attempt at a patch.

Consider doing an approach of doing something like this​:

 =head1 NAME

 bytes \- Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging
code.
And then later say that this pragma used to be for more things, but
don't do
that anymore, as it has been found to be broken. I think that text
should
incorporate Chas.' patch.

Here is my rewritten version. Everything following this text is the
same as it currently is. If people like this version I will make a
patch. Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in
strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally. Perl strings can be
represented internally in a number of different encodings, and,
therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding
using
the C<encode> function from the L<Encode> module​:

use Encode qw/encode/;

my $utf8\_byte\_string   = encode "UTF8"\,   $string;
my $latin1\_byte\_string = encode "Latin1"\, $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8&#8203;::encode\(my $utf8\_byte\_string = $string\);

If you feel this pragma might be useful for your application, this
possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl
Unicode
documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From lasse.makholm@gmail.com

On 13 August 2015 at 23​:48, Chas. Owens <chas.owens@​gmail.com> wrote​:

The bytes pragma is broken, but not in the way you think. Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8. This is why the bytes pragma is broken (it isn't
easy to predict which encoding is being used). You don't necessarily
get the bytes you are expecting.

You are right, of course. And despite having read through most of the perl
Unicode docs, including about how Perl selectively upgrades strings to
utf8, I still managed to get it wrong. As mentioned elsewhere, I'd love to
see words more akin to "STOP! Don't use this!" in the (beginning of the)
docs.

If you have been using the bytes

pragma to get the length for HTTP Content-Length, then you have likely
been sending incorrect sizes.

Thankfully my responses are almost without exception JSON blobs encoded in
ASCII, so I'm not overly concerned... :-)

The mistake is real though.

/L

using bytes
ÿ is length 1
ÿĀ is length 4
using utf8​::encode
ÿ is length 2
ÿĀ is length 4

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

binmode STDOUT => "​:utf8";

my $latin1 = chr(255);
my $utf8 = chr(255) . chr(256);

my ($latin1_length, $utf8_length);
{
use bytes;
$latin1_length = length $latin1;
$utf8_length = length $utf8;
}

print
"using bytes\n",
"$latin1 is length $latin1_length\n",
"$utf8 is length $utf8_length\n";

utf8​::encode(my $latin1_byte_string = $latin1);
utf8​::encode(my $utf8_byte_string = $utf8);

$latin1_length = length $latin1_byte_string;
$utf8_length = length $utf8_byte_string;

print
"using utf8​::encode\n",
"$latin1 is length $latin1_length\n",
"$utf8 is length $utf8_length\n";

Also, if you or your framework hasn't been setting the output
filehandle to UTF-8, then you might have had the right length, but the
wrong encoding​:

$ perl -e 'print chr(255)' | wc -c
1
$ perl -C -e 'print chr(255)' | wc -c
2

On Thu, Aug 13, 2015 at 4​:46 PM, Lasse Makholm <lasse.makholm@​gmail.com>
wrote​:

Hi,

It seems like bytes is not only deprecated and easily misunderstood but
also
broken... (Unless I'm easily misunderstanding it...)

I can't remember ever using bytes for anything except
bytes​::length($string)
for calculating HTTP Content-Length headers and such... Mostly because
it's
easier to type than length Encode blah blah...

The bytes docs explicitly state that the byte strings it operates on
are, in
fact, the bytes that would make up the string in UTF-8 encoding. The
example
given works for the character (U+0190) in the example​:

$x = chr\(400\);
print "Length is "\, length $x\, "\\n";     \# "Length is 1"
printf "Contents are %vd\\n"\, $x;         \# "Contents are 400"
\{
    use bytes; \# or "require bytes; bytes&#8203;::length\(\)"
    print "Length is "\, length $x\, "\\n"; \# "Length is 2"
    printf "Contents are %vd\\n"\, $x;     \# "Contents are 198\.144"
\}

Yields​:

Length is 1
Contents are 400
Length is 2
Contents are 198.144

However, using U+00F8 ( chr(248) - "ø" ) instead - not so much​:

Length is 1
Contents are 248
Length is 1
Contents are 248

Ouch. I'm guessing there's some code somewhere that mistakenly assumes
that
all characters below U+0100 encode as 1 byte in UTF-8... :-(

I'm slightly baffled as to how I've never noticed this before...

This is perl 5, version 18, subversion 2 (v5.18.2) built for
darwin-2level

/L

On 13 August 2015 at 16​:10, Chas. Owens <chas.owens@​gmail.com> wrote​:

On Tue, Aug 11, 2015 at 9​:35 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 08/06/2015 02​:18 AM, Chas. Owens wrote​:

On Thu, Aug 6, 2015, 01​:15 Tony Cook via RT <
perlbug-followup@​perl.org
<mailto​:perlbug-followup@​perl.org>> wrote​:

On Mon Aug 03 07&#8203;:05&#8203;:54 2015\, cowens wrote&#8203;:
 > I think the following patch addresses the confusion of what to

use
> instead of bytes when you feel the need to access the byte
level
of
a
> string.
>
> diff --git a/lib/bytes.pm <http​://bytes.pm> b/lib/bytes.pm
<http​://bytes.pm>
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm <http​://bytes.pm>
> +++ b/lib/bytes.pm <http​://bytes.pm>

 > @&#8203;@&#8203; \-35\,15 \+35\,24 @&#8203;@&#8203; bytes \- Perl pragma to force byte

semantics

rather
 > than character semantics
 >
 > =head1 NOTICE
 >
 > \-This pragma reflects early attempts to incorporate Unicode

into

perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it
exposes
> the
> -innards of how the perl executable currently happens to
store a
> string),
> -and use of this module for anything other than debugging
purposes
is
> -strongly discouraged. If you feel that the functions here
within
> might be
> -useful for your application, this possibly indicates a
mismatch
> between
> -your mental model of Perl Unicode and the current reality. In
that
> case,
> -you may wish to read some of the perl Unicode documentation​:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode
into
perl
> and has
> +since been superseded by explicit (rather than this pragma's
implict)
> encoding
> +using the L<Encode> module​:
> +
> + use Encode qw/encode/;
> +
> + my $utf8_byte_string = encode "UTF-8", $string;
> + my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes the
innards
> of how
> +the perl executable currently happens to store a string), the
byte
> values that
> +result are in an unspecified encoding. Use of this module
for
> anything other
> +than debugging purposes is strongly discouraged. If you feel
that
> the
> +functions here within might be useful for your application,
this
> possibly
> +indicates a mismatch between your mental model of Perl
Unicode
and
> the current
> +reality. In that case, you may wish to read some of the perl
Unicode
> +documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq>
and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me\.

Should it mention utf8&#8203;::encode\(\)?

Tony

\-\-\-
via perlbug&#8203;:  queue&#8203;: perl5 status&#8203;: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them​:
utf8​::encode is an order of magnitude faster even with the assignment
needed to make it work like Encode​::encode. Even factoring out the
call
to find_encoding leaves utf​::encode twice as fast as $obj->encode().
It
also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
replace malformed characters with the replacement character). Here
is
a
second attempt at a patch.

Consider doing an approach of doing something like this​:

 =head1 NAME

 bytes \- Perl pragma to access the individual bytes of characters

stressing that this is to be mostly confined to temporary debugging
code.
And then later say that this pragma used to be for more things, but
don't do
that anymore, as it has been found to be broken. I think that text
should
incorporate Chas.' patch.

Here is my rewritten version. Everything following this text is the
same as it currently is. If people like this version I will make a
patch. Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in
strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally. Perl strings can be
represented internally in a number of different encodings, and,
therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding
using
the C<encode> function from the L<Encode> module​:

use Encode qw/encode/;

my $utf8\_byte\_string   = encode "UTF8"\,   $string;
my $latin1\_byte\_string = encode "Latin1"\, $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8&#8203;::encode\(my $utf8\_byte\_string = $string\);

If you feel this pragma might be useful for your application, this
possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl
Unicode
documentation​: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

--
Chas. Owens
http​://github.com/cowens
The most important skill a programmer can have is the ability to read.

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From lasse.makholm@gmail.com

On 14 August 2015 at 00​:30, Chas. Owens <chas.owens@​gmail.com> wrote​:

A new version that attempts to address the confusion of how strings
are encoded. I am not very happy with it (especially because I don't
know all of the rules for which encoding will be used internally), but
it is a good strawman to draw debate over how to phrase it.

FWIW, I like this version. It clearly shows why bytes​::length() is not what
you want and also what you should do instead. Detailing exactly when and
why Perl upgrades and downgrades strings won't matter to the vast majority
of users.

The "use utf8;" statement should probably be removed though. According to
the utf8 docs​:

Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8. The utility functions described below are
directly usable without "use utf8;".

And indeed, you can call utf8​::encode() without use'ing utf8 first.

/L

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2015

From @ikegami

On Thu, Aug 13, 2015 at 10​:10 AM, Chas. Owens <chas.owens@​gmail.com> wrote​:

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8> pragma​:

use utf8;

utf8&#8203;::encode\(my $utf8\_byte\_string = $string\);

Sorry if this was already mentioned, but that should be

  utf8​::encode(my $utf8_byte_string = $string);

without

  use utf8;

"use utf8;" is a pragma that indicates the source code is encoded using
UTF-8, and does not control access to utf8​::*.

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2015

From @tonycoz

On Thu Aug 13 15​:31​:03 2015, cowens wrote​:

A new version that attempts to address the confusion of how strings
are encoded. I am not very happy with it (especially because I don't
know all of the rules for which encoding will be used internally), but
it is a good strawman to draw debate over how to phrase it.

If the string has code points over 0xff it's encoded as perl's extended UTF-8, otherwise it could be encoded either way. For example Encode​::decode() always (I'm unaware of any exceptions) returns a UTF-8 encoded string, even if all of the characters are between 0 and 0xff​:

tony@​mars​:.../git/perl$ ./perl -Ilib -MDevel​::Peek -MEncode -e '$x = decode("latin1", " "); Dump($x)'
SV = PV(0x18f6830) at 0x19150e8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1a11e70 " "\0 [UTF8 " "]
  CUR = 1
  LEN = 10
tony@​mars​:.../git/perl$ ./perl -Ilib -MDevel​::Peek -MEncode -e '$x = decode("UTF-8", "\303\277"); Dump($x)'
SV = PV(0x1cb8830) at 0x1cd70f8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1dd3340 "\303\277"\0 [UTF8 "\x{ff}"]
  CUR = 2
  LEN = 10

Also some string literals under use utf8​:

tony@​mars​:.../git/perl$ ./perl -Ilib -MDevel​::Peek -Mutf8 -e '$x = "ÿ"; Dump($x)'
SV = PV(0x28e6d70) at 0x29060d8
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x2904380 "\303\277"\0 [UTF8 "\x{ff}"]
  CUR = 2
  LEN = 10
  COW_REFCNT = 1

...

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8​::encode(my $utf8_byte_string = $string);

As others have said, the "use utf8;" isn't needed.

Tony

@p5pRT
Copy link
Author

p5pRT commented Oct 16, 2017

From @tonycoz

On Fri, 14 Aug 2015 11​:28​:28 -0700, ikegami@​adaelis.com wrote​:

On Thu, Aug 13, 2015 at 10​:10 AM, Chas. Owens <chas.owens@​gmail.com>
wrote​:

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8​::encode(my $utf8_byte_string = $string);

Sorry if this was already mentioned, but that should be

utf8​::encode(my $utf8_byte_string = $string);

without

use utf8;

"use utf8;" is a pragma that indicates the source code is encoded
using
UTF-8, and does not control access to utf8​::*.

Is there anything else we want to do for this ticket?

The only patch (with some extras and changes) was applied as​:

commit 01e331e
Author​: Karl Williamson <khw@​cpan.org>
Date​: Sun Aug 9 21​:40​:21 2015 -0600

  Update bytes.pm doc
 
  The one legitimate use of this pragma is for debugging. This changes to
  say so, and other minor changes.

(though it retained the unneeded C<use utf8;>.)

Tony

@p5pRT
Copy link
Author

p5pRT commented Apr 3, 2018

From @khwilliamson

On Sun, 15 Oct 2017 22​:09​:20 -0700, tonyc wrote​:

On Fri, 14 Aug 2015 11​:28​:28 -0700, ikegami@​adaelis.com wrote​:

On Thu, Aug 13, 2015 at 10​:10 AM, Chas. Owens <chas.owens@​gmail.com>
wrote​:

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8>
pragma​:

use utf8;

utf8​::encode(my $utf8_byte_string = $string);

Sorry if this was already mentioned, but that should be

utf8​::encode(my $utf8_byte_string = $string);

without

use utf8;

"use utf8;" is a pragma that indicates the source code is encoded
using
UTF-8, and does not control access to utf8​::*.

Is there anything else we want to do for this ticket?

The only patch (with some extras and changes) was applied as​:

commit 01e331e
Author​: Karl Williamson <khw@​cpan.org>
Date​: Sun Aug 9 21​:40​:21 2015 -0600

Update bytes\.pm doc

The one legitimate use of this pragma is for debugging\.  This changes to
say so\, and other minor changes\.

(though it retained the unneeded C<use utf8;>.)

Tony

No one replied to this; I'm unclear if it was sent to all the concerned parties. The latest Encode running on blead is much faster than before. I'm fine with leaving the wording as it is now; but before closing I wasnt to make sure that others don't have objections. If I don't hear any withing 30 days, I will close the ticket.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Apr 4, 2018

From @iabyn

On Tue, Apr 03, 2018 at 09​:38​:50AM -0700, Karl Williamson via RT wrote​:

No one replied to this; I'm unclear if it was sent to all the concerned
parties. The latest Encode running on blead is much faster than before.
I'm fine with leaving the wording as it is now; but before closing I
wasnt to make sure that others don't have objections. If I don't hear
any withing 30 days, I will close the ticket.

With the just-pushed v5.27.10-105-g0d372decae, I've removed the spurious
'use utf8' from the example, as suggested earlier in the ticket.
I'm happy for the ticket to be closed.

--
Spock (or Data) is fired from his high-ranking position for not being able
to understand the most basic nuances of about one in three sentences that
anyone says to him.
  -- Things That Never Happen in "Star Trek" #19

@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2018

From @khwilliamson

The deadline for 5.28 is fast upon us, and since no objections have been raised, I'm now closing this ticket
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2018

@khwilliamson - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2018

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release yesterday of Perl 5.28.0, this and 185 other issues have been
resolved.

Perl 5.28.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.28.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2018

@khwilliamson - Status changed from 'pending release' to 'resolved'

@p5pRT p5pRT closed this as completed Jun 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant