Skip Menu |
Report information
Id: 125619
Status: pending release
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: the.rob.dixon [at] gmail.com
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: HasPatch
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Date: Wed, 15 Jul 2015 21:19:09 +0100
From: Rob Dixon <the.rob.dixon [...] gmail.com>
To: perlbug [...] perl.org
Subject: Documentation of byte I/O
Download (untitled) / with headers
text/plain 4.5k
This is a bug report for perl from the.rob.dixon@gmail.com, generated with the help of perlbug 1.40 running under perl 5.22.0. ----------------------------------------------------------------- I recently read "perldoc bytes" and saw This pragma reflects early attempts to incorporate Unicode into perl (sic.) and has since been superseded I think it is a mistake to fail to say /what/ has superseded it, and the perluni* pods aren't something I should expect to look at if I'm not using Unicode In any statement that deprecates "bytes", I think something should be said about "utf8", which seems to be its dual but is not In short, I believe the perluni* should have been "perlcharacterencoding", as there is no "perlascii" or "perllatin1" The journey into the promised land of Unicode has been arduous enough as it is, and the perluni* pods are a major achievement, so I don't really expect all of that to be ripped apart on my whim. But perhaps we could do with something like my "perlcharacterencoding" as an entry point to all of the above? As you can tell, these are infant musings without any coherent plan. But I have recently been asked how to enable byte semantics on a Perl input stream, and I was troubled to find that I was neither certain nor able to locate the documentation that told me. I suspect the answer is that file handles have byte semantics by default, and can be opened that way explicitly with a mode of :raw or using binmode, but I am far from sure Or perhaps I am just asking for a better index into the perldoc tomes Thank you for reading ----------------------------------------------------------------- --- Flags: category=docs severity=wishlist --- Site configuration information for perl 5.22.0: Configured by strawberry-perl at Mon Jun 1 20:06:45 2015. Summary of my perl5 (revision 5 version 22 subversion 0) configuration: Platform: osname=MSWin32, osvers=6.3, archname=MSWin32-x86-multi-thread-64int uname='Win32 strawberry-perl 5.22.0.1 #1 Mon Jun 1 20:04:50 2015 i386' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -fwrapv -fno-strict-aliasing -mms-bitfields', optimize='-s -O2', cppflags='-DWIN32' ccversion='', gccversion='4.9.2', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678, doublekind=3 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8, longdblkind=3 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='long long', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='g++', ldflags ='-s -L"C:\STRAWB~1\perl\lib\CORE" -L"C:\STRAWB~1\c\lib"' libpth=C:\STRAWB~1\c\lib C:\STRAWB~1\c\i686-w64-mingw32\lib C:\STRAWB~1\c\lib\gcc\i686-w64-mingw32\4.9.2 libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 libc=, so=dll, useshrplib=true, libperl=libperl522.a gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-mdll -s -L"C:\STRAWB~1\perl\lib\CORE" -L"C:\STRAWB~1\c\lib"' --- @INC for perl 5.22.0: C:/Strawberry/perl/site/lib/MSWin32-x86-multi-thread-64int C:/Strawberry/perl/site/lib C:/Strawberry/perl/vendor/lib C:/Strawberry/perl/lib . --- Environment for perl 5.22.0: HOME (unset) LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=C:\Python27\;C:\Python27\Scripts;C:\Program Files (x86)\Common Files\Intel\Shared Files\cpp\bin\Intel64;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Strawberry\c\bin;C:\Strawberry\perl\site\bin;C:\Strawberry\perl\bin;C:\ffmpeg\bin;C:\Program Files (x86)\Git\cmd;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;E:\Perl\source;C:\Program Files (x86)\EaseUS\Todo Backup\bin\x64\ PERL_BADLANG (unset) SHELL (unset)
From: "Chas. Owens" <chas.owens [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
To: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Date: Thu, 16 Jul 2015 14:48:32 +0000
Download (untitled) / with headers
text/plain 6.3k
Would it be correct to say that explicit encoding using the Encode module has replaced the bytes pragma's implicit treatment of strings as bytes of unknown encoding?

expected result of the bytes pragma:
2 [0xc3 0xa9]
unexpected results when string is not in the expected encoding
1 [0xe9]
explicit encoding yields the expected results
2 [0xc3 0xa9]
even when the string isn't in the expected encoding
2 [0xc3 0xa9]


#!/usr/bin/perl

use strict;
use warnings;
use utf8;

use Encode qw/encode/;

my $utf8 = "é";

print "expected result of the bytes pragma:\n";
{
use bytes;


my $length = length $utf8;
my @bytes  = map { sprintf "0x%02x", ord } split //, $utf8;

print "$length [@bytes]\n";
}

print "unexpected results when string is not in the expected encoding\n";
my $latin1 = encode("Latin1", $utf8);
{
use bytes;

my $length = length $latin1;
my @bytes  = map { sprintf "0x%02x", ord } split //, $latin1;

print "$length [@bytes]\n";
}

print "explicit encoding yields the expected results\n";
{
my $raw = encode('UTF-8', $utf8);
my $length = length $raw;
my @bytes  = map { sprintf "0x%02x", ord } split //, $raw;

print "$length [@bytes]\n";
}

print "even when the string isn't in the expected encoding\n";
{
my $raw = encode('UTF-8', $latin1);
my $length = length $raw;
my @bytes  = map { sprintf "0x%02x", ord } split //, $raw;

print "$length [@bytes]\n";
}



On Wed, Jul 15, 2015 at 4:20 PM Rob Dixon <perlbug-followup@perl.org> wrote:
Show quoted text
# New Ticket Created by  Rob Dixon
# Please include the string:  [perl #125619]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=125619 >


This is a bug report for perl from the.rob.dixon@gmail.com,
generated with the help of perlbug 1.40 running under perl 5.22.0.


-----------------------------------------------------------------

I recently read "perldoc bytes" and saw

  This pragma reflects early attempts to incorporate Unicode into
  perl (sic.) and has since been superseded

I think it is a mistake to fail to say /what/ has superseded it, and
the perluni* pods aren't something I should expect to look at if I'm
not using Unicode

In any statement that deprecates "bytes", I think something should
be said about "utf8", which seems to be its dual but is not

In short, I believe the perluni* should have been
"perlcharacterencoding", as there is no "perlascii" or "perllatin1"

The journey into the promised land of Unicode has been arduous
enough as it is, and the perluni* pods are a major achievement, so I
don't really expect all of that to be ripped apart on my whim. But
perhaps we could do with something like my "perlcharacterencoding"
as an entry point to all of the above?

As you can tell, these are infant musings without any coherent plan.
But I have recently been asked how to enable byte semantics on a
Perl input stream, and I was troubled to find that I was neither
certain nor able to locate the documentation that told me. I suspect
the answer is that file handles have byte semantics by default, and
can be opened that way explicitly with a mode of :raw or using
binmode, but I am far from sure

Or perhaps I am just asking for a better index into the perldoc tomes

Thank you for reading

-----------------------------------------------------------------
---
Flags:
    category=docs
    severity=wishlist
---
Site configuration information for perl 5.22.0:

Configured by strawberry-perl at Mon Jun  1 20:06:45 2015.

Summary of my perl5 (revision 5 version 22 subversion 0) configuration:

  Platform:
    osname=MSWin32, osvers=6.3, archname=MSWin32-x86-multi-thread-64int
    uname='Win32 strawberry-perl 5.22.0.1 #1 Mon Jun  1 20:04:50 2015 i386'
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags =' -s -O2 -DWIN32  -DPERL_TEXTMODE_SCRIPTS
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -fwrapv
-fno-strict-aliasing -mms-bitfields',
    optimize='-s -O2',
    cppflags='-DWIN32'
    ccversion='', gccversion='4.9.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8,
byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=8, longdblkind=3
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
Off_t='long long', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='g++', ldflags ='-s -L"C:\STRAWB~1\perl\lib\CORE" -L"C:\STRAWB~1\c\lib"'
    libpth=C:\STRAWB~1\c\lib C:\STRAWB~1\c\i686-w64-mingw32\lib
C:\STRAWB~1\c\lib\gcc\i686-w64-mingw32\4.9.2
    libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32
-lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool
-lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid
-lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    libc=, so=dll, useshrplib=true, libperl=libperl522.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-mdll -s -L"C:\STRAWB~1\perl\lib\CORE"
-L"C:\STRAWB~1\c\lib"'


---
@INC for perl 5.22.0:
    C:/Strawberry/perl/site/lib/MSWin32-x86-multi-thread-64int
    C:/Strawberry/perl/site/lib
    C:/Strawberry/perl/vendor/lib
    C:/Strawberry/perl/lib
    .

---
Environment for perl 5.22.0:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Python27\;C:\Python27\Scripts;C:\Program Files
(x86)\Common Files\Intel\Shared
Files\cpp\bin\Intel64;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Strawberry\c\bin;C:\Strawberry\perl\site\bin;C:\Strawberry\perl\bin;C:\ffmpeg\bin;C:\Program
Files (x86)\Git\cmd;C:\Program Files (x86)\NVIDIA
Corporation\PhysX\Common;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;E:\Perl\source;C:\Program
Files (x86)\EaseUS\Todo Backup\bin\x64\
    PERL_BADLANG (unset)
    SHELL (unset)

Subject: Re: [perl #125619] Documentation of byte I/O
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Date: Wed, 29 Jul 2015 13:09:11 +0200
Download (untitled) / with headers
text/plain 3.4k
* Rob Dixon <perlbug-followup@perl.org> [2015-07-15 22:20]: Show quoted text
> I recently read "perldoc bytes" and saw > > This pragma reflects early attempts to incorporate Unicode into > perl (sic.) and has since been superseded > > I think it is a mistake to fail to say /what/ has superseded it, and > the perluni* pods aren't something I should expect to look at if I'm > not using Unicode
It has been superseded by nothing. I/O is in terms of bytes by default. If you are not using Unicode, you do not need to do anything special. If, however, you are using code that does at some point decode your bytes into text, then you have to re-encode it appropriately yourself; you never could just say `use bytes` and make its decodedness magically go away. Show quoted text
> In any statement that deprecates "bytes", I think something should > be said about "utf8", which seems to be its dual but is not
Please explain. FWIW, `use utf8` has a *completely* different purpose than `use bytes`. Show quoted text
> In short, I believe the perluni* should have been > "perlcharacterencoding", as there is no "perlascii" or "perllatin1" > > The journey into the promised land of Unicode has been arduous > enough as it is, and the perluni* pods are a major achievement, so I > don't really expect all of that to be ripped apart on my whim. But > perhaps we could do with something like my "perlcharacterencoding" > as an entry point to all of the above?
Somehow I don’t think the documentation is going to get less confusing if we keep adding more documents to it… Show quoted text
> As you can tell, these are infant musings without any coherent plan. > But I have recently been asked how to enable byte semantics on a > Perl input stream, and I was troubled to find that I was neither > certain nor able to locate the documentation that told me. I suspect > the answer is that file handles have byte semantics by default
Correct. Show quoted text
> and can be opened that way explicitly with a mode of :raw or using > binmode
That is necessary on Windows only because the :crlf layer is added to filehandles by default there. (And maybe other platforms, but none of them are really relevant in practice.) It is also necessary if you want to turn off decoding on a filehandle after the fact, but if you ask me, you should change your code to avoid the need to do that – the semantics of doing it are too murky. Show quoted text
> but I am far from sure > > Or perhaps I am just asking for a better index into the perldoc tomes > > Thank you for reading
Well. The documentation is sprawling and has no coherent organisation, basically because it has grown by the same principle as is behind your proposal: someone needed (it) to explain a particular topic, so they wrote a document or two to explain that topic. But coherent docs do not happen by that method. What this produces is documentation that is great so long as you read each document by itself top to bottom, and you take the time out to sit down and read every last document, one by one. If the documentation is supposed to be more penetrable then it must have some topical organisation, which requires editing by someone with a good idea of what topics are covered where, who can then decide how some new topic should be broken up to fit into the existing structure and/or how the structure needs to be shifted around to make room for the new topic. In short, you need an architect. We don’t have one. In fact we are probably even further away from having one for the docs than we are from one for the interpreter. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Date: Mon, 3 Aug 2015 10:04:49 -0400
From: "Chas. Owens" <chas.owens [...] gmail.com>
To: Perl 5 Porters <perl5-porters [...] perl.org>, bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #125619] Documentation of byte I/O
I think the following patch addresses the confusion of what to use instead of bytes when you feel the need to access the byte level of a string. diff --git a/lib/bytes.pm b/lib/bytes.pm index 6dad41a..77d849d 100644 --- a/lib/bytes.pm +++ b/lib/bytes.pm @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics rather than character semantics =head1 NOTICE -This pragma reflects early attempts to incorporate Unicode into perl and -has since been superseded. It breaks encapsulation (i.e. it exposes the -innards of how the perl executable currently happens to store a string), -and use of this module for anything other than debugging purposes is -strongly discouraged. If you feel that the functions here within might be -useful for your application, this possibly indicates a mismatch between -your mental model of Perl Unicode and the current reality. In that case, -you may wish to read some of the perl Unicode documentation: -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>. +This pragma reflects early attempts to incorporate Unicode into perl and has +since been superseded by explicit (rather than this pragma's implict) encoding +using the L<Encode> module: + + use Encode qw/encode/; + + my $utf8_byte_string = encode "UTF-8", $string; + my $latin1_byte_string = encode "Latin1", $string; + +Because this module breaks encapsulation (i.e. it exposes the innards of how +the perl executable currently happens to store a string), the byte values that +result are in an unspecified encoding. Use of this module for anything other +than debugging purposes is strongly discouraged. If you feel that the +functions here within might be useful for your application, this possibly +indicates a mismatch between your mental model of Perl Unicode and the current +reality. In that case, you may wish to read some of the perl Unicode +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and +L<perlunicode>. =head1 SYNOPSIS Chas. Owens http://github.com/cowens The most important skill a programmer can have is the ability to read.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.1k
On Mon Aug 03 07:05:54 2015, cowens wrote: Show quoted text
> I think the following patch addresses the confusion of what to use > instead of bytes when you feel the need to access the byte level of a > string. > > diff --git a/lib/bytes.pm b/lib/bytes.pm > index 6dad41a..77d849d 100644 > --- a/lib/bytes.pm > +++ b/lib/bytes.pm > @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics rather > than character semantics > > =head1 NOTICE > > -This pragma reflects early attempts to incorporate Unicode into perl > and > -has since been superseded. It breaks encapsulation (i.e. it exposes > the > -innards of how the perl executable currently happens to store a > string), > -and use of this module for anything other than debugging purposes is > -strongly discouraged. If you feel that the functions here within > might be > -useful for your application, this possibly indicates a mismatch > between > -your mental model of Perl Unicode and the current reality. In that > case, > -you may wish to read some of the perl Unicode documentation: > -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>. > +This pragma reflects early attempts to incorporate Unicode into perl > and has > +since been superseded by explicit (rather than this pragma's implict) > encoding > +using the L<Encode> module: > + > + use Encode qw/encode/; > + > + my $utf8_byte_string = encode "UTF-8", $string; > + my $latin1_byte_string = encode "Latin1", $string; > + > +Because this module breaks encapsulation (i.e. it exposes the innards > of how > +the perl executable currently happens to store a string), the byte > values that > +result are in an unspecified encoding. Use of this module for > anything other > +than debugging purposes is strongly discouraged. If you feel that > the > +functions here within might be useful for your application, this > possibly > +indicates a mismatch between your mental model of Perl Unicode and > the current > +reality. In that case, you may wish to read some of the perl Unicode > +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and > +L<perlunicode>. > > =head1 SYNOPSIS
It seems like an improvement to me. Should it mention utf8::encode()? Tony
Date: Thu, 6 Aug 2015 04:18:23 -0400
CC: Perl 5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #125619] Documentation of byte I/O
To: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
From: "Chas. Owens" <chas.owens [...] gmail.com>
Download (untitled) / with headers
text/plain 2.8k

On Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org> wrote:

Show quoted text
On Mon Aug 03 07:05:54 2015, cowens wrote:
> I think the following patch addresses the confusion of what to use
> instead of bytes when you feel the need to access the byte level of a
> string.
>
> diff --git a/lib/bytes.pm b/lib/bytes.pm
> index 6dad41a..77d849d 100644
> --- a/lib/bytes.pm
> +++ b/lib/bytes.pm
> @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics rather
> than character semantics
>
> =head1 NOTICE
>
> -This pragma reflects early attempts to incorporate Unicode into perl
> and
> -has since been superseded. It breaks encapsulation (i.e. it exposes
> the
> -innards of how the perl executable currently happens to store a
> string),
> -and use of this module for anything other than debugging purposes is
> -strongly discouraged. If you feel that the functions here within
> might be
> -useful for your application, this possibly indicates a mismatch
> between
> -your mental model of Perl Unicode and the current reality. In that
> case,
> -you may wish to read some of the perl Unicode documentation:
> -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
> +This pragma reflects early attempts to incorporate Unicode into perl
> and has
> +since been superseded by explicit (rather than this pragma's implict)
> encoding
> +using the L<Encode> module:
> +
> +    use Encode qw/encode/;
> +
> +    my $utf8_byte_string   = encode "UTF-8",  $string;
> +    my $latin1_byte_string = encode "Latin1", $string;
> +
> +Because this module breaks encapsulation (i.e. it exposes the innards
> of how
> +the perl executable currently happens to store a string), the byte
> values that
> +result are in an unspecified encoding.  Use of this module for
> anything other
> +than debugging purposes is strongly discouraged.  If you feel that
> the
> +functions here within might be useful for your application, this
> possibly
> +indicates a mismatch between your mental model of Perl Unicode and
> the current
> +reality. In that case, you may wish to read some of the perl Unicode
> +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
> +L<perlunicode>.
>
> =head1 SYNOPSIS

It seems like an improvement to me.

Should it mention utf8::encode()?

Tony

---
via perlbug:  queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125619

At first, I was going to say no, but then I benchmarked them: utf8::encode is an order of magnitude faster even with the assignment needed to make it work like Encode::encode.  Even factoring out the call to find_encoding leaves utf::encode twice as fast as $obj->encode().  It also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't replace malformed characters with the replacement character).  Here is a second attempt at a patch.




Download bytes.patch
text/plain 2.1k

Message body is not shown because sender requested not to inline it.

From: Karl Williamson <public [...] khwilliamson.com>
To: "Chas. Owens" <chas.owens [...] gmail.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Perl 5 Porters <perl5-porters [...] perl.org>
Date: Tue, 11 Aug 2015 19:35:22 -0600
Download (untitled) / with headers
text/plain 3.7k
On 08/06/2015 02:18 AM, Chas. Owens wrote: Show quoted text
> On Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org > <mailto:perlbug-followup@perl.org>> wrote: > > On Mon Aug 03 07:05:54 2015, cowens wrote:
> > I think the following patch addresses the confusion of what to use > > instead of bytes when you feel the need to access the byte level of a > > string. > > > > diff --git a/lib/bytes.pm <http://bytes.pm> b/lib/bytes.pm
> <http://bytes.pm>
> > index 6dad41a..77d849d 100644 > > --- a/lib/bytes.pm <http://bytes.pm> > > +++ b/lib/bytes.pm <http://bytes.pm> > > @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics
> rather
> > than character semantics > > > > =head1 NOTICE > > > > -This pragma reflects early attempts to incorporate Unicode into perl > > and > > -has since been superseded. It breaks encapsulation (i.e. it exposes > > the > > -innards of how the perl executable currently happens to store a > > string), > > -and use of this module for anything other than debugging purposes is > > -strongly discouraged. If you feel that the functions here within > > might be > > -useful for your application, this possibly indicates a mismatch > > between > > -your mental model of Perl Unicode and the current reality. In that > > case, > > -you may wish to read some of the perl Unicode documentation: > > -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>. > > +This pragma reflects early attempts to incorporate Unicode into perl > > and has > > +since been superseded by explicit (rather than this pragma's
> implict)
> > encoding > > +using the L<Encode> module: > > + > > + use Encode qw/encode/; > > + > > + my $utf8_byte_string = encode "UTF-8", $string; > > + my $latin1_byte_string = encode "Latin1", $string; > > + > > +Because this module breaks encapsulation (i.e. it exposes the
> innards
> > of how > > +the perl executable currently happens to store a string), the byte > > values that > > +result are in an unspecified encoding. Use of this module for > > anything other > > +than debugging purposes is strongly discouraged. If you feel that > > the > > +functions here within might be useful for your application, this > > possibly > > +indicates a mismatch between your mental model of Perl Unicode and > > the current > > +reality. In that case, you may wish to read some of the perl Unicode > > +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and > > +L<perlunicode>. > > > > =head1 SYNOPSIS
> > It seems like an improvement to me. > > Should it mention utf8::encode()? > > Tony > > --- > via perlbug: queue: perl5 status: open > https://rt.perl.org/Ticket/Display.html?id=125619 > > > At first, I was going to say no, but then I benchmarked them: > utf8::encode is an order of magnitude faster even with the assignment > needed to make it work like Encode::encode. Even factoring out the call > to find_encoding leaves utf::encode twice as fast as $obj->encode(). It > also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't > replace malformed characters with the replacement character). Here is a > second attempt at a patch. >
Consider doing an approach of doing something like this: =head1 NAME bytes - Perl pragma to access the individual bytes of characters stressing that this is to be mostly confined to temporary debugging code. And then later say that this pragma used to be for more things, but don't do that anymore, as it has been found to be broken. I think that text should incorporate Chas.' patch.
From: "Chas. Owens" <chas.owens [...] gmail.com>
To: Karl Williamson <public [...] khwilliamson.com>
CC: "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #125619] Documentation of byte I/O
Date: Thu, 13 Aug 2015 10:10:23 -0400
Download (untitled) / with headers
text/plain 5.4k
On Tue, Aug 11, 2015 at 9:35 PM, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
> On 08/06/2015 02:18 AM, Chas. Owens wrote:
>> >> On Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org >> <mailto:perlbug-followup@perl.org>> wrote: >> >> On Mon Aug 03 07:05:54 2015, cowens wrote:
>> > I think the following patch addresses the confusion of what to use >> > instead of bytes when you feel the need to access the byte level of
>> a
>> > string. >> > >> > diff --git a/lib/bytes.pm <http://bytes.pm> b/lib/bytes.pm
>> <http://bytes.pm>
>> > index 6dad41a..77d849d 100644 >> > --- a/lib/bytes.pm <http://bytes.pm> >> > +++ b/lib/bytes.pm <http://bytes.pm>
>>
>> > @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics
>> rather
>> > than character semantics >> > >> > =head1 NOTICE >> > >> > -This pragma reflects early attempts to incorporate Unicode into
>> perl
>> > and >> > -has since been superseded. It breaks encapsulation (i.e. it
>> exposes
>> > the >> > -innards of how the perl executable currently happens to store a >> > string), >> > -and use of this module for anything other than debugging purposes
>> is
>> > -strongly discouraged. If you feel that the functions here within >> > might be >> > -useful for your application, this possibly indicates a mismatch >> > between >> > -your mental model of Perl Unicode and the current reality. In that >> > case, >> > -you may wish to read some of the perl Unicode documentation: >> > -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>. >> > +This pragma reflects early attempts to incorporate Unicode into
>> perl
>> > and has >> > +since been superseded by explicit (rather than this pragma's
>> implict)
>> > encoding >> > +using the L<Encode> module: >> > + >> > + use Encode qw/encode/; >> > + >> > + my $utf8_byte_string = encode "UTF-8", $string; >> > + my $latin1_byte_string = encode "Latin1", $string; >> > + >> > +Because this module breaks encapsulation (i.e. it exposes the
>> innards
>> > of how >> > +the perl executable currently happens to store a string), the byte >> > values that >> > +result are in an unspecified encoding. Use of this module for >> > anything other >> > +than debugging purposes is strongly discouraged. If you feel that >> > the >> > +functions here within might be useful for your application, this >> > possibly >> > +indicates a mismatch between your mental model of Perl Unicode and >> > the current >> > +reality. In that case, you may wish to read some of the perl
>> Unicode
>> > +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and >> > +L<perlunicode>. >> > >> > =head1 SYNOPSIS
>> >> It seems like an improvement to me. >> >> Should it mention utf8::encode()? >> >> Tony >> >> --- >> via perlbug: queue: perl5 status: open >> https://rt.perl.org/Ticket/Display.html?id=125619 >> >> >> At first, I was going to say no, but then I benchmarked them: >> utf8::encode is an order of magnitude faster even with the assignment >> needed to make it work like Encode::encode. Even factoring out the call >> to find_encoding leaves utf::encode twice as fast as $obj->encode(). It >> also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't >> replace malformed characters with the replacement character). Here is a >> second attempt at a patch. >>
> > Consider doing an approach of doing something like this: > > =head1 NAME > > bytes - Perl pragma to access the individual bytes of characters > > stressing that this is to be mostly confined to temporary debugging code. > And then later say that this pragma used to be for more things, but don't do > that anymore, as it has been found to be broken. I think that text should > incorporate Chas.' patch. > > > > > >
Here is my rewritten version. Everything following this text is the same as it currently is. If people like this version I will make a patch. Otherwise, please suggest further edits and I will try again. =head1 NAME bytes - Perl pragma to access the individual bytes of characters in strings =head1 NOTICE This pragma is no longer recommended for anything other than debugging of how Perl represents a given string internally. Perl strings can be represented internally in a number of different encodings, and, therefore, the byte values may not be the ones you are expecting. A better solution is to create a byte string with an explicit encoding using the C<encode> function from the L<Encode> module: use Encode qw/encode/; my $utf8_byte_string = encode "UTF8", $string; my $latin1_byte_string = encode "Latin1", $string; Or, if performance is needed and you are only interested in the UTF-8 representation, you can use the C<encode> function from the L<utf8> pragma: use utf8; utf8::encode(my $utf8_byte_string = $string); If you feel this pragma might be useful for your application, this possibly indicates a mismatch between your mental model of Perl Unicode and the current reality. In that case, you may wish to read some of the perl Unicode documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>. -- Chas. Owens http://github.com/cowens The most important skill a programmer can have is the ability to read.
Date: Thu, 13 Aug 2015 22:46:05 +0200
To: "Chas. Owens" <chas.owens [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
From: Lasse Makholm <lasse.makholm [...] gmail.com>
Download (untitled) / with headers
text/plain 7.1k
Download (untitled) / with headers
text/html 10.4k
Hi,

It seems like bytes is not only deprecated and easily misunderstood but also broken... (Unless I'm easily misunderstanding it...)

I can't remember ever using bytes for anything except bytes::length($string) for calculating HTTP Content-Length headers and such... Mostly because it's easier to type than length Encode blah blah...

The bytes docs explicitly state that the byte strings it operates on are, in fact, the bytes that would make up the string in UTF-8 encoding. The example given works for the character (U+0190) in the example:

    $x = chr(400);
    print "Length is ", length $x, "\n";     # "Length is 1"
    printf "Contents are %vd\n", $x;         # "Contents are 400"
    {
        use bytes; # or "require bytes; bytes::length()"
        print "Length is ", length $x, "\n"; # "Length is 2"
        printf "Contents are %vd\n", $x;     # "Contents are 198.144"
    }

Yields:

Length is 1
Contents are 400
Length is 2
Contents are 198.144

However, using U+00F8 ( chr(248) - "ø" ) instead - not so much:

Length is 1
Contents are 248
Length is 1
Contents are 248

Ouch. I'm guessing there's some code somewhere that mistakenly assumes that all characters below U+0100 encode as 1 byte in UTF-8... :-(

I'm slightly baffled as to how I've never noticed this before...

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level

/L


On 13 August 2015 at 16:10, Chas. Owens <chas.owens@gmail.com> wrote:
Show quoted text
On Tue, Aug 11, 2015 at 9:35 PM, Karl Williamson
<public@khwilliamson.com> wrote:
> On 08/06/2015 02:18 AM, Chas. Owens wrote:
>>
>> On Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org
>> <mailto:perlbug-followup@perl.org>> wrote:
>>
>>     On Mon Aug 03 07:05:54 2015, cowens wrote:
>>      > I think the following patch addresses the confusion of what to use
>>      > instead of bytes when you feel the need to access the byte level of
>> a
>>      > string.
>>      >
>>      > diff --git a/lib/bytes.pm <http://bytes.pm> b/lib/bytes.pm
>>     <http://bytes.pm>
>>      > index 6dad41a..77d849d 100644
>>      > --- a/lib/bytes.pm <http://bytes.pm>
>>      > +++ b/lib/bytes.pm <http://bytes.pm>
>>
>>      > @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics
>>     rather
>>      > than character semantics
>>      >
>>      > =head1 NOTICE
>>      >
>>      > -This pragma reflects early attempts to incorporate Unicode into
>> perl
>>      > and
>>      > -has since been superseded. It breaks encapsulation (i.e. it
>> exposes
>>      > the
>>      > -innards of how the perl executable currently happens to store a
>>      > string),
>>      > -and use of this module for anything other than debugging purposes
>> is
>>      > -strongly discouraged. If you feel that the functions here within
>>      > might be
>>      > -useful for your application, this possibly indicates a mismatch
>>      > between
>>      > -your mental model of Perl Unicode and the current reality. In that
>>      > case,
>>      > -you may wish to read some of the perl Unicode documentation:
>>      > -L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.
>>      > +This pragma reflects early attempts to incorporate Unicode into
>> perl
>>      > and has
>>      > +since been superseded by explicit (rather than this pragma's
>>     implict)
>>      > encoding
>>      > +using the L<Encode> module:
>>      > +
>>      > +    use Encode qw/encode/;
>>      > +
>>      > +    my $utf8_byte_string   = encode "UTF-8",  $string;
>>      > +    my $latin1_byte_string = encode "Latin1", $string;
>>      > +
>>      > +Because this module breaks encapsulation (i.e. it exposes the
>>     innards
>>      > of how
>>      > +the perl executable currently happens to store a string), the byte
>>      > values that
>>      > +result are in an unspecified encoding.  Use of this module for
>>      > anything other
>>      > +than debugging purposes is strongly discouraged.  If you feel that
>>      > the
>>      > +functions here within might be useful for your application, this
>>      > possibly
>>      > +indicates a mismatch between your mental model of Perl Unicode and
>>      > the current
>>      > +reality. In that case, you may wish to read some of the perl
>> Unicode
>>      > +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
>>      > +L<perlunicode>.
>>      >
>>      > =head1 SYNOPSIS
>>
>>     It seems like an improvement to me.
>>
>>     Should it mention utf8::encode()?
>>
>>     Tony
>>
>>     ---
>>     via perlbug:  queue: perl5 status: open
>>     https://rt.perl.org/Ticket/Display.html?id=125619
>>
>>
>> At first, I was going to say no, but then I benchmarked them:
>> utf8::encode is an order of magnitude faster even with the assignment
>> needed to make it work like Encode::encode.  Even factoring out the call
>> to find_encoding leaves utf::encode twice as fast as $obj->encode().  It
>> also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't
>> replace malformed characters with the replacement character).  Here is a
>> second attempt at a patch.
>>
>
> Consider doing an approach of doing something like this:
>
>      =head1 NAME
>
>      bytes - Perl pragma to access the individual bytes of characters
>
> stressing that this is to be mostly confined to temporary debugging code.
> And then later say that this pragma used to be for more things, but don't do
> that anymore, as it has been found to be broken.  I think that text should
> incorporate Chas.' patch.
>
>
>
>
>
>

Here is my rewritten version.  Everything following this text is the
same as it currently is.  If people like this version I will make a
patch.  Otherwise, please suggest further edits and I will try again.

=head1 NAME

bytes - Perl pragma to access the individual bytes of characters in strings

=head1 NOTICE

This pragma is no longer recommended for anything other than debugging
of how Perl represents a given string internally.  Perl strings can be
represented internally in a number of different encodings, and, therefore,
the byte values may not be the ones you are expecting.

A better solution is to create a byte string with an explicit encoding using
the C<encode> function from the L<Encode> module:

    use Encode qw/encode/;

    my $utf8_byte_string   = encode "UTF8",   $string;
    my $latin1_byte_string = encode "Latin1", $string;

Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8> pragma:

    use utf8;

    utf8::encode(my $utf8_byte_string = $string);

If you feel this pragma might be useful for your application, this possibly
indicates a mismatch between your mental model of Perl Unicode and the
current reality. In that case, you may wish to read some of the perl Unicode
documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and
L<perlunicode>.

--
Chas. Owens
http://github.com/cowens
The most important skill a programmer can have is the ability to read.

CC: "Chas. Owens" <chas.owens [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #125619] Documentation of byte I/O
Date: Thu, 13 Aug 2015 16:50:49 -0400
To: Lasse Makholm <lasse.makholm [...] gmail.com>
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 785b
* Lasse Makholm <lasse.makholm@gmail.com> [2015-08-13T16:46:05] Show quoted text
> The bytes docs explicitly state that the byte strings it operates on are, > in fact, the bytes that would make up the string in UTF-8 encoding. The > example given works for the character (U+0190) in the example:
The perl runtime stores strings in memory in either Type-A or Type-B format. In Type-A format, it is an array of chars, and each char is the value of the character at that position. In Type-B format, it is an array of chars forming a valid UTF-8 sequence. To know the character at a given position, you must decode. "use bytes" makes things operate on the underlying "array of chars" without reference to whether the storage is Type-A or Type-B, which can be determined through other means. -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

Date: Thu, 13 Aug 2015 17:48:13 -0400
From: "Chas. Owens" <chas.owens [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
To: Lasse Makholm <lasse.makholm [...] gmail.com>
Download (untitled) / with headers
text/plain 9.2k
The bytes pragma is broken, but not in the way you think. Perl encodes strings based on a number of rules I don't fully understand, but if I recall correctly, if all of the characters in the string are below 255, then it uses Latin1, if there are characters greater than 255, it uses UTF-8. This is why the bytes pragma is broken (it isn't easy to predict which encoding is being used). You don't necessarily get the bytes you are expecting. If you have been using the bytes pragma to get the length for HTTP Content-Length, then you have likely been sending incorrect sizes. using bytes ÿ is length 1 ÿĀ is length 4 using utf8::encode ÿ is length 2 ÿĀ is length 4 #!/usr/bin/perl use strict; use warnings; use utf8; binmode STDOUT => ":utf8"; my $latin1 = chr(255); my $utf8 = chr(255) . chr(256); my ($latin1_length, $utf8_length); { use bytes; $latin1_length = length $latin1; $utf8_length = length $utf8; } print "using bytes\n", "$latin1 is length $latin1_length\n", "$utf8 is length $utf8_length\n"; utf8::encode(my $latin1_byte_string = $latin1); utf8::encode(my $utf8_byte_string = $utf8); $latin1_length = length $latin1_byte_string; $utf8_length = length $utf8_byte_string; print "using utf8::encode\n", "$latin1 is length $latin1_length\n", "$utf8 is length $utf8_length\n"; Also, if you or your framework hasn't been setting the output filehandle to UTF-8, then you might have had the right length, but the wrong encoding: $ perl -e 'print chr(255)' | wc -c 1 $ perl -C -e 'print chr(255)' | wc -c 2 On Thu, Aug 13, 2015 at 4:46 PM, Lasse Makholm <lasse.makholm@gmail.com> wrote: Show quoted text
> Hi, > > It seems like bytes is not only deprecated and easily misunderstood but also > broken... (Unless I'm easily misunderstanding it...) > > I can't remember ever using bytes for anything except bytes::length($string) > for calculating HTTP Content-Length headers and such... Mostly because it's > easier to type than length Encode blah blah... > > The bytes docs explicitly state that the byte strings it operates on are, in > fact, the bytes that would make up the string in UTF-8 encoding. The example > given works for the character (U+0190) in the example: > > $x = chr(400); > print "Length is ", length $x, "\n"; # "Length is 1" > printf "Contents are %vd\n", $x; # "Contents are 400" > { > use bytes; # or "require bytes; bytes::length()" > print "Length is ", length $x, "\n"; # "Length is 2" > printf "Contents are %vd\n", $x; # "Contents are 198.144" > } > > Yields: > > Length is 1 > Contents are 400 > Length is 2 > Contents are 198.144 > > However, using U+00F8 ( chr(248) - "ø" ) instead - not so much: > > Length is 1 > Contents are 248 > Length is 1 > Contents are 248 > > Ouch. I'm guessing there's some code somewhere that mistakenly assumes that > all characters below U+0100 encode as 1 byte in UTF-8... :-( > > I'm slightly baffled as to how I've never noticed this before... > > This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level > > /L > > > On 13 August 2015 at 16:10, Chas. Owens <chas.owens@gmail.com> wrote:
>> >> On Tue, Aug 11, 2015 at 9:35 PM, Karl Williamson >> <public@khwilliamson.com> wrote:
>> > On 08/06/2015 02:18 AM, Chas. Owens wrote:
>> >> >> >> On Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org >> >> <mailto:perlbug-followup@perl.org>> wrote: >> >> >> >> On Mon Aug 03 07:05:54 2015, cowens wrote:
>> >> > I think the following patch addresses the confusion of what to
>> >> use
>> >> > instead of bytes when you feel the need to access the byte level
>> >> of >> >> a
>> >> > string. >> >> > >> >> > diff --git a/lib/bytes.pm <http://bytes.pm> b/lib/bytes.pm
>> >> <http://bytes.pm>
>> >> > index 6dad41a..77d849d 100644 >> >> > --- a/lib/bytes.pm <http://bytes.pm> >> >> > +++ b/lib/bytes.pm <http://bytes.pm>
>> >>
>> >> > @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics
>> >> rather
>> >> > than character semantics >> >> > >> >> > =head1 NOTICE >> >> > >> >> > -This pragma reflects early attempts to incorporate Unicode into
>> >> perl
>> >> > and >> >> > -has since been superseded. It breaks encapsulation (i.e. it
>> >> exposes
>> >> > the >> >> > -innards of how the perl executable currently happens to store a >> >> > string), >> >> > -and use of this module for anything other than debugging
>> >> purposes >> >> is
>> >> > -strongly discouraged. If you feel that the functions here
>> >> within
>> >> > might be >> >> > -useful for your application, this possibly indicates a mismatch >> >> > between >> >> > -your mental model of Perl Unicode and the current reality. In
>> >> that
>> >> > case, >> >> > -you may wish to read some of the perl Unicode documentation: >> >> > -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
>> >> L<perlunicode>.
>> >> > +This pragma reflects early attempts to incorporate Unicode into
>> >> perl
>> >> > and has >> >> > +since been superseded by explicit (rather than this pragma's
>> >> implict)
>> >> > encoding >> >> > +using the L<Encode> module: >> >> > + >> >> > + use Encode qw/encode/; >> >> > + >> >> > + my $utf8_byte_string = encode "UTF-8", $string; >> >> > + my $latin1_byte_string = encode "Latin1", $string; >> >> > + >> >> > +Because this module breaks encapsulation (i.e. it exposes the
>> >> innards
>> >> > of how >> >> > +the perl executable currently happens to store a string), the
>> >> byte
>> >> > values that >> >> > +result are in an unspecified encoding. Use of this module for >> >> > anything other >> >> > +than debugging purposes is strongly discouraged. If you feel
>> >> that
>> >> > the >> >> > +functions here within might be useful for your application,
>> >> this
>> >> > possibly >> >> > +indicates a mismatch between your mental model of Perl Unicode
>> >> and
>> >> > the current >> >> > +reality. In that case, you may wish to read some of the perl
>> >> Unicode
>> >> > +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq>
>> >> and
>> >> > +L<perlunicode>. >> >> > >> >> > =head1 SYNOPSIS
>> >> >> >> It seems like an improvement to me. >> >> >> >> Should it mention utf8::encode()? >> >> >> >> Tony >> >> >> >> --- >> >> via perlbug: queue: perl5 status: open >> >> https://rt.perl.org/Ticket/Display.html?id=125619 >> >> >> >> >> >> At first, I was going to say no, but then I benchmarked them: >> >> utf8::encode is an order of magnitude faster even with the assignment >> >> needed to make it work like Encode::encode. Even factoring out the >> >> call >> >> to find_encoding leaves utf::encode twice as fast as $obj->encode(). >> >> It >> >> also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't >> >> replace malformed characters with the replacement character). Here is >> >> a >> >> second attempt at a patch. >> >>
>> > >> > Consider doing an approach of doing something like this: >> > >> > =head1 NAME >> > >> > bytes - Perl pragma to access the individual bytes of characters >> > >> > stressing that this is to be mostly confined to temporary debugging >> > code. >> > And then later say that this pragma used to be for more things, but >> > don't do >> > that anymore, as it has been found to be broken. I think that text >> > should >> > incorporate Chas.' patch. >> > >> > >> > >> > >> > >> >
>> >> Here is my rewritten version. Everything following this text is the >> same as it currently is. If people like this version I will make a >> patch. Otherwise, please suggest further edits and I will try again. >> >> =head1 NAME >> >> bytes - Perl pragma to access the individual bytes of characters in >> strings >> >> =head1 NOTICE >> >> This pragma is no longer recommended for anything other than debugging >> of how Perl represents a given string internally. Perl strings can be >> represented internally in a number of different encodings, and, therefore, >> the byte values may not be the ones you are expecting. >> >> A better solution is to create a byte string with an explicit encoding >> using >> the C<encode> function from the L<Encode> module: >> >> use Encode qw/encode/; >> >> my $utf8_byte_string = encode "UTF8", $string; >> my $latin1_byte_string = encode "Latin1", $string; >> >> Or, if performance is needed and you are only interested in the UTF-8 >> representation, you can use the C<encode> function from the L<utf8> >> pragma: >> >> use utf8; >> >> utf8::encode(my $utf8_byte_string = $string); >> >> If you feel this pragma might be useful for your application, this >> possibly >> indicates a mismatch between your mental model of Perl Unicode and the >> current reality. In that case, you may wish to read some of the perl >> Unicode >> documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and >> L<perlunicode>. >> >> -- >> Chas. Owens >> http://github.com/cowens >> The most important skill a programmer can have is the ability to read.
> >
-- Chas. Owens http://github.com/cowens The most important skill a programmer can have is the ability to read.
Date: Thu, 13 Aug 2015 17:50:41 -0400
To: Lasse Makholm <lasse.makholm [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
From: "Chas. Owens" <chas.owens [...] gmail.com>
Download (untitled) / with headers
text/plain 10.1k
I think you are confused by this line: "As an example, when Perl sees $x = chr(400) , it encodes the character in UTF-8 and stores it in $x". That could just as easily read "As an example, when Perl sees $x = chr(255) , it encodes the character in Latin1 and stores it in $x". It doesn't specify that it always encodes it that way. Maybe we should clean up that language so it isn't confusing. On Thu, Aug 13, 2015 at 5:48 PM, Chas. Owens <chas.owens@gmail.com> wrote: Show quoted text
> The bytes pragma is broken, but not in the way you think. Perl > encodes strings based on a number of rules I don't fully understand, > but if I recall correctly, if all of the characters in the string are > below 255, then it uses Latin1, if there are characters greater than > 255, it uses UTF-8. This is why the bytes pragma is broken (it isn't > easy to predict which encoding is being used). You don't necessarily > get the bytes you are expecting. If you have been using the bytes > pragma to get the length for HTTP Content-Length, then you have likely > been sending incorrect sizes. > > using bytes > ÿ is length 1 > ÿĀ is length 4 > using utf8::encode > ÿ is length 2 > ÿĀ is length 4 > > #!/usr/bin/perl > > use strict; > use warnings; > use utf8; > > binmode STDOUT => ":utf8"; > > my $latin1 = chr(255); > my $utf8 = chr(255) . chr(256); > > my ($latin1_length, $utf8_length); > { > use bytes; > $latin1_length = length $latin1; > $utf8_length = length $utf8; > } > > print > "using bytes\n", > "$latin1 is length $latin1_length\n", > "$utf8 is length $utf8_length\n"; > > utf8::encode(my $latin1_byte_string = $latin1); > utf8::encode(my $utf8_byte_string = $utf8); > > $latin1_length = length $latin1_byte_string; > $utf8_length = length $utf8_byte_string; > > print > "using utf8::encode\n", > "$latin1 is length $latin1_length\n", > "$utf8 is length $utf8_length\n"; > > Also, if you or your framework hasn't been setting the output > filehandle to UTF-8, then you might have had the right length, but the > wrong encoding: > > $ perl -e 'print chr(255)' | wc -c > 1 > $ perl -C -e 'print chr(255)' | wc -c > 2 > > > On Thu, Aug 13, 2015 at 4:46 PM, Lasse Makholm <lasse.makholm@gmail.com> wrote:
>> Hi, >> >> It seems like bytes is not only deprecated and easily misunderstood but also >> broken... (Unless I'm easily misunderstanding it...) >> >> I can't remember ever using bytes for anything except bytes::length($string) >> for calculating HTTP Content-Length headers and such... Mostly because it's >> easier to type than length Encode blah blah... >> >> The bytes docs explicitly state that the byte strings it operates on are, in >> fact, the bytes that would make up the string in UTF-8 encoding. The example >> given works for the character (U+0190) in the example: >> >> $x = chr(400); >> print "Length is ", length $x, "\n"; # "Length is 1" >> printf "Contents are %vd\n", $x; # "Contents are 400" >> { >> use bytes; # or "require bytes; bytes::length()" >> print "Length is ", length $x, "\n"; # "Length is 2" >> printf "Contents are %vd\n", $x; # "Contents are 198.144" >> } >> >> Yields: >> >> Length is 1 >> Contents are 400 >> Length is 2 >> Contents are 198.144 >> >> However, using U+00F8 ( chr(248) - "ø" ) instead - not so much: >> >> Length is 1 >> Contents are 248 >> Length is 1 >> Contents are 248 >> >> Ouch. I'm guessing there's some code somewhere that mistakenly assumes that >> all characters below U+0100 encode as 1 byte in UTF-8... :-( >> >> I'm slightly baffled as to how I've never noticed this before... >> >> This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-2level >> >> /L >> >> >> On 13 August 2015 at 16:10, Chas. Owens <chas.owens@gmail.com> wrote:
>>> >>> On Tue, Aug 11, 2015 at 9:35 PM, Karl Williamson >>> <public@khwilliamson.com> wrote:
>>> > On 08/06/2015 02:18 AM, Chas. Owens wrote:
>>> >> >>> >> On Thu, Aug 6, 2015, 01:15 Tony Cook via RT <perlbug-followup@perl.org >>> >> <mailto:perlbug-followup@perl.org>> wrote: >>> >> >>> >> On Mon Aug 03 07:05:54 2015, cowens wrote:
>>> >> > I think the following patch addresses the confusion of what to
>>> >> use
>>> >> > instead of bytes when you feel the need to access the byte level
>>> >> of >>> >> a
>>> >> > string. >>> >> > >>> >> > diff --git a/lib/bytes.pm <http://bytes.pm> b/lib/bytes.pm
>>> >> <http://bytes.pm>
>>> >> > index 6dad41a..77d849d 100644 >>> >> > --- a/lib/bytes.pm <http://bytes.pm> >>> >> > +++ b/lib/bytes.pm <http://bytes.pm>
>>> >>
>>> >> > @@ -35,15 +35,24 @@ bytes - Perl pragma to force byte semantics
>>> >> rather
>>> >> > than character semantics >>> >> > >>> >> > =head1 NOTICE >>> >> > >>> >> > -This pragma reflects early attempts to incorporate Unicode into
>>> >> perl
>>> >> > and >>> >> > -has since been superseded. It breaks encapsulation (i.e. it
>>> >> exposes
>>> >> > the >>> >> > -innards of how the perl executable currently happens to store a >>> >> > string), >>> >> > -and use of this module for anything other than debugging
>>> >> purposes >>> >> is
>>> >> > -strongly discouraged. If you feel that the functions here
>>> >> within
>>> >> > might be >>> >> > -useful for your application, this possibly indicates a mismatch >>> >> > between >>> >> > -your mental model of Perl Unicode and the current reality. In
>>> >> that
>>> >> > case, >>> >> > -you may wish to read some of the perl Unicode documentation: >>> >> > -L<perluniintro>, L<perlunitut>, L<perlunifaq> and
>>> >> L<perlunicode>.
>>> >> > +This pragma reflects early attempts to incorporate Unicode into
>>> >> perl
>>> >> > and has >>> >> > +since been superseded by explicit (rather than this pragma's
>>> >> implict)
>>> >> > encoding >>> >> > +using the L<Encode> module: >>> >> > + >>> >> > + use Encode qw/encode/; >>> >> > + >>> >> > + my $utf8_byte_string = encode "UTF-8", $string; >>> >> > + my $latin1_byte_string = encode "Latin1", $string; >>> >> > + >>> >> > +Because this module breaks encapsulation (i.e. it exposes the
>>> >> innards
>>> >> > of how >>> >> > +the perl executable currently happens to store a string), the
>>> >> byte
>>> >> > values that >>> >> > +result are in an unspecified encoding. Use of this module for >>> >> > anything other >>> >> > +than debugging purposes is strongly discouraged. If you feel
>>> >> that
>>> >> > the >>> >> > +functions here within might be useful for your application,
>>> >> this
>>> >> > possibly >>> >> > +indicates a mismatch between your mental model of Perl Unicode
>>> >> and
>>> >> > the current >>> >> > +reality. In that case, you may wish to read some of the perl
>>> >> Unicode
>>> >> > +documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq>
>>> >> and
>>> >> > +L<perlunicode>. >>> >> > >>> >> > =head1 SYNOPSIS
>>> >> >>> >> It seems like an improvement to me. >>> >> >>> >> Should it mention utf8::encode()? >>> >> >>> >> Tony >>> >> >>> >> --- >>> >> via perlbug: queue: perl5 status: open >>> >> https://rt.perl.org/Ticket/Display.html?id=125619 >>> >> >>> >> >>> >> At first, I was going to say no, but then I benchmarked them: >>> >> utf8::encode is an order of magnitude faster even with the assignment >>> >> needed to make it work like Encode::encode. Even factoring out the >>> >> call >>> >> to find_encoding leaves utf::encode twice as fast as $obj->encode(). >>> >> It >>> >> also handles malformed UTF-8 like the "UTF8" encoding (ie it doesn't >>> >> replace malformed characters with the replacement character). Here is >>> >> a >>> >> second attempt at a patch. >>> >>
>>> > >>> > Consider doing an approach of doing something like this: >>> > >>> > =head1 NAME >>> > >>> > bytes - Perl pragma to access the individual bytes of characters >>> > >>> > stressing that this is to be mostly confined to temporary debugging >>> > code. >>> > And then later say that this pragma used to be for more things, but >>> > don't do >>> > that anymore, as it has been found to be broken. I think that text >>> > should >>> > incorporate Chas.' patch. >>> > >>> > >>> > >>> > >>> > >>> >
>>> >>> Here is my rewritten version. Everything following this text is the >>> same as it currently is. If people like this version I will make a >>> patch. Otherwise, please suggest further edits and I will try again. >>> >>> =head1 NAME >>> >>> bytes - Perl pragma to access the individual bytes of characters in >>> strings >>> >>> =head1 NOTICE >>> >>> This pragma is no longer recommended for anything other than debugging >>> of how Perl represents a given string internally. Perl strings can be >>> represented internally in a number of different encodings, and, therefore, >>> the byte values may not be the ones you are expecting. >>> >>> A better solution is to create a byte string with an explicit encoding >>> using >>> the C<encode> function from the L<Encode> module: >>> >>> use Encode qw/encode/; >>> >>> my $utf8_byte_string = encode "UTF8", $string; >>> my $latin1_byte_string = encode "Latin1", $string; >>> >>> Or, if performance is needed and you are only interested in the UTF-8 >>> representation, you can use the C<encode> function from the L<utf8> >>> pragma: >>> >>> use utf8; >>> >>> utf8::encode(my $utf8_byte_string = $string); >>> >>> If you feel this pragma might be useful for your application, this >>> possibly >>> indicates a mismatch between your mental model of Perl Unicode and the >>> current reality. In that case, you may wish to read some of the perl >>> Unicode >>> documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and >>> L<perlunicode>. >>> >>> -- >>> Chas. Owens >>> http://github.com/cowens >>> The most important skill a programmer can have is the ability to read.
>> >>
> > > > -- > Chas. Owens > http://github.com/cowens > The most important skill a programmer can have is the ability to read.
-- Chas. Owens http://github.com/cowens The most important skill a programmer can have is the ability to read.
Date: Fri, 14 Aug 2015 00:08:42 +0200
From: Leon Timmermans <fawaka [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Lasse Makholm <lasse.makholm [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
To: "Chas. Owens" <chas.owens [...] gmail.com>
On Thu, Aug 13, 2015 at 11:48 PM, Chas. Owens <chas.owens@gmail.com> wrote:
Show quoted text
The bytes pragma is broken, but not in the way you think.  Perl
encodes strings based on a number of rules I don't fully understand,
but if I recall correctly, if all of the characters in the string are
below 255, then it uses Latin1, if there are characters greater than
255, it uses UTF-8.

No, it can use utf8 internally even when all of the characters are below 255. Whichever it is using is rather situational, which is why it's such a mess. The only sane way to handle this is to ignore the internal encoding altogether, but consistently either decode/encode everything or keep everything binary. This is orthogonal to the internal encoding.
 
Show quoted text
This is why the bytes pragma is broken (it isn't
easy to predict which encoding is being used).  You don't necessarily
get the bytes you are expecting.  If you have been using the bytes
pragma to get the length for HTTP Content-Length, then you have likely
been sending incorrect sizes.

Indeed.

Leon

Date: Thu, 13 Aug 2015 18:25:27 -0400
From: "Chas. Owens" <chas.owens [...] gmail.com>
To: Leon Timmermans <fawaka [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Lasse Makholm <lasse.makholm [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 1.2k
On Thu, Aug 13, 2015 at 6:08 PM, Leon Timmermans <fawaka@gmail.com> wrote: Show quoted text
> On Thu, Aug 13, 2015 at 11:48 PM, Chas. Owens <chas.owens@gmail.com> wrote:
>> >> The bytes pragma is broken, but not in the way you think. Perl >> encodes strings based on a number of rules I don't fully understand, >> but if I recall correctly, if all of the characters in the string are >> below 255, then it uses Latin1, if there are characters greater than >> 255, it uses UTF-8.
> > > No, it can use utf8 internally even when all of the characters are below > 255. Whichever it is using is rather situational, which is why it's such a > mess. The only sane way to handle this is to ignore the internal encoding > altogether, but consistently either decode/encode everything or keep > everything binary. This is orthogonal to the internal encoding. >
Like I said, I don't fully understand the rules (and I don't have to if I don't use the bytes pragma). There are all sorts of edge cases: #!/usr/bin/perl use strict; use warnings; use utf8; my $utf8 = "ÿ"; my $latin1 = "\x{ff}"; { use bytes; print "utf8: ", length $utf8, "\n", "latin1: ", length $latin1, "\n"; } -- Chas. Owens http://github.com/cowens The most important skill a programmer can have is the ability to read.
To: Leon Timmermans <fawaka [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Lasse Makholm <lasse.makholm [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
From: "Chas. Owens" <chas.owens [...] gmail.com>
Date: Thu, 13 Aug 2015 18:30:06 -0400
Download (untitled) / with headers
text/plain 3.3k
A new version that attempts to address the confusion of how strings are encoded. I am not very happy with it (especially because I don't know all of the rules for which encoding will be used internally), but it is a good strawman to draw debate over how to phrase it. =head1 NAME bytes - Perl pragma to access the individual bytes of characters in strings =head1 NOTICE This pragma is no longer recommended for anything other than debugging of how Perl represents a given string internally. Perl strings can be represented internally in a number of different encodings, and, therefore, the byte values may not be the ones you are expecting. For example, # this string can be encoded internally with Latin1, # so it is two bytes long on most systems my $s1 = "\x{ff}\x{ff}"; # but this string has a character above U+00FF and can't be encoded # with Latin1, so it is encoded with UTF-8 and is four bytes long on # most systems my $s2 = "\x{ff}\x{100}"; A better solution is to create a byte string with an explicit encoding using the C<encode> function from the L<Encode> module: use Encode qw/encode/; my $utf8_byte_string = encode "UTF8", $string; my $latin1_byte_string = encode "Latin1", $string; my $utf16_byte_string = encode "UTF-16BE", $string; Or, if performance is needed and you are only interested in the UTF-8 representation, you can use the C<encode> function from the L<utf8> pragma: use utf8; utf8::encode(my $utf8_byte_string = $string); If you feel this pragma might be useful for your application, this possibly indicates a mismatch between your mental model of how perl handles strings and the current reality. In that case, you may wish to read some of the Perl Unicode documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>. On Thu, Aug 13, 2015 at 6:25 PM, Chas. Owens <chas.owens@gmail.com> wrote: Show quoted text
> On Thu, Aug 13, 2015 at 6:08 PM, Leon Timmermans <fawaka@gmail.com> wrote:
>> On Thu, Aug 13, 2015 at 11:48 PM, Chas. Owens <chas.owens@gmail.com> wrote:
>>> >>> The bytes pragma is broken, but not in the way you think. Perl >>> encodes strings based on a number of rules I don't fully understand, >>> but if I recall correctly, if all of the characters in the string are >>> below 255, then it uses Latin1, if there are characters greater than >>> 255, it uses UTF-8.
>> >> >> No, it can use utf8 internally even when all of the characters are below >> 255. Whichever it is using is rather situational, which is why it's such a >> mess. The only sane way to handle this is to ignore the internal encoding >> altogether, but consistently either decode/encode everything or keep >> everything binary. This is orthogonal to the internal encoding. >>
> > Like I said, I don't fully understand the rules (and I don't have to > if I don't use the bytes pragma). There are all sorts of edge cases: > > #!/usr/bin/perl > > use strict; > use warnings; > use utf8; > > my $utf8 = "ÿ"; > my $latin1 = "\x{ff}"; > > { > use bytes; > print > "utf8: ", length $utf8, "\n", > "latin1: ", length $latin1, "\n"; > } > > > > > -- > Chas. Owens > http://github.com/cowens > The most important skill a programmer can have is the ability to read.
-- Chas. Owens http://github.com/cowens The most important skill a programmer can have is the ability to read.
Date: Fri, 14 Aug 2015 13:43:17 +0200
To: "Chas. Owens" <chas.owens [...] gmail.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
From: Lasse Makholm <lasse.makholm [...] gmail.com>
Download (untitled) / with headers
text/plain 11.1k

Message body is not shown because it is too large.

To: "Chas. Owens" <chas.owens [...] gmail.com>
CC: Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #125619] Documentation of byte I/O
From: Lasse Makholm <lasse.makholm [...] gmail.com>
Date: Fri, 14 Aug 2015 14:14:43 +0200
Download (untitled) / with headers
text/plain 10.3k
Download (untitled) / with headers
text/html 15.3k

Message body is not shown because it is too large.

Date: Fri, 14 Aug 2015 14:35:41 +0200
From: Lasse Makholm <lasse.makholm [...] gmail.com>
To: "Chas. Owens" <chas.owens [...] gmail.com>
CC: Leon Timmermans <fawaka [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #125619] Documentation of byte I/O
Download (untitled) / with headers
text/plain 928b


On 14 August 2015 at 00:30, Chas. Owens <chas.owens@gmail.com> wrote:
Show quoted text
A new version that attempts to address the confusion of how strings
are encoded.  I am not very happy with it (especially because I don't
know all of the rules for which encoding will be used internally), but
it is a good strawman to draw debate over how to phrase it.

FWIW, I like this version. It clearly shows why bytes::length() is not what you want and also what you should do instead. Detailing exactly when and why Perl upgrades and downgrades strings won't matter to the vast majority of users.

The "use utf8;" statement should probably be removed though. According to the utf8 docs:

 Do not use this pragma for anything else than telling Perl that your
 script is written in UTF-8. The utility functions described below are
 directly usable without "use utf8;".

And indeed, you can call utf8::encode() without use'ing utf8 first.

/L
From: Eric Brine <ikegami [...] adaelis.com>
Subject: Re: [perl #125619] Documentation of byte I/O
CC: Karl Williamson <public [...] khwilliamson.com>, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>, Perl 5 Porters <perl5-porters [...] perl.org>
To: "Chas. Owens" <chas.owens [...] gmail.com>
Date: Fri, 14 Aug 2015 14:27:53 -0400
Download (untitled) / with headers
text/plain 553b
On Thu, Aug 13, 2015 at 10:10 AM, Chas. Owens <chas.owens@gmail.com> wrote:
Show quoted text
Or, if performance is needed and you are only interested in the UTF-8
representation, you can use the C<encode> function from the L<utf8> pragma:

    use utf8;

    utf8::encode(my $utf8_byte_string = $string);

Sorry if this was already mentioned, but that should be

    utf8::encode(my $utf8_byte_string = $string);

without

    use utf8;

"use utf8;" is a pragma that indicates the source code is encoded using UTF-8, and does not control access to utf8::*.


RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.6k
On Thu Aug 13 15:31:03 2015, cowens wrote: Show quoted text
> A new version that attempts to address the confusion of how strings > are encoded. I am not very happy with it (especially because I don't > know all of the rules for which encoding will be used internally), but > it is a good strawman to draw debate over how to phrase it.
If the string has code points over 0xff it's encoded as perl's extended UTF-8, otherwise it could be encoded either way. For example Encode::decode() always (I'm unaware of any exceptions) returns a UTF-8 encoded string, even if all of the characters are between 0 and 0xff: tony@mars:.../git/perl$ ./perl -Ilib -MDevel::Peek -MEncode -e '$x = decode("latin1", " "); Dump($x)' SV = PV(0x18f6830) at 0x19150e8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1a11e70 " "\0 [UTF8 " "] CUR = 1 LEN = 10 tony@mars:.../git/perl$ ./perl -Ilib -MDevel::Peek -MEncode -e '$x = decode("UTF-8", "\303\277"); Dump($x)' SV = PV(0x1cb8830) at 0x1cd70f8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1dd3340 "\303\277"\0 [UTF8 "\x{ff}"] CUR = 2 LEN = 10 Also some string literals under use utf8: tony@mars:.../git/perl$ ./perl -Ilib -MDevel::Peek -Mutf8 -e '$x = "ÿ"; Dump($x)' SV = PV(0x28e6d70) at 0x29060d8 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x2904380 "\303\277"\0 [UTF8 "\x{ff}"] CUR = 2 LEN = 10 COW_REFCNT = 1 ... Show quoted text
> Or, if performance is needed and you are only interested in the UTF-8 > representation, you can use the C<encode> function from the L<utf8> > pragma: > > use utf8; > > utf8::encode(my $utf8_byte_string = $string);
As others have said, the "use utf8;" isn't needed. Tony
RT-Send-CC: perl5-porters [...] perl.org
On Fri, 14 Aug 2015 11:28:28 -0700, ikegami@adaelis.com wrote: Show quoted text
> On Thu, Aug 13, 2015 at 10:10 AM, Chas. Owens <chas.owens@gmail.com> > wrote: >
> > Or, if performance is needed and you are only interested in the UTF-8 > > representation, you can use the C<encode> function from the L<utf8> > > pragma: > > > > use utf8; > > > > utf8::encode(my $utf8_byte_string = $string); > >
> > Sorry if this was already mentioned, but that should be > > utf8::encode(my $utf8_byte_string = $string); > > without > > use utf8; > > "use utf8;" is a pragma that indicates the source code is encoded > using > UTF-8, and does not control access to utf8::*.
Is there anything else we want to do for this ticket? The only patch (with some extras and changes) was applied as: commit 01e331e519b4ccb213cadc7bfe7c04b8249e3289 Author: Karl Williamson <khw@cpan.org> Date: Sun Aug 9 21:40:21 2015 -0600 Update bytes.pm doc The one legitimate use of this pragma is for debugging. This changes to say so, and other minor changes. (though it retained the unneeded C<use utf8;>.) Tony
RT-Send-CC: perl.p5p [...] rjbs.manxome.org, fawaka [...] gmail.com, pagaltzis [...] gmx.de, perl5-porters [...] perl.org, chas.owens [...] gmail.com, ikegami [...] adaelis.com, lasse.makholm [...] gmail.com, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.5k
On Sun, 15 Oct 2017 22:09:20 -0700, tonyc wrote: Show quoted text
> On Fri, 14 Aug 2015 11:28:28 -0700, ikegami@adaelis.com wrote:
> > On Thu, Aug 13, 2015 at 10:10 AM, Chas. Owens <chas.owens@gmail.com> > > wrote: > >
> > > Or, if performance is needed and you are only interested in the UTF-8 > > > representation, you can use the C<encode> function from the L<utf8> > > > pragma: > > > > > > use utf8; > > > > > > utf8::encode(my $utf8_byte_string = $string); > > >
> > > > Sorry if this was already mentioned, but that should be > > > > utf8::encode(my $utf8_byte_string = $string); > > > > without > > > > use utf8; > > > > "use utf8;" is a pragma that indicates the source code is encoded > > using > > UTF-8, and does not control access to utf8::*.
> > Is there anything else we want to do for this ticket? > > The only patch (with some extras and changes) was applied as: > > commit 01e331e519b4ccb213cadc7bfe7c04b8249e3289 > Author: Karl Williamson <khw@cpan.org> > Date: Sun Aug 9 21:40:21 2015 -0600 > > Update bytes.pm doc > > The one legitimate use of this pragma is for debugging. This changes to > say so, and other minor changes. > > (though it retained the unneeded C<use utf8;>.) > > Tony
No one replied to this; I'm unclear if it was sent to all the concerned parties. The latest Encode running on blead is much faster than before. I'm fine with leaving the wording as it is now; but before closing I wasnt to make sure that others don't have objections. If I don't hear any withing 30 days, I will close the ticket. -- Karl Williamson
To: Karl Williamson via RT <perlbug-followup [...] perl.org>
From: Dave Mitchell <davem [...] iabyn.com>
Date: Wed, 4 Apr 2018 11:47:54 +0100
Subject: Re: [perl #125619] Documentation of byte I/O
CC: perl.p5p [...] rjbs.manxome.org, fawaka [...] gmail.com, pagaltzis [...] gmx.de, perl5-porters [...] perl.org, chas.owens [...] gmail.com, ikegami [...] adaelis.com, lasse.makholm [...] gmail.com
Download (untitled) / with headers
text/plain 816b
On Tue, Apr 03, 2018 at 09:38:50AM -0700, Karl Williamson via RT wrote: Show quoted text
> No one replied to this; I'm unclear if it was sent to all the concerned > parties. The latest Encode running on blead is much faster than before. > I'm fine with leaving the wording as it is now; but before closing I > wasnt to make sure that others don't have objections. If I don't hear > any withing 30 days, I will close the ticket.
With the just-pushed v5.27.10-105-g0d372decae, I've removed the spurious 'use utf8' from the example, as suggested earlier in the ticket. I'm happy for the ticket to be closed. -- Spock (or Data) is fired from his high-ranking position for not being able to understand the most basic nuances of about one in three sentences that anyone says to him. -- Things That Never Happen in "Star Trek" #19
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 128b
The deadline for 5.28 is fast upon us, and since no objections have been raised, I'm now closing this ticket -- Karl Williamson


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org