[PATCH] Update documentation about UTF-8 #15612

p5pRT · 2016-09-18T16:32:21Z

Migrated from rt.perl.org#129298 (status was 'resolved')

Searchable as RT129298$

p5pRT · 2016-09-18T16:32:21Z

From @pali

Created by @pali

Attached patches update documentation about UTF-8. For data exchange it is better to use strict UTF-8 encoding and not perl's utf8. Also it is wrong to use insecure :utf8 PerlIO layer for reading arbitrary input file.

Perl Info


Flags:
    category=core
    severity=medium
    Type=Patch
    PatchStatus=HasPatch

Site configuration information for perl 5.25.5:

Configured by pali at Sun Sep 18 16:45:32 CEST 2016.

Summary of my perl5 (revision 5 version 25 subversion 5) configuration:
  Commit id: e463df90b78a57edd46d5b19a56006b28f5029d6
  Platform:
    osname=linux
    osvers=3.13.0-95-generic
    archname=x86_64-linux
    uname='linux pali 3.13.0-95-generic #142~precise1-ubuntu smp fri aug 12 18:20:15 utc 2016 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Dusedevel'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-O2'
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion=''
    gccversion='4.6.3'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.6/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.15.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.15'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'



@INC for perl 5.25.5:
    lib
    /usr/lib/perl5/site_perl/5.25.5/x86_64-linux
    /usr/lib/perl5/site_perl/5.25.5
    /usr/lib/perl5/5.25.5/x86_64-linux
    /usr/lib/perl5/5.25.5


Environment for perl 5.25.5:
    HOME=/home/pali
    LANG=C
    LANGUAGE=C
    LD_LIBRARY_PATH=/usr/lib/i386-linux-gnu/
    LOGDIR (unset)
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERLDOC_PAGER (unset)
    PERL_BADLANG (unset)
    SHELL=/bin/bash

p5pRT · 2016-09-18T16:32:21Z

From @pali

0001-pod-Do-not-suggest-to-use-insecure-utf8-PerlIO-layer.patch

From a0cf022c6fb8e9a4673173e91d84217fd086dfe6 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:16:33 +0200
Subject: [PATCH 01/10] pod: Do not suggest to use insecure :utf8 PerlIO layer
 when reading files

Instead use strict :encoding(UTF-8) PerlIO layer for input files.
---
 pod/perlop.pod      |    9 ++++++---
 pod/perlunicook.pod |   12 ++++++------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/pod/perlop.pod b/pod/perlop.pod
index d65e911..cf99b88 100644
--- a/pod/perlop.pod
+++ b/pod/perlop.pod
@@ -1262,16 +1262,19 @@ The only operators with lower precedence are the logical operators
 C<"and">, C<"or">, and C<"not">, which may be used to evaluate calls to list
 operators without the need for parentheses:
 
-    open HANDLE, "< :utf8", "filename" or die "Can't open: $!\n";
+    open HANDLE, "< :encoding(UTF-8)", "filename"
+        or die "Can't open: $!\n";
 
 However, some people find that code harder to read than writing
 it with parentheses:
 
-    open(HANDLE, "< :utf8", "filename") or die "Can't open: $!\n";
+    open(HANDLE, "< :encoding(UTF-8)", "filename")
+        or die "Can't open: $!\n";
 
 in which case you might as well just use the more customary C<"||"> operator:
 
-    open(HANDLE, "< :utf8", "filename") || die "Can't open: $!\n";
+    open(HANDLE, "< :encoding(UTF-8)", "filename")
+        || die "Can't open: $!\n";
 
 See also discussion of list operators in L</Terms and List Operators (Leftward)>.
 
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index ac30509..6db6b73 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -26,7 +26,7 @@ to work correctly, with the C<#!> adjusted to work on your system:
  use strict;    # quote strings, declare variables
  use warnings;  # on by default
  use warnings  qw(FATAL utf8);    # fatalize encoding glitches
- use open      qw(:std :utf8);    # undeclared streams in UTF-8
+ use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
  use charnames qw(:full :short);  # unneeded in v5.16
 
 This I<does> make even Unix programmers C<binmode> your binary streams,
@@ -255,9 +255,9 @@ call C<binmode> explicitly:
  or
      $ export PERL_UNICODE=S
  or
-     use open qw(:std :utf8);
+     use open qw(:std :encoding(UTF-8));
  or
-     binmode(STDIN,  ":utf8");
+     binmode(STDIN,  ":encoding(UTF-8)");
      binmode(STDOUT, ":utf8");
      binmode(STDERR, ":utf8");
 
@@ -280,7 +280,7 @@ Files opened without an encoding argument will be in UTF-8:
  or
      $ export PERL_UNICODE=D
  or
-     use open qw(:utf8);
+     use open qw(:encoding(UTF-8));
 
 =head2 ℞ 18: Make all I/O and args default to utf8
 
@@ -288,7 +288,7 @@ Files opened without an encoding argument will be in UTF-8:
  or
      $ export PERL_UNICODE=SDA
  or
-     use open qw(:std :utf8);
+     use open qw(:std :encoding(UTF-8));
      use Encode qw(decode_utf8);
      @ARGV = map { decode_utf8($_, 1) } @ARGV;
 
@@ -701,7 +701,7 @@ Here's that program; tested on v5.14.
  use strict;
  use warnings;
  use warnings  qw(FATAL utf8);    # fatalize encoding faults
- use open      qw(:std :utf8);    # undeclared streams in UTF-8
+ use open      qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
  use charnames qw(:full :short);  # unneeded in v5.16
 
  # std modules
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0002-pod-Suggest-to-use-strict-encoding-UTF-8-PerlIO-laye.patch

From b66b3bffd136947d78c658b31beccbcc340a0c08 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:19:59 +0200
Subject: [PATCH 02/10] pod: Suggest to use strict :encoding(UTF-8) PerlIO
 layer over not strict :encoding(utf8)

For data exchange it is better to use strict UTF-8 encoding and not perl's utf8.
---
 lib/PerlIO.pm        |    2 +-
 lib/open.pm          |   10 +++++-----
 pod/perlfunc.pod     |   10 +++++-----
 pod/perlrun.pod      |    2 +-
 pod/perlunicode.pod  |    2 +-
 pod/perluniintro.pod |   10 +++++-----
 6 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/lib/PerlIO.pm b/lib/PerlIO.pm
index 2e27f98..acf2e19 100644
--- a/lib/PerlIO.pm
+++ b/lib/PerlIO.pm
@@ -104,7 +104,7 @@ is chosen to render simple text parts (i.e.  non-accented letters,
 digits and common punctuation) human readable in the encoded file.
 
 (B<CAUTION>: This layer does not validate byte sequences.  For reading input,
-you should instead use C<:encoding(utf8)> instead of bare C<:utf8>.)
+you should instead use C<:encoding(UTF-8)> instead of bare C<:utf8>.)
 
 Here is how to write your native data out using UTF-8 (or UTF-EBCDIC)
 and then read it back in.
diff --git a/lib/open.pm b/lib/open.pm
index fd22e1b..2392ac9 100644
--- a/lib/open.pm
+++ b/lib/open.pm
@@ -153,7 +153,7 @@ open - perl pragma to set default PerlIO layers for input and output
 
     use open IO  => ':locale';
 
-    use open ':encoding(utf8)';
+    use open ':encoding(UTF-8)';
     use open ':locale';
     use open ':encoding(iso-8859-7)';
 
@@ -195,8 +195,8 @@ For example:
 
 These are equivalent
 
-    use open ':encoding(utf8)';
-    use open IO => ':encoding(utf8)';
+    use open ':encoding(UTF-8)';
+    use open IO => ':encoding(UTF-8)';
 
 as are these
 
@@ -221,8 +221,8 @@ The C<:std> subpragma on its own has no effect, but if combined with
 the C<:utf8> or C<:encoding> subpragmas, it converts the standard
 filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected
 for input/output handles.  For example, if both input and out are
-chosen to be C<:encoding(utf8)>, a C<:std> will mean that STDIN, STDOUT,
-and STDERR are also in C<:encoding(utf8)>.  On the other hand, if only
+chosen to be C<:encoding(UTF-8)>, a C<:std> will mean that STDIN, STDOUT,
+and STDERR are also in C<:encoding(UTF-8)>.  On the other hand, if only
 output is chosen to be in C<< :encoding(koi8r) >>, a C<:std> will cause
 only the STDOUT and STDERR to be in C<koi8r>.  The C<:locale> subpragma
 implicitly turns on C<:std>.
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 36df5c7..8365ff7 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -6091,7 +6091,7 @@ Note the I<characters>: depending on the status of the socket, either
 (8-bit) bytes or characters are received.  By default all sockets
 operate on bytes, but for example if the socket has been changed using
 L<C<binmode>|/binmode FILEHANDLE, LAYER> to operate with the
-C<:encoding(utf8)> I/O layer (see the L<open> pragma), the I/O will
+C<:encoding(UTF-8)> I/O layer (see the L<open> pragma), the I/O will
 operate on UTF8-encoded Unicode
 characters, not bytes.  Similarly for the C<:encoding> layer: in that
 case pretty much any characters can be read.
@@ -6651,7 +6651,7 @@ of the file) from the L<Fcntl> module.  Returns C<1> on success, false
 otherwise.
 
 Note the emphasis on bytes: even if the filehandle has been set to operate
-on characters (for example using the C<:encoding(utf8)> I/O layer), the
+on characters (for example using the C<:encoding(UTF-8)> I/O layer), the
 L<C<seek>|/seek FILEHANDLE,POSITION,WHENCE>,
 L<C<tell>|/tell FILEHANDLE>, and
 L<C<sysseek>|/sysseek FILEHANDLE,POSITION,WHENCE>
@@ -6890,7 +6890,7 @@ Note the I<characters>: depending on the status of the socket, either
 (8-bit) bytes or characters are sent.  By default all sockets operate
 on bytes, but for example if the socket has been changed using
 L<C<binmode>|/binmode FILEHANDLE, LAYER> to operate with the
-C<:encoding(utf8)> I/O layer (see L<C<open>|/open FILEHANDLE,EXPR>, or
+C<:encoding(UTF-8)> I/O layer (see L<C<open>|/open FILEHANDLE,EXPR>, or
 the L<open> pragma), the I/O will operate on UTF-8
 encoded Unicode characters, not bytes.  Similarly for the C<:encoding>
 layer: in that case pretty much any characters can be sent.
@@ -8491,7 +8491,7 @@ to the current position plus POSITION; and C<2> to set it to EOF plus
 POSITION, typically negative.
 
 Note the emphasis on bytes: even if the filehandle has been set to operate
-on characters (for example using the C<:encoding(utf8)> I/O layer), the
+on characters (for example using the C<:encoding(UTF-8)> I/O layer), the
 L<C<seek>|/seek FILEHANDLE,POSITION,WHENCE>,
 L<C<tell>|/tell FILEHANDLE>, and
 L<C<sysseek>|/sysseek FILEHANDLE,POSITION,WHENCE>
@@ -8658,7 +8658,7 @@ the actual filehandle.  If FILEHANDLE is omitted, assumes the file
 last read.
 
 Note the emphasis on bytes: even if the filehandle has been set to operate
-on characters (for example using the C<:encoding(utf8)> I/O layer), the
+on characters (for example using the C<:encoding(UTF-8)> I/O layer), the
 L<C<seek>|/seek FILEHANDLE,POSITION,WHENCE>,
 L<C<tell>|/tell FILEHANDLE>, and
 L<C<sysseek>|/sysseek FILEHANDLE,POSITION,WHENCE>
diff --git a/pod/perlrun.pod b/pod/perlrun.pod
index 12cba35..a890215 100644
--- a/pod/perlrun.pod
+++ b/pod/perlrun.pod
@@ -1121,7 +1121,7 @@ A pseudolayer that enables a flag in the layer below to tell Perl
 that output should be in utf8 and that input should be regarded as
 already in valid utf8 form. B<WARNING: It does not check for validity and as such
 should be handled with extreme caution for input, because security violations
-can occur with non-shortest UTF-8 encodings, etc.> Generally C<:encoding(utf8)> is
+can occur with non-shortest UTF-8 encodings, etc.> Generally C<:encoding(UTF-8)> is
 the best option when reading UTF-8 encoded data.
 
 =item :win32
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 152c34b..bc82574 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1879,7 +1879,7 @@ work under 5.6, so you should be safe to try them out.
 A filehandle that should read or write UTF-8
 
   if ($] > 5.008) {
-    binmode $fh, ":encoding(utf8)";
+    binmode $fh, ":encoding(UTF-8)";
   }
 
 =item *
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index beccd3c..0e3f4bc 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -402,7 +402,7 @@ Unicode or legacy encodings does not magically turn the data into
 Unicode in Perl's eyes.  To do that, specify the appropriate
 layer when opening files
 
-    open(my $fh,'<:encoding(utf8)', 'anything');
+    open(my $fh,'<:encoding(UTF-8)', 'anything');
     my $line_of_unicode = <$fh>;
 
     open(my $fh,'<:encoding(Big5)', 'anything');
@@ -411,8 +411,8 @@ layer when opening files
 The I/O layers can also be specified more flexibly with
 the C<open> pragma.  See L<open>, or look at the following example.
 
-    use open ':encoding(utf8)'; # input/output default encoding will be
-                                # UTF-8
+    use open ':encoding(UTF-8)'; # input/output default encoding will be
+                                 # UTF-8
     open X, ">file";
     print X chr(0x100), "\n";
     close X;
@@ -481,12 +481,12 @@ by repeatedly encoding the data:
     local $/; ## read in the whole file of 8-bit characters
     $t = <F>;
     close F;
-    open F, ">:encoding(utf8)", "file";
+    open F, ">:encoding(UTF-8)", "file";
     print F $t; ## convert to UTF-8 on output
     close F;
 
 If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded.  A C<use open ':encoding(utf8)'> would have avoided the
+UTF-8 encoded.  A C<use open ':encoding(UTF-8)'> would have avoided the
 bug, or explicitly opening also the F<file> for input as UTF-8.
 
 B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0003-pod-Do-not-suggest-to-use-Encode-encode_utf8-for-kno.patch

From 86c5ac3f1abcee81736340cdc3cc49f7ed5eb404 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:21:54 +0200
Subject: [PATCH 03/10] pod: Do not suggest to use Encode::encode_utf8() for
 knowing know the byte length of a string

Encode module could do some additional operations and bytes pragma is
supposed to do that job.
---
 pod/perlfunc.pod     |    4 ++--
 pod/perluniintro.pod |    5 +----
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 8365ff7..8f81049 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -3764,8 +3764,8 @@ many elements these have.  For that, use C<scalar @array> and C<scalar keys
 Like all Perl character operations, L<C<length>|/length EXPR> normally
 deals in logical
 characters, not physical bytes.  For how many bytes a string encoded as
-UTF-8 would take up, use C<length(Encode::encode_utf8(EXPR))> (you'll have
-to C<use Encode> first).  See L<Encode> and L<perlunicode>.
+UTF-8 would take up, use C<bytes::length(EXPR)> (you'll have to
+C<use bytes ()> first).  See L<C<use bytes>|bytes> pragma and L<perlunicode>.
 
 =item __LINE__
 X<__LINE__>
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0e3f4bc..9b6c0da 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -726,14 +726,11 @@ the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
 C<$a> will stay byte-encoded.
 
 Sometimes you might really need to know the byte length of a string
-instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma
+instead of the character length. For that use the C<bytes> pragma
 and the C<length()> function:
 
     my $unicode = chr(0x100);
     print length($unicode), "\n"; # will print 1
-    require Encode;
-    print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
     use bytes;
     print length($unicode), "\n"; # will also print 2
                                   # (the 0xC4 0x80 of the UTF-8)
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0004-pod-Suggest-to-use-strict-UTF-8-encoding-when-dealin.patch

From 20c8001d5efa4c9a47e4629ad6227109ff0f184f Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:25:48 +0200
Subject: [PATCH 04/10] pod: Suggest to use strict UTF-8 encoding when dealing
 with external data

For data exchange it is not good idea to use not strict perl's extended
dialect of utf8 encoding.
---
 pod/perldiag.pod     |    2 +-
 pod/perlpacktut.pod  |    7 ++++---
 pod/perlunicode.pod  |    8 ++++----
 pod/perlunicook.pod  |    8 ++++----
 pod/perlunifaq.pod   |    6 ++++--
 pod/perluniintro.pod |    8 ++++----
 6 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 6d82cde..58ed165 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3351,7 +3351,7 @@ encoding rules, even though it had the UTF8 flag on.
 
 One possible cause is that you set the UTF8 flag yourself for data that
 you thought to be in UTF-8 but it wasn't (it was for example legacy
-8-bit data).  To guard against this, you can use Encode::decode_utf8.
+8-bit data).  To guard against this, you can use C<Encode::decode('UTF-8', ...)>.
 
 If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid byte
 sequences are handled gracefully, but if you use C<:utf8>, the flag is
diff --git a/pod/perlpacktut.pod b/pod/perlpacktut.pod
index f40d1c2..f6a9411 100644
--- a/pod/perlpacktut.pod
+++ b/pod/perlpacktut.pod
@@ -668,9 +668,10 @@ Usually you'll want to pack or unpack UTF-8 strings:
    my @hebrew = unpack( 'U*', $utf );
 
 Please note: in the general case, you're better off using
-Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl
-Unicode string, and Encode::encode_utf8 to encode a Perl Unicode string
-to UTF-8 bytes. These functions provide means of handling invalid byte
+L<C<Encode::decode('UTF-8', $utf)>|Encode/decode> to decode a UTF-8
+encoded byte string to a Perl Unicode string, and
+L<C<Encode::encode('UTF-8', $str)>|Encode/encode> to encode a Perl Unicode
+string to UTF-8 bytes. These functions provide means of handling invalid byte
 sequences and generally have a friendlier interface.
 
 =head2 Another Portable Binary Encoding
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index bc82574..229b9f8 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1894,7 +1894,7 @@ check the documentation to verify if this is still true.
 
   if ($] > 5.008) {
     require Encode;
-    $val = Encode::encode_utf8($val); # make octets
+    $val = Encode::encode("UTF-8", $val); # make octets
   }
 
 =item *
@@ -1906,7 +1906,7 @@ want the UTF8 flag restored:
 
   if ($] > 5.008) {
     require Encode;
-    $val = Encode::decode_utf8($val);
+    $val = Encode::decode("UTF-8", $val);
   }
 
 =item *
@@ -2007,8 +2007,8 @@ Perl's internal representation like so:
     sub my_escape_html ($) {
         my($what) = shift;
         return unless defined $what;
-        Encode::decode_utf8(Foo::Bar::escape_html(
-                                         Encode::encode_utf8($what)));
+        Encode::decode("UTF-8", Foo::Bar::escape_html(
+                                     Encode::encode("UTF-8", $what)));
     }
 
 Sometimes, when the extension does not convert data but just stores
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index 6db6b73..eb395f7 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -234,8 +234,8 @@ C<binmode> as described later below.
  or
      $ export PERL_UNICODE=A
  or
-    use Encode qw(decode_utf8);
-    @ARGV = map { decode_utf8($_, 1) } @ARGV;
+    use Encode qw(decode);
+    @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
 
 =head2 ℞ 14: Decode program arguments as locale encoding
 
@@ -289,8 +289,8 @@ Files opened without an encoding argument will be in UTF-8:
      $ export PERL_UNICODE=SDA
  or
      use open qw(:std :encoding(UTF-8));
-     use Encode qw(decode_utf8);
-     @ARGV = map { decode_utf8($_, 1) } @ARGV;
+     use Encode qw(decode);
+     @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
 
 =head2 ℞ 19: Open file with specific encoding
 
diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod
index 4135fba..63d1d5d 100644
--- a/pod/perlunifaq.pod
+++ b/pod/perlunifaq.pod
@@ -199,7 +199,9 @@ or by letting automatic decoding and encoding do all the work:
 =head2 What are C<decode_utf8> and C<encode_utf8>?
 
 These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
-...)>.
+...)>. Do not use these functions for data exchange. Instead use
+C<decode('UTF-8', ...)> and C<encode('UTF-8', ...)>; see
+L</What's the difference between UTF-8 and utf8?> under.
 
 =head2 What is a "wide character"?
 
@@ -283,7 +285,7 @@ C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
 what it accepts. If you have to communicate with things that aren't so liberal,
 you may want to consider using C<UTF-8>. If you have to communicate with things
 that are too liberal, you may have to use C<utf8>. The full explanation is in
-L<Encode>.
+L<Encode/"UTF-8 vs. utf8 vs. UTF8">.
 
 C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8
 consistently, even where utf8 is actually used internally, because the
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 9b6c0da..7a06741 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -749,12 +749,12 @@ How Do I Detect Data That's Not Valid In a Particular Encoding?
 Use the C<Encode> package to try converting it.
 For example,
 
-    use Encode 'decode_utf8';
+    use Encode 'decode';
 
-    if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
-        # $string is valid utf8
+    if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
+        # $string is valid UTF-8
     } else {
-        # $string is not valid utf8
+        # $string is not valid UTF-8
     }
 
 Or use C<unpack> to try decoding it:
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0005-perluniintro-Suggest-to-use-utf8-decode-instead-heav.patch

From 12ca635c60816a7bf2fa4ad36bcb911d9e271908 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:44:22 +0200
Subject: [PATCH 05/10] perluniintro: Suggest to use utf8::decode() instead
 heavy Encode when sequence of bytes is valid UTF-8

---
 pod/perluniintro.pod |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 7a06741..ea61443 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -814,8 +814,8 @@ pack/unpack to convert to/from Unicode.
 If you have a sequence of bytes you B<know> is valid UTF-8,
 but Perl doesn't know it yet, you can make Perl a believer, too:
 
-    use Encode 'decode_utf8';
-    $Unicode = decode_utf8($bytes);
+    $Unicode = $bytes;
+    utf8::decode($Unicode);
 
 or:
 
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0006-perluniintro-Fix-comment-Encode-decode-does-not-have.patch

From 011dd2cf8e095fdf410071b52509cb10877a756c Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:45:57 +0200
Subject: [PATCH 06/10] perluniintro: Fix comment, Encode::decode does not
 have to return string with UTF8 flag set

---
 pod/perluniintro.pod |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index ea61443..f152844 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -354,7 +354,7 @@ The C<Encode> module knows about many encodings and has interfaces
 for doing conversions between those encodings:
 
     use Encode 'decode';
-    $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
+    $data = decode("iso-8859-3", $data); # convert from legacy
 
 =head2 Unicode I/O
 
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0007-perluniintro-Use-uppercase-UTF-8-encoding-name.patch

From 66cf660b59d2b351f9c2ed0d808a8185bae4ee93 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:52:36 +0200
Subject: [PATCH 07/10] perluniintro: Use uppercase UTF-8 encoding name

Reason is consistency with other documentation files.
---
 pod/perluniintro.pod |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index f152844..244c1fd 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -389,7 +389,7 @@ many encodings have several aliases.  Note that the C<:utf8> layer
 must always be specified exactly like that; it is I<not> subject to
 the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
 input, because it accepts the data without validating that it is indeed valid
-UTF-8; you should instead use C<:encoding(utf-8)> (with or without a
+UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
 hyphen).
 
 See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
@@ -785,7 +785,7 @@ If you have a raw sequence of bytes that you know should be
 interpreted via a particular encoding, you can use C<Encode>:
 
     use Encode 'from_to';
-    from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
+    from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
 
 The call to C<from_to()> changes the bytes in C<$data>, but nothing
 material about the nature of the string has changed as far as Perl is
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0008-Encode-In-documentation-examples-show-strict-UTF-8-e.patch

From effa77f0d53a2dd24a60f9a036e3cf7b032ee2d3 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:57:16 +0200
Subject: [PATCH 08/10] Encode: In documentation examples show strict UTF-8
 encoding

It is better to show examples which produce valid UTF-8 strings instead of
non-strict perl's utf8.
---
 cpan/Encode/Encode.pm |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/cpan/Encode/Encode.pm b/cpan/Encode/Encode.pm
index bda8e1b..65a80dc 100644
--- a/cpan/Encode/Encode.pm
+++ b/cpan/Encode/Encode.pm
@@ -505,11 +505,11 @@ ISO-8859-1, also known as Latin1:
 
   $octets = encode("iso-8859-1", $string);
 
-B<CAVEAT>: When you run C<$octets = encode("utf8", $string)>, then
+B<CAVEAT>: When you run C<$octets = encode("UTF-8", $string)>, then
 $octets I<might not be equal to> $string.  Though both contain the
 same data, the UTF8 flag for $octets is I<always> off.  When you
 encode anything, the UTF8 flag on the result is always off, even when it
-contains a completely valid utf8 string. See L</"The UTF8 flag"> below.
+contains a completely valid UTF-8 string. See L</"The UTF8 flag"> below.
 
 If the $string is C<undef>, then C<undef> is returned.
 
@@ -533,7 +533,7 @@ internal format:
 
   $string = decode("iso-8859-1", $octets);
 
-B<CAVEAT>: When you run C<$string = decode("utf8", $octets)>, then $string
+B<CAVEAT>: When you run C<$string = decode("UTF-8", $octets)>, then $string
 I<might not be equal to> $octets.  Though both contain the same data, the
 UTF8 flag for $string is on.  See L</"The UTF8 flag">
 below.
@@ -599,13 +599,13 @@ and C<undef> on error.
 
 B<CAVEAT>: The following operations may look the same, but are not:
 
-  from_to($data, "iso-8859-1", "utf8"); #1
+  from_to($data, "iso-8859-1", "UTF-8"); #1
   $data = decode("iso-8859-1", $data);  #2
 
 Both #1 and #2 make $data consist of a completely valid UTF-8 string,
 but only #2 turns the UTF8 flag on.  #1 is equivalent to:
 
-  $data = encode("utf8", decode("iso-8859-1", $data));
+  $data = encode("UTF-8", decode("iso-8859-1", $data));
 
 See L</"The UTF8 flag"> below.
 
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0009-Encode-In-documentation-examples-use-name-string-for.patch

From ceb1eea37de3de098e996774dca0f43bc5c9a21e Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:57:21 +0200
Subject: [PATCH 09/10] Encode: In documentation examples use name $string for
 return value of decode function

Function decode() does not have to return string with UTF8 flag on,
therefore $utf8 is not good name.
---
 cpan/Encode/Encode.pm |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/cpan/Encode/Encode.pm b/cpan/Encode/Encode.pm
index 65a80dc..93ad353 100644
--- a/cpan/Encode/Encode.pm
+++ b/cpan/Encode/Encode.pm
@@ -548,11 +548,11 @@ Returns the I<encoding object> corresponding to I<ENCODING>.  Returns
 C<undef> if no matching I<ENCODING> is find.  The returned object is
 what does the actual encoding or decoding.
 
-  $utf8 = decode($name, $bytes);
+  $string = decode($name, $bytes);
 
 is in fact
 
-    $utf8 = do {
+    $string = do {
         $obj = find_encoding($name);
         croak qq(encoding "$name" not found) unless ref $obj;
         $obj->decode($bytes);
@@ -564,8 +564,8 @@ You can therefore save time by reusing this object as follows;
 
     my $enc = find_encoding("iso-8859-1");
     while(<>) {
-        my $utf8 = $enc->decode($_);
-        ... # now do something with $utf8;
+        my $string = $enc->decode($_);
+        ... # now do something with $string;
     }
 
 Besides L</decode> and L</encode>, other methods are
@@ -955,9 +955,9 @@ When you I<encode>, the resulting UTF8 flag is always B<off>.
 
 When you I<decode>, the resulting UTF8 flag is B<on>--I<unless> you can
 unambiguously represent data.  Here is what we mean by "unambiguously".
-After C<$utf8 = decode("foo", $octet)>,
+After C<$str = decode("foo", $octet)>,
 
-  When $octet is...   The UTF8 flag in $utf8 is
+  When $octet is...    The UTF8 flag in $str is
   ---------------------------------------------
   In ASCII only (or EBCDIC only)            OFF
   In ISO-8859-1                              ON
-- 
1.7.9.5

p5pRT · 2016-09-18T16:32:21Z

From @pali

0010-Encode-Add-warning-information-about-encode_utf8-dec.patch

From 53a3d92f267d9551a40397a40ddac19de4caccd8 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:57:24 +0200
Subject: [PATCH 10/10] Encode: Add warning information about
 encode_utf8/decode_utf8 to documentation

---
 cpan/Encode/Encode.pm |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/cpan/Encode/Encode.pm b/cpan/Encode/Encode.pm
index 93ad353..a676fbe 100644
--- a/cpan/Encode/Encode.pm
+++ b/cpan/Encode/Encode.pm
@@ -630,7 +630,11 @@ followed by C<encode> as follows:
 Equivalent to C<$octets = encode("utf8", $string)>.  The characters in
 $string are encoded in Perl's internal format, and the result is returned
 as a sequence of octets.  Because all possible characters in Perl have a
-(loose, not strict) UTF-8 representation, this function cannot fail.
+(loose, not strict) utf8 representation, this function cannot fail.
+
+B<WARNING>: do not use this function for data exchange as it can produce
+not strict utf8 $octets! For strictly valid UTF-8 output use
+C<$octets = encode("UTF-8", $string)>.
 
 =head3 decode_utf8
 
@@ -638,11 +642,15 @@ as a sequence of octets.  Because all possible characters in Perl have a
 
 Equivalent to C<$string = decode("utf8", $octets [, CHECK])>.
 The sequence of octets represented by $octets is decoded
-from UTF-8 into a sequence of logical characters.
-Because not all sequences of octets are valid UTF-8,
+from (loose, not strict) utf8 into a sequence of logical characters.
+Because not all sequences of octets are valid not strict utf8,
 it is quite possible for this function to fail.
 For CHECK, see L</"Handling Malformed Data">.
 
+B<WARNING>: do not use this function for data exchange as it can produce
+$string with not strict utf8 representation! For strictly valid UTF-8
+$string representation use C<$string = decode("UTF-8", $octets [, CHECK])>.
+
 B<CAVEAT>: the input I<$octets> might be modified in-place depending on
 what is set in CHECK. See L</LEAVE_SRC> if you want your inputs to be
 left unchanged.
-- 
1.7.9.5

p5pRT · 2016-09-18T23:27:48Z

From @jkeenan

On Sun Sep 18 09:32:21 2016, pali@cpan.org wrote:

This is a bug report for perl from pali@cpan.org,
generated with the help of perlbug 1.40 running under perl 5.25.5.

-----------------------------------------------------------------
[Please describe your issue here]

Attached patches update documentation about UTF-8. For data exchange
it is better to use strict UTF-8 encoding and not perl's utf8. Also it
is wrong to use insecure :utf8 PerlIO layer for reading arbitrary
input file.

1. The Encode library is "cpan upstream," i.e., it is primarily maintained on CPAN. Hence, requests for changes in its documentation -- your patches 0008, 0009, 0010 -- should be filed via bug-Encode@rt.cpan.org or via the web interface at https://rt.cpan.org/Dist/Display.html?Name=Encode.

2. Because at least 7 different files are touched by the patches attached to this ticket, I think we should get multiple eyeballs on them. Paging our experts on Unicode and IO layers!

Thank you very much.

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2016-09-18T23:27:48Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2016-09-19T16:27:54Z

From @pali

On Sunday 18 September 2016 16:27:48 James E Keenan via RT wrote:

1. The Encode library is "cpan upstream," i.e., it is primarily maintained on CPAN. Hence, requests for changes in its documentation -- your patches 0008, 0009, 0010 -- should be filed via bug-Encode@rt.cpan.org or via the web interface at https://rt.cpan.org/Dist/Display.html?Name=Encode.

2. Because at least 7 different files are touched by the patches attached to this ticket, I think we should get multiple eyeballs on them. Paging our experts on Unicode and IO layers!

Ok! Anyway, all changes are only to documentation sections so other
people could look at it too. There is no code change.

And Encode patches are there too as they are referenced by core perl pod
files. So before sending them to cpan upstream it could be great if you
can review them too...

p5pRT · 2016-10-11T05:02:50Z

From @tonycoz

On Mon Sep 19 09:27:54 2016, pali@cpan.org wrote:

On Sunday 18 September 2016 16:27:48 James E Keenan via RT wrote:

1. The Encode library is "cpan upstream," i.e., it is primarily
maintained on CPAN. Hence, requests for changes in its documentation
-- your patches 0008, 0009, 0010 -- should be filed via bug-
Encode@rt.cpan.org or via the web interface at
https://rt.cpan.org/Dist/Display.html?Name=Encode.

2. Because at least 7 different files are touched by the patches
attached to this ticket, I think we should get multiple eyeballs on
them. Paging our experts on Unicode and IO layers!

Ok! Anyway, all changes are only to documentation sections so other
people could look at it too. There is no code change.

And Encode patches are there too as they are referenced by core perl
pod
files. So before sending them to cpan upstream it could be great if
you
can review them too...

0001:

@@ -280,7 +280,7 @@ Files opened without an encoding argument will be in UTF-8:
or
$ export PERL_UNICODE=D
or
- use open qw(:utf8);
+ use open qw(:encoding(UTF-8));

=head2 ℞ 18: Make all I/O and args default to utf8

Unfortunately this makes the examples no longer equivalent.

0003:

@@ -3764,8 +3764,8 @@ many elements these have. For that, use C<scalar @array> and C<scalar keys
Like all Perl character operations, L<C<length>|/length EXPR> normally
deals in logical
characters, not physical bytes. For how many bytes a string encoded as
-UTF-8 would take up, use C<length(Encode::encode_utf8(EXPR))> (you'll have
-to C<use Encode> first). See L<Encode> and L<perlunicode>.
+UTF-8 would take up, use C<bytes::length(EXPR)> (you'll have to
+C<use bytes ()> first). See L<C<use bytes>|bytes> pragma and L<perlunicode>.

=item __LINE__
X<__LINE__>

This is just plain incorrect. Whether the length returned by bytes::length() is the UTF-8 encoded length depends on the internal encoding of the string:

$ perl -Mbytes -MEncode -le '$x = "\xA0"; print bytes::length $x; print length Encode::encode("UTF-8", $x)'
1
2

0004:

+C<decode('UTF-8', ...)> and C<encode('UTF-8', ...)>; see
+L</What's the difference between UTF-8 and utf8?> under.

"under" what? This would normally be "below" instead, I think.

0009:

A string of what is the issue. Maybe C< $characters > instead of
C< $string >, but that's more Dan's decision.

Tony

p5pRT · 2016-10-12T03:04:20Z

From @pali

On Monday 10 October 2016 22:02:51 Tony Cook via RT wrote:

On Mon Sep 19 09:27:54 2016, pali@cpan.org wrote:

On Sunday 18 September 2016 16:27:48 James E Keenan via RT wrote:

1. The Encode library is "cpan upstream," i.e., it is primarily
maintained on CPAN. Hence, requests for changes in its documentation
-- your patches 0008, 0009, 0010 -- should be filed via bug-
Encode@rt.cpan.org or via the web interface at
https://rt.cpan.org/Dist/Display.html?Name=Encode.

2. Because at least 7 different files are touched by the patches
attached to this ticket, I think we should get multiple eyeballs on
them. Paging our experts on Unicode and IO layers!

Ok! Anyway, all changes are only to documentation sections so other
people could look at it too. There is no code change.

And Encode patches are there too as they are referenced by core perl
pod
files. So before sending them to cpan upstream it could be great if
you
can review them too...

0001:

@@ -280,7 +280,7 @@ Files opened without an encoding argument will be in UTF-8:
or
$ export PERL_UNICODE=D
or
- use open qw(:utf8);
+ use open qw(:encoding(UTF-8));

=head2 ℞ 18: Make all I/O and args default to utf8

Unfortunately this makes the examples no longer equivalent.

When reading untrusted and unknown file, it is still better to use
:encoding layer and not directly :utf8. Do you have any other
suggestion?

0003:

@@ -3764,8 +3764,8 @@ many elements these have. For that, use C<scalar @array> and C<scalar keys
Like all Perl character operations, L<C<length>|/length EXPR> normally
deals in logical
characters, not physical bytes. For how many bytes a string encoded as
-UTF-8 would take up, use C<length(Encode::encode_utf8(EXPR))> (you'll have
-to C<use Encode> first). See L<Encode> and L<perlunicode>.
+UTF-8 would take up, use C<bytes::length(EXPR)> (you'll have to
+C<use bytes ()> first). See L<C<use bytes>|bytes> pragma and L<perlunicode>.

=item __LINE__
X<__LINE__>

This is just plain incorrect. Whether the length returned by bytes::length() is the UTF-8 encoded length depends on the internal encoding of the string:

$ perl -Mbytes -MEncode -le '$x = "\xA0"; print bytes::length $x; print length Encode::encode("UTF-8", $x)'
1
2

Alright, my change is correct only for strings in perl's internal utf8
encoding...

What about this change?

characters, not physical bytes. For how many bytes a string encoded as
-UTF-8 would take up, use C<length(Encode::encode_utf8(EXPR))> (you'll have
+UTF-8 would take up, use C<length(Encode::encode('UTF-8', EXPR))> (you'll have
to C<use Encode> first). See L<Encode> and L<perlunicode>.

0004:

+C<decode('UTF-8', ...)> and C<encode('UTF-8', ...)>; see
+L</What's the difference between UTF-8 and utf8?> under.

"under" what? This would normally be "below" instead, I think.

Yes, below is correct here.

0009:

A string of what is the issue. Maybe C< $characters > instead of
C< $string >, but that's more Dan's decision.

Right, from Encode https://metacpan.org/pod/Encode#TERMINOLOGY
"characters" is correct name.

Tony

Thank you for review!

p5pRT · 2016-10-23T04:01:43Z

From @pali

On Tuesday 11 October 2016 09:45:39 pali@cpan.org wrote:

On Monday 10 October 2016 22:02:51 Tony Cook via RT wrote:

0009:

A string of what is the issue. Maybe C< $characters > instead of
C< $string >, but that's more Dan's decision.

Right, from Encode https://metacpan.org/pod/Encode#TERMINOLOGY
"characters" is correct name.

Err... no. In Encode documentation is string defined as:
"Perl strings are sequences of characters."

So C< $string > is correct definition in this case.

p5pRT · 2016-10-23T07:17:28Z

From @andk

On Sat, 22 Oct 2016 12:13:41 +0200, pali@cpan.org said:

 my $unicode = chr$0x100$;
 print length$$unicode$\, "\\n"; \# will print 1
- require Encode;
- print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
use bytes;
print length($unicode), "\n"; # will also print 2
# (the 0xC4 0x80 of the UTF-8)

Now the "also" in line 6 has lost the point of reference.

--
andreas

p5pRT · 2016-10-23T09:20:35Z

From @pali

On Sunday 23 October 2016 09:16:49 Andreas Koenig wrote:

On Sat, 22 Oct 2016 12:13:41 +0200, pali@cpan.org said:
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1

- require Encode;
- print length(Encode::encode_utf8($unicode)),"\n"; # will
print 2
 use bytes;
 print length$$unicode$\, "\\n"; \# will also print 2
 
                               \# $the 0xC4 0x80 of the UTF\-8$
Now the "also" in line 6 has lost the point of reference.

Alright! Anything else?

p5pRT · 2016-11-11T10:31:35Z

From @pali

On Sunday 23 October 2016 09:17:28 (Andreas J. Koenig) via RT wrote:

Now the "also" in line 6 has lost the point of reference.

Fixed, new version (v3) of that patch is attached.

p5pRT · 2016-11-11T10:31:35Z

From @pali

v3-0003-pod-Do-not-suggest-to-use-Encode-encode_utf8-when-yo.patch

From 93f829f9e96d57c62e7523895763beac75e3b98d Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:21:54 +0200
Subject: [PATCH v3 03/10] pod: Do not suggest to use Encode::encode_utf8()
 when you need to know the byte length of a string

Encode module could do some additional operations and bytes pragma is
supposed to do that job.
---
 pod/perluniintro.pod |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0e3f4bc..a5b7707 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -726,16 +726,13 @@ the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
 C<$a> will stay byte-encoded.
 
 Sometimes you might really need to know the byte length of a string
-instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma
+instead of the character length. For that use the C<bytes> pragma
 and the C<length()> function:
 
     my $unicode = chr(0x100);
     print length($unicode), "\n"; # will print 1
-    require Encode;
-    print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
     use bytes;
-    print length($unicode), "\n"; # will also print 2
+    print length($unicode), "\n"; # will print 2
                                   # (the 0xC4 0x80 of the UTF-8)
     no bytes;
 
-- 
1.7.9.5

p5pRT · 2016-11-17T17:47:06Z

From @pali

On Saturday 22 October 2016 12:13:41 pali@cpan.org wrote:

In attachments are new versions (v2) of third and four patches with
Tony's objections.

On Friday 11 November 2016 11:30:41 pali@cpan.org wrote:

On Sunday 23 October 2016 09:17:28 (Andreas J. Koenig) via RT wrote:

Now the "also" in line 6 has lost the point of reference.

Fixed, new version (v3) of that patch is attached.

@jkeenan, @Tony, @andreas: It is OK now or are some other modifications needed?

p5pRT · 2016-11-17T23:53:21Z

From @Leont

On Thu, Nov 17, 2016 at 2:46 PM, <pali@cpan.org> wrote:

On Saturday 22 October 2016 12:13:41 pali@cpan.org wrote:

In attachments are new versions (v2) of third and four patches with
Tony's objections.

On Friday 11 November 2016 11:30:41 pali@cpan.org wrote:

On Sunday 23 October 2016 09:17:28 (Andreas J. Koenig) via RT wrote:

Now the "also" in line 6 has lost the point of reference.

Fixed, new version (v3) of that patch is attached.

@jkeenan, @Tony, @andreas: It is OK now or are some other modifications
needed?

I've recently been working on making :utf8 a lot safer, which does
invalidate some of the things that you're saying here (probably to the
point of merge conflicts because I did update some of the documentation). I
would prefer to hold those patches for the moment because my patches may
land as soon as 5.25.8.

Leon

p5pRT · 2016-11-19T14:39:25Z

From @pali

On Friday 18 November 2016 00:52:22 Leon Timmermans wrote:

I've recently been working on making :utf8 a lot safer, which does
invalidate some of the things that you're saying here (probably to
the point of merge conflicts because I did update some of the
documentation).

Changes for :utf8 are just in first patch. Other nine patches updates
documentation about Encode module usage and :encoding layer.

I would prefer to hold those patches for the moment
because my patches may land as soon as 5.25.8.

If you are doing changes only to :utf8 layer, I think that other nine
patches should not create merge conflicts.

Can you try to apply my patches on top of your changes? At least we
would know if there are merge conflicts or not.

p5pRT · 2017-01-14T11:26:11Z

From @pali

On Thursday 17 November 2016 18:46:30 pali@cpan.org wrote:

On Saturday 22 October 2016 12:13:41 pali@cpan.org wrote:

In attachments are new versions (v2) of third and four patches with
Tony's objections.

On Friday 11 November 2016 11:30:41 pali@cpan.org wrote:

On Sunday 23 October 2016 09:17:28 (Andreas J. Koenig) via RT wrote:

Now the "also" in line 6 has lost the point of reference.

Fixed, new version (v3) of that patch is attached.

@jkeenan, @Tony, @andreas: It is OK now or are some other
modifications needed?

Hi! Another two months passed. Anything more needed? Or you can
accept/apply my changes to documentation?

p5pRT · 2017-01-16T05:07:55Z

From @khwilliamson

On Sat, 14 Jan 2017 03:26:11 -0800, pali@cpan.org wrote:

On Thursday 17 November 2016 18:46:30 pali@cpan.org wrote:

On Saturday 22 October 2016 12:13:41 pali@cpan.org wrote:

In attachments are new versions (v2) of third and four patches with
Tony's objections.

On Friday 11 November 2016 11:30:41 pali@cpan.org wrote:

On Sunday 23 October 2016 09:17:28 (Andreas J. Koenig) via RT wrote:

Now the "also" in line 6 has lost the point of reference.

Fixed, new version (v3) of that patch is attached.

@jkeenan, @Tony, @andreas: It is OK now or are some other
modifications needed?

Hi! Another two months passed. Anything more needed? Or you can
accept/apply my changes to documentation?

The main reason these haven't been applied is because the advice given to use :encoding(utf8) may be obsolete in 5.26, if the safe version of :utf8 makes it into 5.26. Next week is the deadline for that. But until then, it is premature to push these patches.
--
Karl Williamson

p5pRT · 2017-02-07T23:32:08Z

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT · 2017-02-07T23:32:44Z

From @khwilliamson

Thanks,

All the relevant to core patches have now been applied
--
Karl Williamson

p5pRT · 2017-05-30T20:15:42Z

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0, this and 210 other issues have been
resolved.

Perl 5.26.0 may be downloaded via:
https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists, feel free to reopen this ticket.

p5pRT · 2017-05-30T20:15:42Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed May 30, 2017

p5pRT added Severity Medium type-core labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PATCH] Update documentation about UTF-8 #15612

[PATCH] Update documentation about UTF-8 #15612

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

p5pRT commented Sep 19, 2016

p5pRT commented Oct 11, 2016

p5pRT commented Oct 12, 2016

p5pRT commented Oct 23, 2016

p5pRT commented Oct 23, 2016

p5pRT commented Oct 23, 2016

p5pRT commented Nov 11, 2016

p5pRT commented Nov 11, 2016

p5pRT commented Nov 17, 2016

p5pRT commented Nov 17, 2016

p5pRT commented Nov 19, 2016

p5pRT commented Jan 14, 2017

p5pRT commented Jan 16, 2017

p5pRT commented Feb 7, 2017

p5pRT commented Feb 7, 2017

p5pRT commented May 30, 2017

p5pRT commented May 30, 2017

[PATCH] Update documentation about UTF-8 #15612

[PATCH] Update documentation about UTF-8 #15612

Comments

p5pRT commented Sep 18, 2016

p5pRT commented Sep 18, 2016

From @pali

Created by @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @pali

p5pRT commented Sep 18, 2016

From @jkeenan

p5pRT commented Sep 18, 2016

p5pRT commented Sep 19, 2016

From @pali

p5pRT commented Oct 11, 2016

From @tonycoz

p5pRT commented Oct 12, 2016

From @pali

p5pRT commented Oct 23, 2016

From @pali

p5pRT commented Oct 23, 2016

From @andk

p5pRT commented Oct 23, 2016

From @pali

p5pRT commented Nov 11, 2016

From @pali

p5pRT commented Nov 11, 2016

From @pali

p5pRT commented Nov 17, 2016

From @pali

p5pRT commented Nov 17, 2016

From @Leont

p5pRT commented Nov 19, 2016

From @pali

p5pRT commented Jan 14, 2017

From @pali

p5pRT commented Jan 16, 2017

From @khwilliamson

p5pRT commented Feb 7, 2017

p5pRT commented Feb 7, 2017

From @khwilliamson

p5pRT commented May 30, 2017

From @khwilliamson

p5pRT commented May 30, 2017