New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PATCH] Update documentation about UTF-8 #15612
Comments
From @paliCreated by @paliAttached patches update documentation about UTF-8. For data exchange it is better to use strict UTF-8 encoding and not perl's utf8. Also it is wrong to use insecure :utf8 PerlIO layer for reading arbitrary input file. Perl Info
|
From @pali0001-pod-Do-not-suggest-to-use-insecure-utf8-PerlIO-layer.patchFrom a0cf022c6fb8e9a4673173e91d84217fd086dfe6 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:16:33 +0200
Subject: [PATCH 01/10] pod: Do not suggest to use insecure :utf8 PerlIO layer
when reading files
Instead use strict :encoding(UTF-8) PerlIO layer for input files.
---
pod/perlop.pod | 9 ++++++---
pod/perlunicook.pod | 12 ++++++------
2 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/pod/perlop.pod b/pod/perlop.pod
index d65e911..cf99b88 100644
--- a/pod/perlop.pod
+++ b/pod/perlop.pod
@@ -1262,16 +1262,19 @@ The only operators with lower precedence are the logical operators
C<"and">, C<"or">, and C<"not">, which may be used to evaluate calls to list
operators without the need for parentheses:
- open HANDLE, "< :utf8", "filename" or die "Can't open: $!\n";
+ open HANDLE, "< :encoding(UTF-8)", "filename"
+ or die "Can't open: $!\n";
However, some people find that code harder to read than writing
it with parentheses:
- open(HANDLE, "< :utf8", "filename") or die "Can't open: $!\n";
+ open(HANDLE, "< :encoding(UTF-8)", "filename")
+ or die "Can't open: $!\n";
in which case you might as well just use the more customary C<"||"> operator:
- open(HANDLE, "< :utf8", "filename") || die "Can't open: $!\n";
+ open(HANDLE, "< :encoding(UTF-8)", "filename")
+ || die "Can't open: $!\n";
See also discussion of list operators in L</Terms and List Operators (Leftward)>.
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index ac30509..6db6b73 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -26,7 +26,7 @@ to work correctly, with the C<#!> adjusted to work on your system:
use strict; # quote strings, declare variables
use warnings; # on by default
use warnings qw(FATAL utf8); # fatalize encoding glitches
- use open qw(:std :utf8); # undeclared streams in UTF-8
+ use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
use charnames qw(:full :short); # unneeded in v5.16
This I<does> make even Unix programmers C<binmode> your binary streams,
@@ -255,9 +255,9 @@ call C<binmode> explicitly:
or
$ export PERL_UNICODE=S
or
- use open qw(:std :utf8);
+ use open qw(:std :encoding(UTF-8));
or
- binmode(STDIN, ":utf8");
+ binmode(STDIN, ":encoding(UTF-8)");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
@@ -280,7 +280,7 @@ Files opened without an encoding argument will be in UTF-8:
or
$ export PERL_UNICODE=D
or
- use open qw(:utf8);
+ use open qw(:encoding(UTF-8));
=head2 ℞ 18: Make all I/O and args default to utf8
@@ -288,7 +288,7 @@ Files opened without an encoding argument will be in UTF-8:
or
$ export PERL_UNICODE=SDA
or
- use open qw(:std :utf8);
+ use open qw(:std :encoding(UTF-8));
use Encode qw(decode_utf8);
@ARGV = map { decode_utf8($_, 1) } @ARGV;
@@ -701,7 +701,7 @@ Here's that program; tested on v5.14.
use strict;
use warnings;
use warnings qw(FATAL utf8); # fatalize encoding faults
- use open qw(:std :utf8); # undeclared streams in UTF-8
+ use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
use charnames qw(:full :short); # unneeded in v5.16
# std modules
--
1.7.9.5
|
From @pali0002-pod-Suggest-to-use-strict-encoding-UTF-8-PerlIO-laye.patchFrom b66b3bffd136947d78c658b31beccbcc340a0c08 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:19:59 +0200
Subject: [PATCH 02/10] pod: Suggest to use strict :encoding(UTF-8) PerlIO
layer over not strict :encoding(utf8)
For data exchange it is better to use strict UTF-8 encoding and not perl's utf8.
---
lib/PerlIO.pm | 2 +-
lib/open.pm | 10 +++++-----
pod/perlfunc.pod | 10 +++++-----
pod/perlrun.pod | 2 +-
pod/perlunicode.pod | 2 +-
pod/perluniintro.pod | 10 +++++-----
6 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/lib/PerlIO.pm b/lib/PerlIO.pm
index 2e27f98..acf2e19 100644
--- a/lib/PerlIO.pm
+++ b/lib/PerlIO.pm
@@ -104,7 +104,7 @@ is chosen to render simple text parts (i.e. non-accented letters,
digits and common punctuation) human readable in the encoded file.
(B<CAUTION>: This layer does not validate byte sequences. For reading input,
-you should instead use C<:encoding(utf8)> instead of bare C<:utf8>.)
+you should instead use C<:encoding(UTF-8)> instead of bare C<:utf8>.)
Here is how to write your native data out using UTF-8 (or UTF-EBCDIC)
and then read it back in.
diff --git a/lib/open.pm b/lib/open.pm
index fd22e1b..2392ac9 100644
--- a/lib/open.pm
+++ b/lib/open.pm
@@ -153,7 +153,7 @@ open - perl pragma to set default PerlIO layers for input and output
use open IO => ':locale';
- use open ':encoding(utf8)';
+ use open ':encoding(UTF-8)';
use open ':locale';
use open ':encoding(iso-8859-7)';
@@ -195,8 +195,8 @@ For example:
These are equivalent
- use open ':encoding(utf8)';
- use open IO => ':encoding(utf8)';
+ use open ':encoding(UTF-8)';
+ use open IO => ':encoding(UTF-8)';
as are these
@@ -221,8 +221,8 @@ The C<:std> subpragma on its own has no effect, but if combined with
the C<:utf8> or C<:encoding> subpragmas, it converts the standard
filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected
for input/output handles. For example, if both input and out are
-chosen to be C<:encoding(utf8)>, a C<:std> will mean that STDIN, STDOUT,
-and STDERR are also in C<:encoding(utf8)>. On the other hand, if only
+chosen to be C<:encoding(UTF-8)>, a C<:std> will mean that STDIN, STDOUT,
+and STDERR are also in C<:encoding(UTF-8)>. On the other hand, if only
output is chosen to be in C<< :encoding(koi8r) >>, a C<:std> will cause
only the STDOUT and STDERR to be in C<koi8r>. The C<:locale> subpragma
implicitly turns on C<:std>.
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 36df5c7..8365ff7 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -6091,7 +6091,7 @@ Note the I<characters>: depending on the status of the socket, either
(8-bit) bytes or characters are received. By default all sockets
operate on bytes, but for example if the socket has been changed using
L<C<binmode>|/binmode FILEHANDLE, LAYER> to operate with the
-C<:encoding(utf8)> I/O layer (see the L<open> pragma), the I/O will
+C<:encoding(UTF-8)> I/O layer (see the L<open> pragma), the I/O will
operate on UTF8-encoded Unicode
characters, not bytes. Similarly for the C<:encoding> layer: in that
case pretty much any characters can be read.
@@ -6651,7 +6651,7 @@ of the file) from the L<Fcntl> module. Returns C<1> on success, false
otherwise.
Note the emphasis on bytes: even if the filehandle has been set to operate
-on characters (for example using the C<:encoding(utf8)> I/O layer), the
+on characters (for example using the C<:encoding(UTF-8)> I/O layer), the
L<C<seek>|/seek FILEHANDLE,POSITION,WHENCE>,
L<C<tell>|/tell FILEHANDLE>, and
L<C<sysseek>|/sysseek FILEHANDLE,POSITION,WHENCE>
@@ -6890,7 +6890,7 @@ Note the I<characters>: depending on the status of the socket, either
(8-bit) bytes or characters are sent. By default all sockets operate
on bytes, but for example if the socket has been changed using
L<C<binmode>|/binmode FILEHANDLE, LAYER> to operate with the
-C<:encoding(utf8)> I/O layer (see L<C<open>|/open FILEHANDLE,EXPR>, or
+C<:encoding(UTF-8)> I/O layer (see L<C<open>|/open FILEHANDLE,EXPR>, or
the L<open> pragma), the I/O will operate on UTF-8
encoded Unicode characters, not bytes. Similarly for the C<:encoding>
layer: in that case pretty much any characters can be sent.
@@ -8491,7 +8491,7 @@ to the current position plus POSITION; and C<2> to set it to EOF plus
POSITION, typically negative.
Note the emphasis on bytes: even if the filehandle has been set to operate
-on characters (for example using the C<:encoding(utf8)> I/O layer), the
+on characters (for example using the C<:encoding(UTF-8)> I/O layer), the
L<C<seek>|/seek FILEHANDLE,POSITION,WHENCE>,
L<C<tell>|/tell FILEHANDLE>, and
L<C<sysseek>|/sysseek FILEHANDLE,POSITION,WHENCE>
@@ -8658,7 +8658,7 @@ the actual filehandle. If FILEHANDLE is omitted, assumes the file
last read.
Note the emphasis on bytes: even if the filehandle has been set to operate
-on characters (for example using the C<:encoding(utf8)> I/O layer), the
+on characters (for example using the C<:encoding(UTF-8)> I/O layer), the
L<C<seek>|/seek FILEHANDLE,POSITION,WHENCE>,
L<C<tell>|/tell FILEHANDLE>, and
L<C<sysseek>|/sysseek FILEHANDLE,POSITION,WHENCE>
diff --git a/pod/perlrun.pod b/pod/perlrun.pod
index 12cba35..a890215 100644
--- a/pod/perlrun.pod
+++ b/pod/perlrun.pod
@@ -1121,7 +1121,7 @@ A pseudolayer that enables a flag in the layer below to tell Perl
that output should be in utf8 and that input should be regarded as
already in valid utf8 form. B<WARNING: It does not check for validity and as such
should be handled with extreme caution for input, because security violations
-can occur with non-shortest UTF-8 encodings, etc.> Generally C<:encoding(utf8)> is
+can occur with non-shortest UTF-8 encodings, etc.> Generally C<:encoding(UTF-8)> is
the best option when reading UTF-8 encoded data.
=item :win32
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 152c34b..bc82574 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1879,7 +1879,7 @@ work under 5.6, so you should be safe to try them out.
A filehandle that should read or write UTF-8
if ($] > 5.008) {
- binmode $fh, ":encoding(utf8)";
+ binmode $fh, ":encoding(UTF-8)";
}
=item *
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index beccd3c..0e3f4bc 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -402,7 +402,7 @@ Unicode or legacy encodings does not magically turn the data into
Unicode in Perl's eyes. To do that, specify the appropriate
layer when opening files
- open(my $fh,'<:encoding(utf8)', 'anything');
+ open(my $fh,'<:encoding(UTF-8)', 'anything');
my $line_of_unicode = <$fh>;
open(my $fh,'<:encoding(Big5)', 'anything');
@@ -411,8 +411,8 @@ layer when opening files
The I/O layers can also be specified more flexibly with
the C<open> pragma. See L<open>, or look at the following example.
- use open ':encoding(utf8)'; # input/output default encoding will be
- # UTF-8
+ use open ':encoding(UTF-8)'; # input/output default encoding will be
+ # UTF-8
open X, ">file";
print X chr(0x100), "\n";
close X;
@@ -481,12 +481,12 @@ by repeatedly encoding the data:
local $/; ## read in the whole file of 8-bit characters
$t = <F>;
close F;
- open F, ">:encoding(utf8)", "file";
+ open F, ">:encoding(UTF-8)", "file";
print F $t; ## convert to UTF-8 on output
close F;
If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
+UTF-8 encoded. A C<use open ':encoding(UTF-8)'> would have avoided the
bug, or explicitly opening also the F<file> for input as UTF-8.
B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
--
1.7.9.5
|
From @pali0003-pod-Do-not-suggest-to-use-Encode-encode_utf8-for-kno.patchFrom 86c5ac3f1abcee81736340cdc3cc49f7ed5eb404 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:21:54 +0200
Subject: [PATCH 03/10] pod: Do not suggest to use Encode::encode_utf8() for
knowing know the byte length of a string
Encode module could do some additional operations and bytes pragma is
supposed to do that job.
---
pod/perlfunc.pod | 4 ++--
pod/perluniintro.pod | 5 +----
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 8365ff7..8f81049 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -3764,8 +3764,8 @@ many elements these have. For that, use C<scalar @array> and C<scalar keys
Like all Perl character operations, L<C<length>|/length EXPR> normally
deals in logical
characters, not physical bytes. For how many bytes a string encoded as
-UTF-8 would take up, use C<length(Encode::encode_utf8(EXPR))> (you'll have
-to C<use Encode> first). See L<Encode> and L<perlunicode>.
+UTF-8 would take up, use C<bytes::length(EXPR)> (you'll have to
+C<use bytes ()> first). See L<C<use bytes>|bytes> pragma and L<perlunicode>.
=item __LINE__
X<__LINE__>
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0e3f4bc..9b6c0da 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -726,14 +726,11 @@ the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
C<$a> will stay byte-encoded.
Sometimes you might really need to know the byte length of a string
-instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma
+instead of the character length. For that use the C<bytes> pragma
and the C<length()> function:
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1
- require Encode;
- print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
use bytes;
print length($unicode), "\n"; # will also print 2
# (the 0xC4 0x80 of the UTF-8)
--
1.7.9.5
|
From @pali0004-pod-Suggest-to-use-strict-UTF-8-encoding-when-dealin.patchFrom 20c8001d5efa4c9a47e4629ad6227109ff0f184f Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:25:48 +0200
Subject: [PATCH 04/10] pod: Suggest to use strict UTF-8 encoding when dealing
with external data
For data exchange it is not good idea to use not strict perl's extended
dialect of utf8 encoding.
---
pod/perldiag.pod | 2 +-
pod/perlpacktut.pod | 7 ++++---
pod/perlunicode.pod | 8 ++++----
pod/perlunicook.pod | 8 ++++----
pod/perlunifaq.pod | 6 ++++--
pod/perluniintro.pod | 8 ++++----
6 files changed, 21 insertions(+), 18 deletions(-)
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 6d82cde..58ed165 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3351,7 +3351,7 @@ encoding rules, even though it had the UTF8 flag on.
One possible cause is that you set the UTF8 flag yourself for data that
you thought to be in UTF-8 but it wasn't (it was for example legacy
-8-bit data). To guard against this, you can use Encode::decode_utf8.
+8-bit data). To guard against this, you can use C<Encode::decode('UTF-8', ...)>.
If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid byte
sequences are handled gracefully, but if you use C<:utf8>, the flag is
diff --git a/pod/perlpacktut.pod b/pod/perlpacktut.pod
index f40d1c2..f6a9411 100644
--- a/pod/perlpacktut.pod
+++ b/pod/perlpacktut.pod
@@ -668,9 +668,10 @@ Usually you'll want to pack or unpack UTF-8 strings:
my @hebrew = unpack( 'U*', $utf );
Please note: in the general case, you're better off using
-Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl
-Unicode string, and Encode::encode_utf8 to encode a Perl Unicode string
-to UTF-8 bytes. These functions provide means of handling invalid byte
+L<C<Encode::decode('UTF-8', $utf)>|Encode/decode> to decode a UTF-8
+encoded byte string to a Perl Unicode string, and
+L<C<Encode::encode('UTF-8', $str)>|Encode/encode> to encode a Perl Unicode
+string to UTF-8 bytes. These functions provide means of handling invalid byte
sequences and generally have a friendlier interface.
=head2 Another Portable Binary Encoding
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index bc82574..229b9f8 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1894,7 +1894,7 @@ check the documentation to verify if this is still true.
if ($] > 5.008) {
require Encode;
- $val = Encode::encode_utf8($val); # make octets
+ $val = Encode::encode("UTF-8", $val); # make octets
}
=item *
@@ -1906,7 +1906,7 @@ want the UTF8 flag restored:
if ($] > 5.008) {
require Encode;
- $val = Encode::decode_utf8($val);
+ $val = Encode::decode("UTF-8", $val);
}
=item *
@@ -2007,8 +2007,8 @@ Perl's internal representation like so:
sub my_escape_html ($) {
my($what) = shift;
return unless defined $what;
- Encode::decode_utf8(Foo::Bar::escape_html(
- Encode::encode_utf8($what)));
+ Encode::decode("UTF-8", Foo::Bar::escape_html(
+ Encode::encode("UTF-8", $what)));
}
Sometimes, when the extension does not convert data but just stores
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index 6db6b73..eb395f7 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -234,8 +234,8 @@ C<binmode> as described later below.
or
$ export PERL_UNICODE=A
or
- use Encode qw(decode_utf8);
- @ARGV = map { decode_utf8($_, 1) } @ARGV;
+ use Encode qw(decode);
+ @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
=head2 ℞ 14: Decode program arguments as locale encoding
@@ -289,8 +289,8 @@ Files opened without an encoding argument will be in UTF-8:
$ export PERL_UNICODE=SDA
or
use open qw(:std :encoding(UTF-8));
- use Encode qw(decode_utf8);
- @ARGV = map { decode_utf8($_, 1) } @ARGV;
+ use Encode qw(decode);
+ @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
=head2 ℞ 19: Open file with specific encoding
diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod
index 4135fba..63d1d5d 100644
--- a/pod/perlunifaq.pod
+++ b/pod/perlunifaq.pod
@@ -199,7 +199,9 @@ or by letting automatic decoding and encoding do all the work:
=head2 What are C<decode_utf8> and C<encode_utf8>?
These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
-...)>.
+...)>. Do not use these functions for data exchange. Instead use
+C<decode('UTF-8', ...)> and C<encode('UTF-8', ...)>; see
+L</What's the difference between UTF-8 and utf8?> under.
=head2 What is a "wide character"?
@@ -283,7 +285,7 @@ C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
what it accepts. If you have to communicate with things that aren't so liberal,
you may want to consider using C<UTF-8>. If you have to communicate with things
that are too liberal, you may have to use C<utf8>. The full explanation is in
-L<Encode>.
+L<Encode/"UTF-8 vs. utf8 vs. UTF8">.
C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8
consistently, even where utf8 is actually used internally, because the
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 9b6c0da..7a06741 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -749,12 +749,12 @@ How Do I Detect Data That's Not Valid In a Particular Encoding?
Use the C<Encode> package to try converting it.
For example,
- use Encode 'decode_utf8';
+ use Encode 'decode';
- if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
- # $string is valid utf8
+ if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
+ # $string is valid UTF-8
} else {
- # $string is not valid utf8
+ # $string is not valid UTF-8
}
Or use C<unpack> to try decoding it:
--
1.7.9.5
|
From @pali0005-perluniintro-Suggest-to-use-utf8-decode-instead-heav.patchFrom 12ca635c60816a7bf2fa4ad36bcb911d9e271908 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:44:22 +0200
Subject: [PATCH 05/10] perluniintro: Suggest to use utf8::decode() instead
heavy Encode when sequence of bytes is valid UTF-8
---
pod/perluniintro.pod | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 7a06741..ea61443 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -814,8 +814,8 @@ pack/unpack to convert to/from Unicode.
If you have a sequence of bytes you B<know> is valid UTF-8,
but Perl doesn't know it yet, you can make Perl a believer, too:
- use Encode 'decode_utf8';
- $Unicode = decode_utf8($bytes);
+ $Unicode = $bytes;
+ utf8::decode($Unicode);
or:
--
1.7.9.5
|
From @pali0006-perluniintro-Fix-comment-Encode-decode-does-not-have.patchFrom 011dd2cf8e095fdf410071b52509cb10877a756c Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:45:57 +0200
Subject: [PATCH 06/10] perluniintro: Fix comment, Encode::decode does not
have to return string with UTF8 flag set
---
pod/perluniintro.pod | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index ea61443..f152844 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -354,7 +354,7 @@ The C<Encode> module knows about many encodings and has interfaces
for doing conversions between those encodings:
use Encode 'decode';
- $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
+ $data = decode("iso-8859-3", $data); # convert from legacy
=head2 Unicode I/O
--
1.7.9.5
|
From @pali0007-perluniintro-Use-uppercase-UTF-8-encoding-name.patchFrom 66cf660b59d2b351f9c2ed0d808a8185bae4ee93 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:52:36 +0200
Subject: [PATCH 07/10] perluniintro: Use uppercase UTF-8 encoding name
Reason is consistency with other documentation files.
---
pod/perluniintro.pod | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index f152844..244c1fd 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -389,7 +389,7 @@ many encodings have several aliases. Note that the C<:utf8> layer
must always be specified exactly like that; it is I<not> subject to
the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
input, because it accepts the data without validating that it is indeed valid
-UTF-8; you should instead use C<:encoding(utf-8)> (with or without a
+UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
hyphen).
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
@@ -785,7 +785,7 @@ If you have a raw sequence of bytes that you know should be
interpreted via a particular encoding, you can use C<Encode>:
use Encode 'from_to';
- from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
+ from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
The call to C<from_to()> changes the bytes in C<$data>, but nothing
material about the nature of the string has changed as far as Perl is
--
1.7.9.5
|
From @pali0008-Encode-In-documentation-examples-show-strict-UTF-8-e.patchFrom effa77f0d53a2dd24a60f9a036e3cf7b032ee2d3 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:57:16 +0200
Subject: [PATCH 08/10] Encode: In documentation examples show strict UTF-8
encoding
It is better to show examples which produce valid UTF-8 strings instead of
non-strict perl's utf8.
---
cpan/Encode/Encode.pm | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/cpan/Encode/Encode.pm b/cpan/Encode/Encode.pm
index bda8e1b..65a80dc 100644
--- a/cpan/Encode/Encode.pm
+++ b/cpan/Encode/Encode.pm
@@ -505,11 +505,11 @@ ISO-8859-1, also known as Latin1:
$octets = encode("iso-8859-1", $string);
-B<CAVEAT>: When you run C<$octets = encode("utf8", $string)>, then
+B<CAVEAT>: When you run C<$octets = encode("UTF-8", $string)>, then
$octets I<might not be equal to> $string. Though both contain the
same data, the UTF8 flag for $octets is I<always> off. When you
encode anything, the UTF8 flag on the result is always off, even when it
-contains a completely valid utf8 string. See L</"The UTF8 flag"> below.
+contains a completely valid UTF-8 string. See L</"The UTF8 flag"> below.
If the $string is C<undef>, then C<undef> is returned.
@@ -533,7 +533,7 @@ internal format:
$string = decode("iso-8859-1", $octets);
-B<CAVEAT>: When you run C<$string = decode("utf8", $octets)>, then $string
+B<CAVEAT>: When you run C<$string = decode("UTF-8", $octets)>, then $string
I<might not be equal to> $octets. Though both contain the same data, the
UTF8 flag for $string is on. See L</"The UTF8 flag">
below.
@@ -599,13 +599,13 @@ and C<undef> on error.
B<CAVEAT>: The following operations may look the same, but are not:
- from_to($data, "iso-8859-1", "utf8"); #1
+ from_to($data, "iso-8859-1", "UTF-8"); #1
$data = decode("iso-8859-1", $data); #2
Both #1 and #2 make $data consist of a completely valid UTF-8 string,
but only #2 turns the UTF8 flag on. #1 is equivalent to:
- $data = encode("utf8", decode("iso-8859-1", $data));
+ $data = encode("UTF-8", decode("iso-8859-1", $data));
See L</"The UTF8 flag"> below.
--
1.7.9.5
|
From @pali0009-Encode-In-documentation-examples-use-name-string-for.patchFrom ceb1eea37de3de098e996774dca0f43bc5c9a21e Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:57:21 +0200
Subject: [PATCH 09/10] Encode: In documentation examples use name $string for
return value of decode function
Function decode() does not have to return string with UTF8 flag on,
therefore $utf8 is not good name.
---
cpan/Encode/Encode.pm | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/cpan/Encode/Encode.pm b/cpan/Encode/Encode.pm
index 65a80dc..93ad353 100644
--- a/cpan/Encode/Encode.pm
+++ b/cpan/Encode/Encode.pm
@@ -548,11 +548,11 @@ Returns the I<encoding object> corresponding to I<ENCODING>. Returns
C<undef> if no matching I<ENCODING> is find. The returned object is
what does the actual encoding or decoding.
- $utf8 = decode($name, $bytes);
+ $string = decode($name, $bytes);
is in fact
- $utf8 = do {
+ $string = do {
$obj = find_encoding($name);
croak qq(encoding "$name" not found) unless ref $obj;
$obj->decode($bytes);
@@ -564,8 +564,8 @@ You can therefore save time by reusing this object as follows;
my $enc = find_encoding("iso-8859-1");
while(<>) {
- my $utf8 = $enc->decode($_);
- ... # now do something with $utf8;
+ my $string = $enc->decode($_);
+ ... # now do something with $string;
}
Besides L</decode> and L</encode>, other methods are
@@ -955,9 +955,9 @@ When you I<encode>, the resulting UTF8 flag is always B<off>.
When you I<decode>, the resulting UTF8 flag is B<on>--I<unless> you can
unambiguously represent data. Here is what we mean by "unambiguously".
-After C<$utf8 = decode("foo", $octet)>,
+After C<$str = decode("foo", $octet)>,
- When $octet is... The UTF8 flag in $utf8 is
+ When $octet is... The UTF8 flag in $str is
---------------------------------------------
In ASCII only (or EBCDIC only) OFF
In ISO-8859-1 ON
--
1.7.9.5
|
From @pali0010-Encode-Add-warning-information-about-encode_utf8-dec.patchFrom 53a3d92f267d9551a40397a40ddac19de4caccd8 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:57:24 +0200
Subject: [PATCH 10/10] Encode: Add warning information about
encode_utf8/decode_utf8 to documentation
---
cpan/Encode/Encode.pm | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/cpan/Encode/Encode.pm b/cpan/Encode/Encode.pm
index 93ad353..a676fbe 100644
--- a/cpan/Encode/Encode.pm
+++ b/cpan/Encode/Encode.pm
@@ -630,7 +630,11 @@ followed by C<encode> as follows:
Equivalent to C<$octets = encode("utf8", $string)>. The characters in
$string are encoded in Perl's internal format, and the result is returned
as a sequence of octets. Because all possible characters in Perl have a
-(loose, not strict) UTF-8 representation, this function cannot fail.
+(loose, not strict) utf8 representation, this function cannot fail.
+
+B<WARNING>: do not use this function for data exchange as it can produce
+not strict utf8 $octets! For strictly valid UTF-8 output use
+C<$octets = encode("UTF-8", $string)>.
=head3 decode_utf8
@@ -638,11 +642,15 @@ as a sequence of octets. Because all possible characters in Perl have a
Equivalent to C<$string = decode("utf8", $octets [, CHECK])>.
The sequence of octets represented by $octets is decoded
-from UTF-8 into a sequence of logical characters.
-Because not all sequences of octets are valid UTF-8,
+from (loose, not strict) utf8 into a sequence of logical characters.
+Because not all sequences of octets are valid not strict utf8,
it is quite possible for this function to fail.
For CHECK, see L</"Handling Malformed Data">.
+B<WARNING>: do not use this function for data exchange as it can produce
+$string with not strict utf8 representation! For strictly valid UTF-8
+$string representation use C<$string = decode("UTF-8", $octets [, CHECK])>.
+
B<CAVEAT>: the input I<$octets> might be modified in-place depending on
what is set in CHECK. See L</LEAVE_SRC> if you want your inputs to be
left unchanged.
--
1.7.9.5
|
From @jkeenanOn Sun Sep 18 09:32:21 2016, pali@cpan.org wrote:
1. The Encode library is "cpan upstream," i.e., it is primarily maintained on CPAN. Hence, requests for changes in its documentation -- your patches 0008, 0009, 0010 -- should be filed via bug-Encode@rt.cpan.org or via the web interface at https://rt.cpan.org/Dist/Display.html?Name=Encode. 2. Because at least 7 different files are touched by the patches attached to this ticket, I think we should get multiple eyeballs on them. Paging our experts on Unicode and IO layers! Thank you very much. -- |
The RT System itself - Status changed from 'new' to 'open' |
From @paliOn Sunday 18 September 2016 16:27:48 James E Keenan via RT wrote:
Ok! Anyway, all changes are only to documentation sections so other And Encode patches are there too as they are referenced by core perl pod |
From @tonycozOn Mon Sep 19 09:27:54 2016, pali@cpan.org wrote:
0001: @@ -280,7 +280,7 @@ Files opened without an encoding argument will be in UTF-8: =head2 ℞ 18: Make all I/O and args default to utf8 Unfortunately this makes the examples no longer equivalent. 0003: @@ -3764,8 +3764,8 @@ many elements these have. For that, use C<scalar @array> and C<scalar keys =item __LINE__ This is just plain incorrect. Whether the length returned by bytes::length() is the UTF-8 encoded length depends on the internal encoding of the string: $ perl -Mbytes -MEncode -le '$x = "\xA0"; print bytes::length $x; print length Encode::encode("UTF-8", $x)' 0004: +C<decode('UTF-8', ...)> and C<encode('UTF-8', ...)>; see "under" what? This would normally be "below" instead, I think. 0009: A string of what is the issue. Maybe C< $characters > instead of Tony |
From @paliOn Monday 10 October 2016 22:02:51 Tony Cook via RT wrote:
When reading untrusted and unknown file, it is still better to use
Alright, my change is correct only for strings in perl's internal utf8 What about this change? characters, not physical bytes. For how many bytes a string encoded as
Yes, below is correct here.
Right, from Encode https://metacpan.org/pod/Encode#TERMINOLOGY
Thank you for review! |
From @paliOn Tuesday 11 October 2016 09:45:39 pali@cpan.org wrote:
Err... no. In Encode documentation is string defined as: So C< $string > is correct definition in this case. |
From @andk
Now the "also" in line 6 has lost the point of reference. -- |
From @paliOn Sunday 23 October 2016 09:16:49 Andreas Koenig wrote:
Alright! Anything else? |
From @paliOn Sunday 23 October 2016 09:17:28 (Andreas J. Koenig) via RT wrote:
Fixed, new version (v3) of that patch is attached. |
From @paliv3-0003-pod-Do-not-suggest-to-use-Encode-encode_utf8-when-yo.patchFrom 93f829f9e96d57c62e7523895763beac75e3b98d Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sun, 18 Sep 2016 17:21:54 +0200
Subject: [PATCH v3 03/10] pod: Do not suggest to use Encode::encode_utf8()
when you need to know the byte length of a string
Encode module could do some additional operations and bytes pragma is
supposed to do that job.
---
pod/perluniintro.pod | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0e3f4bc..a5b7707 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -726,16 +726,13 @@ the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
C<$a> will stay byte-encoded.
Sometimes you might really need to know the byte length of a string
-instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma
+instead of the character length. For that use the C<bytes> pragma
and the C<length()> function:
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1
- require Encode;
- print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
use bytes;
- print length($unicode), "\n"; # will also print 2
+ print length($unicode), "\n"; # will print 2
# (the 0xC4 0x80 of the UTF-8)
no bytes;
--
1.7.9.5
|
From @paliOn Saturday 22 October 2016 12:13:41 pali@cpan.org wrote:
On Friday 11 November 2016 11:30:41 pali@cpan.org wrote:
@jkeenan, @Tony, @andreas: It is OK now or are some other modifications needed? |
From @LeontOn Thu, Nov 17, 2016 at 2:46 PM, <pali@cpan.org> wrote:
I've recently been working on making :utf8 a lot safer, which does Leon |
From @paliOn Friday 18 November 2016 00:52:22 Leon Timmermans wrote:
Changes for :utf8 are just in first patch. Other nine patches updates
If you are doing changes only to :utf8 layer, I think that other nine Can you try to apply my patches on top of your changes? At least we |
From @paliOn Thursday 17 November 2016 18:46:30 pali@cpan.org wrote:
Hi! Another two months passed. Anything more needed? Or you can |
From @khwilliamsonOn Sat, 14 Jan 2017 03:26:11 -0800, pali@cpan.org wrote:
The main reason these haven't been applied is because the advice given to use :encoding(utf8) may be obsolete in 5.26, if the safe version of :utf8 makes it into 5.26. Next week is the deadline for that. But until then, it is premature to push these patches. |
@khwilliamson - Status changed from 'open' to 'pending release' |
From @khwilliamsonThanks, All the relevant to core patches have now been applied |
From @khwilliamsonThank you for filing this report. You have helped make Perl better. With the release today of Perl 5.26.0, this and 210 other issues have been Perl 5.26.0 may be downloaded via: If you find that the problem persists, feel free to reopen this ticket. |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#129298 (status was 'resolved')
Searchable as RT129298$
The text was updated successfully, but these errors were encountered: