New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename utf8::is_utf8() (and other functions) #16060
Comments
From @paliHi! This is continuation from original discussion about renaming Problem is that in more perl modules is used this incorrect code use utf8; my $value = func(); In most cases module developers think that utf8::is_utf8() returns true Reason for this is poor name of function utf8::is_utf8() and also poor Functions utf8::is_utf8(), utf8::upgrade() and utf8::downgrade() changes I'm proposing following rename of functions: utf8::is_utf8() --> Internals::uses_string_wide_storage() Plus adding backward compatible aliases to make existing code works like As all those functions should be used only for debugging purposes (e.g. I'm attaching patches which: * Add new warning category 'experimental::internal' |
From @pali0001-Add-new-warning-category-experimental-internal.patchFrom c7b1fcfd26a2500662a10e345691eda3f3f32039 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sat, 1 Jul 2017 17:33:45 +0200
Subject: [PATCH 1/3] Add new warning category experimental::internal
This category is for internal perl functions which should not be used in
normal perl code, unless dealing with perl internals.
---
lib/warnings.pm | 19 +++++++++++++------
regen/warnings.pl | 2 ++
warnings.h | 4 ++++
3 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/lib/warnings.pm b/lib/warnings.pm
index 2ae1bb4..7b27e4a 100644
--- a/lib/warnings.pm
+++ b/lib/warnings.pm
@@ -96,10 +96,13 @@ our %Offsets = (
# Warnings Categories added in Perl 5.025
'experimental::declared_refs' => 132,
+
+ # Warnings Categories added in Perl 5.028
+ 'experimental::internal' => 134,
);
our %Bits = (
- 'all' => "\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x15", # [0..66]
+ 'all' => "\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55", # [0..67]
'ambiguous' => "\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [29]
'bareword' => "\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [30]
'closed' => "\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [6]
@@ -109,10 +112,11 @@ our %Bits = (
'digit' => "\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [31]
'exec' => "\x00\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [7]
'exiting' => "\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [3]
- 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x55\x51\x15\x10", # [51..56,58..62,66]
+ 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x55\x51\x15\x50", # [51..56,58..62,66,67]
'experimental::bitwise' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00", # [58]
'experimental::const_attr' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00", # [59]
'experimental::declared_refs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10", # [66]
+ 'experimental::internal' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40", # [67]
'experimental::lexical_subs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00", # [52]
'experimental::postderef' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00", # [55]
'experimental::re_strict' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00", # [60]
@@ -169,7 +173,7 @@ our %Bits = (
);
our %DeadBits = (
- 'all' => "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\x2a", # [0..66]
+ 'all' => "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa", # [0..67]
'ambiguous' => "\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [29]
'bareword' => "\x00\x00\x00\x00\x00\x00\x00\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [30]
'closed' => "\x00\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [6]
@@ -179,10 +183,11 @@ our %DeadBits = (
'digit' => "\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [31]
'exec' => "\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [7]
'exiting' => "\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [3]
- 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xaa\xa2\x2a\x20", # [51..56,58..62,66]
+ 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xaa\xa2\x2a\xa0", # [51..56,58..62,66,67]
'experimental::bitwise' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x20\x00\x00", # [58]
'experimental::const_attr' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00", # [59]
'experimental::declared_refs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x20", # [66]
+ 'experimental::internal' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80", # [67]
'experimental::lexical_subs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00", # [52]
'experimental::postderef' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00", # [55]
'experimental::re_strict' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00", # [60]
@@ -240,8 +245,8 @@ our %DeadBits = (
# These are used by various things, including our own tests
our $NONE = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
-our $DEFAULT = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x10", # [2,4,22,23,25,52..56,58..63,66]
-our $LAST_BIT = 134 ;
+our $DEFAULT = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50", # [2,4,22,23,25,52..56,58..63,66,67]
+our $LAST_BIT = 136 ;
our $BYTES = 17 ;
our $All = "" ; vec($All, $Offsets{'all'}, 2) = 3 ;
@@ -732,6 +737,8 @@ The current hierarchy is:
| |
| +- experimental::declared_refs
| |
+ | +- experimental::internal
+ | |
| +- experimental::lexical_subs
| |
| +- experimental::postderef
diff --git a/regen/warnings.pl b/regen/warnings.pl
index 5721c17..36ce14b 100644
--- a/regen/warnings.pl
+++ b/regen/warnings.pl
@@ -107,6 +107,8 @@ my $tree = {
[ 5.021, DEFAULT_ON ],
'experimental::declared_refs' =>
[ 5.025, DEFAULT_ON ],
+ 'experimental::internal' =>
+ [ 5.028, DEFAULT_ON ],
}],
'missing' => [ 5.021, DEFAULT_OFF],
diff --git a/warnings.h b/warnings.h
index 0166837..72e27a2 100644
--- a/warnings.h
+++ b/warnings.h
@@ -115,6 +115,10 @@
#define WARN_EXPERIMENTAL__DECLARED_REFS 66
+/* Warnings Categories added in Perl 5.028 */
+
+#define WARN_EXPERIMENTAL__INTERNAL 67
+
#define WARNsize 17
#define WARN_ALLstring "\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125"
#define WARN_NONEstring "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
--
1.7.9.5
|
From @pali0002-Mark-functions-utf8-is_utf8-utf8-upgrade-utf8-downgr.patchFrom d763a8a4b85b53ebc5b05ba1b0a64daf9df6c2e2 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sat, 1 Jul 2017 17:41:15 +0200
Subject: [PATCH 2/3] Mark functions utf8::is_utf8(), utf8::upgrade(),
utf8::downgrade() as Internal
Move all those functions into Internals namespace, throw new warning
experimental::internal warning when used and provide backward compatible
deprecated aliases (for make existing code still work).
In most cases all those functions are incorrectly used due to poor names
and not proper documentation. Those functions are internal and should not
be used unless debugging perl or dealing with broken XS modules.
---
universal.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 55 insertions(+), 10 deletions(-)
diff --git a/universal.c b/universal.c
index be39310..20f1d53 100644
--- a/universal.c
+++ b/universal.c
@@ -422,10 +422,24 @@ XS(XS_UNIVERSAL_DOES)
}
}
-XS(XS_utf8_is_utf8); /* prototype to pass -Wmissing-prototypes */
-XS(XS_utf8_is_utf8)
+XS(XS_Internals_uses_string_wide_storage); /* prototype to pass -Wmissing-prototypes */
+XS(XS_Internals_uses_string_wide_storage)
{
- dXSARGS;
+ dXSARGS;
+ const GV *const gv = CvGV(cv);
+ const HV *const stash = gv ? GvSTASH(gv) : NULL;
+ const char *const hvname = stash ? HvNAME(stash) : NULL;
+
+ if (hvname && strcmp(hvname, "utf8") == 0) {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_DEPRECATED),
+ "utf8::is_utf8() is internal and deprecated function, look into perldoc utf8");
+ } else {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_INTERNAL),
+ "Internals::uses_string_wide_storage() is experimental internal perl function");
+ }
+
if (items != 1)
croak_xs_usage(cv, "sv");
else {
@@ -485,10 +499,24 @@ XS(XS_utf8_decode)
XSRETURN(1);
}
-XS(XS_utf8_upgrade); /* prototype to pass -Wmissing-prototypes */
-XS(XS_utf8_upgrade)
+XS(XS_Internals_upgrade_string_to_wide_storage); /* prototype to pass -Wmissing-prototypes */
+XS(XS_Internals_upgrade_string_to_wide_storage)
{
dXSARGS;
+ const GV *const gv = CvGV(cv);
+ const HV *const stash = gv ? GvSTASH(gv) : NULL;
+ const char *const hvname = stash ? HvNAME(stash) : NULL;
+
+ if (hvname && strcmp(hvname, "utf8") == 0) {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_DEPRECATED),
+ "utf8::upgrade() is internal and deprecated function, look into perldoc utf8");
+ } else {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_INTERNAL),
+ "Internals::upgrade_string_to_wide_storage() is experimental internal perl function");
+ }
+
if (items != 1)
croak_xs_usage(cv, "sv");
else {
@@ -502,10 +530,24 @@ XS(XS_utf8_upgrade)
XSRETURN(1);
}
-XS(XS_utf8_downgrade); /* prototype to pass -Wmissing-prototypes */
-XS(XS_utf8_downgrade)
+XS(XS_Internals_downgrade_string_from_wide_storage); /* prototype to pass -Wmissing-prototypes */
+XS(XS_Internals_downgrade_string_from_wide_storage)
{
dXSARGS;
+ const GV *const gv = CvGV(cv);
+ const HV *const stash = gv ? GvSTASH(gv) : NULL;
+ const char *const hvname = stash ? HvNAME(stash) : NULL;
+
+ if (hvname && strcmp(hvname, "utf8") == 0) {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_DEPRECATED),
+ "utf8::downgrade() is internal and deprecated function, look into perldoc utf8");
+ } else {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_EXPERIMENTAL__INTERNAL),
+ "Internals::downgrade_string_from_wide_storage() is experimental internal function");
+ }
+
if (items < 1 || items > 2)
croak_xs_usage(cv, "sv, failok=0");
else {
@@ -1000,14 +1042,17 @@ static const struct xsub_details details[] = {
#define VXS_XSUB_DETAILS
#include "vxs.inc"
#undef VXS_XSUB_DETAILS
- {"utf8::is_utf8", XS_utf8_is_utf8, NULL},
+ {"utf8::is_utf8", XS_Internals_uses_string_wide_storage, NULL},
{"utf8::valid", XS_utf8_valid, NULL},
{"utf8::encode", XS_utf8_encode, NULL},
{"utf8::decode", XS_utf8_decode, NULL},
- {"utf8::upgrade", XS_utf8_upgrade, NULL},
- {"utf8::downgrade", XS_utf8_downgrade, NULL},
+ {"utf8::upgrade", XS_Internals_upgrade_string_to_wide_storage, NULL},
+ {"utf8::downgrade", XS_Internals_downgrade_string_from_wide_storage, NULL},
{"utf8::native_to_unicode", XS_utf8_native_to_unicode, NULL},
{"utf8::unicode_to_native", XS_utf8_unicode_to_native, NULL},
+ {"Internals::uses_string_wide_storage", XS_Internals_uses_string_wide_storage, NULL},
+ {"Internals::upgrade_string_to_wide_storage", XS_Internals_upgrade_string_to_wide_storage, NULL},
+ {"Internals::downgrade_string_from_wide_storage", XS_Internals_downgrade_string_from_wide_storage, NULL},
{"Internals::SvREADONLY", XS_Internals_SvREADONLY, "\\[$%@];$"},
{"Internals::SvREFCNT", XS_Internals_SvREFCNT, "\\[$%@];$"},
{"Internals::hv_clear_placeholders", XS_Internals_hv_clear_placehold, "\\%"},
--
1.7.9.5
|
From @pali0003-Update-documentation-in-perldoc-utf8.patchFrom e5b0bbd18075ea178708f5da32beee3570751f0e Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sat, 1 Jul 2017 17:46:05 +0200
Subject: [PATCH 3/3] Update documentation in perldoc utf8
Add information about new internal functions and update documentation for
wide string storage functions.
---
lib/utf8.pm | 93 ++++++++++++++++++++++++++++++++++-------------------------
1 file changed, 54 insertions(+), 39 deletions(-)
diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87..84a96ae 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -31,14 +31,8 @@ utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
use utf8;
no utf8;
- # Convert the internal representation of a Perl scalar to/from UTF-8.
-
- $num_octets = utf8::upgrade($string);
- $success = utf8::downgrade($string[, $fail_ok]);
-
# Change each character of a Perl scalar to/from a series of
# characters that represent the UTF-8 bytes of each original character.
-
utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
@@ -51,7 +45,6 @@ utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
# platforms; 193 on
# EBCDIC
- $flag = utf8::is_utf8($string); # since Perl 5.8.1
$flag = utf8::valid($string);
=head1 DESCRIPTION
@@ -105,39 +98,46 @@ you should not say that unless you really want to have UTF-8 source code.
=item * C<$num_octets = utf8::upgrade($string)>
-(Since Perl v5.8.0)
-Converts in-place the internal representation of the string from an octet
-sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
-logical character sequence itself is unchanged. If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8. Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+[INTERNAL] (Since Perl v5.8.0) Deprecated compatibility-supporting alias of
+C<Internals::upgrade_string_to_wide_storage>
-B<Note that this function does not handle arbitrary encodings>;
-use L<Encode> instead.
+=item * C<$num_octets = Internals::upgrade_string_to_wide_storage($string)>
-=item * C<$success = utf8::downgrade($string[, $fail_ok])>
+[INTERNAL] (Since Perl v5.28.0)
-(Since Perl v5.8.0)
-Converts in-place the internal representation of the string from
-UTF-8 to the equivalent octet sequence in the native encoding (Latin-1
-or EBCDIC). The logical character sequence itself is unchanged. If
-I<$string> is already stored as native 8 bit, then this is a no-op. Can
-be used to
-make sure that the UTF-8 flag is off, e.g. when you want to make sure
-that the substr() or length() function works with the usually faster
-byte algorithm.
-
-Fails if the original UTF-8 sequence cannot be represented in the
-native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
-true, returns false.
+Converts in-place the internal representation of the string to wide storage
+(which can store characters above U+0000FF). The logical character sequence
+itself is unchanged. If I<$string> is already stored in wide storage then
+this is a no-op. Returns the number of bytes necessary to represent the
+string in wide storage.
+
+Internal string storage is invisible for pure perl code and perl itself call
+this function automatically when needed. Therefore there is no reason to call
+this function unless you are debugging internal perl C or XS code.
+
+=item * C<$num_octets = Internals::downgrade_string_from_wide_storage($string[, $fail_ok])>
+
+[INTERNAL] (Since Perl v5.28.0)
+
+Converts in-place the internal representation of the string from wide storage
+(which can store characters above U+0000FF) to small non-wide 8 bit storage
+(which can store only 8 bit characters). The logical character sequence
+itself is unchanged. If I<$string> is already stored in non-wide 8 bit storage,
+then this is a no-op.
+
+Fails if the original I<$string> cannot be represented in the native 8 bit
+encoding. On failure dies or, if the value of I<$fail_ok> is true, returns false.
Returns true on success.
-B<Note that this function does not handle arbitrary encodings>;
-use L<Encode> instead.
+Internal string storage is invisible for pure perl code and perl itself call
+this function automatically when needed. Therefore there is no reason to call
+this function unless you are debugging internal perl C or XS code.
+
+=item * C<$success = utf8::downgrade($string[, $fail_ok])>
+
+[INTERNAL] (Since Perl v5.8.0) Deprecated compatibility-supporting alias of
+C<Internals::downgrade_string_from_wide_storage>
=item * C<utf8::encode($string)>
@@ -207,17 +207,32 @@ platforms, so there is no performance hit in using it there.
=item * C<$flag = utf8::is_utf8($string)>
-(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
-UTF-8. Functionally the same as C<Encode::is_utf8()>.
+[INTERNAL] (Since Perl v5.8.1) Deprecated compatibility-supporting (but poorly
+named) alias of C<Internals::uses_string_wide_storage()>. It does B<not> check
+if string is encoded in UTF-8.
+
+=item * C<$flag = Internals::uses_string_wide_storage($string)>
+
+[INTERNAL] (Since Perl v5.28.0)
+
+Test whether C<$string>'s internal representation storage is wide (which can
+store characters above U+0000FF). Note that C<$string> can, but does not have
+to contain wide characters. It bears no impact on whether that string is
+actually utf8 or not.
+
+Internal string storage is invisible for pure perl code and perl itself call
+change storage automatically when needed. This function should not be used
+unless you are debugging internal perl C or XS code.
=item * C<$flag = utf8::valid($string)>
-[INTERNAL] Test whether I<$string> is in a consistent state regarding
+[INTERNAL] (Since Perl v5.8.0)
+
+Test whether I<$string> is in a consistent state regarding
UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
on B<or> if I<$string> is held as bytes (both these states are 'consistent').
Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state. You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
=back
--
1.7.9.5
|
From @TuxOn Sat, 01 Jul 2017 09:03:18 -0700, (via RT)
I am still objecting, as this will also break code that uses those As these are not XS, Devel::PPPort won't help (assuming authors use I'd loath to change/fix every occurrence of code that uses any of these
Then why add new functions in the first place?
No, please. Most correct uses will be in dark distant corners, hidden
-- |
The RT System itself - Status changed from 'new' to 'open' |
From @LeontOn Sat, Jul 1, 2017 at 6:03 PM, via RT <perlbug-followup@perl.org> wrote:
I don't see how this is an option. I'll grant you that something like this Leon |
From @paliOn Saturday 01 July 2017 19:13:30 you wrote:
Hm? What you mean with to break? Existing functions would still work, |
From @paliOn Saturday 01 July 2017 18:54:24 you wrote:
From discussion it was clear that current name utf8::is_utf8() is poor |
From @LeontOn Sat, Jul 1, 2017 at 7:45 PM, <pali@cpan.org> wrote:
Then I misunderstood your proposal, "rename" suggested to me that the old Leon |
From @xsawyerxOn 07/01/2017 01:52 PM, Leon Timmermans wrote:
You could support it with Devel::PPPort. It's a simple addition. However, the problem remains that if someone were to use these new |
From @tonycozOn Mon, Jul 03, 2017 at 01:03:37PM -0400, Sawyer X wrote:
These are perl functions (as documented in utf8.pm), not C functions, The patch retains the old names, so that isn't an issue. But it does deprecate the old names, which is an issue, I can't As a side note, the original thread refers to: https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-Tar/lib/Archive/Tar.pm#L1501 which I could see as correct because of the way perl's unicode support Tony |
From @GrinnzOn Mon, Jul 3, 2017 at 8:38 PM, Tony Cook <tony@develop-help.com> wrote:
Not entirely correct IMO. If the intent is that filenames be encoded to -Dan |
From @tonycozOn Mon, Jul 03, 2017 at 09:35:06PM -0400, Dan Book wrote:
If the caller creates a file using the name they pass in, encoding the Perl functions such as open and stat currently ignore the the UTF-8 The code in Archive::Tar seems a reasonable workaround to me, I don't Tony |
From @paliOn Monday 03 July 2017 21:35:06 Dan Book wrote:
See bug: https://rt.perl.org/Public/Bug/Display.html?id=130831 |
From @paliOn Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
Warning can be removed from patch. It is just question how you decide. And for old code can be defined this function easily: *new_name = *old_name; Reason for this patch series is: |
From @demerphqOn 4 July 2017 at 09:19, <pali@cpan.org> wrote:
I dont mind adding new aliases for these functions, I object to your scalar::is_unicode_string() I don't like the wide-storage thing, (although I admit i think it cheers, Yves -- |
From @paliOn Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:
I proposed Internals, because that flag is internal for perl and
But this is wrong! SVf_UTF8 does not tell if scalar string is unicode Name is_binary_string is misleading in same way as current name is_utf8. If you say that binary string is one with codes only in range 0x00-0xFF
Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1
|
From @demerphqOn 4 July 2017 at 11:03, <pali@cpan.org> wrote:
No. This is a myth. Plain and simply a myth. People have a hard time accepting it, but the utf8 flag tells parts of You can see the difference in the following: "ba\x{DF}"=~/ss/i; "ba\N{U+DF}"=~/ss/i; The latter matches because \N{U+DF} produces the unicode code point
Erf, maybe. We need a term for "not-unicode", and "binary" is as good
The SVf_UTF8 flag being off means the string should be treated as
I spoke imprecisely, I should have said ASCII, not latin-1. cheers, -- |
From @paliOn Tuesday 04 July 2017 11:22:42 demerphq wrote:
$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;'
$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;'
No, both were matched under Perl 5.24.1. |
From @demerphqOn 4 July 2017 at 12:04, <pali@cpan.org> wrote:
-E is not -e. -E is enabling a pragma which changes the default behavior. However it is *PRAGMA*. It is NOT the normal behavior of Perl.
No, they did not. If \x{DF} magically started matching 'ss' it would cheers, -- |
From @paliOn Tuesday 04 July 2017 03:12:19 yves orton via RT wrote:
Ah, right. I forgot that -E enables feature unicode_strings which Default behavior is a bit unpredicable as it is affected by the my $str1 = "\x{DF}"; "ba$str1" =~ /ss/i; "ba$str1$str3" =~ /ss/i; To make it predicable either /aa or /u modifiers should be already "ba$str1" =~ /ss/aai; "ba$str1" =~ /ss/ui; |
From @demerphqOn 4 July 2017 at 13:14, <pali@cpan.org> wrote:
It is only unpredictable if your model of strings is broken. I happen cheers, -- |
From @paliOn Tuesday 04 July 2017 13:32:26 demerphq wrote:
I do not know what you mean if model of strings is broken, but once you
I think this discussion is out of original request, which is for better |
From @xsawyerxOn 07/04/2017 07:38 AM, pali@cpan.org wrote:
It is "broken" in that sense for probably more people than we would
Agree. For now we seem to have two points we agree on: As long as the second clause does not break the third, I think we should Yves mentioned that "Internals" namespace to be undesired place for it Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :) Thanks! |
From zefram@fysh.orgdemerphq wrote:
Those are bugs. In some cases they are bugs that we've decided we can't If a flag to distinguish between character strings and binary strings -zefram |
From zefram@fysh.orgSawyer X wrote:
I didn't want to add to a mostly bikeshedding discussion, but OK. In fact you can see all my preferred names in my CPAN module I think the names for these functions should be reasonably concise, I don't have any strong opinion about which package any new names for -zefram |
From @khwilliamsonOn 07/10/2017 02:13 PM, Zefram wrote:
My view is that the current names could be improved, and that there I don't know what namespace is best. At first blush Internals seems $foo & "" which trying to find out if $foo is a string or just a number. I don't I have never liked upgrade and downgrade. When you upgrade something |
From @cpansproutOn Mon, 10 Jul 2017 19:53:42 -0700, public@khwilliamson.com wrote:
Adding new public functions to the Internals namespace would completely change its meaning. It contains functions that exist mainly for perl’s own functionality (for built-in modules like Hash::Util to use) and for testing perl itself. Users are not supposed to know about them. That the cat is out of the bag and we cannot remove them is unfortunate. Since we already use ‘utf8’ to refer to Perl’s Unicode support, why not continue to use that namespace? -- Father Chrysostomos |
From @cpansproutOn Mon, 10 Jul 2017 19:53:42 -0700, public@khwilliamson.com wrote:
Well, er, that is exactly what you get. You can stretch your legs beyond CLV.*
I think that is one of the best arguments in favour of ‘upgrade’. It is just like upgrading most commercial software!
* That is a Roman numeral. -- Father Chrysostomos |
From @TuxOn Tue, 11 Jul 2017 10:41:37 -0700, "Father Chrysostomos via RT"
Count me in: three. I like the way Dave has written down my feelings :) -- |
From @khwilliamsonOn 07/12/2017 12:36 AM, H.Merijn Brand wrote:
I guess we have a fundamental disagreement about language design and the The point of adding synonyms for deceptively-named functions and macros Unless Perl is close to death, the number of people who are going to Specifically about av_top_index, I don't believe that it is so poorly It came about not because of AvFILL, but because of the already-existing Using av_len is a bug waiting to happen. It is a foreseeable problem. Writing code using deceptively named things or with poor API's is slower Code reviews also are affected. It is just too easy to read the thing In researching the issue back when av_top_index was created, I found But all this could be avoided by the code using a non-deceptive name. It is foreseeable that av_len is going to cause problems. It would be No one was really happy with "av_top_index" as a name. So AvFILL was Writing good APIs is hard. I have flattered myself at times into |
From @TuxOn Wed, 12 Jul 2017 22:53:57 -0600, Karl Williamson
The problem with av_top_index is that it hat not (yet) been ported to $ ack av_top_index I know I didn't quote all of your message and I understand your For the utf8 functions, the scope is WAY wider: it is used from To be honest, I do not see an easy way out of that dilemma. If you have -- |
From @paliOn Wednesday 12 July 2017 23:44:39 H. Merijn Brand via RT wrote:
Devel::PPPort is probably unmaintained... It has open couple of bugs Because of those problems, I have no motivation to prepare any other
Problem is that people very often use construct which I wrote in first And all this happens just because of wrong name from which can be If we would not add better aliases, then broken code would be still As utf8::is_utf8() is not needed too often, backward compatibility can *NEW_NAME = *utf8::is_utf8; I think this is a good compromise. If you think that upgrade and |
From @cpansproutOn Wed, 12 Jul 2017 21:55:03 -0700, public@khwilliamson.com wrote:
I agree the disagreement is unfortunate.
When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts. I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden. I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.)
But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway. My personal experience is that what you are arguing for, while it sounds good, does not work in practice. -- Father Chrysostomos |
From @demerphqOn 14 July 2017 at 04:28, Father Chrysostomos via RT
I think the reason that it sounds good is because it does make sense But with something like Perl we can't just get rid of things, if we Despite this I think sometimes these things *can* be justified and yves -- |
From @xsawyerx[Top-posted] I have mixed thoughts about this. I'm sympathetic to both considerations: Having properly-named functions A few ways to make such a situation easier: * Document utf8::is_utf8() to prevent this confusion: This is by far the (Since Perl 5.8.1) Test whether $string is marked internally as This is confusing, to say the least. "Marked internally" is the words [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the I like this wording better for several reasons: It is under the title If the document on both was better, then we could have possibly left * Provide different functions and document all functions in all other If we decide to have better named functions, we will have additional I mix those when it comes to English.pm. I use One additional point about English is that, unlike what we're suggesting I see value in adding proper names, but then we would need to take care * Move all known usages in core to new functions Another way to improve this new cognitive load is by reducing it in the * Automated policies for improving CPAN code quality This is beyond the scope of core, but I think it's worthwhile taking Still, it is worthwhile keeping in mind. Overall, I'm still undecided. Maybe we could start with improving the [1] Using "is_upgraded" as an example different name. On 07/13/2017 06:53 AM, Karl Williamson wrote:
|
From @hvdsOn Mon, 17 Jul 2017 01:47:32 -0700, xsawyerx@gmail.com wrote:
Me too.
I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience, most people working in a perl shop tend to read lots of code in their local codebase, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up, it may not happen that early for a good proportion of new developers. Maybe you had in mind primarily historical threads googled up from perlmonks, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later. Hugo |
From @tonycozOn Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
utf8::is_utf8() doesn't accept the second parameter and does no
Perhaps something like: =item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as If you need to force Unicode semantics for code that needs to be Using this flag to decide whether a string should be treated as If you're accepting bytes: utf8::downgrade($string); # throws an exception if code point over 0xFF utf8::downgrade($string, 1) # our own error handling or if you're accepting characters and need encoded bytes: utf8::encode($string); # unconditionally The only exception is if you're dealing with filenames, since perl << Are there any other cases someone might be tempted to call Tony |
From @TuxOn Tue, 18 Jul 2017 10:53:53 +1000, Tony Cook <tony@develop-help.com>
I like this. What I miss here is a small example of how to guarantee -- |
From @GrinnzOn Tue, Jul 18, 2017 at 3:04 AM, H.Merijn Brand <h.m.brand@xs4all.nl> wrote:
This isn't something that you can guarantee. It always depends on knowing |
From @TuxOn Tue, 18 Jul 2017 03:13:40 -0400, Dan Book <grinnz@gmail.com> wrote:
My point exactly. Just have a piece of text that tells the user why it -- |
From @xsawyerxOn 07/17/2017 10:09 PM, Hugo van der Sanden via RT wrote:
I meant people who will start hacking on Perl core. |
From @tonycozOn Tue, Jul 18, 2017 at 10:53:53AM +1000, Tony Cook wrote:
Thinking about it further, I'm pretty sure this doesn't all belong L<perlunifaq/What is "the UTF8 flag"?> provides a good description of perlunicook largely works at a higher level than the functions in One thing from the above that doesn't seem to be discussed well[1] is
which could perhaps use some expansion in perlunicode. I'm not sure where the cheat sheet following belongs, though Tony [1] perlunifaq briefly mentions some of the issues under "What about |
From @xsawyerxOn 07/19/2017 08:58 AM, Tony Cook wrote:
+1 on the suggested text. I think this addition is useful, even if it is also covered in more |
From @tonycozOn Tue, 18 Jul 2017 23:58:39 -0700, tonyc wrote:
perlunitut covers this reasonably well.
Attached is a series of patches (as a single file), the first three The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places. Tony |
From @tonycoz131685-various-changes.patchFrom bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:30:56 +1000
Subject: use utf8; doesn't force unicode semantics on all strings in scope
eg.
$ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"'
no match
perhaps this should be removed, or completely re-worded, it's worded
similarly to the next point which behaves differently.
---
pod/perlunicode.pod | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ef02b0a..d3ccf44 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -233,7 +233,7 @@ Unicode:
Within the scope of S<C<use utf8>>
If the whole program is Unicode (signified by using 8-bit B<U>nicode
-B<T>ransformation B<F>ormat), then all strings within it must be
+B<T>ransformation B<F>ormat), then all literal strings within it must be
Unicode.
=item *
--
2.1.4
From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:45:33 +1000
Subject: encoding.pm no longer works
---
pod/perlunicode.pod | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d3ccf44..24102bf 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -60,10 +60,11 @@ filenames.
Use the C<:encoding(...)> layer to read from and write to
filehandles using the specified encoding. (See L<open>.)
-=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
+=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
UTF-8.
-See L<encoding>.
+The L<encoding> module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
=item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
--
2.1.4
From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 15:42:18 +1000
Subject: unfortunately sysread() tries to read characters
---
pod/perluniintro.pod | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0ad9dda..5e263b4 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed
list see L<Encode::Supported>.
C<read()> reads characters and returns the number of characters.
-C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
-and C<sysseek()>.
+C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>.
+
+C<sysread()> and C<syswrite()> should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
Notice that because of the default behaviour of not doing any
conversion upon input if there is no default layer,
--
2.1.4
From fb22d08dd9f174ddc4007c8ca6ef0e379fe34874 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Thu, 20 Jul 2017 15:44:49 +1000
Subject: (perl #131685) improve utf8::* function documentation
Splits the little cheat sheet I posted as a comment into pieces
and puts them closer to where they belong
- better document why you'd want to use utf8::upgrade()
- similarly for utf8::downgrade()
- try hard to convince people not to use utf8::is_utf8()
- no, utf8::is_utf8() isn't what you want instead of utf8::valid()
---
lib/utf8.pm | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 52 insertions(+), 9 deletions(-)
diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87..9abbd06 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -2,7 +2,7 @@ package utf8;
$utf8::hint_bits = 0x00800000;
-our $VERSION = '1.19';
+our $VERSION = '1.20';
sub import {
$^H |= $utf8::hint_bits;
@@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code.
Converts in-place the internal representation of the string from an octet
sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
logical character sequence itself is unchanged. If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8. Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+upgraded, then this is a no-op. Returns the
+number of octets necessary to represent the string as UTF-8.
+
+If your code needs to be compatible with versions of perl without
+C<use feature 'unicode_strings';>, you can force Unicode semantics on
+a given string:
+
+ # force unicode semantics for $string without the
+ # "unicode_strings" feature
+ utf8::upgrade($string);
+
+For example:
+
+ # without explicit or implicit use feature 'unicode_strings'
+ my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
+ /ss/i; # won't match
+ my $y = uc($x); # won't comvert
+ utf8::upgrade($x);
+ /ss/i; # matches
+ my $z = uc($x); # converts to "SS"
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -136,6 +151,15 @@ true, returns false.
Returns true on success.
+If your code expects an octet sequence this can be used to validate
+that you've received one:
+
+ # throw an exception if not representable as octets
+ utf8::downgrade($string)
+
+ # or do your own error handling
+ utf8::downgrade($string, 1) or die "string must be octets";
+
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -153,6 +177,11 @@ Returns nothing.
# ASCII platforms) 0xc4 and 0x80. On EBCDIC
# 1047, this would instead be 0x8C and 0x41.
+Similar to:
+
+ use Encode;
+ $a = Encode::encode("utf8", $a);
+
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there.
=item * C<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
-UTF-8. Functionally the same as C<Encode::is_utf8()>.
+UTF-8. Functionally the same as C<Encode::is_utf8($string)>.
+
+Typically only necessary for debugging and testing, if you need to
+dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
+provides more detail in a compact form.
+
+If you still think you need this outside of debugging, testing or
+dealing with filenames, you should probably read L<perlunitut> and
+L<perlunifaq/What is "the UTF8 flag"?>.
+
+Don't use this flag as a marker to distinguish character and binary
+data, that should be decided for each variable when you write your
+code.
+
+To force unicode semantics in code portable to perl 5.8 and 5.10, call
+C<utf8::upgrade($string)> unconditionally.
=item * C<$flag = utf8::valid($string)>
@@ -216,8 +260,7 @@ UTF-8. Functionally the same as C<Encode::is_utf8()>.
UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
on B<or> if I<$string> is held as bytes (both these states are 'consistent').
Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state. You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
=back
--
2.1.4
|
From @xsawyerxOn 07/20/2017 07:50 AM, Tony Cook via RT wrote:
Thank you, Tony. I have only two small nit-pickings on the patch: There's a typo for |
From @xsawyerxOn 07/20/2017 09:23 AM, Sawyer X wrote:
For what it's worth, this received an offline +1 from rgs. :) |
From @tonycozOn Thu, Jul 20, 2017 at 09:23:44AM +0200, Sawyer X wrote:
Updated patch attached. Any opinions on whether the reference to C<use utf8;> modified by the It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8 Tony |
From @tonycoz131685-various-changes.patchFrom bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:30:56 +1000
Subject: use utf8; doesn't force unicode semantics on all strings in scope
eg.
$ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"'
no match
perhaps this should be removed, or completely re-worded, it's worded
similarly to the next point which behaves differently.
---
pod/perlunicode.pod | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ef02b0a..d3ccf44 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -233,7 +233,7 @@ Unicode:
Within the scope of S<C<use utf8>>
If the whole program is Unicode (signified by using 8-bit B<U>nicode
-B<T>ransformation B<F>ormat), then all strings within it must be
+B<T>ransformation B<F>ormat), then all literal strings within it must be
Unicode.
=item *
--
2.1.4
From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:45:33 +1000
Subject: encoding.pm no longer works
---
pod/perlunicode.pod | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d3ccf44..24102bf 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -60,10 +60,11 @@ filenames.
Use the C<:encoding(...)> layer to read from and write to
filehandles using the specified encoding. (See L<open>.)
-=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
+=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
UTF-8.
-See L<encoding>.
+The L<encoding> module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
=item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
--
2.1.4
From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 15:42:18 +1000
Subject: unfortunately sysread() tries to read characters
---
pod/perluniintro.pod | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0ad9dda..5e263b4 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed
list see L<Encode::Supported>.
C<read()> reads characters and returns the number of characters.
-C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
-and C<sysseek()>.
+C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>.
+
+C<sysread()> and C<syswrite()> should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
Notice that because of the default behaviour of not doing any
conversion upon input if there is no default layer,
--
2.1.4
From bba883b879024faf30095f9f19b52ec5ce4d8aac Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Fri, 21 Jul 2017 11:29:39 +1000
Subject: (perl #131685) improve utf8::* function documentation
Splits the little cheat sheet I posted as a comment into pieces
and puts them closer to where they belong
- better document why you'd want to use utf8::upgrade()
- similarly for utf8::downgrade()
- try hard to convince people not to use utf8::is_utf8()
- no, utf8::is_utf8() isn't what you want instead of utf8::valid()
- change some examples to use $x instead of the sort reserved $a
---
lib/utf8.pm | 69 +++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 56 insertions(+), 13 deletions(-)
diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87..50a5b20 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -2,7 +2,7 @@ package utf8;
$utf8::hint_bits = 0x00800000;
-our $VERSION = '1.19';
+our $VERSION = '1.20';
sub import {
$^H |= $utf8::hint_bits;
@@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code.
Converts in-place the internal representation of the string from an octet
sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
logical character sequence itself is unchanged. If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8. Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+upgraded, then this is a no-op. Returns the
+number of octets necessary to represent the string as UTF-8.
+
+If your code needs to be compatible with versions of perl without
+C<use feature 'unicode_strings';>, you can force Unicode semantics on
+a given string:
+
+ # force unicode semantics for $string without the
+ # "unicode_strings" feature
+ utf8::upgrade($string);
+
+For example:
+
+ # without explicit or implicit use feature 'unicode_strings'
+ my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
+ /ss/i; # won't match
+ my $y = uc($x); # won't convert
+ utf8::upgrade($x);
+ /ss/i; # matches
+ my $z = uc($x); # converts to "SS"
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -136,6 +151,15 @@ true, returns false.
Returns true on success.
+If your code expects an octet sequence this can be used to validate
+that you've received one:
+
+ # throw an exception if not representable as octets
+ utf8::downgrade($string)
+
+ # or do your own error handling
+ utf8::downgrade($string, 1) or die "string must be octets";
+
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -148,11 +172,16 @@ replaced with a sequence of one or more characters that represent the
individual UTF-8 bytes of the character. The UTF8 flag is turned off.
Returns nothing.
- my $a = "\x{100}"; # $a contains one character, with ord 0x100
- utf8::encode($a); # $a contains two characters, with ords (on
+ my $x = "\x{100}"; # $a contains one character, with ord 0x100
+ utf8::encode($x); # $a contains two characters, with ords (on
# ASCII platforms) 0xc4 and 0x80. On EBCDIC
# 1047, this would instead be 0x8C and 0x41.
+Similar to:
+
+ use Encode;
+ $x = Encode::encode("utf8", $x);
+
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -167,9 +196,9 @@ turned on only if the source string contains multiple-byte UTF-8
characters. If I<$string> is invalid as UTF-8, returns false;
otherwise returns true.
- my $a = "\xc4\x80"; # $a contains two characters, with ords
+ my $x = "\xc4\x80"; # $a contains two characters, with ords
# 0xc4 and 0x80
- utf8::decode($a); # On ASCII platforms, $a contains one char,
+ utf8::decode($x); # On ASCII platforms, $a contains one char,
# with ord 0x100. Since these bytes aren't
# legal UTF-EBCDIC, on EBCDIC platforms, $a is
# unchanged and the function returns FALSE.
@@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there.
=item * C<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
-UTF-8. Functionally the same as C<Encode::is_utf8()>.
+UTF-8. Functionally the same as C<Encode::is_utf8($string)>.
+
+Typically only necessary for debugging and testing, if you need to
+dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
+provides more detail in a compact form.
+
+If you still think you need this outside of debugging, testing or
+dealing with filenames, you should probably read L<perlunitut> and
+L<perlunifaq/What is "the UTF8 flag"?>.
+
+Don't use this flag as a marker to distinguish character and binary
+data, that should be decided for each variable when you write your
+code.
+
+To force unicode semantics in code portable to perl 5.8 and 5.10, call
+C<utf8::upgrade($string)> unconditionally.
=item * C<$flag = utf8::valid($string)>
@@ -216,8 +260,7 @@ UTF-8. Functionally the same as C<Encode::is_utf8()>.
UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
on B<or> if I<$string> is held as bytes (both these states are 'consistent').
Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state. You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
=back
--
2.1.4
|
From @xsawyerx+1 (Except "$a" still appears in the comments next to the lines that now On 07/21/2017 03:40 AM, Tony Cook wrote:
|
From @tonycozOn Fri, 21 Jul 2017 02:02:08 -0700, xsawyerx@gmail.com wrote:
Fixed and applied as e423fa8, 01c3fbb, ee329ae and 0397beb. Is there anything else we should do to avoid mis-use of these functions? I previously said:
I'm referring to "I/O flow (the actual 5 minute tutorial)", should this be expanded elsewhere? I don't think it should be expanded in perlunitut. Tony |
From @paliOn Sunday 23 July 2017 18:57:43 Tony Cook via RT wrote:
Just one note: +Similar to: Maybe instead of "utf8" we should show "UTF-8" to users/developers in In commit 8e179dd was replaced usage of |
From @paliOn Sunday 23 July 2017 18:57:43 Tony Cook via RT wrote:
The most useful and legitimate are those functions: What about moving them "upper" in synopsis and also in description? So Probably adding "[INTERNAL]" description, like is for utf8::valid could |
From @khwilliamsonOn 07/13/2017 08:28 PM, Father Chrysostomos via RT wrote:
I searched the archives of p5p for occurrences of av_top_index and I myself am confused by the previous names, and this helps *me*. There
That tells me that the names were not chosen well enough. It is an art,
If you assume that new Perl XS programmers are mostly going to be My father was good at double-clutching. He used that, the story goes, |
Migrated from rt.perl.org#131685 (status was 'open')
Searchable as RT131685$
The text was updated successfully, but these errors were encountered: