Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename utf8::is_utf8() (and other functions) #16060

Open
p5pRT opened this issue Jul 1, 2017 · 59 comments
Open

Rename utf8::is_utf8() (and other functions) #16060

p5pRT opened this issue Jul 1, 2017 · 59 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 1, 2017

Migrated from rt.perl.org#131685 (status was 'open')

Searchable as RT131685$

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @pali

Hi!

This is continuation from original discussion about renaming
utf8​::is_utf8() to utf8​::is_upgraded() which can be found at​:
https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code
pattern​:

  use utf8;

  my $value = func();
  if (utf8​::is_utf8($value)) {
  utf8​::encode($value);
  }

In most cases module developers think that utf8​::is_utf8() returns true
when it is needed to manually encode argument into UTF-8 bytes. Which is
of course wrong.

Reason for this is poor name of function utf8​::is_utf8() and also poor
documentation about this function.

Functions utf8​::is_utf8(), utf8​::upgrade() and utf8​::downgrade() changes
internal string representation, which is fully invisible for pure perl
code, and therefore I think all those functions should be in Internals
namespace.

I'm proposing following rename of functions​:

utf8​::is_utf8() --> Internals​::uses_string_wide_storage()
utf8​::upgrade() --> Internals​::upgrade_string_to_wide_storage()
utf8​::downgrade() --> Internals​::downgrade_string_from_wide_storage()

Plus adding backward compatible aliases to make existing code works like
before.

As all those functions should be used only for debugging purposes (e.g.
test cases for XS code) or when dealing with buggy XS module, I'm
proposing starting to throw warning (e.g. since v5.28.0) when those
functions are called. For those who are dealing with internals, can turn
warning off by no warnings 'experimental​::internal';

I'm attaching patches which​:

* Add new warning category 'experimental​::internal'
* Rename utf8 functions
* Update perldoc utf8 documentation

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @pali

0001-Add-new-warning-category-experimental-internal.patch
From c7b1fcfd26a2500662a10e345691eda3f3f32039 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sat, 1 Jul 2017 17:33:45 +0200
Subject: [PATCH 1/3] Add new warning category experimental::internal

This category is for internal perl functions which should not be used in
normal perl code, unless dealing with perl internals.
---
 lib/warnings.pm   |   19 +++++++++++++------
 regen/warnings.pl |    2 ++
 warnings.h        |    4 ++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/lib/warnings.pm b/lib/warnings.pm
index 2ae1bb4..7b27e4a 100644
--- a/lib/warnings.pm
+++ b/lib/warnings.pm
@@ -96,10 +96,13 @@ our %Offsets = (
 
     # Warnings Categories added in Perl 5.025
     'experimental::declared_refs'	=> 132,
+
+    # Warnings Categories added in Perl 5.028
+    'experimental::internal'		=> 134,
 );
 
 our %Bits = (
-    'all'				=> "\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x15", # [0..66]
+    'all'				=> "\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55", # [0..67]
     'ambiguous'				=> "\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [29]
     'bareword'				=> "\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [30]
     'closed'				=> "\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [6]
@@ -109,10 +112,11 @@ our %Bits = (
     'digit'				=> "\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [31]
     'exec'				=> "\x00\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [7]
     'exiting'				=> "\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [3]
-    'experimental'			=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x55\x51\x15\x10", # [51..56,58..62,66]
+    'experimental'			=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x55\x51\x15\x50", # [51..56,58..62,66,67]
     'experimental::bitwise'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00", # [58]
     'experimental::const_attr'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00", # [59]
     'experimental::declared_refs'	=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10", # [66]
+    'experimental::internal'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40", # [67]
     'experimental::lexical_subs'	=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00", # [52]
     'experimental::postderef'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00", # [55]
     'experimental::re_strict'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00", # [60]
@@ -169,7 +173,7 @@ our %Bits = (
 );
 
 our %DeadBits = (
-    'all'				=> "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\x2a", # [0..66]
+    'all'				=> "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa", # [0..67]
     'ambiguous'				=> "\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [29]
     'bareword'				=> "\x00\x00\x00\x00\x00\x00\x00\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [30]
     'closed'				=> "\x00\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [6]
@@ -179,10 +183,11 @@ our %DeadBits = (
     'digit'				=> "\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [31]
     'exec'				=> "\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [7]
     'exiting'				=> "\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [3]
-    'experimental'			=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xaa\xa2\x2a\x20", # [51..56,58..62,66]
+    'experimental'			=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xaa\xa2\x2a\xa0", # [51..56,58..62,66,67]
     'experimental::bitwise'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x20\x00\x00", # [58]
     'experimental::const_attr'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00", # [59]
     'experimental::declared_refs'	=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x20", # [66]
+    'experimental::internal'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80", # [67]
     'experimental::lexical_subs'	=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00", # [52]
     'experimental::postderef'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00", # [55]
     'experimental::re_strict'		=> "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00", # [60]
@@ -240,8 +245,8 @@ our %DeadBits = (
 
 # These are used by various things, including our own tests
 our $NONE				=  "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
-our $DEFAULT				=  "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x10", # [2,4,22,23,25,52..56,58..63,66]
-our $LAST_BIT				=  134 ;
+our $DEFAULT				=  "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50", # [2,4,22,23,25,52..56,58..63,66,67]
+our $LAST_BIT				=  136 ;
 our $BYTES				=  17 ;
 
 our $All = "" ; vec($All, $Offsets{'all'}, 2) = 3 ;
@@ -732,6 +737,8 @@ The current hierarchy is:
          |                 |
          |                 +- experimental::declared_refs
          |                 |
+         |                 +- experimental::internal
+         |                 |
          |                 +- experimental::lexical_subs
          |                 |
          |                 +- experimental::postderef
diff --git a/regen/warnings.pl b/regen/warnings.pl
index 5721c17..36ce14b 100644
--- a/regen/warnings.pl
+++ b/regen/warnings.pl
@@ -107,6 +107,8 @@ my $tree = {
                                     [ 5.021, DEFAULT_ON ],
                                 'experimental::declared_refs' =>
                                     [ 5.025, DEFAULT_ON ],
+                                'experimental::internal' =>
+                                    [ 5.028, DEFAULT_ON ],
                         }],
 
         'missing'       => [ 5.021, DEFAULT_OFF],
diff --git a/warnings.h b/warnings.h
index 0166837..72e27a2 100644
--- a/warnings.h
+++ b/warnings.h
@@ -115,6 +115,10 @@
 
 #define WARN_EXPERIMENTAL__DECLARED_REFS 66
 
+/* Warnings Categories added in Perl 5.028 */
+
+#define WARN_EXPERIMENTAL__INTERNAL	 67
+
 #define WARNsize			 17
 #define WARN_ALLstring			 "\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125"
 #define WARN_NONEstring			 "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
-- 
1.7.9.5

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @pali

0002-Mark-functions-utf8-is_utf8-utf8-upgrade-utf8-downgr.patch
From d763a8a4b85b53ebc5b05ba1b0a64daf9df6c2e2 Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sat, 1 Jul 2017 17:41:15 +0200
Subject: [PATCH 2/3] Mark functions utf8::is_utf8(), utf8::upgrade(),
 utf8::downgrade() as Internal

Move all those functions into Internals namespace, throw new warning
experimental::internal warning when used and provide backward compatible
deprecated aliases (for make existing code still work).

In most cases all those functions are incorrectly used due to poor names
and not proper documentation. Those functions are internal and should not
be used unless debugging perl or dealing with broken XS modules.
---
 universal.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 10 deletions(-)

diff --git a/universal.c b/universal.c
index be39310..20f1d53 100644
--- a/universal.c
+++ b/universal.c
@@ -422,10 +422,24 @@ XS(XS_UNIVERSAL_DOES)
     }
 }
 
-XS(XS_utf8_is_utf8); /* prototype to pass -Wmissing-prototypes */
-XS(XS_utf8_is_utf8)
+XS(XS_Internals_uses_string_wide_storage); /* prototype to pass -Wmissing-prototypes */
+XS(XS_Internals_uses_string_wide_storage)
 {
-     dXSARGS;
+    dXSARGS;
+    const GV *const gv = CvGV(cv);
+    const HV *const stash = gv ? GvSTASH(gv) : NULL;
+    const char *const hvname = stash ? HvNAME(stash) : NULL;
+
+    if (hvname && strcmp(hvname, "utf8") == 0) {
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_DEPRECATED),
+            "utf8::is_utf8() is internal and deprecated function, look into perldoc utf8");
+    } else {
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_INTERNAL),
+            "Internals::uses_string_wide_storage() is experimental internal perl function");
+    }
+
      if (items != 1)
 	 croak_xs_usage(cv, "sv");
      else {
@@ -485,10 +499,24 @@ XS(XS_utf8_decode)
     XSRETURN(1);
 }
 
-XS(XS_utf8_upgrade); /* prototype to pass -Wmissing-prototypes */
-XS(XS_utf8_upgrade)
+XS(XS_Internals_upgrade_string_to_wide_storage); /* prototype to pass -Wmissing-prototypes */
+XS(XS_Internals_upgrade_string_to_wide_storage)
 {
     dXSARGS;
+    const GV *const gv = CvGV(cv);
+    const HV *const stash = gv ? GvSTASH(gv) : NULL;
+    const char *const hvname = stash ? HvNAME(stash) : NULL;
+
+    if (hvname && strcmp(hvname, "utf8") == 0) {
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_DEPRECATED),
+            "utf8::upgrade() is internal and deprecated function, look into perldoc utf8");
+    } else {
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_INTERNAL),
+            "Internals::upgrade_string_to_wide_storage() is experimental internal perl function");
+    }
+
     if (items != 1)
 	croak_xs_usage(cv, "sv");
     else {
@@ -502,10 +530,24 @@ XS(XS_utf8_upgrade)
     XSRETURN(1);
 }
 
-XS(XS_utf8_downgrade); /* prototype to pass -Wmissing-prototypes */
-XS(XS_utf8_downgrade)
+XS(XS_Internals_downgrade_string_from_wide_storage); /* prototype to pass -Wmissing-prototypes */
+XS(XS_Internals_downgrade_string_from_wide_storage)
 {
     dXSARGS;
+    const GV *const gv = CvGV(cv);
+    const HV *const stash = gv ? GvSTASH(gv) : NULL;
+    const char *const hvname = stash ? HvNAME(stash) : NULL;
+
+    if (hvname && strcmp(hvname, "utf8") == 0) {
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_DEPRECATED),
+            "utf8::downgrade() is internal and deprecated function, look into perldoc utf8");
+    } else {
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__INTERNAL),
+            "Internals::downgrade_string_from_wide_storage() is experimental internal function");
+    }
+
     if (items < 1 || items > 2)
 	croak_xs_usage(cv, "sv, failok=0");
     else {
@@ -1000,14 +1042,17 @@ static const struct xsub_details details[] = {
 #define VXS_XSUB_DETAILS
 #include "vxs.inc"
 #undef VXS_XSUB_DETAILS
-    {"utf8::is_utf8", XS_utf8_is_utf8, NULL},
+    {"utf8::is_utf8", XS_Internals_uses_string_wide_storage, NULL},
     {"utf8::valid", XS_utf8_valid, NULL},
     {"utf8::encode", XS_utf8_encode, NULL},
     {"utf8::decode", XS_utf8_decode, NULL},
-    {"utf8::upgrade", XS_utf8_upgrade, NULL},
-    {"utf8::downgrade", XS_utf8_downgrade, NULL},
+    {"utf8::upgrade", XS_Internals_upgrade_string_to_wide_storage, NULL},
+    {"utf8::downgrade", XS_Internals_downgrade_string_from_wide_storage, NULL},
     {"utf8::native_to_unicode", XS_utf8_native_to_unicode, NULL},
     {"utf8::unicode_to_native", XS_utf8_unicode_to_native, NULL},
+    {"Internals::uses_string_wide_storage", XS_Internals_uses_string_wide_storage, NULL},
+    {"Internals::upgrade_string_to_wide_storage", XS_Internals_upgrade_string_to_wide_storage, NULL},
+    {"Internals::downgrade_string_from_wide_storage", XS_Internals_downgrade_string_from_wide_storage, NULL},
     {"Internals::SvREADONLY", XS_Internals_SvREADONLY, "\\[$%@];$"},
     {"Internals::SvREFCNT", XS_Internals_SvREFCNT, "\\[$%@];$"},
     {"Internals::hv_clear_placeholders", XS_Internals_hv_clear_placehold, "\\%"},
-- 
1.7.9.5

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @pali

0003-Update-documentation-in-perldoc-utf8.patch
From e5b0bbd18075ea178708f5da32beee3570751f0e Mon Sep 17 00:00:00 2001
From: Pali <pali@cpan.org>
Date: Sat, 1 Jul 2017 17:46:05 +0200
Subject: [PATCH 3/3] Update documentation in perldoc utf8

Add information about new internal functions and update documentation for
wide string storage functions.
---
 lib/utf8.pm |   93 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 54 insertions(+), 39 deletions(-)

diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87..84a96ae 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -31,14 +31,8 @@ utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
  use utf8;
  no utf8;
 
- # Convert the internal representation of a Perl scalar to/from UTF-8.
-
- $num_octets = utf8::upgrade($string);
- $success    = utf8::downgrade($string[, $fail_ok]);
-
  # Change each character of a Perl scalar to/from a series of
  # characters that represent the UTF-8 bytes of each original character.
-
  utf8::encode($string);  # "\x{100}"  becomes "\xc4\x80"
  utf8::decode($string);  # "\xc4\x80" becomes "\x{100}"
 
@@ -51,7 +45,6 @@ utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
                                                # platforms; 193 on
                                                # EBCDIC
 
- $flag = utf8::is_utf8($string); # since Perl 5.8.1
  $flag = utf8::valid($string);
 
 =head1 DESCRIPTION
@@ -105,39 +98,46 @@ you should not say that unless you really want to have UTF-8 source code.
 
 =item * C<$num_octets = utf8::upgrade($string)>
 
-(Since Perl v5.8.0)
-Converts in-place the internal representation of the string from an octet
-sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
-logical character sequence itself is unchanged.  If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8.  Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+[INTERNAL] (Since Perl v5.8.0) Deprecated compatibility-supporting alias of
+C<Internals::upgrade_string_to_wide_storage>
 
-B<Note that this function does not handle arbitrary encodings>;
-use L<Encode> instead.
+=item * C<$num_octets = Internals::upgrade_string_to_wide_storage($string)>
 
-=item * C<$success = utf8::downgrade($string[, $fail_ok])>
+[INTERNAL] (Since Perl v5.28.0)
 
-(Since Perl v5.8.0)
-Converts in-place the internal representation of the string from
-UTF-8 to the equivalent octet sequence in the native encoding (Latin-1
-or EBCDIC). The logical character sequence itself is unchanged. If
-I<$string> is already stored as native 8 bit, then this is a no-op.  Can
-be used to
-make sure that the UTF-8 flag is off, e.g. when you want to make sure
-that the substr() or length() function works with the usually faster
-byte algorithm.
-
-Fails if the original UTF-8 sequence cannot be represented in the
-native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
-true, returns false. 
+Converts in-place the internal representation of the string to wide storage
+(which can store characters above U+0000FF).  The logical character sequence
+itself is unchanged.  If I<$string> is already stored in wide storage then
+this is a no-op.  Returns the number of bytes necessary to represent the
+string in wide storage.
+
+Internal string storage is invisible for pure perl code and perl itself call
+this function automatically when needed.  Therefore there is no reason to call
+this function unless you are debugging internal perl C or XS code.
+
+=item * C<$num_octets = Internals::downgrade_string_from_wide_storage($string[, $fail_ok])>
+
+[INTERNAL] (Since Perl v5.28.0)
+
+Converts in-place the internal representation of the string from wide storage
+(which can store characters above U+0000FF) to small non-wide 8 bit storage
+(which can store only 8 bit characters).  The logical character sequence
+itself is unchanged.  If I<$string> is already stored in non-wide 8 bit storage,
+then this is a no-op.
+
+Fails if the original I<$string> cannot be represented in the native 8 bit
+encoding.  On failure dies or, if the value of I<$fail_ok> is true, returns false.
 
 Returns true on success.
 
-B<Note that this function does not handle arbitrary encodings>;
-use L<Encode> instead.
+Internal string storage is invisible for pure perl code and perl itself call
+this function automatically when needed.  Therefore there is no reason to call
+this function unless you are debugging internal perl C or XS code.
+
+=item * C<$success = utf8::downgrade($string[, $fail_ok])>
+
+[INTERNAL] (Since Perl v5.8.0)  Deprecated compatibility-supporting alias of
+C<Internals::downgrade_string_from_wide_storage>
 
 =item * C<utf8::encode($string)>
 
@@ -207,17 +207,32 @@ platforms, so there is no performance hit in using it there.
 
 =item * C<$flag = utf8::is_utf8($string)>
 
-(Since Perl 5.8.1)  Test whether I<$string> is marked internally as encoded in
-UTF-8.  Functionally the same as C<Encode::is_utf8()>.
+[INTERNAL] (Since Perl v5.8.1)  Deprecated compatibility-supporting (but poorly
+named) alias of C<Internals::uses_string_wide_storage()>.  It does B<not> check
+if string is encoded in UTF-8.
+
+=item * C<$flag = Internals::uses_string_wide_storage($string)>
+
+[INTERNAL] (Since Perl v5.28.0)
+
+Test whether C<$string>'s internal representation storage is wide (which can
+store characters above U+0000FF).  Note that C<$string> can, but does not have
+to contain wide characters.  It bears no impact on whether that string is
+actually utf8 or not.
+
+Internal string storage is invisible for pure perl code and perl itself call
+change storage automatically when needed.  This function should not be used
+unless you are debugging internal perl C or XS code.
 
 =item * C<$flag = utf8::valid($string)>
 
-[INTERNAL] Test whether I<$string> is in a consistent state regarding
+[INTERNAL] (Since Perl v5.8.0)
+
+Test whether I<$string> is in a consistent state regarding
 UTF-8.  Will return true if it is well-formed UTF-8 and has the UTF-8 flag
 on B<or> if I<$string> is held as bytes (both these states are 'consistent').
 Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state.  You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
 
 =back
 
-- 
1.7.9.5

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @Tux

On Sat, 01 Jul 2017 09​:03​:18 -0700, (via RT)
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by
# Please include the string​: [perl #131685]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131685 >

Hi!

This is continuation from original discussion about renaming
utf8​::is_utf8() to utf8​::is_upgraded() which can be found at​:
https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code
pattern​:

use utf8;

my $value = func();
if (utf8​::is_utf8($value)) {
utf8​::encode($value);
}

In most cases module developers think that utf8​::is_utf8() returns true
when it is needed to manually encode argument into UTF-8 bytes. Which is
of course wrong.

Reason for this is poor name of function utf8​::is_utf8() and also poor
documentation about this function.

Functions utf8​::is_utf8(), utf8​::upgrade() and utf8​::downgrade() changes
internal string representation, which is fully invisible for pure perl
code, and therefore I think all those functions should be in Internals
namespace.

I'm proposing following rename of functions​:

utf8​::is_utf8() --> Internals​::uses_string_wide_storage()
utf8​::upgrade() --> Internals​::upgrade_string_to_wide_storage()
utf8​::downgrade() --> Internals​::downgrade_string_from_wide_storage()

I am still objecting, as this will also break code that uses those
functions as intended and correctly.

As these are not XS, Devel​::PPPort won't help (assuming authors use
D​::P on XS modules to guarantee backward compat)

I'd loath to change/fix every occurrence of code that uses any of these
three correctly, as that code is brittle to start with and probably
hard to fix when broken.

Plus adding backward compatible aliases to make existing code works like
before.

Then why add new functions in the first place?

As all those functions should be used only for debugging purposes (e.g.
test cases for XS code) or when dealing with buggy XS module, I'm
proposing starting to throw warning (e.g. since v5.28.0) when those
functions are called. For those who are dealing with internals, can turn
warning off by no warnings 'experimental​::internal';

No, please. Most correct uses will be in dark distant corners, hidden
in modules you don't want to touch anyway.

I'm attaching patches which​:

* Add new warning category 'experimental​::internal'
* Rename utf8 functions
* Update perldoc utf8 documentation

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @Leont

On Sat, Jul 1, 2017 at 6​:03 PM, via RT <perlbug-followup@​perl.org> wrote​:

Hi!

This is continuation from original discussion about renaming
utf8​::is_utf8() to utf8​::is_upgraded() which can be found at​:
https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code
pattern​:

use utf8;

my $value = func();
if (utf8​::is_utf8($value)) {
utf8​::encode($value);
}

In most cases module developers think that utf8​::is_utf8() returns true
when it is needed to manually encode argument into UTF-8 bytes. Which is
of course wrong.

Reason for this is poor name of function utf8​::is_utf8() and also poor
documentation about this function.

Functions utf8​::is_utf8(), utf8​::upgrade() and utf8​::downgrade() changes
internal string representation, which is fully invisible for pure perl
code, and therefore I think all those functions should be in Internals
namespace.

I'm proposing following rename of functions​:

utf8​::is_utf8() --> Internals​::uses_string_wide_storage()
utf8​::upgrade() --> Internals​::upgrade_string_to_wide_storage()
utf8​::downgrade() --> Internals​::downgrade_string_from_wide_storage()

Plus adding backward compatible aliases to make existing code works like
before.

As all those functions should be used only for debugging purposes (e.g.
test cases for XS code) or when dealing with buggy XS module, I'm
proposing starting to throw warning (e.g. since v5.28.0) when those
functions are called. For those who are dealing with internals, can turn
warning off by no warnings 'experimental​::internal';

I'm attaching patches which​:

* Add new warning category 'experimental​::internal'
* Rename utf8 functions
* Update perldoc utf8 documentation

I don't see how this is an option. I'll grant you that something like this
would have been a better option back then but you're 15 years too late.
"This would have been better" is no excuse to break a decade and a half of
software.

Leon

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @pali

On Saturday 01 July 2017 19​:13​:30 you wrote​:

to break a decade and a half of software.

Hm? What you mean with to break? Existing functions would still work,
just there are also new functions under new names. Usage of old
functions is just removed from documentation.

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @pali

On Saturday 01 July 2017 18​:54​:24 you wrote​:

Plus adding backward compatible aliases to make existing code works
like before.

Then why add new functions in the first place?

From discussion it was clear that current name utf8​::is_utf8() is poor
and is reason why it is incorrectly used.

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2017

From @Leont

On Sat, Jul 1, 2017 at 7​:45 PM, <pali@​cpan.org> wrote​:

On Saturday 01 July 2017 19​:13​:30 you wrote​:

to break a decade and a half of software.

Hm? What you mean with to break? Existing functions would still work,
just there are also new functions under new names. Usage of old
functions is just removed from documentation.

Then I misunderstood your proposal, "rename" suggested to me that the old
ones disappear. In that case I'm not sure I see the benefit of your
proposal. Why would anyone want to use an interface that won't work on
perls older than 5.28, and could disappear in a future version of perl
(since that's the point of Internals​::)? This isn't making sense to me.

Leon

@p5pRT
Copy link
Author

p5pRT commented Jul 3, 2017

From @xsawyerx

On 07/01/2017 01​:52 PM, Leon Timmermans wrote​:

On Sat, Jul 1, 2017 at 7​:45 PM, <pali@​cpan.org <mailto​:pali@​cpan.org>>
wrote​:

On Saturday 01 July 2017 19&#8203;:13&#8203;:30 you wrote&#8203;:
> to break a decade and a half of software\.

Hm? What you mean with to break? Existing functions would still work\,
just there are also new functions under new names\. Usage of old
functions is just removed from documentation\.

Then I misunderstood your proposal, "rename" suggested to me that the
old ones disappear. In that case I'm not sure I see the benefit of
your proposal. Why would anyone want to use an interface that won't
work on perls older than 5.28, and could disappear in a future version
of perl (since that's the point of Internals​::)? This isn't making
sense to me.

You could support it with Devel​::PPPort. It's a simple addition.

However, the problem remains that if someone were to use these new
functions without PPPort, their code would not work on older versions. I
can't see a way around that.

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @tonycoz

On Mon, Jul 03, 2017 at 01​:03​:37PM -0400, Sawyer X wrote​:

On 07/01/2017 01​:52 PM, Leon Timmermans wrote​:

On Sat, Jul 1, 2017 at 7​:45 PM, <pali@​cpan.org <mailto​:pali@​cpan.org>>
wrote​:

On Saturday 01 July 2017 19&#8203;:13&#8203;:30 you wrote&#8203;:
> to break a decade and a half of software\.

Hm? What you mean with to break? Existing functions would still work\,
just there are also new functions under new names\. Usage of old
functions is just removed from documentation\.

Then I misunderstood your proposal, "rename" suggested to me that the
old ones disappear. In that case I'm not sure I see the benefit of
your proposal. Why would anyone want to use an interface that won't
work on perls older than 5.28, and could disappear in a future version
of perl (since that's the point of Internals​::)? This isn't making
sense to me.

You could support it with Devel​::PPPort. It's a simple addition.

However, the problem remains that if someone were to use these new
functions without PPPort, their code would not work on older versions. I
can't see a way around that.

These are perl functions (as documented in utf8.pm), not C functions,
Devel​::PPPort does nothing for us.

The patch retains the old names, so that isn't an issue.

But it does deprecate the old names, which is an issue, I can't
imagine us removing these functions.

As a side note, the original thread refers to​:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support
(fails to) deal with filenames.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @Grinnz

On Mon, Jul 3, 2017 at 8​:38 PM, Tony Cook <tony@​develop-help.com> wrote​:

As a side note, the original thread refers to​:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-
Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support
(fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to
UTF-8, this will fail to encode downgraded names with non-ascii characters.

-Dan

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @tonycoz

On Mon, Jul 03, 2017 at 09​:35​:06PM -0400, Dan Book wrote​:

On Mon, Jul 3, 2017 at 8​:38 PM, Tony Cook <tony@​develop-help.com> wrote​:

As a side note, the original thread refers to​:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-
Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support
(fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to
UTF-8, this will fail to encode downgraded names with non-ascii characters.

If the caller creates a file using the name they pass in, encoding the
name (which might not be utf-8 marked) may make the later -e or -l
check fail.

Perl functions such as open and stat currently ignore the the UTF-8
flag, which makes this pretty messy.

The code in Archive​::Tar seems a reasonable workaround to me, I don't
think the author had much choice.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @pali

On Monday 03 July 2017 21​:35​:06 Dan Book wrote​:

On Mon, Jul 3, 2017 at 8​:38 PM, Tony Cook <tony@​develop-help.com> wrote​:

As a side note, the original thread refers to​:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-
Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support
(fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to
UTF-8, this will fail to encode downgraded names with non-ascii characters.

-Dan

See bug​: https://rt.perl.org/Public/Bug/Display.html?id=130831

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @pali

On Tuesday 04 July 2017 10​:38​:26 Tony Cook wrote​:

But it does deprecate the old names, which is an issue, I can't
imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide.
Also functions stay there, but we can instruct people via documentation
to use new functions for a new code... Again it is question if you call
it deprecation or aliasing. In any case functions are not going to be
deleted, so in final case it does not matter for old code.

And for old code can be defined this function easily​:

  *new_name = *old_name;

Reason for this patch series is​:
* document those utf8​:: functions
* allow developers to call those functions via non-cryptic names

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @demerphq

On 4 July 2017 at 09​:19, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 10​:38​:26 Tony Cook wrote​:

But it does deprecate the old names, which is an issue, I can't
imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide.
Also functions stay there, but we can instruct people via documentation
to use new functions for a new code... Again it is question if you call
it deprecation or aliasing. In any case functions are not going to be
deleted, so in final case it does not matter for old code.

And for old code can be defined this function easily​:

*new_name = *old_name;

Reason for this patch series is​:
* document those utf8​:: functions
* allow developers to call those functions via non-cryptic names

I dont mind adding new aliases for these functions, I object to your
proposal to put them in Internals however; I think that they should go
in 'scalar', which we decided at the last PerlQA is the designated
place for functions that operate on scalars.

scalar​::is_unicode_string()
scalar​::is_binary_string()

I don't like the wide-storage thing, (although I admit i think it
better than "is_utf8"), a latin1 string in utf8 does not use
wide-storage, and the unicode flag has significance beyond the storage
format; utf8-on strings get unicode semantics in case insensitive
operations.

cheers,
Yves

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @pali

On Tuesday 04 July 2017 01​:52​:29 yves orton via RT wrote​:

On 4 July 2017 at 09​:19, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 10​:38​:26 Tony Cook wrote​:

But it does deprecate the old names, which is an issue, I can't
imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide.
Also functions stay there, but we can instruct people via documentation
to use new functions for a new code... Again it is question if you call
it deprecation or aliasing. In any case functions are not going to be
deleted, so in final case it does not matter for old code.

And for old code can be defined this function easily​:

*new_name = *old_name;

Reason for this patch series is​:
* document those utf8​:: functions
* allow developers to call those functions via non-cryptic names

I dont mind adding new aliases for these functions, I object to your
proposal to put them in Internals however; I think that they should go
in 'scalar', which we decided at the last PerlQA is the designated
place for functions that operate on scalars.

I proposed Internals, because that flag is internal for perl and
invisible for pure perl code. But if more people are happy with scalar
namespace, I'm fine with it.

scalar​::is_unicode_string()
scalar​::is_binary_string()

But this is wrong! SVf_UTF8 does not tell if scalar string is unicode
or binary. It just tell type of internal storage.

Name is_binary_string is misleading in same way as current name is_utf8.

If you say that binary string is one with codes only in range 0x00-0xFF
then you can have that binary string also with SVf_UTF8 flag and your
function name "is_binary_string" would return false for your binary
string. Such name would lead to another problems.

I don't like the wide-storage thing, (although I admit i think it
better than "is_utf8"), a latin1 string in utf8 does not use
wide-storage,

Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1
extension from ASCII) contains two bytes when encoded in UTF-8 and
therefore are wide in UTF-8 too.

and the unicode flag has significance beyond the storage
format; utf8-on strings get unicode semantics in case insensitive
operations.

cheers,
Yves

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @demerphq

On 4 July 2017 at 11​:03, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 01​:52​:29 yves orton via RT wrote​:

On 4 July 2017 at 09​:19, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 10​:38​:26 Tony Cook wrote​:

But it does deprecate the old names, which is an issue, I can't
imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide.
Also functions stay there, but we can instruct people via documentation
to use new functions for a new code... Again it is question if you call
it deprecation or aliasing. In any case functions are not going to be
deleted, so in final case it does not matter for old code.

And for old code can be defined this function easily​:

*new_name = *old_name;

Reason for this patch series is​:
* document those utf8​:: functions
* allow developers to call those functions via non-cryptic names

I dont mind adding new aliases for these functions, I object to your
proposal to put them in Internals however; I think that they should go
in 'scalar', which we decided at the last PerlQA is the designated
place for functions that operate on scalars.

I proposed Internals, because that flag is internal for perl and
invisible for pure perl code. But if more people are happy with scalar
namespace, I'm fine with it.

scalar​::is_unicode_string()
scalar​::is_binary_string()

But this is wrong! SVf_UTF8 does not tell if scalar string is unicode
or binary. It just tell type of internal storage.

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations, when set
those rules are Unicode. When the flag is not set the default rules
are derived from ASCII.

You can see the difference in the following​:

"ba\x{DF}"=~/ss/i;

"ba\N{U+DF}"=~/ss/i;

The latter matches because \N{U+DF} produces the unicode code point
DF, and the former does not match, because \x{DF} produces the ASCII
octet DF instead. The former is an ASCII string, and the later is a
Unicode string.

Name is_binary_string is misleading in same way as current name is_utf8.

Erf, maybe. We need a term for "not-unicode", and "binary" is as good
as any. I don't mind other proposals.

If you say that binary string is one with codes only in range 0x00-0xFF
then you can have that binary string also with SVf_UTF8 flag and your
function name "is_binary_string" would return false for your binary
string. Such name would lead to another problems.

The SVf_UTF8 flag being off means the string should be treated as
ASCII when doing case-insensitive operations, and as binary for other
purposes, and that the data is encoded as a series of discrete octets.
It is not uncommon for people on this list to use the terms unicode
and binary for this reason.

I don't like the wide-storage thing, (although I admit i think it
better than "is_utf8"), a latin1 string in utf8 does not use
wide-storage,

Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1
extension from ASCII) contains two bytes when encoded in UTF-8 and
therefore are wide in UTF-8 too.

I spoke imprecisely, I should have said ASCII, not latin-1.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @pali

On Tuesday 04 July 2017 11​:22​:42 demerphq wrote​:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations, when set
those rules are Unicode. When the flag is not set the default rules
are derived from ASCII.

You can see the difference in the following​:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;'
matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;'
matched

The latter matches because \N{U+DF} produces the unicode code point
DF, and the former does not match, because \x{DF} produces the ASCII
octet DF instead. The former is an ASCII string, and the later is a
Unicode string.

No, both were matched under Perl 5.24.1.

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @demerphq

On 4 July 2017 at 12​:04, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 11​:22​:42 demerphq wrote​:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations, when set
those rules are Unicode. When the flag is not set the default rules
are derived from ASCII.

You can see the difference in the following​:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;'
matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;'
matched

-E is not -e.

-E is enabling a pragma which changes the default behavior.

However it is *PRAGMA*. It is NOT the normal behavior of Perl.

The latter matches because \N{U+DF} produces the unicode code point
DF, and the former does not match, because \x{DF} produces the ASCII
octet DF instead. The former is an ASCII string, and the later is a
Unicode string.

No, both were matched under Perl 5.24.1.

No, they did not. If \x{DF} magically started matching 'ss' it would
be a *MASSIVE* regression.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @pali

On Tuesday 04 July 2017 03​:12​:19 yves orton via RT wrote​:

On 4 July 2017 at 12​:04, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 11​:22​:42 demerphq wrote​:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations, when set
those rules are Unicode. When the flag is not set the default rules
are derived from ASCII.

You can see the difference in the following​:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;'
matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;'
matched

-E is not -e.

-E is enabling a pragma which changes the default behavior.

However it is *PRAGMA*. It is NOT the normal behavior of Perl.

Ah, right. I forgot that -E enables feature unicode_strings which
basically means that both examples were equivalent.

Default behavior is a bit unpredicable as it is affected by the
infamous Unicode Bug.

my $str1 = "\x{DF}";
my $str2 = "\N{U+DF}";
my $str3 = "\x{100}";

"ba$str1" =~ /ss/i;
"ba$str2" =~ /ss/i;

"ba$str1$str3" =~ /ss/i;

To make it predicable either /aa or /u modifiers should be already
used... It will prevent problems

"ba$str1" =~ /ss/aai;
"ba$str2" =~ /ss/aai;
"ba$str1$str3" =~ /ss/aai;

"ba$str1" =~ /ss/ui;
"ba$str2" =~ /ss/ui;
"ba$str1$str3" =~ /ss/ui;

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @demerphq

On 4 July 2017 at 13​:14, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 03​:12​:19 yves orton via RT wrote​:

On 4 July 2017 at 12​:04, <pali@​cpan.org> wrote​:

On Tuesday 04 July 2017 11​:22​:42 demerphq wrote​:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations, when set
those rules are Unicode. When the flag is not set the default rules
are derived from ASCII.

You can see the difference in the following​:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;'
matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;'
matched

-E is not -e.

-E is enabling a pragma which changes the default behavior.

However it is *PRAGMA*. It is NOT the normal behavior of Perl.

Ah, right. I forgot that -E enables feature unicode_strings which
basically means that both examples were equivalent.

Default behavior is a bit unpredicable as it is affected by the
infamous Unicode Bug.

It is only unpredictable if your model of strings is broken. I happen
to be very familiar with the internals, and do not find the actual
rules to be that difficult to deal with.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2017

From @pali

On Tuesday 04 July 2017 13​:32​:26 demerphq wrote​:

It is only unpredictable if your model of strings is broken.

I do not know what you mean if model of strings is broken, but once you
start receiving strings from other modules, user input or whatever
external resource, plus you start combining/concatenating those strings
you would hit the unicode bug. Therefore safe way is to use /aa or /u
modifiers in regex matching in way how you want to do matching.

I happen
to be very familiar with the internals, and do not find the actual
rules to be that difficult to deal with.

I think this discussion is out of original request, which is for better
documentation of utf8.pm and better name for utf8​::is_utf8() function.

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2017

From @xsawyerx

On 07/04/2017 07​:38 AM, pali@​cpan.org wrote​:

On Tuesday 04 July 2017 13​:32​:26 demerphq wrote​:

It is only unpredictable if your model of strings is broken.
I do not know what you mean if model of strings is broken,

It is "broken" in that sense for probably more people than we would
like. Do we have any documentation that clarifies this entire issue?
(I know I trip on this frequently and never fully understood this issue
myself.)

[...]

I happen
to be very familiar with the internals, and do not find the actual
rules to be that difficult to deal with.
I think this discussion is out of original request, which is for better
documentation of utf8.pm and better name for utf8​::is_utf8() function.

Agree.

For now we seem to have two points we agree on​:
* We want to document these functions
* We want to give them better names
* We want the old behavior to work

As long as the second clause does not break the third, I think we should
seek to move forward.

Yves mentioned that "Internals" namespace to be undesired place for it
(which was discussed at P5H, the last core hackathon) and I agree.
"scalar" was the most popular one, IIRC.

Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)

Thanks!

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2017

From zefram@fysh.org

demerphq wrote​:

People have a hard time accepting it, but the utf8 flag tells parts of
the internals to use different rules for certain operations,

Those are bugs. In some cases they are bugs that we've decided we can't
just fix because of backcompat, so we add a flag to enable non-buggy
semantics and the bug lives on as default behaviour.

If a flag to distinguish between character strings and binary strings
were an intentional semantic feature, we'd need some rules to say
how the flag is to be set by operations that generate string outputs.
We've never done that.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2017

From zefram@fysh.org

Sawyer X wrote​:

Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)

I didn't want to add to a mostly bikeshedding discussion, but OK.
I concur that the existing names are poor, but I'm not much happier with
the names that have been suggested on this thread. I reckon the best
terminology we have for this flag, at the user level, is "upgraded",
and so the name "is_utf8" would be better as "is_upgraded". The existing
names "upgrade" and "downgrade" for the transforming operations are OK,
and the only change I'd potentially like to make to them would be to add
something that explicates their rather unusual in-place side-effecting
nature.

In fact you can see all my preferred names in my CPAN module
Scalar​::String. This module essentially attempts to be the sane version
of utf8.pm, attempting to impart the right mental model through its
function names and documentation. (The "sclstr_" prefix on all the
function names may be omitted if desired; the important part of the name
is that which distinguishes these functions from each other.)

I think the names for these functions should be reasonably concise,
and in particular we should have a single-word adjective for "having
the SvUTF8 flag on" if possible. We should also try to reuse existing
terminology, rather than invent anything new. We should also avoid any
term that implies anything beyond the storage, such as any reference to
characters or Unicode, because such implications are largely inaccurate,
and anywhere they are accurate is a bug. All of this leads me to prefer
"upgraded" over "utf8", "unicode", "uses_wide_storage", and the like.

I don't have any strong opinion about which package any new names for
these functions should appear in. I think on balance we should not
remove the old names, because the trouble that arises from maintaining
them is small compared to the hassle that would arise from requiring
existing correct programs to change. Not removing them implies that
we wouldn't even be deprecating them, as currently defined, but we can
fairly discourage the use of the old names in documentation.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2017

From @khwilliamson

On 07/10/2017 02​:13 PM, Zefram wrote​:

Sawyer X wrote​:

Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)

I didn't want to add to a mostly bikeshedding discussion, but OK.
I concur that the existing names are poor, but I'm not much happier with
the names that have been suggested on this thread. I reckon the best
terminology we have for this flag, at the user level, is "upgraded",
and so the name "is_utf8" would be better as "is_upgraded". The existing
names "upgrade" and "downgrade" for the transforming operations are OK,
and the only change I'd potentially like to make to them would be to add
something that explicates their rather unusual in-place side-effecting
nature.

In fact you can see all my preferred names in my CPAN module
Scalar​::String. This module essentially attempts to be the sane version
of utf8.pm, attempting to impart the right mental model through its
function names and documentation. (The "sclstr_" prefix on all the
function names may be omitted if desired; the important part of the name
is that which distinguishes these functions from each other.)

I think the names for these functions should be reasonably concise,
and in particular we should have a single-word adjective for "having
the SvUTF8 flag on" if possible. We should also try to reuse existing
terminology, rather than invent anything new. We should also avoid any
term that implies anything beyond the storage, such as any reference to
characters or Unicode, because such implications are largely inaccurate,
and anywhere they are accurate is a bug. All of this leads me to prefer
"upgraded" over "utf8", "unicode", "uses_wide_storage", and the like.

I don't have any strong opinion about which package any new names for
these functions should appear in. I think on balance we should not
remove the old names, because the trouble that arises from maintaining
them is small compared to the hassle that would arise from requiring
existing correct programs to change. Not removing them implies that
we wouldn't even be deprecating them, as currently defined, but we can
fairly discourage the use of the old names in documentation.

-zefram

My view is that the current names could be improved, and that there
should be no technical nor social problem in creating new names while
retaining the old ones, but changing the docs to stress the new ones.
I've done that a lot.

I don't know what namespace is best. At first blush Internals seems
good to me, for this and other things that people currently have hacks
for, like

  $foo & ""

which trying to find out if $foo is a string or just a number. I don't
fully understand the objection to 'Internals'

I have never liked upgrade and downgrade. When you upgrade something
you are supposed to get something better, like more legroom. I have
never seen why a PV is better than a number, or a UTF-8 string better
than a non-one (it's far slower, for example, which is a downgrade in my
estimation). The use of upgrade and downgrade is jargon based on the
attitudes of the implementers, which should be avoided. Maybe it's too
baked in to change, but I regret that it's there. UTF-8 itself is an
implementation detail that should never have been exposed to the
outside, but 'use utf8' pretty much does that.

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2017

From @cpansprout

On Mon, 10 Jul 2017 19​:53​:42 -0700, public@​khwilliamson.com wrote​:

I don't know what namespace is best. At first blush Internals seems
good to me, for this and other things that people currently have hacks
for, like

$foo & ""

which trying to find out if $foo is a string or just a number. I don't
fully understand the objection to 'Internals'

Adding new public functions to the Internals namespace would completely change its meaning. It contains functions that exist mainly for perl’s own functionality (for built-in modules like Hash​::Util to use) and for testing perl itself. Users are not supposed to know about them. That the cat is out of the bag and we cannot remove them is unfortunate.

Since we already use ‘utf8’ to refer to Perl’s Unicode support, why not continue to use that namespace?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2017

From @cpansprout

On Mon, 10 Jul 2017 19​:53​:42 -0700, public@​khwilliamson.com wrote​:

I have never liked upgrade and downgrade. When you upgrade something
you are supposed to get something better, like more legroom.

Well, er, that is exactly what you get. You can stretch your legs beyond CLV.*

I have
never seen why a PV is better than a number, or a UTF-8 string better
than a non-one (it's far slower, for example,

I think that is one of the best arguments in favour of ‘upgrade’. It is just like upgrading most commercial software!

which is a downgrade in my
estimation). The use of upgrade and downgrade is jargon based on the
attitudes of the implementers, which should be avoided. Maybe it's too
baked in to change, but I regret that it's there. UTF-8 itself is an
implementation detail that should never have been exposed to the
outside, but 'use utf8' pretty much does that.

* That is a Roman numeral.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2017

From @Tux

On Tue, 11 Jul 2017 10​:41​:37 -0700, "Father Chrysostomos via RT"
<perlbug-followup@​perl.org> wrote​:

On Tue, 11 Jul 2017 00​:55​:51 -0700, davem wrote​:

On Mon, Jul 10, 2017 at 12​:45​:48PM -0400, Sawyer X wrote​:

Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten
stronger over time (*)) is rarely/never to add a new alias name to an
existing function.

Alias names just increase the cognitive load. If the old names were
confusing, having more names will just increase the confusion.

Before, you would have to remember that a particular function foo() is
badly named and doesn't do what you might expect it to do, based solely on
the name.

Afterwards, you have to remember that that are two functions foo() and
bar(), one is deprecated (which one?), one is badly named (which one?),
but they both do the same thing (Or do they? Sigh. Let's check the
documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name, but I was used to
it. Now I can never remember what the new alias is called (just looked
it up - av_top_index()). In hindsight, I would have voted against adding
av_top_index.

I agree with everything you have said. I brought up the same
objection when this proposal was first put forward, but I thought I
had lost the debate. Well, at least there are two of us now. :-)

Count me in​: three. I like the way Dave has written down my feelings :)

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2017

From @khwilliamson

On 07/12/2017 12​:36 AM, H.Merijn Brand wrote​:

On Tue, 11 Jul 2017 10​:41​:37 -0700, "Father Chrysostomos via RT"
<perlbug-followup@​perl.org> wrote​:

On Tue, 11 Jul 2017 00​:55​:51 -0700, davem wrote​:

On Mon, Jul 10, 2017 at 12​:45​:48PM -0400, Sawyer X wrote​:

Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten
stronger over time (*)) is rarely/never to add a new alias name to an
existing function.

Alias names just increase the cognitive load. If the old names were
confusing, having more names will just increase the confusion.

Before, you would have to remember that a particular function foo() is
badly named and doesn't do what you might expect it to do, based solely on
the name.

Afterwards, you have to remember that that are two functions foo() and
bar(), one is deprecated (which one?), one is badly named (which one?),
but they both do the same thing (Or do they? Sigh. Let's check the
documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name, but I was used to
it. Now I can never remember what the new alias is called (just looked
it up - av_top_index()). In hindsight, I would have voted against adding
av_top_index.

I agree with everything you have said. I brought up the same
objection when this proposal was first put forward, but I thought I
had lost the debate. Well, at least there are two of us now. :-)

Count me in​: three. I like the way Dave has written down my feelings :)

I guess we have a fundamental disagreement about language design and the
direction Perl should go, which makes me sad.

The point of adding synonyms for deceptively-named functions and macros
is to make life easier overall. Forbidding new better-named synonyms
for problematically named things forces everyone who comes along to deal
with the gotchas and cognitive load that those people already here have
had to deal with. By creating better named things, those people can
largely avoid these problems. This allows them to work more
efficiently, avoiding traps, and with less cursing Perl.

Unless Perl is close to death, the number of people who are going to
come along before it does die dwarfs the number who are already expert.
  Some people are knowledgeable in parts of Perl, but not all. They
also gain if gotchas get removed before they have to deal with them.

Specifically about av_top_index, I don't believe that it is so poorly
named that you have to keep consulting the documentation as to what it does.

It came about not because of AvFILL, but because of the already-existing
synonym, the evilly named "av_len". This name implies it gives a
length, but in fact it is one-off from that. av_top_index, though
cumbersome, accurately indicates what it returns.

Using av_len is a bug waiting to happen. It is a foreseeable problem.
I believe that it would be unethical to not create a non-deceptive
alternative. It's kind of like a safety recall.

Writing code using deceptively named things or with poor API's is slower
and more error prone. Every time you use one, you have to get out of
your mental pipeline and recall that this is a gotcha, and have to
figure out how it is a gotcha and how you have to compensate. You are
effectively flushing your mental instruction cache. In the case of
av_len, you have to remember which way is the off-by-one problem here.

Code reviews also are affected. It is just too easy to read the thing
and forget that it doesn't do what you would want.

In researching the issue back when av_top_index was created, I found
published modules that used av_len, as its name implies, as a length.
Others undoubtedly had caught the problem earlier, say through their
unit testing.

But all this could be avoided by the code using a non-deceptive name.
Hopefully, the coder won't even be aware that there exist deceptive ones
for hysterical reasons.

It is foreseeable that av_len is going to cause problems. It would be
irresponsible of us to not create a non-deceptive synonym when it is so
easy to do.

No one was really happy with "av_top_index" as a name. So AvFILL was
retained in the core. All occurrences of av_len were removed. If we
could have come up with a short, pithy synonym, we would have replaced
AvFILL as well, and then people looking at the core would have seen that
and gotten used to it, and over time the memory of the less well-named
versions would have faded.

Writing good APIs is hard. I have flattered myself at times into
thinking I'm good at it. Maybe I am actually good, but if so, I'm still
not good enough. And few, if any, are. If we have a poor API in some
area, we should not tie our hands and say tough to all those people who
come along later, and give them more reason to use some other language

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2017

From @Tux

On Wed, 12 Jul 2017 22​:53​:57 -0600, Karl Williamson
<public@​khwilliamson.com> wrote​:

It came about not because of AvFILL, but because of the already-existing
synonym, the evilly named "av_len". This name implies it gives a
length, but in fact it is one-off from that. av_top_index, though
cumbersome, accurately indicates what it returns.

The problem with av_top_index is that it hat not (yet) been ported to
Devel​::PPPort, so I cannot change any XS code that uses av_len into
using the new function if that XS is to support 5.16.0 or older

$ ack av_top_index
ppport.h
1225​:av_top_index||5.017009|
$

I know I didn't quote all of your message and I understand your
motivation, but the problem for these misnamed functions is much wider
than the scope of av_top_index, which is *only* available to XS, and XS
is more or less easy to fix by adding stuff to Devel​::PPPort

For the utf8 functions, the scope is WAY wider​: it is used from
pure-perl, and renaming them (with or without aliases) would cause
major brain damage for all authors that use these functions (correct or
incorrect) when their code has to work on a wide range of perl versions.

To be honest, I do not see an easy way out of that dilemma. If you have
one, I'm open to change for the better.

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2017

From @pali

On Wednesday 12 July 2017 23​:44​:39 H. Merijn Brand via RT wrote​:

On Wed, 12 Jul 2017 22​:53​:57 -0600, Karl Williamson
<public@​khwilliamson.com> wrote​:

It came about not because of AvFILL, but because of the already-existing
synonym, the evilly named "av_len". This name implies it gives a
length, but in fact it is one-off from that. av_top_index, though
cumbersome, accurately indicates what it returns.

The problem with av_top_index is that it hat not (yet) been ported to
Devel​::PPPort,

Devel​::PPPort is probably unmaintained... It has open couple of bugs
since 2015 without any comments. And also pull requests are not
processed since 2016. Even those security released like this​:
Dual-Life/Devel-PPPort#47

Because of those problems, I have no motivation to prepare any other
patch for Devel​::PPPort. For dead/unmaintained modules it is useless.

so I cannot change any XS code that uses av_len into
using the new function if that XS is to support 5.16.0 or older

$ ack av_top_index
ppport.h
1225​:av_top_index||5.017009|
$

I know I didn't quote all of your message and I understand your
motivation, but the problem for these misnamed functions is much wider
than the scope of av_top_index, which is *only* available to XS, and XS
is more or less easy to fix by adding stuff to Devel​::PPPort

For the utf8 functions, the scope is WAY wider​: it is used from
pure-perl, and renaming them (with or without aliases) would cause
major brain damage for all authors that use these functions (correct or
incorrect) when their code has to work on a wide range of perl versions.

To be honest, I do not see an easy way out of that dilemma. If you have
one, I'm open to change for the better.

Problem is that people very often use construct which I wrote in first
comment. Or they read "is_utf8" means string is UTF-8 encoded and
therefore I need to call utf8​::decode() on it.

And all this happens just because of wrong name from which can be
deduced by more people what it should do -- which involves in *no*
reading documentation...

If we would not add better aliases, then broken code would be still
produced on cpan.

As utf8​::is_utf8() is not needed too often, backward compatibility can
be achieved by​:

*NEW_NAME = *utf8​::is_utf8;

I think this is a good compromise. If you think that upgrade and
downgrade function names are fine, OK, but at least please add better
name for is_utf8(). In original email I suggested is_upgraded(), so name
would be bound with "upgrade()" function. Because it really checks if
upgrade() was called or not.

@p5pRT
Copy link
Author

p5pRT commented Jul 14, 2017

From @cpansprout

On Wed, 12 Jul 2017 21​:55​:03 -0700, public@​khwilliamson.com wrote​:

I guess we have a fundamental disagreement about language design and
the
direction Perl should go, which makes me sad.

I agree the disagreement is unfortunate.

The point of adding synonyms for deceptively-named functions and
macros
is to make life easier overall. Forbidding new better-named synonyms
for problematically named things forces everyone who comes along to
deal
with the gotchas and cognitive load that those people already here
have
had to deal with. By creating better named things, those people can
largely avoid these problems. This allows them to work more
efficiently, avoiding traps, and with less cursing Perl.

When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts.

I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.

I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.)

Unless Perl is close to death, the number of people who are going to
come along before it does die dwarfs the number who are already
expert.
Some people are knowledgeable in parts of Perl, but not all. They
also gain if gotchas get removed before they have to deal with them.

But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.

My personal experience is that what you are arguing for, while it sounds good, does not work in practice.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 14, 2017

From @demerphq

On 14 July 2017 at 04​:28, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Wed, 12 Jul 2017 21​:55​:03 -0700, public@​khwilliamson.com wrote​:

I guess we have a fundamental disagreement about language design and
the
direction Perl should go, which makes me sad.

I agree the disagreement is unfortunate.

The point of adding synonyms for deceptively-named functions and
macros
is to make life easier overall. Forbidding new better-named synonyms
for problematically named things forces everyone who comes along to
deal
with the gotchas and cognitive load that those people already here
have
had to deal with. By creating better named things, those people can
largely avoid these problems. This allows them to work more
efficiently, avoiding traps, and with less cursing Perl.

When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts.

I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.

I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.)

Unless Perl is close to death, the number of people who are going to
come along before it does die dwarfs the number who are already
expert.
Some people are knowledgeable in parts of Perl, but not all. They
also gain if gotchas get removed before they have to deal with them.

But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.

My personal experience is that what you are arguing for, while it sounds good, does not work in practice.

I think the reason that it sounds good is because it does make sense
at a micro level. If you are working on company code for instance, or
a small code-base, renaming poorly named things means that the old
name is *gone*, and cognitive burden is reduced.

But with something like Perl we can't just get rid of things, if we
want to rename we have to do something for all the older code out
there. So we have to support both in some ways. Which means the
cognitive burden is increased.

Despite this I think sometimes these things *can* be justified and
managed, but we have to be extremely careful about the choices we
make, and have real plans in place to deprecate the older use cases in
some kind of way. So for instance if we were going to get rid of
Internals then we can rename things it contained, and then bundle an
Internals.pm which does the right thing, people needing back compat
can add 'use Internals' and get the back-compat. So i could see us
considering the ideas in this thread in the context of the proposed
introduction of 'array', 'scalar', etc.

yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2017

From @xsawyerx

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations​: Having properly-named functions
to reduce confusion for future developers (we hope to have some, right?)
but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier​:

* Document utf8​::is_utf8() to prevent this confusion​: This is by far the
first thing that should be done. I have double checked the wording for
utf8​::is_utf8() from my blead (978b185)​:

  (Since Perl 5.8.1) Test whether $string is marked internally as
  encoded in UTF-8. Functionally the same as "Encode​::is_utf8()".

This is confusing, to say the least. "Marked internally" is the words
core hackers are looking for and recognize, but "UTF-8" is what non-core
hackers (those without the cognitive bias in core terms) see and
understand. If we head over to Encode​::is_utf8() we see​:

  [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/.
  If /CHECK/ is true, also checks whether /STRING/ contains
  well-formed UTF-8. Returns true if successful, false otherwise.

  As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the
  |utf8​::is_utf8| function.

I like this wording better for several reasons​: It is under the title
"Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
that it checks for well-formed UTF-8 only if that flag is true. There
are improvements to be made here too. We can note what the flag means
(subtle, complicated, bike-shed-able) or at the very least add a nice
"this isn't the flag you're looking for" warning. We can also suggest
when to use and when not to use the function (otherwise it's left to the
reader, who can easily get it wrong, which is why we're here).

If the document on both was better, then we could have possibly left
this as unfortunate naming errors we're carrying with us (along with
"wantarray" for noting whether the context is scalar, list, or void).

* Provide different functions and document all functions in all other
functions

If we decide to have better named functions, we will have additional
cognitive load for both experienced core developers and new developers.
For core developers, it is a muscle memory to undo and two different
sets of code to deal with - those with the old name and those with the
new name. For new developers, it will be simple at first, until you come
in contact with the old name. It is likely this will also happen early,
so you need to learn two names anyway. However, their muscle memory will
be geared towards using a more descriptive name.

I mix those when it comes to English.pm. I use $_, $@​, $!, $#, $/, $^X,
$0 and a few more, but I use English.pm for $&lt;, $&gt;, $(, $), $", and a
few more. The reasoning is simple​: $_, $@​, $!, and $# are so common it
will be built into every muscle memory. On the other hand, for many
developers, if they see $<, they will need to look it up in perlvar
anyway. However, $UID or $REAL_USER_ID is readable right away and no
need to look it up.

One additional point about English is that, unlike what we're suggesting
here, the punctuation variable names are the right name, they're just
not descriptive. is_utf8() is not about descriptive, but misleading. It
is a misnomer. It makes it an undesired pitfall.

I see value in adding proper names, but then we would need to take care
of at least making all possible names available in the documentation of
all other names. If you're reading utf8.pm, you need to find
"is_upgraded" in "is_utf8" and "is_utf8" in "is_upgraded"[1]. This makes
it easy to quickly find what they mean and differentiate when we see
different names.

* Move all known usages in core to new functions

Another way to improve this new cognitive load is by reducing it in the
codebase. Removing as many instances of the old name will reduce the
mixture of names, thus helping us move towards the new name. This is a
much more intrusive change but has a high potential of helping seasoned
developers to deal with the new name.

* Automated policies for improving CPAN code quality

This is beyond the scope of core, but I think it's worthwhile taking
into account the perspective of the community. Realizing the misused
"is_utf8" brings with it a question of whether and how we could reduce
this problem's scope outside the core, and this could have been done
with a kwalitee check (CPANTS[2]) that checked for "is_utf8" and
recommends reviewing its use. This is far more complicated since there
is a legitimate (but narrow) use for it, and you might get false
positives. I believe only a human could find the situations in which
it's valuable.

Still, it is worthwhile keeping in mind.

Overall, I'm still undecided. Maybe we could start with improving the
existing documentation?

[1] Using "is_upgraded" as an example different name.
[2] http​://cpants.cpanauthors.org/

On 07/13/2017 06​:53 AM, Karl Williamson wrote​:

On 07/12/2017 12​:36 AM, H.Merijn Brand wrote​:

On Tue, 11 Jul 2017 10​:41​:37 -0700, "Father Chrysostomos via RT"
<perlbug-followup@​perl.org> wrote​:

On Tue, 11 Jul 2017 00​:55​:51 -0700, davem wrote​:

On Mon, Jul 10, 2017 at 12​:45​:48PM -0400, Sawyer X wrote​:

Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has
gotten
stronger over time (*)) is rarely/never to add a new alias name to an
existing function.

Alias names just increase the cognitive load. If the old names were
confusing, having more names will just increase the confusion.

Before, you would have to remember that a particular function foo() is
badly named and doesn't do what you might expect it to do, based
solely on
the name.

Afterwards, you have to remember that that are two functions foo() and
bar(), one is deprecated (which one?), one is badly named (which
one?),
but they both do the same thing (Or do they? Sigh. Let's check the
documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name, but I was
used to
it. Now I can never remember what the new alias is called (just looked
it up - av_top_index()). In hindsight, I would have voted against
adding
av_top_index.

I agree with everything you have said. I brought up the same
objection when this proposal was first put forward, but I thought I
had lost the debate. Well, at least there are two of us now. :-)

Count me in​: three. I like the way Dave has written down my feelings :)

I guess we have a fundamental disagreement about language design and
the direction Perl should go, which makes me sad.

The point of adding synonyms for deceptively-named functions and
macros is to make life easier overall. Forbidding new better-named
synonyms for problematically named things forces everyone who comes
along to deal with the gotchas and cognitive load that those people
already here have had to deal with. By creating better-named things,
those people can largely avoid these problems. This allows them to
work more efficiently, avoiding traps, and with less cursing Perl.

Unless Perl is close to death, the number of people who are going to
come along before it does die dwarfs the number who are already
expert. Some people are knowledgeable in parts of Perl, but not
all. They also gain if gotchas get removed before they have to deal
with them.

Specifically about av_top_index, I don't believe that it is so poorly
named that you have to keep consulting the documentation as to what it
does.

It came about not because of AvFILL, but because of the
already-existing synonym, the evilly named "av_len". This name
implies it gives a length, but in fact it is one-off from that.
av_top_index, though cumbersome, accurately indicates what it returns.

Using av_len is a bug waiting to happen. It is a foreseeable problem.
I believe that it would be unethical to not create a non-deceptive
alternative. It's kind of like a safety recall.

Writing code using deceptively named things or with poor API's is
slower and more error prone. Every time you use one, you have to get
out of your mental pipeline and recall that this is a gotcha, and have
to figure out how it is a gotcha and how you have to compensate. You
are effectively flushing your mental instruction cache. In the case
of av_len, you have to remember which way is the off-by-one problem here.

Code reviews also are affected. It is just too easy to read the thing
and forget that it doesn't do what you would want.

In researching the issue back when av_top_index was created, I found
published modules that used av_len, as its name implies, as a length.
Others undoubtedly had caught the problem earlier, say through their
unit testing.

But all this could be avoided by the code using a non-deceptive name.
Hopefully, the coder won't even be aware that there exist deceptive
ones for hysterical reasons.

It is foreseeable that av_len is going to cause problems. It would be
irresponsible of us to not create a non-deceptive synonym when it is
so easy to do.

No one was really happy with "av_top_index" as a name. So AvFILL was
retained in the core. All occurrences of av_len were removed. If we
could have come up with a short, pithy synonym, we would have replaced
AvFILL as well, and then people looking at the core would have seen
that and gotten used to it, and over time the memory of the less
well-named versions would have faded.

Writing good APIs is hard. I have flattered myself at times into
thinking I'm good at it. Maybe I am actually good, but if so, I'm
still not good enough. And few, if any, are. If we have a poor API
in some area, we should not tie our hands and say tough to all those
people who come along later, and give them more reason to use some
other language

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2017

From @hvds

On Mon, 17 Jul 2017 01​:47​:32 -0700, xsawyerx@​gmail.com wrote​:

I have mixed thoughts about this.

Me too.

If we decide to have better named functions [...]
For new developers, it will be simple at first, until you come
in contact with the old name. It is likely this will also happen early,
so you need to learn two names anyway. [...]

I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience, most people working in a perl shop tend to read lots of code in their local codebase, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up, it may not happen that early for a good proportion of new developers.

Maybe you had in mind primarily historical threads googled up from perlmonks, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2017

From @tonycoz

On Mon, Jul 17, 2017 at 10​:46​:59AM +0200, Sawyer X wrote​:

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations​: Having properly-named functions
to reduce confusion for future developers (we hope to have some, right?)
but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier​:

* Document utf8​::is_utf8() to prevent this confusion​: This is by far the
first thing that should be done. I have double checked the wording for
utf8​::is_utf8() from my blead (978b185)​:

    \(Since Perl 5\.8\.1\) Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8\(\)"\.

This is confusing, to say the least. "Marked internally" is the words
core hackers are looking for and recognize, but "UTF-8" is what non-core
hackers (those without the cognitive bias in core terms) see and
understand. If we head over to Encode​::is_utf8() we see​:

\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.

I like this wording better for several reasons​: It is under the title
"Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
that it checks for well-formed UTF-8 only if that flag is true. There
are improvements to be made here too. We can note what the flag means
(subtle, complicated, bike-shed-able) or at the very least add a nice
"this isn't the flag you're looking for" warning. We can also suggest
when to use and when not to use the function (otherwise it's left to the
reader, who can easily get it wrong, which is why we're here).

utf8​::is_utf8() doesn't accept the second parameter and does no
validity checks (we have utf8​::valid() for that), despite the note in
utf8.pm.

If the document on both was better, then we could have possibly left
this as unfortunate naming errors we're carrying with us (along with
"wantarray" for noting whether the context is scalar, list, or void).
...
Overall, I'm still undecided. Maybe we could start with improving the
existing documentation?

Perhaps something like​:

=item * C<$flag = utf8​::is_utf8($string)>

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8. Functionally the same as C<Encode​::is_utf8($string)>.
Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be
compatible with perls older than 5.12, call C<utf8​::upgrade($string)>
unconditionally.

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

If you're accepting bytes​:

  utf8​::downgrade($string); # throws an exception if code point over 0xFF

  utf8​::downgrade($string, 1) # our own error handling
  or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes​:

  utf8​::encode($string); # unconditionally

The only exception is if you're dealing with filenames, since perl
uses the internal representation of the string for system calls.

<<

Are there any other cases someone might be tempted to call
utf8​::is_utf8()?

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2017

From @Tux

On Tue, 18 Jul 2017 10​:53​:53 +1000, Tony Cook <tony@​develop-help.com>
wrote​:

On Mon, Jul 17, 2017 at 10​:46​:59AM +0200, Sawyer X wrote​:

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations​: Having properly-named functions
to reduce confusion for future developers (we hope to have some, right?)
but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier​:

* Document utf8​::is_utf8() to prevent this confusion​: This is by far the
first thing that should be done. I have double checked the wording for
utf8​::is_utf8() from my blead (978b185)​:

    \(Since Perl 5\.8\.1\) Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8\(\)"\.

This is confusing, to say the least. "Marked internally" is the words
core hackers are looking for and recognize, but "UTF-8" is what non-core
hackers (those without the cognitive bias in core terms) see and
understand. If we head over to Encode​::is_utf8() we see​:

\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.

I like this wording better for several reasons​: It is under the title
"Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
that it checks for well-formed UTF-8 only if that flag is true. There
are improvements to be made here too. We can note what the flag means
(subtle, complicated, bike-shed-able) or at the very least add a nice
"this isn't the flag you're looking for" warning. We can also suggest
when to use and when not to use the function (otherwise it's left to the
reader, who can easily get it wrong, which is why we're here).

utf8​::is_utf8() doesn't accept the second parameter and does no
validity checks (we have utf8​::valid() for that), despite the note in
utf8.pm.

If the document on both was better, then we could have possibly left
this as unfortunate naming errors we're carrying with us (along with
"wantarray" for noting whether the context is scalar, list, or void).
...
Overall, I'm still undecided. Maybe we could start with improving the
existing documentation?

Perhaps something like​:

=item * C<$flag = utf8​::is_utf8($string)>

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8. Functionally the same as C<Encode​::is_utf8($string)>.
Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be
compatible with perls older than 5.12, call C<utf8​::upgrade($string)>
unconditionally.

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

If you're accepting bytes​:

utf8​::downgrade($string); # throws an exception if code point over 0xFF

utf8​::downgrade($string, 1) # our own error handling
or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes​:

utf8​::encode($string); # unconditionally

The only exception is if you're dealing with filenames, since perl
uses the internal representation of the string for system calls.

<<

Are there any other cases someone might be tempted to call
utf8​::is_utf8()?

Tony

I like this. What I miss here is a small example of how to guarantee
preventing double encoding/decoding, as I think that is what is
function is most often (erroneously) used for.

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2017

From @Grinnz

On Tue, Jul 18, 2017 at 3​:04 AM, H.Merijn Brand <h.m.brand@​xs4all.nl> wrote​:

I like this. What I miss here is a small example of how to guarantee
preventing double encoding/decoding, as I think that is what is
function is most often (erroneously) used for.

This isn't something that you can guarantee. It always depends on knowing
how you get your input. When people don't understand this they look for the
magic bullet that is_utf8 appears to be, but it is not.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2017

From @Tux

On Tue, 18 Jul 2017 03​:13​:40 -0400, Dan Book <grinnz@​gmail.com> wrote​:

On Tue, Jul 18, 2017 at 3​:04 AM, H.Merijn Brand <h.m.brand@​xs4all.nl> wrote​:

I like this. What I miss here is a small example of how to guarantee
preventing double encoding/decoding, as I think that is what is
function is most often (erroneously) used for.

This isn't something that you can guarantee. It always depends on knowing
how you get your input. When people don't understand this they look for the
magic bullet that is_utf8 appears to be, but it is not.

My point exactly. Just have a piece of text that tells the user why it
isn't and what the best alternative *could* be.

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2017

From @xsawyerx

On 07/17/2017 10​:09 PM, Hugo van der Sanden via RT wrote​:

On Mon, 17 Jul 2017 01​:47​:32 -0700, xsawyerx@​gmail.com wrote​:

I have mixed thoughts about this.
Me too.

If we decide to have better named functions [...]
For new developers, it will be simple at first, until you come
in contact with the old name. It is likely this will also happen early,
so you need to learn two names anyway. [...]
I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience, most people working in a perl shop tend to read lots of code in their local codebase, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up, it may not happen that early for a good proportion of new developers.

Maybe you had in mind primarily historical threads googled up from perlmonks, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.

I meant people who will start hacking on Perl core.

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2017

From @tonycoz

On Tue, Jul 18, 2017 at 10​:53​:53AM +1000, Tony Cook wrote​:

On Mon, Jul 17, 2017 at 10​:46​:59AM +0200, Sawyer X wrote​:

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations​: Having properly-named functions
to reduce confusion for future developers (we hope to have some, right?)
but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier​:

* Document utf8​::is_utf8() to prevent this confusion​: This is by far the
first thing that should be done. I have double checked the wording for
utf8​::is_utf8() from my blead (978b185)​:

    \(Since Perl 5\.8\.1\) Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8\(\)"\.

This is confusing, to say the least. "Marked internally" is the words
core hackers are looking for and recognize, but "UTF-8" is what non-core
hackers (those without the cognitive bias in core terms) see and
understand. If we head over to Encode​::is_utf8() we see​:

\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.

I like this wording better for several reasons​: It is under the title
"Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
that it checks for well-formed UTF-8 only if that flag is true. There
are improvements to be made here too. We can note what the flag means
(subtle, complicated, bike-shed-able) or at the very least add a nice
"this isn't the flag you're looking for" warning. We can also suggest
when to use and when not to use the function (otherwise it's left to the
reader, who can easily get it wrong, which is why we're here).

utf8​::is_utf8() doesn't accept the second parameter and does no
validity checks (we have utf8​::valid() for that), despite the note in
utf8.pm.

If the document on both was better, then we could have possibly left
this as unfortunate naming errors we're carrying with us (along with
"wantarray" for noting whether the context is scalar, list, or void).
...
Overall, I'm still undecided. Maybe we could start with improving the
existing documentation?

Perhaps something like​:

=item * C<$flag = utf8​::is_utf8($string)>

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8. Functionally the same as C<Encode​::is_utf8($string)>.
Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be
compatible with perls older than 5.12, call C<utf8​::upgrade($string)>
unconditionally.

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

If you're accepting bytes​:

utf8​::downgrade($string); # throws an exception if code point over 0xFF

utf8​::downgrade($string, 1) # our own error handling
or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes​:

utf8​::encode($string); # unconditionally

The only exception is if you're dealing with filenames, since perl
uses the internal representation of the string for system calls.

<<

Are there any other cases someone might be tempted to call
utf8​::is_utf8()?

Thinking about it further, I'm pretty sure this doesn't all belong
here.

L<perlunifaq/What is "the UTF8 flag"?> provides a good description of
the flag is_utf8() returns, and the whole of perlunifaq covers some of
the things the above tries to cover.

perlunicook largely works at a higher level than the functions in
utf8​::* work at.

One thing from the above that doesn't seem to be discussed well[1] is
what I tried to cover briefly in​:

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

which could perhaps use some expansion in perlunicode.

I'm not sure where the cheat sheet following belongs, though
perlunifaq covers some of it (though using Encode instead of utf8​::*).

Tony

[1] perlunifaq briefly mentions some of the issues under "What about
binary data, like image?" and more detail in "What if I don't decode?"

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2017

From @xsawyerx

On 07/19/2017 08​:58 AM, Tony Cook wrote​:

On Tue, Jul 18, 2017 at 10​:53​:53AM +1000, Tony Cook wrote​:

On Mon, Jul 17, 2017 at 10​:46​:59AM +0200, Sawyer X wrote​:

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations​: Having properly-named functions
to reduce confusion for future developers (we hope to have some, right?)
but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier​:

* Document utf8​::is_utf8() to prevent this confusion​: This is by far the
first thing that should be done. I have double checked the wording for
utf8​::is_utf8() from my blead (978b185)​:

    \(Since Perl 5\.8\.1\) Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8\(\)"\.

This is confusing, to say the least. "Marked internally" is the words
core hackers are looking for and recognize, but "UTF-8" is what non-core
hackers (those without the cognitive bias in core terms) see and
understand. If we head over to Encode​::is_utf8() we see​:

\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.

I like this wording better for several reasons​: It is under the title
"Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
that it checks for well-formed UTF-8 only if that flag is true. There
are improvements to be made here too. We can note what the flag means
(subtle, complicated, bike-shed-able) or at the very least add a nice
"this isn't the flag you're looking for" warning. We can also suggest
when to use and when not to use the function (otherwise it's left to the
reader, who can easily get it wrong, which is why we're here).
utf8​::is_utf8() doesn't accept the second parameter and does no
validity checks (we have utf8​::valid() for that), despite the note in
utf8.pm.

If the document on both was better, then we could have possibly left
this as unfortunate naming errors we're carrying with us (along with
"wantarray" for noting whether the context is scalar, list, or void).
...
Overall, I'm still undecided. Maybe we could start with improving the
existing documentation?
Perhaps something like​:

=item * C<$flag = utf8​::is_utf8($string)>

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8. Functionally the same as C<Encode​::is_utf8($string)>.
Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be
compatible with perls older than 5.12, call C<utf8​::upgrade($string)>
unconditionally.

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

If you're accepting bytes​:

utf8​::downgrade($string); # throws an exception if code point over 0xFF

utf8​::downgrade($string, 1) # our own error handling
or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes​:

utf8​::encode($string); # unconditionally

The only exception is if you're dealing with filenames, since perl
uses the internal representation of the string for system calls.

<<

Are there any other cases someone might be tempted to call
utf8​::is_utf8()?
Thinking about it further, I'm pretty sure this doesn't all belong
here.

L<perlunifaq/What is "the UTF8 flag"?> provides a good description of
the flag is_utf8() returns, and the whole of perlunifaq covers some of
the things the above tries to cover.

perlunicook largely works at a higher level than the functions in
utf8​::* work at.

+1 on the suggested text.

I think this addition is useful, even if it is also covered in more
documents. We could also link to those documents for further learning.

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2017

From @tonycoz

On Tue, 18 Jul 2017 23​:58​:39 -0700, tonyc wrote​:

which could perhaps use some expansion in perlunicode.

perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs, though
perlunifaq covers some of it (though using Encode instead of utf8​::*).

Attached is a series of patches (as a single file), the first three
fix some minor problems with the unicode documentation I found when
going through it.

The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2017

From @tonycoz

131685-various-changes.patch
From bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:30:56 +1000
Subject: use utf8; doesn't force unicode semantics on all strings in scope

eg.

$ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"'
no match

perhaps this should be removed, or completely re-worded, it's worded
similarly to the next point which behaves differently.
---
 pod/perlunicode.pod | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ef02b0a..d3ccf44 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -233,7 +233,7 @@ Unicode:
 Within the scope of S<C<use utf8>>
 
 If the whole program is Unicode (signified by using 8-bit B<U>nicode
-B<T>ransformation B<F>ormat), then all strings within it must be
+B<T>ransformation B<F>ormat), then all literal strings within it must be
 Unicode.
 
 =item *
-- 
2.1.4


From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:45:33 +1000
Subject: encoding.pm no longer works

---
 pod/perlunicode.pod | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d3ccf44..24102bf 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -60,10 +60,11 @@ filenames.
 Use the C<:encoding(...)> layer  to read from and write to
 filehandles using the specified encoding.  (See L<open>.)
 
-=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
+=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
 UTF-8.
 
-See L<encoding>.
+The L<encoding> module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
 
 =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
 
-- 
2.1.4


From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 15:42:18 +1000
Subject: unfortunately sysread() tries to read characters

---
 pod/perluniintro.pod | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0ad9dda..5e263b4 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed
 list see L<Encode::Supported>.
 
 C<read()> reads characters and returns the number of characters.
-C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
-and C<sysseek()>.
+C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>.
+
+C<sysread()> and C<syswrite()> should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
 
 Notice that because of the default behaviour of not doing any
 conversion upon input if there is no default layer,
-- 
2.1.4


From fb22d08dd9f174ddc4007c8ca6ef0e379fe34874 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Thu, 20 Jul 2017 15:44:49 +1000
Subject: (perl #131685) improve utf8::* function documentation

Splits the little cheat sheet I posted as a comment into pieces
and puts them closer to where they belong

- better document why you'd want to use utf8::upgrade()

- similarly for utf8::downgrade()

- try hard to convince people not to use utf8::is_utf8()

- no, utf8::is_utf8() isn't what you want instead of utf8::valid()
---
 lib/utf8.pm | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 52 insertions(+), 9 deletions(-)

diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87..9abbd06 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -2,7 +2,7 @@ package utf8;
 
 $utf8::hint_bits = 0x00800000;
 
-our $VERSION = '1.19';
+our $VERSION = '1.20';
 
 sub import {
     $^H |= $utf8::hint_bits;
@@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code.
 Converts in-place the internal representation of the string from an octet
 sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
 logical character sequence itself is unchanged.  If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8.  Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+upgraded, then this is a no-op. Returns the
+number of octets necessary to represent the string as UTF-8.
+
+If your code needs to be compatible with versions of perl without
+C<use feature 'unicode_strings';>, you can force Unicode semantics on
+a given string:
+
+  # force unicode semantics for $string without the
+  # "unicode_strings" feature
+  utf8::upgrade($string);
+
+For example:
+
+  # without explicit or implicit use feature 'unicode_strings'
+  my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
+  /ss/i;          # won't match
+  my $y = uc($x); # won't comvert
+  utf8::upgrade($x);
+  /ss/i;          # matches
+  my $z = uc($x); # converts to "SS"
 
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
@@ -136,6 +151,15 @@ true, returns false.
 
 Returns true on success.
 
+If your code expects an octet sequence this can be used to validate
+that you've received one:
+
+  # throw an exception if not representable as octets
+  utf8::downgrade($string)
+
+  # or do your own error handling
+  utf8::downgrade($string, 1) or die "string must be octets";
+
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
 
@@ -153,6 +177,11 @@ Returns nothing.
                     # ASCII platforms) 0xc4 and 0x80.  On EBCDIC
                     # 1047, this would instead be 0x8C and 0x41.
 
+Similar to:
+
+  use Encode;
+  $a = Encode::encode("utf8", $a);
+
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
 
@@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there.
 =item * C<$flag = utf8::is_utf8($string)>
 
 (Since Perl 5.8.1)  Test whether I<$string> is marked internally as encoded in
-UTF-8.  Functionally the same as C<Encode::is_utf8()>.
+UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
+
+Typically only necessary for debugging and testing, if you need to
+dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
+provides more detail in a compact form.
+
+If you still think you need this outside of debugging, testing or
+dealing with filenames, you should probably read L<perlunitut> and
+L<perlunifaq/What is "the UTF8 flag"?>.
+
+Don't use this flag as a marker to distinguish character and binary
+data, that should be decided for each variable when you write your
+code.
+
+To force unicode semantics in code portable to perl 5.8 and 5.10, call
+C<utf8::upgrade($string)> unconditionally.
 
 =item * C<$flag = utf8::valid($string)>
 
@@ -216,8 +260,7 @@ UTF-8.  Functionally the same as C<Encode::is_utf8()>.
 UTF-8.  Will return true if it is well-formed UTF-8 and has the UTF-8 flag
 on B<or> if I<$string> is held as bytes (both these states are 'consistent').
 Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state.  You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
 
 =back
 
-- 
2.1.4

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2017

From @xsawyerx

On 07/20/2017 07​:50 AM, Tony Cook via RT wrote​:

On Tue, 18 Jul 2017 23​:58​:39 -0700, tonyc wrote​:

which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs, though
perlunifaq covers some of it (though using Encode instead of utf8​::*).
Attached is a series of patches (as a single file), the first three
fix some minor problems with the unicode documentation I found when
going through it.

The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.

Thank you, Tony.

I have only two small nit-pickings on the patch​: There's a typo for
"convert" (says "comvert") and it uses "$a" in one of the examples which
I think should be "$x" or some unreserved variable name, to avoid confusion.

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2017

From @xsawyerx

On 07/20/2017 09​:23 AM, Sawyer X wrote​:

On 07/20/2017 07​:50 AM, Tony Cook via RT wrote​:

On Tue, 18 Jul 2017 23​:58​:39 -0700, tonyc wrote​:

which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs, though
perlunifaq covers some of it (though using Encode instead of utf8​::*).
Attached is a series of patches (as a single file), the first three
fix some minor problems with the unicode documentation I found when
going through it.

The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.
Thank you, Tony.

I have only two small nit-pickings on the patch​: There's a typo for
"convert" (says "comvert") and it uses "$a" in one of the examples which
I think should be "$x" or some unreserved variable name, to avoid confusion.

For what it's worth, this received an offline +1 from rgs. :)

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2017

From @tonycoz

On Thu, Jul 20, 2017 at 09​:23​:44AM +0200, Sawyer X wrote​:

On 07/20/2017 07​:50 AM, Tony Cook via RT wrote​:

On Tue, 18 Jul 2017 23​:58​:39 -0700, tonyc wrote​:

which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs, though
perlunifaq covers some of it (though using Encode instead of utf8​::*).
Attached is a series of patches (as a single file), the first three
fix some minor problems with the unicode documentation I found when
going through it.

The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.

Thank you, Tony.

I have only two small nit-pickings on the patch​: There's a typo for
"convert" (says "comvert") and it uses "$a" in one of the examples which
I think should be "$x" or some unreserved variable name, to avoid confusion.

Updated patch attached.

Any opinions on whether the reference to C<use utf8;> modified by the
first patch should be removed?

It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8
marked), which isn't a big deal, until we do "abc\xDF" which also
isn't marked.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2017

From @tonycoz

131685-various-changes.patch
From bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:30:56 +1000
Subject: use utf8; doesn't force unicode semantics on all strings in scope

eg.

$ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"'
no match

perhaps this should be removed, or completely re-worded, it's worded
similarly to the next point which behaves differently.
---
 pod/perlunicode.pod | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ef02b0a..d3ccf44 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -233,7 +233,7 @@ Unicode:
 Within the scope of S<C<use utf8>>
 
 If the whole program is Unicode (signified by using 8-bit B<U>nicode
-B<T>ransformation B<F>ormat), then all strings within it must be
+B<T>ransformation B<F>ormat), then all literal strings within it must be
 Unicode.
 
 =item *
-- 
2.1.4


From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 10:45:33 +1000
Subject: encoding.pm no longer works

---
 pod/perlunicode.pod | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d3ccf44..24102bf 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -60,10 +60,11 @@ filenames.
 Use the C<:encoding(...)> layer  to read from and write to
 filehandles using the specified encoding.  (See L<open>.)
 
-=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
+=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
 UTF-8.
 
-See L<encoding>.
+The L<encoding> module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
 
 =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
 
-- 
2.1.4


From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 19 Jul 2017 15:42:18 +1000
Subject: unfortunately sysread() tries to read characters

---
 pod/perluniintro.pod | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 0ad9dda..5e263b4 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed
 list see L<Encode::Supported>.
 
 C<read()> reads characters and returns the number of characters.
-C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
-and C<sysseek()>.
+C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>.
+
+C<sysread()> and C<syswrite()> should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
 
 Notice that because of the default behaviour of not doing any
 conversion upon input if there is no default layer,
-- 
2.1.4


From bba883b879024faf30095f9f19b52ec5ce4d8aac Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Fri, 21 Jul 2017 11:29:39 +1000
Subject: (perl #131685) improve utf8::* function documentation

Splits the little cheat sheet I posted as a comment into pieces
and puts them closer to where they belong

- better document why you'd want to use utf8::upgrade()

- similarly for utf8::downgrade()

- try hard to convince people not to use utf8::is_utf8()

- no, utf8::is_utf8() isn't what you want instead of utf8::valid()

- change some examples to use $x instead of the sort reserved $a
---
 lib/utf8.pm | 69 +++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 56 insertions(+), 13 deletions(-)

diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87..50a5b20 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -2,7 +2,7 @@ package utf8;
 
 $utf8::hint_bits = 0x00800000;
 
-our $VERSION = '1.19';
+our $VERSION = '1.20';
 
 sub import {
     $^H |= $utf8::hint_bits;
@@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code.
 Converts in-place the internal representation of the string from an octet
 sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
 logical character sequence itself is unchanged.  If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8.  Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+upgraded, then this is a no-op. Returns the
+number of octets necessary to represent the string as UTF-8.
+
+If your code needs to be compatible with versions of perl without
+C<use feature 'unicode_strings';>, you can force Unicode semantics on
+a given string:
+
+  # force unicode semantics for $string without the
+  # "unicode_strings" feature
+  utf8::upgrade($string);
+
+For example:
+
+  # without explicit or implicit use feature 'unicode_strings'
+  my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
+  /ss/i;          # won't match
+  my $y = uc($x); # won't convert
+  utf8::upgrade($x);
+  /ss/i;          # matches
+  my $z = uc($x); # converts to "SS"
 
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
@@ -136,6 +151,15 @@ true, returns false.
 
 Returns true on success.
 
+If your code expects an octet sequence this can be used to validate
+that you've received one:
+
+  # throw an exception if not representable as octets
+  utf8::downgrade($string)
+
+  # or do your own error handling
+  utf8::downgrade($string, 1) or die "string must be octets";
+
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
 
@@ -148,11 +172,16 @@ replaced with a sequence of one or more characters that represent the
 individual UTF-8 bytes of the character.  The UTF8 flag is turned off.
 Returns nothing.
 
- my $a = "\x{100}"; # $a contains one character, with ord 0x100
- utf8::encode($a);  # $a contains two characters, with ords (on
+ my $x = "\x{100}"; # $a contains one character, with ord 0x100
+ utf8::encode($x);  # $a contains two characters, with ords (on
                     # ASCII platforms) 0xc4 and 0x80.  On EBCDIC
                     # 1047, this would instead be 0x8C and 0x41.
 
+Similar to:
+
+  use Encode;
+  $x = Encode::encode("utf8", $x);
+
 B<Note that this function does not handle arbitrary encodings>;
 use L<Encode> instead.
 
@@ -167,9 +196,9 @@ turned on only if the source string contains multiple-byte UTF-8
 characters.  If I<$string> is invalid as UTF-8, returns false;
 otherwise returns true.
 
- my $a = "\xc4\x80"; # $a contains two characters, with ords
+ my $x = "\xc4\x80"; # $a contains two characters, with ords
                      # 0xc4 and 0x80
- utf8::decode($a);   # On ASCII platforms, $a contains one char,
+ utf8::decode($x);   # On ASCII platforms, $a contains one char,
                      # with ord 0x100.   Since these bytes aren't
                      # legal UTF-EBCDIC, on EBCDIC platforms, $a is
                      # unchanged and the function returns FALSE.
@@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there.
 =item * C<$flag = utf8::is_utf8($string)>
 
 (Since Perl 5.8.1)  Test whether I<$string> is marked internally as encoded in
-UTF-8.  Functionally the same as C<Encode::is_utf8()>.
+UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
+
+Typically only necessary for debugging and testing, if you need to
+dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
+provides more detail in a compact form.
+
+If you still think you need this outside of debugging, testing or
+dealing with filenames, you should probably read L<perlunitut> and
+L<perlunifaq/What is "the UTF8 flag"?>.
+
+Don't use this flag as a marker to distinguish character and binary
+data, that should be decided for each variable when you write your
+code.
+
+To force unicode semantics in code portable to perl 5.8 and 5.10, call
+C<utf8::upgrade($string)> unconditionally.
 
 =item * C<$flag = utf8::valid($string)>
 
@@ -216,8 +260,7 @@ UTF-8.  Functionally the same as C<Encode::is_utf8()>.
 UTF-8.  Will return true if it is well-formed UTF-8 and has the UTF-8 flag
 on B<or> if I<$string> is held as bytes (both these states are 'consistent').
 Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state.  You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
 
 =back
 
-- 
2.1.4

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2017

From @xsawyerx

+1

(Except "$a" still appears in the comments next to the lines that now
say "$x". Sorry.)

On 07/21/2017 03​:40 AM, Tony Cook wrote​:

On Thu, Jul 20, 2017 at 09​:23​:44AM +0200, Sawyer X wrote​:

On 07/20/2017 07​:50 AM, Tony Cook via RT wrote​:

On Tue, 18 Jul 2017 23​:58​:39 -0700, tonyc wrote​:

which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs, though
perlunifaq covers some of it (though using Encode instead of utf8​::*).
Attached is a series of patches (as a single file), the first three
fix some minor problems with the unicode documentation I found when
going through it.

The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.
Thank you, Tony.

I have only two small nit-pickings on the patch​: There's a typo for
"convert" (says "comvert") and it uses "$a" in one of the examples which
I think should be "$x" or some unreserved variable name, to avoid confusion.
Updated patch attached.

Any opinions on whether the reference to C<use utf8;> modified by the
first patch should be removed?

It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8
marked), which isn't a big deal, until we do "abc\xDF" which also
isn't marked.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 24, 2017

From @tonycoz

On Fri, 21 Jul 2017 02​:02​:08 -0700, xsawyerx@​gmail.com wrote​:

+1

(Except "$a" still appears in the comments next to the lines that now
say "$x". Sorry.)

Fixed and applied as e423fa8, 01c3fbb, ee329ae and 0397beb.

Is there anything else we should do to avoid mis-use of these functions?

I previously said​:

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be
decided as part of the interface of your function.
which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well.

I'm referring to "I/O flow (the actual 5 minute tutorial)", should this be expanded elsewhere?

I don't think it should be expanded in perlunitut.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 24, 2017

From @pali

On Sunday 23 July 2017 18​:57​:43 Tony Cook via RT wrote​:

On Fri, 21 Jul 2017 02​:02​:08 -0700, xsawyerx@​gmail.com wrote​:

+1

(Except "$a" still appears in the comments next to the lines that now
say "$x". Sorry.)

Fixed and applied as e423fa8, 01c3fbb, ee329ae and 0397beb.

Just one note​:

+Similar to​:
+
+ use Encode;
+ $x = Encode​::encode("utf8", $x);
+

Maybe instead of "utf8" we should show "UTF-8" to users/developers in
examples. So if they are using Encode​::encode they would get "correct"
UTF-8 output and not perl's extended utf8.

In commit 8e179dd was replaced usage of
Encode "utf8" by "UTF-8" as it is better for people doing copy+paste
without context.

@p5pRT
Copy link
Author

p5pRT commented Jul 24, 2017

From @pali

On Sunday 23 July 2017 18​:57​:43 Tony Cook via RT wrote​:

Is there anything else we should do to avoid mis-use of these functions?

The most useful and legitimate are those functions​:
utf8​::encode utf8​::decode utf8​::native_to_unicode utf8​::unicode_to_native

What about moving them "upper" in synopsis and also in description? So
first we show users those functions which they probably want to use in
their code, and after describe those upgrade/downgrade/is_utf8...

Probably adding "[INTERNAL]" description, like is for utf8​::valid could
help too.

@p5pRT
Copy link
Author

p5pRT commented Aug 2, 2017

From @khwilliamson

On 07/13/2017 08​:28 PM, Father Chrysostomos via RT wrote​:

On Wed, 12 Jul 2017 21​:55​:03 -0700, public@​khwilliamson.com wrote​:

I guess we have a fundamental disagreement about language design and
the
direction Perl should go, which makes me sad.

I agree the disagreement is unfortunate.

The point of adding synonyms for deceptively-named functions and
macros
is to make life easier overall. Forbidding new better-named synonyms
for problematically named things forces everyone who comes along to
deal
with the gotchas and cognitive load that those people already here
have
had to deal with. By creating better named things, those people can
largely avoid these problems. This allows them to work more
efficiently, avoiding traps, and with less cursing Perl.

When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts.

I searched the archives of p5p for occurrences of av_top_index and
av_tindex. There were two complaints I saw before the recent spate.
One was Marc Lehmann; the other, more recent was Dave Mitchell saying
av_tindex didn't seem natural to him.

I myself am confused by the previous names, and this helps *me*. There
are times when I want to refer to the highest element. And there are
times when the length is the more natural concept. I would like
something for these occasions like 'av_true_len'. Again, if I see
av_len, I realize it's problematic and I have to slow down to think
about how it is. Life is more difficult.

I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.

That tells me that the names were not chosen well enough. It is an art,
and few coders are good at it. I still have learned only a few of the
punctuation variables.

I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.)

Unless Perl is close to death, the number of people who are going to
come along before it does die dwarfs the number who are already
expert.
Some people are knowledgeable in parts of Perl, but not all. They
also gain if gotchas get removed before they have to deal with them.

But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.

My personal experience is that what you are arguing for, while it sounds good, does not work in practice.

If you assume that new Perl XS programmers are mostly going to be
reading old code that uses these constructs, yes they will have to learn
them at some point. And, encountering those constructs will likely slow
them down each time. But my hope is that there will be plenty of new
Perl programmers programming Perl and XS on new projects, and they
shouldn't have to be burdened by the past.

My father was good at double-clutching. He used that, the story goes,
to save a tourist bus whose brakes had failed that he was driving down,
a steep slope. He tried to teach me that art, and I did it a few times,
but transmissions had gotten better, and I never had to do it, and
couldn't do it now. Nowadays most people don't even know what it is,
nor should they have to be burdened by a skill that technology has made
essentially obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants