Skip Menu |
Report information
Id: 131685
Status: open
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: pali [at] cpan.org
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Subject: Rename utf8::is_utf8() (and other functions)
To: perlbug [...] perl.org
Date: Sat, 1 Jul 2017 18:02:55 +0200
From: pali [...] cpan.org
Download (untitled) / with headers
text/plain 1.6k
Hi! This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html Problem is that in more perl modules is used this incorrect code pattern: use utf8; my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); } In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong. Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function. Functions utf8::is_utf8(), utf8::upgrade() and utf8::downgrade() changes internal string representation, which is fully invisible for pure perl code, and therefore I think all those functions should be in Internals namespace. I'm proposing following rename of functions: utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage() Plus adding backward compatible aliases to make existing code works like before. As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals, can turn warning off by no warnings 'experimental::internal'; I'm attaching patches which: * Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Date: Sat, 1 Jul 2017 18:53:53 +0200
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.8k
On Sat, 01 Jul 2017 09:03:18 -0700, (via RT) <perlbug-followup@perl.org> wrote: Show quoted text
> # New Ticket Created by > # Please include the string: [perl #131685] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=131685 > > > > Hi! > > This is continuation from original discussion about renaming > utf8::is_utf8() to utf8::is_upgraded() which can be found at: > https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html > > Problem is that in more perl modules is used this incorrect code > pattern: > > use utf8; > > my $value = func(); > if (utf8::is_utf8($value)) { > utf8::encode($value); > } > > In most cases module developers think that utf8::is_utf8() returns true > when it is needed to manually encode argument into UTF-8 bytes. Which is > of course wrong. > > Reason for this is poor name of function utf8::is_utf8() and also poor > documentation about this function. > > Functions utf8::is_utf8(), utf8::upgrade() and utf8::downgrade() changes > internal string representation, which is fully invisible for pure perl > code, and therefore I think all those functions should be in Internals > namespace. > > I'm proposing following rename of functions: > > utf8::is_utf8() --> Internals::uses_string_wide_storage() > utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() > utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()
I am still objecting, as this will also break code that uses those functions as intended and correctly. As these are not XS, Devel::PPPort won't help (assuming authors use D::P on XS modules to guarantee backward compat) I'd loath to change/fix every occurrence of code that uses any of these three correctly, as that code is brittle to start with and probably hard to fix when broken. Show quoted text
> Plus adding backward compatible aliases to make existing code works like > before.
Then why add new functions in the first place? Show quoted text
> As all those functions should be used only for debugging purposes (e.g. > test cases for XS code) or when dealing with buggy XS module, I'm > proposing starting to throw warning (e.g. since v5.28.0) when those > functions are called. For those who are dealing with internals, can turn > warning off by no warnings 'experimental::internal';
No, please. Most correct uses will be in dark distant corners, hidden in modules you don't want to touch anyway. Show quoted text
> I'm attaching patches which: > > * Add new warning category 'experimental::internal' > * Rename utf8 functions > * Update perldoc utf8 documentation
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Download (untitled)
application/pgp-signature 473b

Message body not shown because it is not plain text.

To: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
CC: "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
From: Leon Timmermans <fawaka [...] gmail.com>
Date: Sat, 1 Jul 2017 19:12:28 +0200
On Sat, Jul 1, 2017 at 6:03 PM, via RT <perlbug-followup@perl.org> wrote:
Show quoted text
Hi!

This is continuation from original discussion about renaming
utf8::is_utf8() to utf8::is_upgraded() which can be found at:
https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code
pattern:

  use utf8;

  my $value = func();
  if (utf8::is_utf8($value)) {
    utf8::encode($value);
  }

In most cases module developers think that utf8::is_utf8() returns true
when it is needed to manually encode argument into UTF-8 bytes. Which is
of course wrong.

Reason for this is poor name of function utf8::is_utf8() and also poor
documentation about this function.

Functions utf8::is_utf8(), utf8::upgrade() and utf8::downgrade() changes
internal string representation, which is fully invisible for pure perl
code, and therefore I think all those functions should be in Internals
namespace.

I'm proposing following rename of functions:

utf8::is_utf8() --> Internals::uses_string_wide_storage()
utf8::upgrade() --> Internals::upgrade_string_to_wide_storage()
utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()

Plus adding backward compatible aliases to make existing code works like
before.

As all those functions should be used only for debugging purposes (e.g.
test cases for XS code) or when dealing with buggy XS module, I'm
proposing starting to throw warning (e.g. since v5.28.0) when those
functions are called. For those who are dealing with internals, can turn
warning off by no warnings 'experimental::internal';

I'm attaching patches which:

* Add new warning category 'experimental::internal'
* Rename utf8 functions
* Update perldoc utf8 documentation

I don't see how this is an option. I'll grant you that something like this would have been a better option back then  but you're 15 years too late. "This would have been better" is no excuse to break a decade and a half of software.

Leon
From: pali [...] cpan.org
Date: Sat, 1 Jul 2017 19:45:02 +0200
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 272b
On Saturday 01 July 2017 19:13:30 you wrote: Show quoted text
> to break a decade and a half of software.
Hm? What you mean with to break? Existing functions would still work, just there are also new functions under new names. Usage of old functions is just removed from documentation.
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
From: pali [...] cpan.org
Date: Sat, 1 Jul 2017 19:13:04 +0200
Download (untitled) / with headers
text/plain 301b
On Saturday 01 July 2017 18:54:24 you wrote: Show quoted text
> > Plus adding backward compatible aliases to make existing code works > > like before.
> > Then why add new functions in the first place?
From discussion it was clear that current name utf8::is_utf8() is poor and is reason why it is incorrectly used.
Date: Sat, 1 Jul 2017 19:52:52 +0200
From: Leon Timmermans <fawaka [...] gmail.com>
CC: perlbug <perlbug-followup [...] perl.org>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: pali [...] cpan.org
Download (untitled) / with headers
text/plain 702b
On Sat, Jul 1, 2017 at 7:45 PM, <pali@cpan.org> wrote:
Show quoted text
On Saturday 01 July 2017 19:13:30 you wrote:
> to break a decade and a half of software.

Hm? What you mean with to break? Existing functions would still work,
just there are also new functions under new names. Usage of old
functions is just removed from documentation.

Then I misunderstood your proposal, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.

Leon
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Leon Timmermans <fawaka [...] gmail.com>, pali [...] cpan.org
CC: perlbug <perlbug-followup [...] perl.org>
Date: Mon, 3 Jul 2017 13:03:37 -0400
From: Sawyer X <xsawyerx [...] gmail.com>
On 07/01/2017 01:52 PM, Leon Timmermans wrote: Show quoted text
> On Sat, Jul 1, 2017 at 7:45 PM, <pali@cpan.org <mailto:pali@cpan.org>> > wrote: > > On Saturday 01 July 2017 19:13:30 you wrote:
> > to break a decade and a half of software.
> > Hm? What you mean with to break? Existing functions would still work, > just there are also new functions under new names. Usage of old > functions is just removed from documentation. > > > Then I misunderstood your proposal, "rename" suggested to me that the > old ones disappear. In that case I'm not sure I see the benefit of > your proposal. Why would anyone want to use an interface that won't > work on perls older than 5.28, and could disappear in a future version > of perl (since that's the point of Internals::)? This isn't making > sense to me.
You could support it with Devel::PPPort. It's a simple addition. However, the problem remains that if someone were to use these new functions without PPPort, their code would not work on older versions. I can't see a way around that.
To: Sawyer X <xsawyerx [...] gmail.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Date: Tue, 4 Jul 2017 10:38:26 +1000
From: Tony Cook <tony [...] develop-help.com>
CC: Leon Timmermans <fawaka [...] gmail.com>, pali [...] cpan.org, perlbug <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 1.6k
On Mon, Jul 03, 2017 at 01:03:37PM -0400, Sawyer X wrote: Show quoted text
> > > On 07/01/2017 01:52 PM, Leon Timmermans wrote:
> > On Sat, Jul 1, 2017 at 7:45 PM, <pali@cpan.org <mailto:pali@cpan.org>> > > wrote: > > > > On Saturday 01 July 2017 19:13:30 you wrote:
> > > to break a decade and a half of software.
> > > > Hm? What you mean with to break? Existing functions would still work, > > just there are also new functions under new names. Usage of old > > functions is just removed from documentation. > > > > > > Then I misunderstood your proposal, "rename" suggested to me that the > > old ones disappear. In that case I'm not sure I see the benefit of > > your proposal. Why would anyone want to use an interface that won't > > work on perls older than 5.28, and could disappear in a future version > > of perl (since that's the point of Internals::)? This isn't making > > sense to me.
> > You could support it with Devel::PPPort. It's a simple addition. > > However, the problem remains that if someone were to use these new > functions without PPPort, their code would not work on older versions. I > can't see a way around that.
These are perl functions (as documented in utf8.pm), not C functions, Devel::PPPort does nothing for us. The patch retains the old names, so that isn't an issue. But it does deprecate the old names, which is an issue, I can't imagine us removing these functions. As a side note, the original thread refers to: https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-Tar/lib/Archive/Tar.pm#L1501 which I could see as correct because of the way perl's unicode support (fails to) deal with filenames. Tony
To: Tony Cook <tony [...] develop-help.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
CC: Sawyer X <xsawyerx [...] gmail.com>, Leon Timmermans <fawaka [...] gmail.com>, pali [...] cpan.org, perlbug <perlbug-followup [...] perl.org>
From: Dan Book <grinnz [...] gmail.com>
Date: Mon, 3 Jul 2017 21:35:06 -0400
Download (untitled) / with headers
text/plain 495b
On Mon, Jul 3, 2017 at 8:38 PM, Tony Cook <tony@develop-help.com> wrote: Show quoted text

As a side note, the original thread refers to:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support
(fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8, this will fail to encode downgraded names with non-ascii characters.

-Dan
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Dan Book <grinnz [...] gmail.com>
CC: Sawyer X <xsawyerx [...] gmail.com>, Leon Timmermans <fawaka [...] gmail.com>, pali [...] cpan.org, perlbug <perlbug-followup [...] perl.org>
From: Tony Cook <tony [...] develop-help.com>
Date: Tue, 4 Jul 2017 11:59:23 +1000
Download (untitled) / with headers
text/plain 947b
On Mon, Jul 03, 2017 at 09:35:06PM -0400, Dan Book wrote: Show quoted text
> On Mon, Jul 3, 2017 at 8:38 PM, Tony Cook <tony@develop-help.com> wrote:
> > > > > > As a side note, the original thread refers to: > > > > https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- > > Tar/lib/Archive/Tar.pm#L1501 > > > > which I could see as correct because of the way perl's unicode support > > (fails to) deal with filenames. > > > > Tony > >
> > Not entirely correct IMO. If the intent is that filenames be encoded to > UTF-8, this will fail to encode downgraded names with non-ascii characters.
If the caller creates a file using the name they pass in, encoding the name (which might not be utf-8 marked) may make the later -e or -l check fail. Perl functions such as open and stat currently ignore the the UTF-8 flag, which makes this pretty messy. The code in Archive::Tar seems a reasonable workaround to me, I don't think the author had much choice. Tony
From: pali [...] cpan.org
Date: Tue, 4 Jul 2017 09:11:39 +0200
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 643b
On Monday 03 July 2017 21:35:06 Dan Book wrote: Show quoted text
> On Mon, Jul 3, 2017 at 8:38 PM, Tony Cook <tony@develop-help.com> wrote:
> > > > > > As a side note, the original thread refers to: > > > > https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- > > Tar/lib/Archive/Tar.pm#L1501 > > > > which I could see as correct because of the way perl's unicode support > > (fails to) deal with filenames. > > > > Tony > >
> > Not entirely correct IMO. If the intent is that filenames be encoded to > UTF-8, this will fail to encode downgraded names with non-ascii characters. > > -Dan
See bug: https://rt.perl.org/Public/Bug/Display.html?id=130831
Date: Tue, 4 Jul 2017 09:19:38 +0200
From: pali [...] cpan.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 714b
On Tuesday 04 July 2017 10:38:26 Tony Cook wrote: Show quoted text
> But it does deprecate the old names, which is an issue, I can't > imagine us removing these functions.
Warning can be removed from patch. It is just question how you decide. Also functions stay there, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted, so in final case it does not matter for old code. And for old code can be defined this function easily: *new_name = *old_name; Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: pali [...] cpan.org
From: demerphq <demerphq [...] gmail.com>
Date: Tue, 4 Jul 2017 10:52:09 +0200
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 1.4k
On 4 July 2017 at 09:19, <pali@cpan.org> wrote: Show quoted text
> On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
>> But it does deprecate the old names, which is an issue, I can't >> imagine us removing these functions.
> > Warning can be removed from patch. It is just question how you decide. > Also functions stay there, but we can instruct people via documentation > to use new functions for a new code... Again it is question if you call > it deprecation or aliasing. In any case functions are not going to be > deleted, so in final case it does not matter for old code. > > And for old code can be defined this function easily: > > *new_name = *old_name; > > Reason for this patch series is: > * document those utf8:: functions > * allow developers to call those functions via non-cryptic names
I dont mind adding new aliases for these functions, I object to your proposal to put them in Internals however; I think that they should go in 'scalar', which we decided at the last PerlQA is the designated place for functions that operate on scalars. scalar::is_unicode_string() scalar::is_binary_string() I don't like the wide-storage thing, (although I admit i think it better than "is_utf8"), a latin1 string in utf8 does not use wide-storage, and the unicode flag has significance beyond the storage format; utf8-on strings get unicode semantics in case insensitive operations. cheers, Yves Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Date: Tue, 4 Jul 2017 11:03:31 +0200
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 2.2k
On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote: Show quoted text
> On 4 July 2017 at 09:19, <pali@cpan.org> wrote:
> > On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
> >> But it does deprecate the old names, which is an issue, I can't > >> imagine us removing these functions.
> > > > Warning can be removed from patch. It is just question how you decide. > > Also functions stay there, but we can instruct people via documentation > > to use new functions for a new code... Again it is question if you call > > it deprecation or aliasing. In any case functions are not going to be > > deleted, so in final case it does not matter for old code. > > > > And for old code can be defined this function easily: > > > > *new_name = *old_name; > > > > Reason for this patch series is: > > * document those utf8:: functions > > * allow developers to call those functions via non-cryptic names
> > I dont mind adding new aliases for these functions, I object to your > proposal to put them in Internals however; I think that they should go > in 'scalar', which we decided at the last PerlQA is the designated > place for functions that operate on scalars.
I proposed Internals, because that flag is internal for perl and invisible for pure perl code. But if more people are happy with scalar namespace, I'm fine with it. Show quoted text
> scalar::is_unicode_string() > scalar::is_binary_string()
But this is wrong! SVf_UTF8 does not tell if scalar string is unicode or binary. It just tell type of internal storage. Name is_binary_string is misleading in same way as current name is_utf8. If you say that binary string is one with codes only in range 0x00-0xFF then you can have that binary string also with SVf_UTF8 flag and your function name "is_binary_string" would return false for your binary string. Such name would lead to another problems. Show quoted text
> I don't like the wide-storage thing, (although I admit i think it > better than "is_utf8"), a latin1 string in utf8 does not use > wide-storage,
Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1 extension from ASCII) contains two bytes when encoded in UTF-8 and therefore are wide in UTF-8 too. Show quoted text
> and the unicode flag has significance beyond the storage > format; utf8-on strings get unicode semantics in case insensitive > operations. > > cheers, > Yves
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Date: Tue, 4 Jul 2017 11:22:42 +0200
To: pali [...] cpan.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 3.3k
On 4 July 2017 at 11:03, <pali@cpan.org> wrote: Show quoted text
> On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:
>> On 4 July 2017 at 09:19, <pali@cpan.org> wrote:
>> > On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
>> >> But it does deprecate the old names, which is an issue, I can't >> >> imagine us removing these functions.
>> > >> > Warning can be removed from patch. It is just question how you decide. >> > Also functions stay there, but we can instruct people via documentation >> > to use new functions for a new code... Again it is question if you call >> > it deprecation or aliasing. In any case functions are not going to be >> > deleted, so in final case it does not matter for old code. >> > >> > And for old code can be defined this function easily: >> > >> > *new_name = *old_name; >> > >> > Reason for this patch series is: >> > * document those utf8:: functions >> > * allow developers to call those functions via non-cryptic names
>> >> I dont mind adding new aliases for these functions, I object to your >> proposal to put them in Internals however; I think that they should go >> in 'scalar', which we decided at the last PerlQA is the designated >> place for functions that operate on scalars.
> > I proposed Internals, because that flag is internal for perl and > invisible for pure perl code. But if more people are happy with scalar > namespace, I'm fine with it. >
>> scalar::is_unicode_string() >> scalar::is_binary_string()
> > But this is wrong! SVf_UTF8 does not tell if scalar string is unicode > or binary. It just tell type of internal storage.
No. This is a myth. Plain and simply a myth. People have a hard time accepting it, but the utf8 flag tells parts of the internals to use different rules for certain operations, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII. You can see the difference in the following: "ba\x{DF}"=~/ss/i; "ba\N{U+DF}"=~/ss/i; The latter matches because \N{U+DF} produces the unicode code point DF, and the former does not match, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string, and the later is a Unicode string. Show quoted text
> Name is_binary_string is misleading in same way as current name is_utf8.
Erf, maybe. We need a term for "not-unicode", and "binary" is as good as any. I don't mind other proposals. Show quoted text
> If you say that binary string is one with codes only in range 0x00-0xFF > then you can have that binary string also with SVf_UTF8 flag and your > function name "is_binary_string" would return false for your binary > string. Such name would lead to another problems.
The SVf_UTF8 flag being off means the string should be treated as ASCII when doing case-insensitive operations, and as binary for other purposes, and that the data is encoded as a series of discrete octets. It is not uncommon for people on this list to use the terms unicode and binary for this reason. Show quoted text
>> I don't like the wide-storage thing, (although I admit i think it >> better than "is_utf8"), a latin1 string in utf8 does not use >> wide-storage,
> > Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1 > extension from ASCII) contains two bytes when encoded in UTF-8 and > therefore are wide in UTF-8 too.
I spoke imprecisely, I should have said ASCII, not latin-1. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Date: Tue, 4 Jul 2017 12:04:35 +0200
From: pali [...] cpan.org
Download (untitled) / with headers
text/plain 827b
On Tuesday 04 July 2017 11:22:42 demerphq wrote: Show quoted text
> No. This is a myth. Plain and simply a myth. > > People have a hard time accepting it, but the utf8 flag tells parts of > the internals to use different rules for certain operations, when set > those rules are Unicode. When the flag is not set the default rules > are derived from ASCII. > > You can see the difference in the following: > > "ba\x{DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched Show quoted text
> "ba\N{U+DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched Show quoted text
> The latter matches because \N{U+DF} produces the unicode code point > DF, and the former does not match, because \x{DF} produces the ASCII > octet DF instead. The former is an ASCII string, and the later is a > Unicode string.
No, both were matched under Perl 5.24.1.
Date: Tue, 4 Jul 2017 12:11:33 +0200
From: demerphq <demerphq [...] gmail.com>
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: pali [...] cpan.org
Download (untitled) / with headers
text/plain 1.1k
On 4 July 2017 at 12:04, <pali@cpan.org> wrote: Show quoted text
> On Tuesday 04 July 2017 11:22:42 demerphq wrote:
>> No. This is a myth. Plain and simply a myth. >> >> People have a hard time accepting it, but the utf8 flag tells parts of >> the internals to use different rules for certain operations, when set >> those rules are Unicode. When the flag is not set the default rules >> are derived from ASCII. >> >> You can see the difference in the following: >> >> "ba\x{DF}"=~/ss/i;
> > $ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' > matched >
>> "ba\N{U+DF}"=~/ss/i;
> > $ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' > matched
-E is not -e. -E is enabling a pragma which changes the default behavior. However it is *PRAGMA*. It is NOT the normal behavior of Perl. Show quoted text
>> The latter matches because \N{U+DF} produces the unicode code point >> DF, and the former does not match, because \x{DF} produces the ASCII >> octet DF instead. The former is an ASCII string, and the later is a >> Unicode string.
> > No, both were matched under Perl 5.24.1.
No, they did not. If \x{DF} magically started matching 'ss' it would be a *MASSIVE* regression. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Date: Tue, 4 Jul 2017 13:14:04 +0200
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 1.4k
On Tuesday 04 July 2017 03:12:19 yves orton via RT wrote: Show quoted text
> On 4 July 2017 at 12:04, <pali@cpan.org> wrote:
> > On Tuesday 04 July 2017 11:22:42 demerphq wrote:
> >> No. This is a myth. Plain and simply a myth. > >> > >> People have a hard time accepting it, but the utf8 flag tells parts of > >> the internals to use different rules for certain operations, when set > >> those rules are Unicode. When the flag is not set the default rules > >> are derived from ASCII. > >> > >> You can see the difference in the following: > >> > >> "ba\x{DF}"=~/ss/i;
> > > > $ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' > > matched > >
> >> "ba\N{U+DF}"=~/ss/i;
> > > > $ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' > > matched
> > -E is not -e. > > -E is enabling a pragma which changes the default behavior. > > However it is *PRAGMA*. It is NOT the normal behavior of Perl.
Ah, right. I forgot that -E enables feature unicode_strings which basically means that both examples were equivalent. Default behavior is a bit unpredicable as it is affected by the infamous Unicode Bug. my $str1 = "\x{DF}"; my $str2 = "\N{U+DF}"; my $str3 = "\x{100}"; "ba$str1" =~ /ss/i; "ba$str2" =~ /ss/i; "ba$str1$str3" =~ /ss/i; To make it predicable either /aa or /u modifiers should be already used... It will prevent problems "ba$str1" =~ /ss/aai; "ba$str2" =~ /ss/aai; "ba$str1$str3" =~ /ss/aai; "ba$str1" =~ /ss/ui; "ba$str2" =~ /ss/ui; "ba$str1$str3" =~ /ss/ui;
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: pali [...] cpan.org
Date: Tue, 4 Jul 2017 13:32:26 +0200
From: demerphq <demerphq [...] gmail.com>
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 1.3k
On 4 July 2017 at 13:14, <pali@cpan.org> wrote: Show quoted text
> On Tuesday 04 July 2017 03:12:19 yves orton via RT wrote:
>> On 4 July 2017 at 12:04, <pali@cpan.org> wrote:
>> > On Tuesday 04 July 2017 11:22:42 demerphq wrote:
>> >> No. This is a myth. Plain and simply a myth. >> >> >> >> People have a hard time accepting it, but the utf8 flag tells parts of >> >> the internals to use different rules for certain operations, when set >> >> those rules are Unicode. When the flag is not set the default rules >> >> are derived from ASCII. >> >> >> >> You can see the difference in the following: >> >> >> >> "ba\x{DF}"=~/ss/i;
>> > >> > $ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' >> > matched >> >
>> >> "ba\N{U+DF}"=~/ss/i;
>> > >> > $ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' >> > matched
>> >> -E is not -e. >> >> -E is enabling a pragma which changes the default behavior. >> >> However it is *PRAGMA*. It is NOT the normal behavior of Perl.
> > Ah, right. I forgot that -E enables feature unicode_strings which > basically means that both examples were equivalent. > > Default behavior is a bit unpredicable as it is affected by the > infamous Unicode Bug.
It is only unpredictable if your model of strings is broken. I happen to be very familiar with the internals, and do not find the actual rules to be that difficult to deal with. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
From: pali [...] cpan.org
Date: Tue, 4 Jul 2017 13:38:00 +0200
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 726b
On Tuesday 04 July 2017 13:32:26 demerphq wrote: Show quoted text
> It is only unpredictable if your model of strings is broken.
I do not know what you mean if model of strings is broken, but once you start receiving strings from other modules, user input or whatever external resource, plus you start combining/concatenating those strings you would hit the unicode bug. Therefore safe way is to use /aa or /u modifiers in regex matching in way how you want to do matching. Show quoted text
> I happen > to be very familiar with the internals, and do not find the actual > rules to be that difficult to deal with.
I think this discussion is out of original request, which is for better documentation of utf8.pm and better name for utf8::is_utf8() function.
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: pali [...] cpan.org, perlbug-followup [...] perl.org
Date: Mon, 10 Jul 2017 12:45:48 -0400
From: Sawyer X <xsawyerx [...] gmail.com>
Download (untitled) / with headers
text/plain 1.2k
On 07/04/2017 07:38 AM, pali@cpan.org wrote: Show quoted text
> On Tuesday 04 July 2017 13:32:26 demerphq wrote:
>> It is only unpredictable if your model of strings is broken.
> I do not know what you mean if model of strings is broken,
It is "broken" in that sense for probably more people than we would like. Do we have any documentation that clarifies this entire issue? (I know I trip on this frequently and never fully understood this issue myself.) Show quoted text
> [...]
>> I happen >> to be very familiar with the internals, and do not find the actual >> rules to be that difficult to deal with.
> I think this discussion is out of original request, which is for better > documentation of utf8.pm and better name for utf8::is_utf8() function.
Agree. For now we seem to have two points we agree on: * We want to document these functions * We want to give them better names * We want the old behavior to work As long as the second clause does not break the third, I think we should seek to move forward. Yves mentioned that "Internals" namespace to be undesired place for it (which was discussed at P5H, the last core hackathon) and I agree. "scalar" was the most popular one, IIRC. Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :) Thanks!
From: Zefram <zefram [...] fysh.org>
Date: Mon, 10 Jul 2017 20:51:02 +0100
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 585b
demerphq wrote: Show quoted text
>People have a hard time accepting it, but the utf8 flag tells parts of >the internals to use different rules for certain operations,
Those are bugs. In some cases they are bugs that we've decided we can't just fix because of backcompat, so we add a flag to enable non-buggy semantics and the bug lives on as default behaviour. If a flag to distinguish between character strings and binary strings were an intentional semantic feature, we'd need some rules to say how the flag is to be set by operations that generate string outputs. We've never done that. -zefram
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perl5-porters [...] perl.org
Date: Mon, 10 Jul 2017 21:13:17 +0100
From: Zefram <zefram [...] fysh.org>
Sawyer X wrote: Show quoted text
>Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
I didn't want to add to a mostly bikeshedding discussion, but OK. I concur that the existing names are poor, but I'm not much happier with the names that have been suggested on this thread. I reckon the best terminology we have for this flag, at the user level, is "upgraded", and so the name "is_utf8" would be better as "is_upgraded". The existing names "upgrade" and "downgrade" for the transforming operations are OK, and the only change I'd potentially like to make to them would be to add something that explicates their rather unusual in-place side-effecting nature. In fact you can see all my preferred names in my CPAN module Scalar::String. This module essentially attempts to be the sane version of utf8.pm, attempting to impart the right mental model through its function names and documentation. (The "sclstr_" prefix on all the function names may be omitted if desired; the important part of the name is that which distinguishes these functions from each other.) I think the names for these functions should be reasonably concise, and in particular we should have a single-word adjective for "having the SvUTF8 flag on" if possible. We should also try to reuse existing terminology, rather than invent anything new. We should also avoid any term that implies anything beyond the storage, such as any reference to characters or Unicode, because such implications are largely inaccurate, and anywhere they are accurate is a bug. All of this leads me to prefer "upgraded" over "utf8", "unicode", "uses_wide_storage", and the like. I don't have any strong opinion about which package any new names for these functions should appear in. I think on balance we should not remove the old names, because the trouble that arises from maintaining them is small compared to the hassle that would arise from requiring existing correct programs to change. Not removing them implies that we wouldn't even be deprecating them, as currently defined, but we can fairly discourage the use of the old names in documentation. -zefram
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Zefram <zefram [...] fysh.org>, perl5-porters [...] perl.org
Date: Mon, 10 Jul 2017 20:53:04 -0600
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.2k
On 07/10/2017 02:13 PM, Zefram wrote: Show quoted text
> Sawyer X wrote:
>> Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
> > I didn't want to add to a mostly bikeshedding discussion, but OK. > I concur that the existing names are poor, but I'm not much happier with > the names that have been suggested on this thread. I reckon the best > terminology we have for this flag, at the user level, is "upgraded", > and so the name "is_utf8" would be better as "is_upgraded". The existing > names "upgrade" and "downgrade" for the transforming operations are OK, > and the only change I'd potentially like to make to them would be to add > something that explicates their rather unusual in-place side-effecting > nature. > > In fact you can see all my preferred names in my CPAN module > Scalar::String. This module essentially attempts to be the sane version > of utf8.pm, attempting to impart the right mental model through its > function names and documentation. (The "sclstr_" prefix on all the > function names may be omitted if desired; the important part of the name > is that which distinguishes these functions from each other.) > > I think the names for these functions should be reasonably concise, > and in particular we should have a single-word adjective for "having > the SvUTF8 flag on" if possible. We should also try to reuse existing > terminology, rather than invent anything new. We should also avoid any > term that implies anything beyond the storage, such as any reference to > characters or Unicode, because such implications are largely inaccurate, > and anywhere they are accurate is a bug. All of this leads me to prefer > "upgraded" over "utf8", "unicode", "uses_wide_storage", and the like. > > I don't have any strong opinion about which package any new names for > these functions should appear in. I think on balance we should not > remove the old names, because the trouble that arises from maintaining > them is small compared to the hassle that would arise from requiring > existing correct programs to change. Not removing them implies that > we wouldn't even be deprecating them, as currently defined, but we can > fairly discourage the use of the old names in documentation. > > -zefram >
My view is that the current names could be improved, and that there should be no technical nor social problem in creating new names while retaining the old ones, but changing the docs to stress the new ones. I've done that a lot. I don't know what namespace is best. At first blush Internals seems good to me, for this and other things that people currently have hacks for, like $foo & "" which trying to find out if $foo is a string or just a number. I don't fully understand the objection to 'Internals' I have never liked upgrade and downgrade. When you upgrade something you are supposed to get something better, like more legroom. I have never seen why a PV is better than a number, or a UTF-8 string better than a non-one (it's far slower, for example, which is a downgrade in my estimation). The use of upgrade and downgrade is jargon based on the attitudes of the implementers, which should be avoided. Maybe it's too baked in to change, but I regret that it's there. UTF-8 itself is an implementation detail that should never have been exposed to the outside, but 'use utf8' pretty much does that.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 862b
On Mon, 10 Jul 2017 19:53:42 -0700, public@khwilliamson.com wrote: Show quoted text
> I don't know what namespace is best. At first blush Internals seems > good to me, for this and other things that people currently have hacks > for, like > > $foo & "" > > which trying to find out if $foo is a string or just a number. I don't > fully understand the objection to 'Internals'
Adding new public functions to the Internals namespace would completely change its meaning. It contains functions that exist mainly for perl’s own functionality (for built-in modules like Hash::Util to use) and for testing perl itself. Users are not supposed to know about them. That the cat is out of the bag and we cannot remove them is unfortunate. Since we already use ‘utf8’ to refer to Perl’s Unicode support, why not continue to use that namespace? -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 963b
On Mon, 10 Jul 2017 19:53:42 -0700, public@khwilliamson.com wrote: Show quoted text
> I have never liked upgrade and downgrade. When you upgrade something > you are supposed to get something better, like more legroom.
Well, er, that is exactly what you get. You can stretch your legs beyond CLV.* Show quoted text
> I have > never seen why a PV is better than a number, or a UTF-8 string better > than a non-one (it's far slower, for example,
I think that is one of the best arguments in favour of ‘upgrade’. It is just like upgrading most commercial software! Show quoted text
> which is a downgrade in my > estimation). The use of upgrade and downgrade is jargon based on the > attitudes of the implementers, which should be avoided. Maybe it's too > baked in to change, but I regret that it's there. UTF-8 itself is an > implementation detail that should never have been exposed to the > outside, but 'use utf8' pretty much does that.
* That is a Roman numeral. -- Father Chrysostomos
Date: Tue, 11 Jul 2017 08:54:24 +0100
From: Dave Mitchell <davem [...] iabyn.com>
CC: pali [...] cpan.org, perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Sawyer X <xsawyerx [...] gmail.com>
Download (untitled) / with headers
text/plain 1.2k
On Mon, Jul 10, 2017 at 12:45:48PM -0400, Sawyer X wrote: Show quoted text
> Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function. Alias names just increase the cognitive load. If the old names were confusing, having more names will just increase the confusion. Before, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do, based solely on the name. Afterwards, you have to remember that that are two functions foo() and bar(), one is deprecated (which one?), one is badly named (which one?), but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time). Life is now harder. (*) My opinion firmed over AvFILL(). It was a weird name, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight, I would have voted against adding av_top_index. -- All wight. I will give you one more chance. This time, I want to hear no Wubens. No Weginalds. No Wudolf the wed-nosed weindeers. -- Life of Brian
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Tue, 11 Jul 2017 00:55:51 -0700, davem wrote: Show quoted text
> On Mon, Jul 10, 2017 at 12:45:48PM -0400, Sawyer X wrote:
> > Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
> > My opinion on this sort of proposal (and it's an opinion which has gotten > stronger over time (*)) is rarely/never to add a new alias name to an > existing function. > > Alias names just increase the cognitive load. If the old names were > confusing, having more names will just increase the confusion. > > Before, you would have to remember that a particular function foo() is > badly named and doesn't do what you might expect it to do, based solely on > the name. > > Afterwards, you have to remember that that are two functions foo() and > bar(), one is deprecated (which one?), one is badly named (which one?), > but they both do the same thing (Or do they? Sigh. Let's check the > documentation one more time). > > Life is now harder. > > (*) My opinion firmed over AvFILL(). It was a weird name, but I was used to > it. Now I can never remember what the new alias is called (just looked > it up - av_top_index()). In hindsight, I would have voted against adding > av_top_index.
I agree with everything you have said. I brought up the same objection when this proposal was first put forward, but I thought I had lost the debate. Well, at least there are two of us now. :-) -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 287b
On Mon, 10 Jul 2017 09:46:48 -0700, xsawyerx@gmail.com wrote: Show quoted text
> Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
I haven't seen names I prefer over the current names, certainly none that are improved enough that it's worth having two names for the same thing. Tony
To: perl5-porters [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
CC: perlbug-followup [...] perl.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Date: Wed, 12 Jul 2017 08:36:49 +0200
Download (untitled) / with headers
text/plain 1.8k
On Tue, 11 Jul 2017 10:41:37 -0700, "Father Chrysostomos via RT" <perlbug-followup@perl.org> wrote: Show quoted text
> On Tue, 11 Jul 2017 00:55:51 -0700, davem wrote:
> > On Mon, Jul 10, 2017 at 12:45:48PM -0400, Sawyer X wrote:
> > > Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
> > > > My opinion on this sort of proposal (and it's an opinion which has gotten > > stronger over time (*)) is rarely/never to add a new alias name to an > > existing function. > > > > Alias names just increase the cognitive load. If the old names were > > confusing, having more names will just increase the confusion. > > > > Before, you would have to remember that a particular function foo() is > > badly named and doesn't do what you might expect it to do, based solely on > > the name. > > > > Afterwards, you have to remember that that are two functions foo() and > > bar(), one is deprecated (which one?), one is badly named (which one?), > > but they both do the same thing (Or do they? Sigh. Let's check the > > documentation one more time). > > > > Life is now harder. > > > > (*) My opinion firmed over AvFILL(). It was a weird name, but I was used to > > it. Now I can never remember what the new alias is called (just looked > > it up - av_top_index()). In hindsight, I would have voted against adding > > av_top_index.
> > I agree with everything you have said. I brought up the same > objection when this proposal was first put forward, but I thought I > had lost the debate. Well, at least there are two of us now. :-)
Count me in: three. I like the way Dave has written down my feelings :) -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Download (untitled)
application/pgp-signature 473b

Message body not shown because it is not plain text.

Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>, perl5-porters [...] perl.org
Date: Wed, 12 Jul 2017 22:53:57 -0600
From: Karl Williamson <public [...] khwilliamson.com>
CC: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 4.8k
On 07/12/2017 12:36 AM, H.Merijn Brand wrote: Show quoted text
> On Tue, 11 Jul 2017 10:41:37 -0700, "Father Chrysostomos via RT" > <perlbug-followup@perl.org> wrote: >
>> On Tue, 11 Jul 2017 00:55:51 -0700, davem wrote:
>>> On Mon, Jul 10, 2017 at 12:45:48PM -0400, Sawyer X wrote:
>>>> Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
>>> >>> My opinion on this sort of proposal (and it's an opinion which has gotten >>> stronger over time (*)) is rarely/never to add a new alias name to an >>> existing function. >>> >>> Alias names just increase the cognitive load. If the old names were >>> confusing, having more names will just increase the confusion. >>> >>> Before, you would have to remember that a particular function foo() is >>> badly named and doesn't do what you might expect it to do, based solely on >>> the name. >>> >>> Afterwards, you have to remember that that are two functions foo() and >>> bar(), one is deprecated (which one?), one is badly named (which one?), >>> but they both do the same thing (Or do they? Sigh. Let's check the >>> documentation one more time). >>> >>> Life is now harder. >>> >>> (*) My opinion firmed over AvFILL(). It was a weird name, but I was used to >>> it. Now I can never remember what the new alias is called (just looked >>> it up - av_top_index()). In hindsight, I would have voted against adding >>> av_top_index.
>> >> I agree with everything you have said. I brought up the same >> objection when this proposal was first put forward, but I thought I >> had lost the debate. Well, at least there are two of us now. :-)
> > Count me in: three. I like the way Dave has written down my feelings :) >
I guess we have a fundamental disagreement about language design and the direction Perl should go, which makes me sad. The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things, those people can largely avoid these problems. This allows them to work more efficiently, avoiding traps, and with less cursing Perl. Unless Perl is close to death, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl, but not all. They also gain if gotchas get removed before they have to deal with them. Specifically about av_top_index, I don't believe that it is so poorly named that you have to keep consulting the documentation as to what it does. It came about not because of AvFILL, but because of the already-existing synonym, the evilly named "av_len". This name implies it gives a length, but in fact it is one-off from that. av_top_index, though cumbersome, accurately indicates what it returns. Using av_len is a bug waiting to happen. It is a foreseeable problem. I believe that it would be unethical to not create a non-deceptive alternative. It's kind of like a safety recall. Writing code using deceptively named things or with poor API's is slower and more error prone. Every time you use one, you have to get out of your mental pipeline and recall that this is a gotcha, and have to figure out how it is a gotcha and how you have to compensate. You are effectively flushing your mental instruction cache. In the case of av_len, you have to remember which way is the off-by-one problem here. Code reviews also are affected. It is just too easy to read the thing and forget that it doesn't do what you would want. In researching the issue back when av_top_index was created, I found published modules that used av_len, as its name implies, as a length. Others undoubtedly had caught the problem earlier, say through their unit testing. But all this could be avoided by the code using a non-deceptive name. Hopefully, the coder won't even be aware that there exist deceptive ones for hysterical reasons. It is foreseeable that av_len is going to cause problems. It would be irresponsible of us to not create a non-deceptive synonym when it is so easy to do. No one was really happy with "av_top_index" as a name. So AvFILL was retained in the core. All occurrences of av_len were removed. If we could have come up with a short, pithy synonym, we would have replaced AvFILL as well, and then people looking at the core would have seen that and gotten used to it, and over time the memory of the less well-named versions would have faded. Writing good APIs is hard. I have flattered myself at times into thinking I'm good at it. Maybe I am actually good, but if so, I'm still not good enough. And few, if any, are. If we have a poor API in some area, we should not tie our hands and say tough to all those people who come along later, and give them more reason to use some other language
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Karl Williamson <public [...] khwilliamson.com>
CC: perl5-porters [...] perl.org, perlbug-followup [...] perl.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Date: Thu, 13 Jul 2017 08:43:48 +0200
Download (untitled) / with headers
text/plain 1.5k
On Wed, 12 Jul 2017 22:53:57 -0600, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
> It came about not because of AvFILL, but because of the already-existing > synonym, the evilly named "av_len". This name implies it gives a > length, but in fact it is one-off from that. av_top_index, though > cumbersome, accurately indicates what it returns.
The problem with av_top_index is that it hat not (yet) been ported to Devel::PPPort, so I cannot change any XS code that uses av_len into using the new function if that XS is to support 5.16.0 or older $ ack av_top_index ppport.h 1225:av_top_index||5.017009| $ I know I didn't quote all of your message and I understand your motivation, but the problem for these misnamed functions is much wider than the scope of av_top_index, which is *only* available to XS, and XS is more or less easy to fix by adding stuff to Devel::PPPort For the utf8 functions, the scope is WAY wider: it is used from pure-perl, and renaming them (with or without aliases) would cause major brain damage for all authors that use these functions (correct or incorrect) when their code has to work on a wide range of perl versions. To be honest, I do not see an easy way out of that dilemma. If you have one, I'm open to change for the better. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Download (untitled)
application/pgp-signature 473b

Message body not shown because it is not plain text.

From: pali [...] cpan.org
Date: Thu, 13 Jul 2017 09:29:26 +0200
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 2.5k
On Wednesday 12 July 2017 23:44:39 H. Merijn Brand via RT wrote: Show quoted text
> On Wed, 12 Jul 2017 22:53:57 -0600, Karl Williamson > <public@khwilliamson.com> wrote: >
> > It came about not because of AvFILL, but because of the already-existing > > synonym, the evilly named "av_len". This name implies it gives a > > length, but in fact it is one-off from that. av_top_index, though > > cumbersome, accurately indicates what it returns.
> > The problem with av_top_index is that it hat not (yet) been ported to > Devel::PPPort,
Devel::PPPort is probably unmaintained... It has open couple of bugs since 2015 without any comments. And also pull requests are not processed since 2016. Even those security released like this: https://github.com/mhx/Devel-PPPort/pull/47 Because of those problems, I have no motivation to prepare any other patch for Devel::PPPort. For dead/unmaintained modules it is useless. Show quoted text
> so I cannot change any XS code that uses av_len into > using the new function if that XS is to support 5.16.0 or older > > $ ack av_top_index > ppport.h > 1225:av_top_index||5.017009| > $ > > I know I didn't quote all of your message and I understand your > motivation, but the problem for these misnamed functions is much wider > than the scope of av_top_index, which is *only* available to XS, and XS > is more or less easy to fix by adding stuff to Devel::PPPort > > For the utf8 functions, the scope is WAY wider: it is used from > pure-perl, and renaming them (with or without aliases) would cause > major brain damage for all authors that use these functions (correct or > incorrect) when their code has to work on a wide range of perl versions. > > To be honest, I do not see an easy way out of that dilemma. If you have > one, I'm open to change for the better.
Problem is that people very often use construct which I wrote in first comment. Or they read "is_utf8" means string is UTF-8 encoded and therefore I need to call utf8::decode() on it. And all this happens just because of wrong name from which can be deduced by more people what it should do -- which involves in *no* reading documentation... If we would not add better aliases, then broken code would be still produced on cpan. As utf8::is_utf8() is not needed too often, backward compatibility can be achieved by: *NEW_NAME = *utf8::is_utf8; I think this is a good compromise. If you think that upgrade and downgrade function names are fine, OK, but at least please add better name for is_utf8(). In original email I suggested is_upgraded(), so name would be bound with "upgrade()" function. Because it really checks if upgrade() was called or not.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.3k
On Wed, 12 Jul 2017 21:55:03 -0700, public@khwilliamson.com wrote: Show quoted text
> I guess we have a fundamental disagreement about language design and > the > direction Perl should go, which makes me sad.
I agree the disagreement is unfortunate. Show quoted text
> > The point of adding synonyms for deceptively-named functions and > macros > is to make life easier overall. Forbidding new better-named synonyms > for problematically named things forces everyone who comes along to > deal > with the gotchas and cognitive load that those people already here > have > had to deal with. By creating better named things, those people can > largely avoid these problems. This allows them to work more > efficiently, avoiding traps, and with less cursing Perl.
When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts. I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden. I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.) Show quoted text
> Unless Perl is close to death, the number of people who are going to > come along before it does die dwarfs the number who are already > expert. > Some people are knowledgeable in parts of Perl, but not all. They > also gain if gotchas get removed before they have to deal with them.
But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway. My personal experience is that what you are arguing for, while it sounds good, does not work in practice. -- Father Chrysostomos
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
CC: Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Date: Fri, 14 Jul 2017 10:05:39 +0200
Download (untitled) / with headers
text/plain 3.5k
On 14 July 2017 at 04:28, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Wed, 12 Jul 2017 21:55:03 -0700, public@khwilliamson.com wrote:
>> I guess we have a fundamental disagreement about language design and >> the >> direction Perl should go, which makes me sad.
> > I agree the disagreement is unfortunate. >
>> >> The point of adding synonyms for deceptively-named functions and >> macros >> is to make life easier overall. Forbidding new better-named synonyms >> for problematically named things forces everyone who comes along to >> deal >> with the gotchas and cognitive load that those people already here >> have >> had to deal with. By creating better named things, those people can >> largely avoid these problems. This allows them to work more >> efficiently, avoiding traps, and with less cursing Perl.
> > When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts. > > I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden. > > I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.) >
>> Unless Perl is close to death, the number of people who are going to >> come along before it does die dwarfs the number who are already >> expert. >> Some people are knowledgeable in parts of Perl, but not all. They >> also gain if gotchas get removed before they have to deal with them.
> > But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway. > > My personal experience is that what you are arguing for, while it sounds good, does not work in practice.
I think the reason that it sounds good is because it does make sense at a micro level. If you are working on company code for instance, or a small code-base, renaming poorly named things means that the old name is *gone*, and cognitive burden is reduced. But with something like Perl we can't just get rid of things, if we want to rename we have to do something for all the older code out there. So we have to support both in some ways. Which means the cognitive burden is increased. Despite this I think sometimes these things *can* be justified and managed, but we have to be extremely careful about the choices we make, and have real plans in place to deprecate the older use cases in some kind of way. So for instance if we were going to get rid of Internals then we can rename things it contained, and then bundle an Internals.pm which does the right thing, people needing back compat can add 'use Internals' and get the back-compat. So i could see us considering the ideas in this thread in the context of the proposed introduction of 'array', 'scalar', etc. yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
To: Karl Williamson <public [...] khwilliamson.com>, "H.Merijn Brand" <h.m.brand [...] xs4all.nl>, perl5-porters [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Date: Mon, 17 Jul 2017 10:46:59 +0200
From: Sawyer X <xsawyerx [...] gmail.com>
CC: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 9.8k
[Top-posted] I have mixed thoughts about this. I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some, right?) but not introduce additional cognitive load for existing developers. A few ways to make such a situation easier: * Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185): (Since Perl 5.8.1) Test whether $string is marked internally as encoded in UTF-8. Functionally the same as "Encode::is_utf8()". This is confusing, to say the least. "Marked internally" is the words core hackers are looking for and recognize, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see: [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. If /CHECK/ is true, also checks whether /STRING/ contains well-formed UTF-8. Returns true if successful, false otherwise. As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the |utf8::is_utf8| function. I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle, complicated, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader, who can easily get it wrong, which is why we're here). If the document on both was better, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar, list, or void). * Provide different functions and document all functions in all other functions If we decide to have better named functions, we will have additional cognitive load for both experienced core developers and new developers. For core developers, it is a muscle memory to undo and two different sets of code to deal with - those with the old name and those with the new name. For new developers, it will be simple at first, until you come in contact with the old name. It is likely this will also happen early, so you need to learn two names anyway. However, their muscle memory will be geared towards using a more descriptive name. I mix those when it comes to English.pm. I use $_, $@, $!, $#, $/, $^X, $0 and a few more, but I use English.pm for $<, $>, $(, $), $", and a few more. The reasoning is simple: $_, $@, $!, and $# are so common it will be built into every muscle memory. On the other hand, for many developers, if they see $<, they will need to look it up in perlvar anyway. However, $UID or $REAL_USER_ID is readable right away and no need to look it up. One additional point about English is that, unlike what we're suggesting here, the punctuation variable names are the right name, they're just not descriptive. is_utf8() is not about descriptive, but misleading. It is a misnomer. It makes it an undesired pitfall. I see value in adding proper names, but then we would need to take care of at least making all possible names available in the documentation of all other names. If you're reading utf8.pm, you need to find "is_upgraded" in "is_utf8" and "is_utf8" in "is_upgraded"[1]. This makes it easy to quickly find what they mean and differentiate when we see different names. * Move all known usages in core to new functions Another way to improve this new cognitive load is by reducing it in the codebase. Removing as many instances of the old name will reduce the mixture of names, thus helping us move towards the new name. This is a much more intrusive change but has a high potential of helping seasoned developers to deal with the new name. * Automated policies for improving CPAN code quality This is beyond the scope of core, but I think it's worthwhile taking into account the perspective of the community. Realizing the misused "is_utf8" brings with it a question of whether and how we could reduce this problem's scope outside the core, and this could have been done with a kwalitee check (CPANTS[2]) that checked for "is_utf8" and recommends reviewing its use. This is far more complicated since there is a legitimate (but narrow) use for it, and you might get false positives. I believe only a human could find the situations in which it's valuable. Still, it is worthwhile keeping in mind. Overall, I'm still undecided. Maybe we could start with improving the existing documentation? [1] Using "is_upgraded" as an example different name. [2] http://cpants.cpanauthors.org/ On 07/13/2017 06:53 AM, Karl Williamson wrote: Show quoted text
> On 07/12/2017 12:36 AM, H.Merijn Brand wrote:
>> On Tue, 11 Jul 2017 10:41:37 -0700, "Father Chrysostomos via RT" >> <perlbug-followup@perl.org> wrote: >>
>>> On Tue, 11 Jul 2017 00:55:51 -0700, davem wrote:
>>>> On Mon, Jul 10, 2017 at 12:45:48PM -0400, Sawyer X wrote:
>>>>> Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
>>>> >>>> My opinion on this sort of proposal (and it's an opinion which has >>>> gotten >>>> stronger over time (*)) is rarely/never to add a new alias name to an >>>> existing function. >>>> >>>> Alias names just increase the cognitive load. If the old names were >>>> confusing, having more names will just increase the confusion. >>>> >>>> Before, you would have to remember that a particular function foo() is >>>> badly named and doesn't do what you might expect it to do, based >>>> solely on >>>> the name. >>>> >>>> Afterwards, you have to remember that that are two functions foo() and >>>> bar(), one is deprecated (which one?), one is badly named (which >>>> one?), >>>> but they both do the same thing (Or do they? Sigh. Let's check the >>>> documentation one more time). >>>> >>>> Life is now harder. >>>> >>>> (*) My opinion firmed over AvFILL(). It was a weird name, but I was >>>> used to >>>> it. Now I can never remember what the new alias is called (just looked >>>> it up - av_top_index()). In hindsight, I would have voted against >>>> adding >>>> av_top_index.
>>> >>> I agree with everything you have said. I brought up the same >>> objection when this proposal was first put forward, but I thought I >>> had lost the debate. Well, at least there are two of us now. :-)
>> >> Count me in: three. I like the way Dave has written down my feelings :) >>
> > I guess we have a fundamental disagreement about language design and > the direction Perl should go, which makes me sad. > > The point of adding synonyms for deceptively-named functions and > macros is to make life easier overall. Forbidding new better-named > synonyms for problematically named things forces everyone who comes > along to deal with the gotchas and cognitive load that those people > already here have had to deal with. By creating better-named things, > those people can largely avoid these problems. This allows them to > work more efficiently, avoiding traps, and with less cursing Perl. > > Unless Perl is close to death, the number of people who are going to > come along before it does die dwarfs the number who are already > expert. Some people are knowledgeable in parts of Perl, but not > all. They also gain if gotchas get removed before they have to deal > with them. > > Specifically about av_top_index, I don't believe that it is so poorly > named that you have to keep consulting the documentation as to what it > does. > > It came about not because of AvFILL, but because of the > already-existing synonym, the evilly named "av_len". This name > implies it gives a length, but in fact it is one-off from that. > av_top_index, though cumbersome, accurately indicates what it returns. > > Using av_len is a bug waiting to happen. It is a foreseeable problem. > I believe that it would be unethical to not create a non-deceptive > alternative. It's kind of like a safety recall. > > Writing code using deceptively named things or with poor API's is > slower and more error prone. Every time you use one, you have to get > out of your mental pipeline and recall that this is a gotcha, and have > to figure out how it is a gotcha and how you have to compensate. You > are effectively flushing your mental instruction cache. In the case > of av_len, you have to remember which way is the off-by-one problem here. > > Code reviews also are affected. It is just too easy to read the thing > and forget that it doesn't do what you would want. > > In researching the issue back when av_top_index was created, I found > published modules that used av_len, as its name implies, as a length. > Others undoubtedly had caught the problem earlier, say through their > unit testing. > > But all this could be avoided by the code using a non-deceptive name. > Hopefully, the coder won't even be aware that there exist deceptive > ones for hysterical reasons. > > It is foreseeable that av_len is going to cause problems. It would be > irresponsible of us to not create a non-deceptive synonym when it is > so easy to do. > > No one was really happy with "av_top_index" as a name. So AvFILL was > retained in the core. All occurrences of av_len were removed. If we > could have come up with a short, pithy synonym, we would have replaced > AvFILL as well, and then people looking at the core would have seen > that and gotten used to it, and over time the memory of the less > well-named versions would have faded. > > Writing good APIs is hard. I have flattered myself at times into > thinking I'm good at it. Maybe I am actually good, but if so, I'm > still not good enough. And few, if any, are. If we have a poor API > in some area, we should not tie our hands and say tough to all those > people who come along later, and give them more reason to use some > other language
RT-Send-CC: perl5-porters [...] perl.org
On Mon, 17 Jul 2017 01:47:32 -0700, xsawyerx@gmail.com wrote: Show quoted text
> I have mixed thoughts about this.
Me too. Show quoted text
> If we decide to have better named functions [...] > For new developers, it will be simple at first, until you come > in contact with the old name. It is likely this will also happen early, > so you need to learn two names anyway. [...]
I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience, most people working in a perl shop tend to read lots of code in their local codebase, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up, it may not happen that early for a good proportion of new developers. Maybe you had in mind primarily historical threads googled up from perlmonks, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later. Hugo
CC: Karl Williamson <public [...] khwilliamson.com>, "H.Merijn Brand" <h.m.brand [...] xs4all.nl>, perl5-porters [...] perl.org, perlbug-followup [...] perl.org
Date: Tue, 18 Jul 2017 10:53:53 +1000
From: Tony Cook <tony [...] develop-help.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Sawyer X <xsawyerx [...] gmail.com>
Download (untitled) / with headers
text/plain 3.3k
On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote: Show quoted text
> [Top-posted] > > I have mixed thoughts about this. > > I'm sympathetic to both considerations: Having properly-named functions > to reduce confusion for future developers (we hope to have some, right?) > but not introduce additional cognitive load for existing developers. > > A few ways to make such a situation easier: > > * Document utf8::is_utf8() to prevent this confusion: This is by far the > first thing that should be done. I have double checked the wording for > utf8::is_utf8() from my blead (978b185): > > (Since Perl 5.8.1) Test whether $string is marked internally as > encoded in UTF-8. Functionally the same as "Encode::is_utf8()". > > This is confusing, to say the least. "Marked internally" is the words > core hackers are looking for and recognize, but "UTF-8" is what non-core > hackers (those without the cognitive bias in core terms) see and > understand. If we head over to Encode::is_utf8() we see: > > [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. > If /CHECK/ is true, also checks whether /STRING/ contains > well-formed UTF-8. Returns true if successful, false otherwise. > > As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the > |utf8::is_utf8| function. > > I like this wording better for several reasons: It is under the title > "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds > that it checks for well-formed UTF-8 only if that flag is true. There > are improvements to be made here too. We can note what the flag means > (subtle, complicated, bike-shed-able) or at the very least add a nice > "this isn't the flag you're looking for" warning. We can also suggest > when to use and when not to use the function (otherwise it's left to the > reader, who can easily get it wrong, which is why we're here).
utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that), despite the note in utf8.pm. Show quoted text
> If the document on both was better, then we could have possibly left > this as unfortunate naming errors we're carrying with us (along with > "wantarray" for noting whether the context is scalar, list, or void).
... Show quoted text
> Overall, I'm still undecided. Maybe we could start with improving the > existing documentation?
Perhaps something like: Show quoted text
>>
=item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in UTF-8. Functionally the same as C<Encode::is_utf8($string)>. Typically only necessary for debugging. If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12, call C<utf8::upgrade($string)> unconditionally. Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong, this should be decided as part of the interface of your function. If you're accepting bytes: utf8::downgrade($string); # throws an exception if code point over 0xFF utf8::downgrade($string, 1) # our own error handling or die "\$string must be representable as bytes" or if you're accepting characters and need encoded bytes: utf8::encode($string); # unconditionally The only exception is if you're dealing with filenames, since perl uses the internal representation of the string for system calls. << Are there any other cases someone might be tempted to call utf8::is_utf8()? Tony
To: Tony Cook <tony [...] develop-help.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Date: Tue, 18 Jul 2017 09:04:06 +0200
CC: Sawyer X <xsawyerx [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, perl5-porters [...] perl.org, perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 4.1k
On Tue, 18 Jul 2017 10:53:53 +1000, Tony Cook <tony@develop-help.com> wrote: Show quoted text
> On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
> > [Top-posted] > > > > I have mixed thoughts about this. > > > > I'm sympathetic to both considerations: Having properly-named functions > > to reduce confusion for future developers (we hope to have some, right?) > > but not introduce additional cognitive load for existing developers. > > > > A few ways to make such a situation easier: > > > > * Document utf8::is_utf8() to prevent this confusion: This is by far the > > first thing that should be done. I have double checked the wording for > > utf8::is_utf8() from my blead (978b185): > > > > (Since Perl 5.8.1) Test whether $string is marked internally as > > encoded in UTF-8. Functionally the same as "Encode::is_utf8()". > > > > This is confusing, to say the least. "Marked internally" is the words > > core hackers are looking for and recognize, but "UTF-8" is what non-core > > hackers (those without the cognitive bias in core terms) see and > > understand. If we head over to Encode::is_utf8() we see: > > > > [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. > > If /CHECK/ is true, also checks whether /STRING/ contains > > well-formed UTF-8. Returns true if successful, false otherwise. > > > > As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the > > |utf8::is_utf8| function. > > > > I like this wording better for several reasons: It is under the title > > "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds > > that it checks for well-formed UTF-8 only if that flag is true. There > > are improvements to be made here too. We can note what the flag means > > (subtle, complicated, bike-shed-able) or at the very least add a nice > > "this isn't the flag you're looking for" warning. We can also suggest > > when to use and when not to use the function (otherwise it's left to the > > reader, who can easily get it wrong, which is why we're here).
> > utf8::is_utf8() doesn't accept the second parameter and does no > validity checks (we have utf8::valid() for that), despite the note in > utf8.pm. >
> > If the document on both was better, then we could have possibly left > > this as unfortunate naming errors we're carrying with us (along with > > "wantarray" for noting whether the context is scalar, list, or void).
> ...
> > Overall, I'm still undecided. Maybe we could start with improving the > > existing documentation?
> > Perhaps something like: >
> >>
> > =item * C<$flag = utf8::is_utf8($string)> > > (Since Perl 5.8.1) Test whether I<$string> is marked internally as > encoded in UTF-8. Functionally the same as C<Encode::is_utf8($string)>. > Typically only necessary for debugging. > > If you need to force Unicode semantics for code that needs to be > compatible with perls older than 5.12, call C<utf8::upgrade($string)> > unconditionally. > > Using this flag to decide whether a string should be treated as > already encoded bytes or characters is wrong, this should be decided > as part of the interface of your function. > > If you're accepting bytes: > > utf8::downgrade($string); # throws an exception if code point over 0xFF > > utf8::downgrade($string, 1) # our own error handling > or die "\$string must be representable as bytes" > > or if you're accepting characters and need encoded bytes: > > utf8::encode($string); # unconditionally > > The only exception is if you're dealing with filenames, since perl > uses the internal representation of the string for system calls. > > << > > Are there any other cases someone might be tempted to call > utf8::is_utf8()? > > Tony
I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding, as I think that is what is function is most often (erroneously) used for. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Download (untitled)
application/pgp-signature 473b

Message body not shown because it is not plain text.

CC: Tony Cook <tony [...] develop-help.com>, Sawyer X <xsawyerx [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Dave Mitchell via RT <perlbug-followup [...] perl.org>
Date: Tue, 18 Jul 2017 03:13:40 -0400
From: Dan Book <grinnz [...] gmail.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Download (untitled) / with headers
text/plain 477b
On Tue, Jul 18, 2017 at 3:04 AM, H.Merijn Brand <h.m.brand@xs4all.nl> wrote: Show quoted text

I like this. What I miss here is a small example of how to guarantee
preventing double encoding/decoding, as I think that is what is
function is most often (erroneously) used for.


This isn't something that you can guarantee. It always depends on knowing how you get your input. When people don't understand this they look for the magic bullet that is_utf8 appears to be, but it is not.
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Dan Book <grinnz [...] gmail.com>
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Date: Tue, 18 Jul 2017 09:18:34 +0200
CC: Tony Cook <tony [...] develop-help.com>, Sawyer X <xsawyerx [...] gmail.com>, Karl Williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Dave Mitchell via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 973b
On Tue, 18 Jul 2017 03:13:40 -0400, Dan Book <grinnz@gmail.com> wrote: Show quoted text
> On Tue, Jul 18, 2017 at 3:04 AM, H.Merijn Brand <h.m.brand@xs4all.nl> wrote:
> > > > > > I like this. What I miss here is a small example of how to guarantee > > preventing double encoding/decoding, as I think that is what is > > function is most often (erroneously) used for.
> > This isn't something that you can guarantee. It always depends on knowing > how you get your input. When people don't understand this they look for the > magic bullet that is_utf8 appears to be, but it is not.
My point exactly. Just have a piece of text that tells the user why it isn't and what the best alternative *could* be. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Download (untitled)
application/pgp-signature 473b

Message body not shown because it is not plain text.

To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Date: Tue, 18 Jul 2017 11:29:15 +0200
From: Sawyer X <xsawyerx [...] gmail.com>
CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.1k
On 07/17/2017 10:09 PM, Hugo van der Sanden via RT wrote: Show quoted text
> On Mon, 17 Jul 2017 01:47:32 -0700, xsawyerx@gmail.com wrote:
>> I have mixed thoughts about this.
> Me too. >
>> If we decide to have better named functions [...] >> For new developers, it will be simple at first, until you come >> in contact with the old name. It is likely this will also happen early, >> so you need to learn two names anyway. [...]
> I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience, most people working in a perl shop tend to read lots of code in their local codebase, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up, it may not happen that early for a good proportion of new developers. > > Maybe you had in mind primarily historical threads googled up from perlmonks, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.
I meant people who will start hacking on Perl core.
To: Sawyer X <xsawyerx [...] gmail.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
CC: Karl Williamson <public [...] khwilliamson.com>, "H.Merijn Brand" <h.m.brand [...] xs4all.nl>, perl5-porters [...] perl.org, perlbug-followup [...] perl.org
Date: Wed, 19 Jul 2017 16:58:15 +1000
From: Tony Cook <tony [...] develop-help.com>
Download (untitled) / with headers
text/plain 4.5k
On Tue, Jul 18, 2017 at 10:53:53AM +1000, Tony Cook wrote: Show quoted text
> On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
> > [Top-posted] > > > > I have mixed thoughts about this. > > > > I'm sympathetic to both considerations: Having properly-named functions > > to reduce confusion for future developers (we hope to have some, right?) > > but not introduce additional cognitive load for existing developers. > > > > A few ways to make such a situation easier: > > > > * Document utf8::is_utf8() to prevent this confusion: This is by far the > > first thing that should be done. I have double checked the wording for > > utf8::is_utf8() from my blead (978b185): > > > > (Since Perl 5.8.1) Test whether $string is marked internally as > > encoded in UTF-8. Functionally the same as "Encode::is_utf8()". > > > > This is confusing, to say the least. "Marked internally" is the words > > core hackers are looking for and recognize, but "UTF-8" is what non-core > > hackers (those without the cognitive bias in core terms) see and > > understand. If we head over to Encode::is_utf8() we see: > > > > [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. > > If /CHECK/ is true, also checks whether /STRING/ contains > > well-formed UTF-8. Returns true if successful, false otherwise. > > > > As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the > > |utf8::is_utf8| function. > > > > I like this wording better for several reasons: It is under the title > > "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds > > that it checks for well-formed UTF-8 only if that flag is true. There > > are improvements to be made here too. We can note what the flag means > > (subtle, complicated, bike-shed-able) or at the very least add a nice > > "this isn't the flag you're looking for" warning. We can also suggest > > when to use and when not to use the function (otherwise it's left to the > > reader, who can easily get it wrong, which is why we're here).
> > utf8::is_utf8() doesn't accept the second parameter and does no > validity checks (we have utf8::valid() for that), despite the note in > utf8.pm. >
> > If the document on both was better, then we could have possibly left > > this as unfortunate naming errors we're carrying with us (along with > > "wantarray" for noting whether the context is scalar, list, or void).
> ...
> > Overall, I'm still undecided. Maybe we could start with improving the > > existing documentation?
> > Perhaps something like: >
> >>
> > =item * C<$flag = utf8::is_utf8($string)> > > (Since Perl 5.8.1) Test whether I<$string> is marked internally as > encoded in UTF-8. Functionally the same as C<Encode::is_utf8($string)>. > Typically only necessary for debugging. > > If you need to force Unicode semantics for code that needs to be > compatible with perls older than 5.12, call C<utf8::upgrade($string)> > unconditionally. > > Using this flag to decide whether a string should be treated as > already encoded bytes or characters is wrong, this should be decided > as part of the interface of your function. > > If you're accepting bytes: > > utf8::downgrade($string); # throws an exception if code point over 0xFF > > utf8::downgrade($string, 1) # our own error handling > or die "\$string must be representable as bytes" > > or if you're accepting characters and need encoded bytes: > > utf8::encode($string); # unconditionally > > The only exception is if you're dealing with filenames, since perl > uses the internal representation of the string for system calls. > > << > > Are there any other cases someone might be tempted to call > utf8::is_utf8()?
Thinking about it further, I'm pretty sure this doesn't all belong here. L<perlunifaq/What is "the UTF8 flag"?> provides a good description of the flag is_utf8() returns, and the whole of perlunifaq covers some of the things the above tries to cover. perlunicook largely works at a higher level than the functions in utf8::* work at. One thing from the above that doesn't seem to be discussed well[1] is what I tried to cover briefly in: Show quoted text
> Using this flag to decide whether a string should be treated as > already encoded bytes or characters is wrong, this should be decided > as part of the interface of your function.
which could perhaps use some expansion in perlunicode. I'm not sure where the cheat sheet following belongs, though perlunifaq covers some of it (though using Encode instead of utf8::*). Tony [1] perlunifaq briefly mentions some of the issues under "What about binary data, like image?" and more detail in "What if I don't decode?"
CC: Karl Williamson <public [...] khwilliamson.com>, "H.Merijn Brand" <h.m.brand [...] xs4all.nl>, perl5-porters [...] perl.org, perlbug-followup [...] perl.org
Date: Wed, 19 Jul 2017 18:30:43 +0200
From: Sawyer X <xsawyerx [...] gmail.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: Tony Cook <tony [...] develop-help.com>
Download (untitled) / with headers
text/plain 4.1k
On 07/19/2017 08:58 AM, Tony Cook wrote: Show quoted text
> On Tue, Jul 18, 2017 at 10:53:53AM +1000, Tony Cook wrote:
>> On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
>>> [Top-posted] >>> >>> I have mixed thoughts about this. >>> >>> I'm sympathetic to both considerations: Having properly-named functions >>> to reduce confusion for future developers (we hope to have some, right?) >>> but not introduce additional cognitive load for existing developers. >>> >>> A few ways to make such a situation easier: >>> >>> * Document utf8::is_utf8() to prevent this confusion: This is by far the >>> first thing that should be done. I have double checked the wording for >>> utf8::is_utf8() from my blead (978b185): >>> >>> (Since Perl 5.8.1) Test whether $string is marked internally as >>> encoded in UTF-8. Functionally the same as "Encode::is_utf8()". >>> >>> This is confusing, to say the least. "Marked internally" is the words >>> core hackers are looking for and recognize, but "UTF-8" is what non-core >>> hackers (those without the cognitive bias in core terms) see and >>> understand. If we head over to Encode::is_utf8() we see: >>> >>> [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. >>> If /CHECK/ is true, also checks whether /STRING/ contains >>> well-formed UTF-8. Returns true if successful, false otherwise. >>> >>> As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the >>> |utf8::is_utf8| function. >>> >>> I like this wording better for several reasons: It is under the title >>> "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds >>> that it checks for well-formed UTF-8 only if that flag is true. There >>> are improvements to be made here too. We can note what the flag means >>> (subtle, complicated, bike-shed-able) or at the very least add a nice >>> "this isn't the flag you're looking for" warning. We can also suggest >>> when to use and when not to use the function (otherwise it's left to the >>> reader, who can easily get it wrong, which is why we're here).
>> utf8::is_utf8() doesn't accept the second parameter and does no >> validity checks (we have utf8::valid() for that), despite the note in >> utf8.pm. >>
>>> If the document on both was better, then we could have possibly left >>> this as unfortunate naming errors we're carrying with us (along with >>> "wantarray" for noting whether the context is scalar, list, or void).
>> ...
>>> Overall, I'm still undecided. Maybe we could start with improving the >>> existing documentation?
>> Perhaps something like: >> >> =item * C<$flag = utf8::is_utf8($string)> >> >> (Since Perl 5.8.1) Test whether I<$string> is marked internally as >> encoded in UTF-8. Functionally the same as C<Encode::is_utf8($string)>. >> Typically only necessary for debugging. >> >> If you need to force Unicode semantics for code that needs to be >> compatible with perls older than 5.12, call C<utf8::upgrade($string)> >> unconditionally. >> >> Using this flag to decide whether a string should be treated as >> already encoded bytes or characters is wrong, this should be decided >> as part of the interface of your function. >> >> If you're accepting bytes: >> >> utf8::downgrade($string); # throws an exception if code point over 0xFF >> >> utf8::downgrade($string, 1) # our own error handling >> or die "\$string must be representable as bytes" >> >> or if you're accepting characters and need encoded bytes: >> >> utf8::encode($string); # unconditionally >> >> The only exception is if you're dealing with filenames, since perl >> uses the internal representation of the string for system calls. >> >> << >> >> Are there any other cases someone might be tempted to call >> utf8::is_utf8()?
> Thinking about it further, I'm pretty sure this doesn't all belong > here. > > L<perlunifaq/What is "the UTF8 flag"?> provides a good description of > the flag is_utf8() returns, and the whole of perlunifaq covers some of > the things the above tries to cover. > > perlunicook largely works at a higher level than the functions in > utf8::* work at.
+1 on the suggested text. I think this addition is useful, even if it is also covered in more documents. We could also link to those documents for further learning.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 582b
On Tue, 18 Jul 2017 23:58:39 -0700, tonyc wrote: Show quoted text
> which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well. Show quoted text
> I'm not sure where the cheat sheet following belongs, though > perlunifaq covers some of it (though using Encode instead of utf8::*).
Attached is a series of patches (as a single file), the first three fix some minor problems with the unicode documentation I found when going through it. The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places. Tony
Subject: 131685-various-changes.patch
From bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001 From: Tony Cook <tony@develop-help.com> Date: Wed, 19 Jul 2017 10:30:56 +1000 Subject: use utf8; doesn't force unicode semantics on all strings in scope eg. $ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"' no match perhaps this should be removed, or completely re-worded, it's worded similarly to the next point which behaves differently. --- pod/perlunicode.pod | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index ef02b0a..d3ccf44 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -233,7 +233,7 @@ Unicode: Within the scope of S<C<use utf8>> If the whole program is Unicode (signified by using 8-bit B<U>nicode -B<T>ransformation B<F>ormat), then all strings within it must be +B<T>ransformation B<F>ormat), then all literal strings within it must be Unicode. =item * -- 2.1.4 From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001 From: Tony Cook <tony@develop-help.com> Date: Wed, 19 Jul 2017 10:45:33 +1000 Subject: encoding.pm no longer works --- pod/perlunicode.pod | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index d3ccf44..24102bf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -60,10 +60,11 @@ filenames. Use the C<:encoding(...)> layer to read from and write to filehandles using the specified encoding. (See L<open>.) -=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be +=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8. -See L<encoding>. +The L<encoding> module has been deprecated since perl 5.18 and the +perl internals it requires have been removed with perl 5.26. =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts -- 2.1.4 From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001 From: Tony Cook <tony@develop-help.com> Date: Wed, 19 Jul 2017 15:42:18 +1000 Subject: unfortunately sysread() tries to read characters --- pod/perluniintro.pod | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 0ad9dda..5e263b4 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed list see L<Encode::Supported>. C<read()> reads characters and returns the number of characters. -C<seek()> and C<tell()> operate on byte counts, as do C<sysread()> -and C<sysseek()>. +C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>. + +C<sysread()> and C<syswrite()> should not be used on file handles with +character encoding layers, they behave badly, and that behaviour has +been deprecated since perl 5.24. Notice that because of the default behaviour of not doing any conversion upon input if there is no default layer, -- 2.1.4 From fb22d08dd9f174ddc4007c8ca6ef0e379fe34874 Mon Sep 17 00:00:00 2001 From: Tony Cook <tony@develop-help.com> Date: Thu, 20 Jul 2017 15:44:49 +1000 Subject: (perl #131685) improve utf8::* function documentation Splits the little cheat sheet I posted as a comment into pieces and puts them closer to where they belong - better document why you'd want to use utf8::upgrade() - similarly for utf8::downgrade() - try hard to convince people not to use utf8::is_utf8() - no, utf8::is_utf8() isn't what you want instead of utf8::valid() --- lib/utf8.pm | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 52 insertions(+), 9 deletions(-) diff --git a/lib/utf8.pm b/lib/utf8.pm index 324cb87..9abbd06 100644 --- a/lib/utf8.pm +++ b/lib/utf8.pm @@ -2,7 +2,7 @@ package utf8; $utf8::hint_bits = 0x00800000; -our $VERSION = '1.19'; +our $VERSION = '1.20'; sub import { $^H |= $utf8::hint_bits; @@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code. Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The logical character sequence itself is unchanged. If I<$string> is already -stored as UTF-8, then this is a no-op. Returns the -number of octets necessary to represent the string as UTF-8. Can be -used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()> -work as Unicode on strings containing non-ASCII characters whose code points -are below 256. +upgraded, then this is a no-op. Returns the +number of octets necessary to represent the string as UTF-8. + +If your code needs to be compatible with versions of perl without +C<use feature 'unicode_strings';>, you can force Unicode semantics on +a given string: + + # force unicode semantics for $string without the + # "unicode_strings" feature + utf8::upgrade($string); + +For example: + + # without explicit or implicit use feature 'unicode_strings' + my $x = "\xDF"; # LATIN SMALL LETTER SHARP S + /ss/i; # won't match + my $y = uc($x); # won't comvert + utf8::upgrade($x); + /ss/i; # matches + my $z = uc($x); # converts to "SS" B<Note that this function does not handle arbitrary encodings>; use L<Encode> instead. @@ -136,6 +151,15 @@ true, returns false. Returns true on success. +If your code expects an octet sequence this can be used to validate +that you've received one: + + # throw an exception if not representable as octets + utf8::downgrade($string) + + # or do your own error handling + utf8::downgrade($string, 1) or die "string must be octets"; + B<Note that this function does not handle arbitrary encodings>; use L<Encode> instead. @@ -153,6 +177,11 @@ Returns nothing. # ASCII platforms) 0xc4 and 0x80. On EBCDIC # 1047, this would instead be 0x8C and 0x41. +Similar to: + + use Encode; + $a = Encode::encode("utf8", $a); + B<Note that this function does not handle arbitrary encodings>; use L<Encode> instead. @@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there. =item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in -UTF-8. Functionally the same as C<Encode::is_utf8()>. +UTF-8. Functionally the same as C<Encode::is_utf8($string)>. + +Typically only necessary for debugging and testing, if you need to +dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump() +provides more detail in a compact form. + +If you still think you need this outside of debugging, testing or +dealing with filenames, you should probably read L<perlunitut> and +L<perlunifaq/What is "the UTF8 flag"?>. + +Don't use this flag as a marker to distinguish character and binary +data, that should be decided for each variable when you write your +code. + +To force unicode semantics in code portable to perl 5.8 and 5.10, call +C<utf8::upgrade($string)> unconditionally. =item * C<$flag = utf8::valid($string)> @@ -216,8 +260,7 @@ UTF-8. Functionally the same as C<Encode::is_utf8()>. UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag on B<or> if I<$string> is held as bytes (both these states are 'consistent'). Main reason for this routine is to allow Perl's test suite to check -that operations have left strings in a consistent state. You most -probably want to use C<utf8::is_utf8()> instead. +that operations have left strings in a consistent state. =back -- 2.1.4
Date: Thu, 20 Jul 2017 09:23:44 +0200
From: Sawyer X <xsawyerx [...] gmail.com>
CC: perl5-porters [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 879b
On 07/20/2017 07:50 AM, Tony Cook via RT wrote: Show quoted text
> On Tue, 18 Jul 2017 23:58:39 -0700, tonyc wrote:
>> which could perhaps use some expansion in perlunicode.
> perlunitut covers this reasonably well. >
>> I'm not sure where the cheat sheet following belongs, though >> perlunifaq covers some of it (though using Encode instead of utf8::*).
> Attached is a series of patches (as a single file), the first three > fix some minor problems with the unicode documentation I found when > going through it. > > The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.
Thank you, Tony. I have only two small nit-pickings on the patch: There's a typo for "convert" (says "comvert") and it uses "$a" in one of the examples which I think should be "$x" or some unreserved variable name, to avoid confusion.
Date: Thu, 20 Jul 2017 22:47:40 +0200
From: Sawyer X <xsawyerx [...] gmail.com>
CC: perl5-porters [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 1005b
On 07/20/2017 09:23 AM, Sawyer X wrote: Show quoted text
> > On 07/20/2017 07:50 AM, Tony Cook via RT wrote:
>> On Tue, 18 Jul 2017 23:58:39 -0700, tonyc wrote:
>>> which could perhaps use some expansion in perlunicode.
>> perlunitut covers this reasonably well. >>
>>> I'm not sure where the cheat sheet following belongs, though >>> perlunifaq covers some of it (though using Encode instead of utf8::*).
>> Attached is a series of patches (as a single file), the first three >> fix some minor problems with the unicode documentation I found when >> going through it. >> >> The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.
> Thank you, Tony. > > I have only two small nit-pickings on the patch: There's a typo for > "convert" (says "comvert") and it uses "$a" in one of the examples which > I think should be "$x" or some unreserved variable name, to avoid confusion.
For what it's worth, this received an offline +1 from rgs. :)
To: Sawyer X <xsawyerx [...] gmail.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Date: Fri, 21 Jul 2017 11:40:45 +1000
From: Tony Cook <tony [...] develop-help.com>
CC: perlbug-followup [...] perl.org, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.2k
On Thu, Jul 20, 2017 at 09:23:44AM +0200, Sawyer X wrote: Show quoted text
> > > On 07/20/2017 07:50 AM, Tony Cook via RT wrote:
> > On Tue, 18 Jul 2017 23:58:39 -0700, tonyc wrote:
> >> which could perhaps use some expansion in perlunicode.
> > perlunitut covers this reasonably well. > >
> >> I'm not sure where the cheat sheet following belongs, though > >> perlunifaq covers some of it (though using Encode instead of utf8::*).
> > Attached is a series of patches (as a single file), the first three > > fix some minor problems with the unicode documentation I found when > > going through it. > > > > The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.
> > Thank you, Tony. > > I have only two small nit-pickings on the patch: There's a typo for > "convert" (says "comvert") and it uses "$a" in one of the examples which > I think should be "$x" or some unreserved variable name, to avoid confusion.
Updated patch attached. Any opinions on whether the reference to C<use utf8;> modified by the first patch should be removed? It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8 marked), which isn't a big deal, until we do "abc\xDF" which also isn't marked. Tony

Message body is not shown because sender requested not to inline it.

From: Sawyer X <xsawyerx [...] gmail.com>
Date: Fri, 21 Jul 2017 11:01:46 +0200
CC: perlbug-followup [...] perl.org, perl5-porters [...] perl.org
To: Tony Cook <tony [...] develop-help.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 1.3k
+1 (Except "$a" still appears in the comments next to the lines that now say "$x". Sorry.) On 07/21/2017 03:40 AM, Tony Cook wrote: Show quoted text
> On Thu, Jul 20, 2017 at 09:23:44AM +0200, Sawyer X wrote:
>> >> On 07/20/2017 07:50 AM, Tony Cook via RT wrote:
>>> On Tue, 18 Jul 2017 23:58:39 -0700, tonyc wrote:
>>>> which could perhaps use some expansion in perlunicode.
>>> perlunitut covers this reasonably well. >>>
>>>> I'm not sure where the cheat sheet following belongs, though >>>> perlunifaq covers some of it (though using Encode instead of utf8::*).
>>> Attached is a series of patches (as a single file), the first three >>> fix some minor problems with the unicode documentation I found when >>> going through it. >>> >>> The fourth re-works the documentation in utf8.pm, taking bits from my little cheat sheet and hopefully putting them in the right places.
>> Thank you, Tony. >> >> I have only two small nit-pickings on the patch: There's a typo for >> "convert" (says "comvert") and it uses "$a" in one of the examples which >> I think should be "$x" or some unreserved variable name, to avoid confusion.
> Updated patch attached. > > Any opinions on whether the reference to C<use utf8;> modified by the > first patch should be removed? > > It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8 > marked), which isn't a big deal, until we do "abc\xDF" which also > isn't marked. > > Tony
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 898b
On Fri, 21 Jul 2017 02:02:08 -0700, xsawyerx@gmail.com wrote: Show quoted text
> +1 > > (Except "$a" still appears in the comments next to the lines that now > say "$x". Sorry.)
Fixed and applied as e423fa83496ce7d83b137bd7f0852864b6073b36, 01c3fbbc0d1b54bb0dd6fdc0abed7854e62c6717, ee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30 and 0397beb0d12565d70e168bfea7376e2612a6748a. Is there anything else we should do to avoid mis-use of these functions? I previously said: Show quoted text
> > > Using this flag to decide whether a string should be treated as > > > already encoded bytes or characters is wrong, this should be > > > decided as part of the interface of your function.
> > which could perhaps use some expansion in perlunicode.
> perlunitut covers this reasonably well.
I'm referring to "I/O flow (the actual 5 minute tutorial)", should this be expanded elsewhere? I don't think it should be expanded in perlunitut. Tony
Date: Mon, 24 Jul 2017 14:35:45 +0200
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Download (untitled) / with headers
text/plain 855b
On Sunday 23 July 2017 18:57:43 Tony Cook via RT wrote: Show quoted text
> On Fri, 21 Jul 2017 02:02:08 -0700, xsawyerx@gmail.com wrote:
> > +1 > > > > (Except "$a" still appears in the comments next to the lines that now > > say "$x". Sorry.)
> > Fixed and applied as e423fa83496ce7d83b137bd7f0852864b6073b36, 01c3fbbc0d1b54bb0dd6fdc0abed7854e62c6717, ee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30 and 0397beb0d12565d70e168bfea7376e2612a6748a.
Just one note: +Similar to: + + use Encode; + $x = Encode::encode("utf8", $x); + Maybe instead of "utf8" we should show "UTF-8" to users/developers in examples. So if they are using Encode::encode they would get "correct" UTF-8 output and not perl's extended utf8. In commit 8e179dd8df306c5088bf6c15b494826d48278928 was replaced usage of Encode "utf8" by "UTF-8" as it is better for people doing copy+paste without context.
Date: Mon, 24 Jul 2017 14:50:03 +0200
From: pali [...] cpan.org
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 550b
On Sunday 23 July 2017 18:57:43 Tony Cook via RT wrote: Show quoted text
> Is there anything else we should do to avoid mis-use of these functions?
The most useful and legitimate are those functions: utf8::encode utf8::decode utf8::native_to_unicode utf8::unicode_to_native What about moving them "upper" in synopsis and also in description? So first we show users those functions which they probably want to use in their code, and after describe those upgrade/downgrade/is_utf8... Probably adding "[INTERNAL]" description, like is for utf8::valid could help too.
CC: perl5-porters [...] perl.org
Date: Tue, 1 Aug 2017 23:25:25 -0600
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
To: perlbug-followup [...] perl.org
On 07/13/2017 08:28 PM, Father Chrysostomos via RT wrote: Show quoted text
> On Wed, 12 Jul 2017 21:55:03 -0700, public@khwilliamson.com wrote:
>> I guess we have a fundamental disagreement about language design and >> the >> direction Perl should go, which makes me sad.
> > I agree the disagreement is unfortunate. >
>> >> The point of adding synonyms for deceptively-named functions and >> macros >> is to make life easier overall. Forbidding new better-named synonyms >> for problematically named things forces everyone who comes along to >> deal >> with the gotchas and cognitive load that those people already here >> have >> had to deal with. By creating better named things, those people can >> largely avoid these problems. This allows them to work more >> efficiently, avoiding traps, and with less cursing Perl.
> > When you first put forward this argument (specifically with regard to av_len), it made sense to me, and I had no objection to it. Later, people wrote to p5p complaining that the new situation was more confusing; in addition, *I* started to get confused. That was when I started to have second thoughts.
I searched the archives of p5p for occurrences of av_top_index and av_tindex. There were two complaints I saw before the recent spate. One was Marc Lehmann; the other, more recent was Dave Mitchell saying av_tindex didn't seem natural to him. I myself am confused by the previous names, and this helps *me*. There are times when I want to refer to the highest element. And there are times when the length is the more natural concept. I would like something for these occasions like 'av_true_len'. Again, if I see av_len, I realize it's problematic and I have to slow down to think about how it is. Life is more difficult. Show quoted text
> > I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.
That tells me that the names were not chosen well enough. It is an art, and few coders are good at it. I still have learned only a few of the punctuation variables. Show quoted text
> > I think the same applies even to poorly named functions. You just have to learn the gotcha once, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source, then you are coming close to what I would call autopodotoxy.) >
>> Unless Perl is close to death, the number of people who are going to >> come along before it does die dwarfs the number who are already >> expert. >> Some people are knowledgeable in parts of Perl, but not all. They >> also gain if gotchas get removed before they have to deal with them.
> > But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway. > > My personal experience is that what you are arguing for, while it sounds good, does not work in practice. >
If you assume that new Perl XS programmers are mostly going to be reading old code that uses these constructs, yes they will have to learn them at some point. And, encountering those constructs will likely slow them down each time. But my hope is that there will be plenty of new Perl programmers programming Perl and XS on new projects, and they shouldn't have to be burdened by the past. My father was good at double-clutching. He used that, the story goes, to save a tourist bus whose brakes had failed that he was driving down, a steep slope. He tried to teach me that art, and I did it a few times, but transmissions had gotten better, and I never had to do it, and couldn't do it now. Nowadays most people don't even know what it is, nor should they have to be burdened by a skill that technology has made essentially obsolete.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org