Skip Menu |
Report information
Id: 119499
Status: resolved
Priority: 0/
Queue: perl5

Owner: khw <khw [at] cpan.org>
Requestors: vsespb <victor [at] vsespb.ru>
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Subject: $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 28 Aug 2013 12:51:45 +0400
To: perlbug [...] perl.org
From: Victor Efimov <victor [...] vsespb.ru>
$! returned as character string under 5.19.2+ and UTF-8 locales. But as binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value properly.

Also I am not sure if it will be possible to decode it when language with Latin-1 -only characters is set.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x1468e30 "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"]
  CUR = 34
  LEN = 40


LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK)
  PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 \344\356\361\362\363\357\345"\0
  CUR = 18
  LEN = 24


RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.3k
Seems this is result of https://rt.perl.org/rt3/Ticket/Display.html?id=112208 fix. However I think fix is wrong. 1) it breaks old code, which: a) tries to decode $! using Encode::decode and I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) b) which prints error messages to screen as-is (without "binmode STDOUT :encoding") 2) Sometimes it returns binary string (under non-utf8 locales, or when message is ASCII-only), sometimes character string (when locale is UTF-8). It's hard to distinct one from another. Possible solution is utf8::is_utf8(), but use of utf8::is_utf8 advertised as a danger way. Another solution is use Encode::decode_utf8 when locale is UTF-8 ( but not Encode::decode("UTF-8"...) ). Problem that this method's documentation is wrong - several people reported this: https://rt.cpan.org/Public/Bug/Display.html?id=87267 https://rt.cpan.org/Public/Bug/Display.html?id=61671 https://github.com/dankogai/p5-encode/pull/11 https://github.com/dankogai/p5-encode/pull/10 3) It's not documented in perllocale, perlunicode, perlvar. 4) It's not clear how it works in case of Latin-1 characters in UTF-8 locale. On Wed Aug 28 01:52:13 2013, vsespb wrote: Show quoted text
> $! returned as character string under 5.19.2+ and UTF-8 locales. But as > binary strings > under single-byte encoding locales. > > I believe this is useless and just makes it harder to decode $! value > properly. > > Also I am not sure if it will be possible to decode it when language with > Latin-1 -only characters is set. > > LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek > -e '$!=EACCES; Dump "$!"' > > SV = PV(0x144dd80) at 0x14702a0 > REFCNT = 1 > FLAGS = (PADTMP,POK,pPOK,UTF8) > PV = 0x1468e30 > "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 > \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 > "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} > \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"] > CUR = 34 > LEN = 40 > > > LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 > perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"' > > SV = PV(0x1db8d80) at 0x1ddf7e0 > REFCNT = 1 > FLAGS = (PADTMP,POK,pPOK) > PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 > \344\356\361\362\363\357\345"\0 > CUR = 18 > LEN = 24
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 28 Aug 2013 11:18:57 -0600
To: perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.8k
On 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote: Show quoted text
> # New Ticket Created by Victor Efimov > # Please include the string: [perl #119499] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=119499 > > >
I am trying to understand your issues with this change. I believe it is working correctly now. Show quoted text
> $! returned as character string under 5.19.2+ and UTF-8 locales. But as > binary strings > under single-byte encoding locales.
I don't understand your use of the word 'binary' here. In both cases, it returns characters in contexts where strings are appropriate, and the numeric value in contexts where numbers are appropriate In string contexts, it returns the appropriate encoding. In UTF-8 locales, it returns the UTF-8 encoded character string. In non-UTF-8 locales, it returns the single-byte string in the correct encoding. Show quoted text
> > I believe this is useless and just makes it harder to decode $! value > properly.
I don't have a clue as to why you think this is useless. This change was to fix https://rt.perl.org/rt3/Ticket/Display.html?id=112208 (reported also as perl #117429, so more than one person found this to be a bug). The patch merely examines the string text of $!, and if it is UTF-8, sets the flag indicating that. Code that is trying to decode $! should be using the (constant) numeric value rather than trying to parse the (locale-dependent) string. Show quoted text
> > Also I am not sure if it will be possible to decode it when language with > Latin-1 -only characters is set.
Again, use the numeric value when trying to parse the error. Show quoted text
> > LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek > -e '$!=EACCES; Dump "$!"' > > SV = PV(0x144dd80) at 0x14702a0 > REFCNT = 1 > FLAGS = (PADTMP,POK,pPOK,UTF8) > PV = 0x1468e30 > "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262 > \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 > "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} > \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"] > CUR = 34 > LEN = 40
I ran this, substituting 'say $!' for the Dump, and got this output: Отказано в доступе which is the correct Cyrillic text. Prior to the patch, this would have printed garbage. Show quoted text
> > > LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 > perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"' > > SV = PV(0x1db8d80) at 0x1ddf7e0 > REFCNT = 1 > FLAGS = (PADTMP,POK,pPOK) > PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 > \344\356\361\362\363\357\345"\0 > CUR = 18 > LEN = 24
I do not have a Windows machine with CP1251, but I hand looked at this dump, and the characters are Отказано в доступе in that code page. So this looks proper.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 4.6k
Show quoted text
> Code that is trying to decode $! should be using the (constant) numeric > value rather than trying to parse the (locale-dependent) string.
I am not trying to parse $!. I am trying to print original error message to the screen for the user. Show quoted text
> In string contexts, it returns the appropriate encoding. In UTF-8 > locales, it returns the UTF-8 encoded character string. In non-UTF-8 > locales, it returns the single-byte string in the correct encoding.
That is just wrong to sometimes return bytes, sometimes characters. The following example worked fine before this change: use strict; use warnings; use I18N::Langinfo; use Encode; my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()); binmode STDOUT, ":encoding($enc)"; my $filename = "not a file ".chr(0x444); open my $f, "<", $filename or do { my $error = "$!"; $error = decode($enc, "$error"); print "Error accessing file $filename: $error\n"; }; but with this change: - under non-Unicode locales works fine. - under UTF-8 locales fails with "Cannot decode string with wide characters " Possible fix for this example is: replace $error = decode($enc, "$error"); with $error = utf8::is_utf8($error) ? $error : decode($enc, "$error"); Another place where it breaks old code is: perl -e 'open my $f, "<", "notafile" or die $!' now prints warning: "Wide character in die" when locale is UTF-8 and message contains wide characters. Show quoted text
> I ran this, substituting 'say $!' for the Dump, and got this output: > Отказано в доступе > which is the correct Cyrillic text. Prior to the patch, this would have > printed garbage.
No, prior to this patch it prints correct (same) text but without "Wide character" warnings. On Wed Aug 28 10:19:47 2013, public@khwilliamson.com wrote: Show quoted text
> On 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote:
> > # New Ticket Created by Victor Efimov > > # Please include the string: [perl #119499] > > # in the subject line of all future correspondence about this issue. > > # <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=119499 > > > > >
> > I am trying to understand your issues with this change. I believe it is > working correctly now. >
> > $! returned as character string under 5.19.2+ and UTF-8 locales. But as > > binary strings > > under single-byte encoding locales.
> > I don't understand your use of the word 'binary' here. In both cases, > it returns characters in contexts where strings are appropriate, and the > numeric value in contexts where numbers are appropriate > > In string contexts, it returns the appropriate encoding. In UTF-8 > locales, it returns the UTF-8 encoded character string. In non-UTF-8 > locales, it returns the single-byte string in the correct encoding.
> > > > I believe this is useless and just makes it harder to decode $! value > > properly.
> > I don't have a clue as to why you think this is useless. This change > was to fix https://rt.perl.org/rt3/Ticket/Display.html?id=112208 > (reported also as perl #117429, so more than one person found this to be > a bug). The patch merely examines the string text of $!, and if it is > UTF-8, sets the flag indicating that. > > Code that is trying to decode $! should be using the (constant) numeric > value rather than trying to parse the (locale-dependent) string.
> > > > Also I am not sure if it will be possible to decode it when language
with Show quoted text
> > Latin-1 -only characters is set.
> > Again, use the numeric value when trying to parse the error. >
> > > > LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX
-MDevel::Peek Show quoted text
> > -e '$!=EACCES; Dump "$!"' > > > > SV = PV(0x144dd80) at 0x14702a0 > > REFCNT = 1 > > FLAGS = (PADTMP,POK,pPOK,UTF8) > > PV = 0x1468e30 > > "\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276
\320\262 Show quoted text
> > \320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8 > > "\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432} > > \x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"] > > CUR = 34 > > LEN = 40
> > I ran this, substituting 'say $!' for the Dump, and got this output: > Отказано в доступе > > which is the correct Cyrillic text. Prior to the patch, this would have > printed garbage.
> > > > > > LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251
LC_MESSAGES=ru_RU.CP1251 Show quoted text
> > perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"' > > > > SV = PV(0x1db8d80) at 0x1ddf7e0 > > REFCNT = 1 > > FLAGS = (PADTMP,POK,pPOK) > > PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342 > > \344\356\361\362\363\357\345"\0 > > CUR = 18 > > LEN = 24
> > I do not have a Windows machine with CP1251, but I hand looked at this > dump, and the characters are Отказано в доступе in that code page. So > this looks proper. >
CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 28 Aug 2013 21:43:25 +0200
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 613b
On Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov <perlbug-followup@perl.org> wrote:
Show quoted text
$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value
properly.
 
Automatic decoding is definitely the more useful behavior. I agree inconsistency is a bad thing though. Not sure it's easy to fix though. Patches welcome.

Show quoted text
Also I am not sure if it will be possible to decode it when language with
Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.1k
Show quoted text
> Automatic decoding is definitely the more useful behavior
yes. when a) it's documented (perllocale or perlunicode or perlvar) b) it's not breaking existing code. OR c) it turned on with 'use feature' or something. Show quoted text
> I agree inconsistency is a bad thing though.
yes, especially when sometimes it's bytes, sometimes character and you have to check UTF-8 flag. Show quoted text
> Not sure it's easy to fix though.
I think in Perl you can get encoding with I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) Then decode using Encode module. (both are core modules) Perlhaps that can be fixed in Perl code, in Errno. (We already load Errno when %! accessed), which will auto-load I18N::Langinfo and Encode? And I am totally not sure about perl C internals. Show quoted text
> Patches welcome.
I cannot do C coding. Also I think that old code, relying on old behaviour was not relying on something undocumented. it was partly documented: http://perldoc.perl.org/perllocale.html Show quoted text
> Note especially that the string value of $! and the error messages
given by external utilities may be changed by LC_MESSAGES (also perllocale now have updates, related to $! in blead) http://perldoc.perl.org/perlunicode.html Show quoted text
> there are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both, but it is not. So ideal fix would be imho: 1. document it (perllocale or perlunicode or perlvar) 2. decode $! on non-UTF locales. always return character strings. 3. turn on new behaviour only with 'use feature' On Wed Aug 28 12:44:17 2013, LeonT wrote: Show quoted text
> On Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov > <perlbug-followup@perl.org>wrote: >
> > $! returned as character string under 5.19.2+ and UTF-8 locales. But as > > binary strings > > under single-byte encoding locales. > > > > I believe this is useless and just makes it harder to decode $! value > > properly. > >
> > Automatic decoding is definitely the more useful behavior. I agree > inconsistency is a bad thing though. Not sure it's easy to fix though. > Patches welcome. > > Also I am not sure if it will be possible to decode it when language with
> > Latin-1 -only characters is set.
> > > AFAIK that should work perfectly fine. > > Leon
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 3.1k
There is a distribution which decodes POSIX::strerror with I18N::Langinfo: http://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/ also, another possible problem, that all examples in perl documentation, with "die $!" now raise warnings: http://perldoc.perl.org/perlopentut.html open(INFO, "datafile") || die("can't open datafile: $!"); open(INFO, "< datafile") || die("can't open datafile: $!"); open(RESULTS,"> runstats") || die("can't open runstats: $!"); open(LOG, ">> logfile ") || die("can't open logfile: $!"); ======== $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO, "datafile") || die $!;' Wide character in die at -e line 1. Нет такого файла или каталога at -e line 1. ======== On Wed Aug 28 13:02:14 2013, vsespb wrote: Show quoted text
>
> > Automatic decoding is definitely the more useful behavior
> > yes. when > a) it's documented (perllocale or perlunicode or perlvar) > > b) it's not breaking existing code. > OR > c) it turned on with 'use feature' or something. > >
> > I agree inconsistency is a bad thing though.
> > yes, especially when sometimes it's bytes, sometimes character and you > have to check UTF-8 flag. >
> > Not sure it's easy to fix though.
> > I think in Perl you can get encoding with > I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()) > Then decode using Encode module. (both are core modules) > > Perlhaps that can be fixed in Perl code, in Errno. (We already load > Errno when %! accessed), which will auto-load I18N::Langinfo and Encode? > > And I am totally not sure about perl C internals. >
> > Patches welcome.
> > I cannot do C coding. > > Also I think that old code, relying on old behaviour was not relying on > something undocumented. > > it was partly documented: > > http://perldoc.perl.org/perllocale.html
> > Note especially that the string value of $! and the error messages
> given by external utilities may be changed by LC_MESSAGES > > (also perllocale now have updates, related to $! in blead) > > http://perldoc.perl.org/perlunicode.html
> > there are still many places where Unicode (in some encoding or
> another) could be given as arguments or received as results, or both, > but it is not. > > > So ideal fix would be imho: > 1. document it (perllocale or perlunicode or perlvar) > 2. decode $! on non-UTF locales. always return character strings. > 3. turn on new behaviour only with 'use feature' > > > > On Wed Aug 28 12:44:17 2013, LeonT wrote:
> > On Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov > > <perlbug-followup@perl.org>wrote: > >
> > > $! returned as character string under 5.19.2+ and UTF-8 locales.
But as Show quoted text
> > > binary strings > > > under single-byte encoding locales. > > > > > > I believe this is useless and just makes it harder to decode $! value > > > properly. > > >
> > > > Automatic decoding is definitely the more useful behavior. I agree > > inconsistency is a bad thing though. Not sure it's easy to fix though. > > Patches welcome. > > > > Also I am not sure if it will be possible to decode it when language
with Show quoted text
> > > Latin-1 -only characters is set.
> > > > > > AFAIK that should work perfectly fine. > > > > Leon
> >
CC: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 28 Aug 2013 18:22:57 -0500
To: perlbug-followup [...] perl.org
From: Salvador Ortiz Garcia <sog [...] msg.com.mx>
Download (untitled) / with headers
text/plain 906b
Time to set PERL_UNICODE=SL ? --- Salvador Ortiz. On 08/28/2013 04:36 PM, Victor Efimov via RT wrote: Show quoted text
> There is a distribution which decodes POSIX::strerror with I18N::Langinfo: > > http://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/ > > also, another possible problem, that all examples in perl documentation, > with "die $!" now raise warnings: > > http://perldoc.perl.org/perlopentut.html > > open(INFO, "datafile") || die("can't open datafile: $!"); > open(INFO, "< datafile") || die("can't open datafile: $!"); > open(RESULTS,"> runstats") || die("can't open runstats: $!"); > open(LOG, ">> logfile ") || die("can't open logfile: $!"); > > > ======== > $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO, > "datafile") || die $!;' > Wide character in die at -e line 1. > Нет такого файла или каталога at -e line 1. > ======== > > >
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Wed Aug 28 10:19:47 2013, public@khwilliamson.com wrote: Show quoted text
> On 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote:
> > I believe this is useless and just makes it harder to decode $! value > > properly.
> > I don't have a clue as to why you think this is useless. This change > was to fix https://rt.perl.org/rt3/Ticket/Display.html?id=112208 > (reported also as perl #117429, so more than one person found this to be > a bug). The patch merely examines the string text of $!, and if it is > UTF-8, sets the flag indicating that.
You are describing from the point of view of internals. From the user’s standpoint, this means you are decoding $! if the character set is UTF-8, but leaving it encoded otherwise. This means even #112208 is not fixed, because the test case was ‘use open <:std :encoding(utf-8)>’ followed by $!. If $! is not utf-8 and you try to feed it through STDOUT, you still get garbage on the screen. The ultimate problem is that perl has no way of guaranteeing that $! can be fed to STDOUT and come out correctly. Even if it could do that, there is no way for it to tell that STDOUT/STDERR is where $! is going to go. So now, $! may or may not be encoded, and you have to way of telling reliably without doing the same environment checks that perl itself did internally before deciding to decode $! itself. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 623b
On Wed Aug 28 23:40:08 2013, sprout wrote: Show quoted text
> > So now, $! may or may not be encoded, and you have to way of telling > reliably without doing the same environment checks that perl itself did > internally before deciding to decode $! itself. >
Small corrections: a) Actually there is a way: check is_utf8($!) flag (which is not good because is_utf8 marked as danger, and it's documented you cant distinct characters from bytes with this flag) b) Current fix does not do environment checks, it just tries to do UTF-8 validity check http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e
CC: perl5-porters [...] perl.org, Lukas Mai <l.mai [...] web.de>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Thu, 29 Aug 2013 14:04:14 -0600
To: perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.6k
On 08/29/2013 02:15 AM, Victor Efimov via RT wrote: Show quoted text
> > On Wed Aug 28 23:40:08 2013, sprout wrote:
>> >> So now, $! may or may not be encoded, and you have to way of telling >> reliably without doing the same environment checks that perl itself did >> internally before deciding to decode $! itself.
I don't follow these arguments. What that commit did is only to look at the string returned by the operating system, and if it is encoded in UTF-8, to set that flag in the scalar. That's it (*). If the OS didn't return UTF-8, it leaves the flag alone. I find it hard to comprehend that this isn't the right thing to do. For the first time, $! in string context is no different than any other string scalar in Perl. They have a utf-8 bit set which means that the encoding is in UTF-8, or they don't have it set, which means that the encoding is unknown to Perl. This commit did not change the latter part one iota. We have conventions as to what the bytes in that scalar mean depending on the context it is used, the pragmas that are in effect in those contexts, and the operations that are being performed on it. But they are just conventions. This commit did not change that. What is different about $! is that we have made the decision to respect locale when accessing it even when not in the scope of 'use locale'. In light of these issues, perhaps this should be discussed again. I'll let the people who argued for that decision to again argue for it. The change fixed two bug reports for the common case where the locales for messages and the I/O matched and where people had not taken pains to deal with locale. I think that should trump the less frequent cases, given the conflicts. If code wants $! to be expressed in a certain language, it should set the locale to that language while accessing $! and then restore the old locale. Show quoted text
>>
> > Small corrections: > > a) Actually there is a way: check is_utf8($!) flag (which is not good > because is_utf8 marked as danger, and it's documented you cant distinct > characters from bytes with this flag)
I don't see that danger marked currently in the pod for utf8.pm. Where do you see that? Show quoted text
> > b) Current fix does not do environment checks, it just tries to do UTF-8 > validity check > http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e >
(*) To be precise 1) if the string returned by the OS is entirely ASCII, it does not set the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are identical, so the flag is irrelevant. And yes, this is buggy if operating under a non-ASCII 7-bit locale, as in ISO 646. These locales have all been superseded so should be rare today, but a bug report could be written on this. 2) As Victor notes, the commit does a UTF-8 validity check, so it is possible that that could give false positives. But as Wikipedia says, "One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test." (The original emphasized "extremely".) I checked this out with the CP1251 character set, and the only modern Russian character that could be a continuation byte is ё. All other vowels and consonants must be start bytes. That means that to generate a false positive, an OS message in CP1251 must only contain words whose 2nd, 4th, ... bytes are that vowel. That just isn't going to happen, though the common Russian word Её (her, hers, ...) could be confusable if there were no other words in the message.
RT-Send-CC: perl5-porters [...] perl.org
On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote: Show quoted text
> > I don't see that danger marked currently in the pod for utf8.pm. > Where > do you see that?
http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22? ====== Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all. ====== Show quoted text
> 2) As Victor notes, the commit does a UTF-8 validity check, so it is > possible that that could give false positives. But as Wikipedia says, > "One of the few cases where charset detection works reliably is > detecting UTF-8. This is due to the large percentage of invalid byte > sequences in UTF-8, so that text in any other encoding that uses bytes > with the high bit set is extremely unlikely to pass a UTF-8 validity > test." (The original emphasized "extremely".) I checked this out > with > the CP1251 character set, and the only modern Russian character that > could be a continuation byte is ё. All other vowels and consonants > must > be start bytes. That means that to generate a false positive, an OS > message in CP1251 must only contain words whose 2nd, 4th, ... bytes > are > that vowel. That just isn't going to happen, though the common > Russian > word Её (her, hers, ...) could be confusable if there were no other > words in the message. >
I agree that it's pretty reliable. However different languages and different encodings can show different misdetection rate. For example rate for CP866 (this is ancient encoding probably) higher than for CP1251. Also Russian alphabet does not contain A-Z characters, unlike German or French. So French error message can contain just couple of non-ASCII-7bit characters, unlike Russian. I would not surprise if this detection is *not* introducing any single bug for any combinations of encoding and language. However I would not surprise too, if this detection is broken for some Language-Encoding pair (perhaps for non-Western, non-Cyrilic languages).
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.9k
On Thu Aug 29 14:06:57 2013, vsespb wrote: Show quoted text
> On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:
> > 2) As Victor notes, the commit does a UTF-8 validity check, so it is
> I agree that it's pretty reliable. However different languages and
Generator of byte sequences that are valid in UTF-8 and in another encoding, and which represend letters (\w) in another encoding. #!/usr/bin/env perl use strict; use warnings; use Encode; use utf8; binmode STDOUT, ":encoding(UTF-8)"; my @A = grep { /\w/ } map { chr($_) } (128..1024); for my $z1 (@A) { for my $z2 ('', @A) { for my $z3 ('', @A) { for my $encoding (qw/ISO-8859-1 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-7 ISO-8859-8 ISO-8859-9 ISO-8859-10/) { my $S = $z1.$z2.$z3; my $e = eval { encode($encoding, "$S", Encode::FB_CROAK); }; next unless $e; my $xx = $e; $xx =~ s/(.)/sprintf("\\x%02X",ord($1))/eg; Encode::_utf8_on($e); if (utf8::valid($e)) { print "# $encoding [$S]".(length($S))." [$e] [$xx]\n"; print <<"END"; perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my \$z = "$xx"; print "[", decode("UTF-8", "\$z", Encode::FB_CROAK), "]\\t[", decode("$encoding", "\$z", Encode::FB_CROAK), "]\\n"' END } } }}} __END__ example output: perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z = "\xC3\xBE"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[", decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"' perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z = "\xC3\xBC"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[", decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"' perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z = "\xC3\xA1"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[", decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"' example output of output example: $perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z = "\xC3\xBC"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[", decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"' [ü] [Ăź]
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 912b
On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote: Show quoted text
> have all been superseded so should be rare today, but a bug report > could > be written on this.
Rare under linux. AFAIK FreeBSD 9 (latest stable) users still have single-byte encoding locale by default (at least they try hard to get UTF-8 working and it's only partly supported) http://forums.freebsd.org/showthread.php?t=34682 There are real users with non-UTF8 locale, I saw one. We've spent serveral hours trying to find why my perl script hangs sometimes, and in the end found bug in perlio https://rt.perl.org/rt3/Ticket/Display.html?id=117537 related to non-UTF A real application which is broken by this change is 'ack' (ack 1 and ack 2). Russian error messages now printed with warning (under UTF-8 locale!). French error messages now corrupted (because it's Latin-1) under UTF-8 locale too. https://github.com/petdance/ack2/issues/367
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 5.6k
On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote: Show quoted text
> On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:
> > > > On Wed Aug 28 23:40:08 2013, sprout wrote:
> >> > >> So now, $! may or may not be encoded, and you have to way of
> telling
> >> reliably without doing the same environment checks that perl itself
> did
> >> internally before deciding to decode $! itself.
> > I don't follow these arguments. What that commit did is only to look > at > the string returned by the operating system, and if it is encoded in > UTF-8, to set that flag in the scalar. That's it (*). If the OS > didn't > return UTF-8, it leaves the flag alone. I find it hard to comprehend > that this isn't the right thing to do. For the first time, $! in > string > context is no different than any other string scalar in Perl. They > have > a utf-8 bit set which means that the encoding is in UTF-8,
You are still describing this from the point of view of the internals. From the users point of view, the utf8 flag does not mean it is encoded in utf8. It means it is *de*coded; just a sequence of Unicode characters. Show quoted text
> or they > don't > have it set, which means that the encoding is unknown to Perl.
I.e., still encoded. Show quoted text
> This > commit did not change the latter part one iota.
The former is the problem, not the latter. If a program can find out what encoding the OS is using for errno messages, it should be able to apply that encoding to $! via decode($os_encoding, $!, Encode::FB_CROAK). But that fails now when perl thought it saw utf8. Show quoted text
> We have conventions as to what the bytes in that scalar mean depending > on the context it is used, the pragmas that are in effect in those > contexts, and the operations that are being performed on it. But they > are just conventions. This commit did not change that.
I don’t follow. The bytes inside the scalar are not visible to the Perl program without resorting to introspection that should never be used for dispatch. Your commit changed the content of the scalar as returned by ord and substr, but only sometimes. It’s the ‘only sometimes’ that is problematic. Show quoted text
> What is different about $! is that we have made the decision to > respect > locale when accessing it even when not in the scope of 'use locale'.
The problem here is that the locale is only sometimes being respected. Show quoted text
> In > light of these issues, perhaps this should be discussed again. I'll > let > the people who argued for that decision to again argue for it. > > The change fixed two bug reports for the common case where the locales > for messages and the I/O matched and where people had not taken pains > to > deal with locale. I think that should trump the less frequent cases, > given the conflicts.
But the less frequent cases now require one to introspect internal scalar flags that should make no difference. Also, is that really more frequent? What about scripts that pass $! straight to STDOUT without layers, knowing that $! is already in the character set the terminal expects? Show quoted text
> > If code wants $! to be expressed in a certain language, it should set > the locale to that language while accessing $! and then restore the > old > locale.
Are you suggesting that perl itself start defaulting to the C locale for $!? Show quoted text
>
> >>
> > > > Small corrections: > > > > a) Actually there is a way: check is_utf8($!) flag (which is not
> good
> > because is_utf8 marked as danger, and it's documented you cant
> distinct
> > characters from bytes with this flag)
> > I don't see that danger marked currently in the pod for utf8.pm. > Where > do you see that?
(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in UTF-8. Functionally the same as Encode::is_utf8(). I think he is referring to ‘internally’ here, which indicates that you shouldn’t rely on it. Show quoted text
> > > > b) Current fix does not do environment checks, it just tries to do
> UTF-8
> > validity check > >
>
http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e Show quoted text
> >
> > (*) To be precise > > 1) if the string returned by the OS is entirely ASCII, it does not set > the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are > identical, so the flag is irrelevant. And yes, this is buggy if > operating under a non-ASCII 7-bit locale, as in ISO 646. These > locales > have all been superseded so should be rare today, but a bug report > could > be written on this. > > 2) As Victor notes, the commit does a UTF-8 validity check, so it is > possible that that could give false positives. But as Wikipedia says, > "One of the few cases where charset detection works reliably is > detecting UTF-8. This is due to the large percentage of invalid byte > sequences in UTF-8, so that text in any other encoding that uses bytes > with the high bit set is extremely unlikely to pass a UTF-8 validity > test." (The original emphasized "extremely".) I checked this out > with > the CP1251 character set, and the only modern Russian character that > could be a continuation byte is ё. All other vowels and consonants > must > be start bytes. That means that to generate a false positive, an OS > message in CP1251 must only contain words whose 2nd, 4th, ... bytes > are > that vowel. That just isn't going to happen, though the common > Russian > word Её (her, hers, ...) could be confusable if there were no other > words in the message. >
That is all very nice, but how would you rewrite this code to work in 5.19.2 and up? if (!open fh, $filename) { # add_to_log expects a string of characters, so decode it add_to_log($filename, 0+$!, Encode::decode( I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()), $! )); return; } -- Father Chrysostomos
CC: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sat, 31 Aug 2013 22:36:30 -0600
To: perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.3k
On 08/31/2013 07:27 AM, Father Chrysostomos via RT wrote: Show quoted text
> On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:
>> On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:
>>> >>> On Wed Aug 28 23:40:08 2013, sprout wrote:
>>>> >>>> So now, $! may or may not be encoded, and you have to way of
>> telling
>>>> reliably without doing the same environment checks that perl itself
>> did
>>>> internally before deciding to decode $! itself.
>> >> I don't follow these arguments. What that commit did is only to look >> at >> the string returned by the operating system, and if it is encoded in >> UTF-8, to set that flag in the scalar. That's it (*). If the OS >> didn't >> return UTF-8, it leaves the flag alone. I find it hard to comprehend >> that this isn't the right thing to do. For the first time, $! in >> string >> context is no different than any other string scalar in Perl. They >> have >> a utf-8 bit set which means that the encoding is in UTF-8,
> > You are still describing this from the point of view of the internals.
I persist in this because I believe your point is a red herring. I believe that it is a valid and strong argument that bringing outlier behavior into conformity with the rest of how Perl operates may very well trump other concerns. I was attempting to show that that is what this commit did. Rather than address most of the rest of your email, some of which I believe are speciour or false, let's cut to the chase Show quoted text
> how would you rewrite this code to work in > 5.19.2 and up? > > if (!open fh, $filename) { > # add_to_log expects a string of characters, so decode it > add_to_log($filename, 0+$!, Encode::decode( > I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()), > $! > )); > return; > }
I feel compelled to point out that this code is buggy. I18N::Langinfo is not portable to all platforms that Perl runs on, and CODESET gives the locale of LC_CTYPE, which may not be the same locale that $! is returned in: LC_MESSAGES. (Note that the code could be modified to change LC_CTYPE to the locale of LC_MESSAGES temporarily around the langinfo call to addess this bug.) Also, some vendors' nl_langinfo() was, at the time, so buggy that the core .t for this doesn't do any "real" testing. http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-Langinfo/t/Langinfo.t But on platforms where it works reliably, and the typical case where LC_CTYPE matches LC_MESSAGES, my commit does break this code. If it were my code here, I'd 'use bytes' (I don't believe bytes.pm should be removed from core; that this area is one of the few valid uses for it, and this is not the thread to discuss it), or utf8::is_utf8() (I think we should soften somewhat the admonition against using that.) I think all of us would agree that deference should be paid to (apparently) working code when making changes. And it may be that this commit is so egregious, or not really helpful in enough places that its cost benefit ratio is not high enough to keep. And $! remains an outlier in the sense that it is AFAIK, and I've looked hard (perhaps not hard enough), now the only place (except for some POSIX:: routines) where the program's underlying current locale leaks outside the scope of 'use locale'. The main argument that I've heard for doing that is that $! is often for the end-user and not the programmer. But it isn't for the end user if what gets displayed is gibberish, which includes being in some language the user doesn't know, though the latter is better than garbage bytes. So what I'm advocating is re-examining whether we wish $! to respect 'use locale' or not. If we chose to respect 'use locale', outside that, it would return messages in the system default locale, typically "C". I'm pretty confident that the problem can't be solved so that no code has to change and things just start working correctly for everybody. Currently, using $! in production code that can be operated by users who might have their own locales is much more complicated than people imagine. "die $!" could print gibberish. Maybe a partial answer is to create a wrapper that does the best it can on the platform it is running on, and suggest people change to use it. If this commit is reverted, we do need to decide how we will address the bugs it fixed and the new ones that are sure to come in (barring some better answer). Do we reject them and say you need to handle $! yourself?
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 1 Sep 2013 18:36:53 +0400
To: Karl Williamson <public [...] khwilliamson.com>
From: Victor Efimov <victor [...] vsespb.ru>



2013/9/1 Karl Williamson <public@khwilliamson.com>
Show quoted text
And $! remains an outlier in the sense that it is AFAIK, and I've looked hard (perhaps not hard enough), now the only place (except for some POSIX:: routines) where the program's underlying current locale leaks outside the scope of 'use locale'. 


But that is not the only place, where non-ASCII character can appear.

The following is documented in perlunicode:

"While Perl does have extensive ways to input and output in Unicode, and a few other "entry points" like the @ARGV array (which can sometimes be interpreted as
 UTF-8), there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not."

I believe that note can mean that encoded $! is not a bug, but a feature.
If it's considered a bug, then all other places where non-ASCII appears encoded, and it's not explicitly documented, can be considered as bug (examples are $0, %INC values, @INC, something else?)

Thus it's impossible for people to use those variables now, as it may change anytime in the future.

CC: Karl Williamson <public [...] khwilliamson.com>, Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 1 Sep 2013 17:23:00 +0200
To: Victor Efimov <victor [...] vsespb.ru>
From: Leon Timmermans <fawaka [...] gmail.com>
On Sun, Sep 1, 2013 at 4:36 PM, Victor Efimov <victor@vsespb.ru> wrote:
Show quoted text

But that is not the only place, where non-ASCII character can appear.

The following is documented in perlunicode:

"While Perl does have extensive ways to input and output in Unicode, and a few other "entry points" like the @ARGV array (which can sometimes be interpreted as
 UTF-8), there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not."

I believe that note can mean that encoded $! is not a bug, but a feature.
If it's considered a bug, then all other places where non-ASCII appears encoded, and it's not explicitly documented, can be considered as bug (examples are $0, %INC values, @INC, something else?)

Thus it's impossible for people to use those variables now, as it may change anytime in the future.

$! is inherently a piece of text, not piece of binary data. As such, it makes perfect sense to treat it as such an automatically decode it. The same is not necessarily true for your other examples.

Leon
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 1 Sep 2013 17:25:09 +0200
To: Karl Williamson <public [...] khwilliamson.com>
From: Leon Timmermans <fawaka [...] gmail.com>
Download (untitled) / with headers
text/plain 1.4k
On Sun, Sep 1, 2013 at 6:36 AM, Karl Williamson <public@khwilliamson.com> wrote:
Show quoted text
And $! remains an outlier in the sense that it is AFAIK, and I've looked hard (perhaps not hard enough), now the only place (except for some POSIX:: routines) where the program's underlying current locale leaks outside the scope of 'use locale'.

Yeah, in POSIX strftime and the is* functions are also affected.
 
Show quoted text
The main argument that I've heard for doing that is that $! is often for the end-user and not the programmer.  But it isn't for the end user if what gets displayed is gibberish, which includes being in some language the user doesn't know, though the latter is better than garbage bytes.  So what I'm advocating is re-examining whether we wish $! to respect 'use locale' or not.  If we chose to respect 'use locale', outside that, it would return messages in the system default locale, typically "C".
 
That does sounds like consistency to me.

Show quoted text
I'm pretty confident that the problem can't be solved so that no code has to change and things just start working correctly for everybody.

That is my feeling too. The new situation feels rather unfinished to me, but the old situation was clearly not the most useful behavior we can offer.

Show quoted text
Currently, using $! in production code that can be operated by users who might have their own locales is much more complicated than people imagine.  "die $!" could print gibberish.

Indeed.

Leon
CC: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 01 Sep 2013 10:23:42 -0600
To: perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 969b
On 08/31/2013 10:36 PM, Karl Williamson wrote: Show quoted text
> I feel compelled to point out that this code is buggy. I18N::Langinfo > is not portable to all platforms that Perl runs on, and CODESET gives > the locale of LC_CTYPE, which may not be the same locale that $! is > returned in: LC_MESSAGES. (Note that the code could be modified to > change LC_CTYPE to the locale of LC_MESSAGES temporarily around the > langinfo call to addess this bug.) Also, some vendors' nl_langinfo() > was, at the time, so buggy that the core .t for this doesn't do any > "real" testing. > http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-Langinfo/t/Langinfo.
I now feel compelled to point out that I should have been more clear that this code is fine, not buggy, if used in the environment in which it was likely designed for. On a platform with a working nl_langinfo() and the programmer knows that LC_MESSAGES and LC_CTYPE are always in sync, this worked well, until I broke it.
CC: Karl Williamson <public [...] khwilliamson.com>, Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 1 Sep 2013 20:47:54 +0400
To: Leon Timmermans <fawaka [...] gmail.com>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 410b

2013/9/1 Leon Timmermans <fawaka@gmail.com>
Show quoted text

$! is inherently a piece of text, not piece of binary data. As such, it makes perfect sense to treat it as such an automatically decode it. The same is not necessarily true for your other examples.


btw, interesting that $^E is not affected by this change, i.e when $! is same as $^E (I tested on linux only), $^E does not have utf-8 flag, while $! has.

CC: Leon Timmermans <fawaka [...] gmail.com>, Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 01 Sep 2013 11:15:30 -0600
To: Victor Efimov <victor [...] vsespb.ru>
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.2k
On 09/01/2013 10:47 AM, Victor Efimov wrote: Show quoted text
> > 2013/9/1 Leon Timmermans <fawaka@gmail.com <mailto:fawaka@gmail.com>> > > > $! is inherently a piece of text, not piece of binary data. As such, > it makes perfect sense to treat it as such an automatically decode > it. The same is not necessarily true for your other examples. > > > btw, interesting that $^E is not affected by this change, i.e when $! is > same as $^E (I tested on linux only), $^E does not have utf-8 flag, > while $! has. >
I've been wondering myself what should happen with $^E, and I believe the two should be made consistent. Some other thoughts I've had about this issue. The commit did not break the ISO 646 7-bit codings, as the behavior is unchanged for those. Those encodings must not be very important nor have been for quite some time, as it does not appear that Encode supports them. We could have a feature automatically turned on in v5.20. I'll call it 'errno' for now ('mauve' having been taken ;) ). Without it being on, $! works as it did in <=v5.18. Within its scope Perl attempts to decode $! as best it can, autoloading Encode and trying to determine the locale using nl_langinfo() if available. This may be a crazy idea; but I thought I'd put it out there to stimulate discussion
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 1 Sep 2013 18:45:37 +0100
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 645b
Karl Williamson wrote: Show quoted text
>Within its scope Perl attempts to decode $! as best it can,
Scoping doesn't work well for this sort of thing. The decoding happens in get magic, when the variable is being read. If that behaviour is affected by the lexical scope in which the reading happens, this means that different readers will see different values in the same variable, which is awfully confusing if a reference to the variable gets passed around. Worse, XS code gets the behaviour of *its caller's* lexical scope. Amusingly, $[ used to influence $#foo magic variables in this manner. It's one of the reasons I'm glad we got rid of $[. -zefram
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Sun, 1 Sep 2013 22:07:54 +0400
To: Zefram <zefram [...] fysh.org>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 914b
one problem with lexical scope is also POSIX::strerror.
currently it's implemented using

strerror  => 'errno => local $! = $_[0]; "$!"',

thus it has to be fixed too if we implement lexical featurization.


2013/9/1 Zefram <zefram@fysh.org>
Show quoted text
Karl Williamson wrote:
>Within its scope Perl attempts to decode $! as best it can,

Scoping doesn't work well for this sort of thing.  The decoding happens in
get magic, when the variable is being read.  If that behaviour is affected
by the lexical scope in which the reading happens, this means that
different readers will see different values in the same variable, which
is awfully confusing if a reference to the variable gets passed around.
Worse, XS code gets the behaviour of *its caller's* lexical scope.

Amusingly, $[ used to influence $#foo magic variables in this manner.
It's one of the reasons I'm glad we got rid of $[.

-zefram

RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.6k
On Sun Sep 01 09:24:19 2013, public@khwilliamson.com wrote: Show quoted text
> On 08/31/2013 10:36 PM, Karl Williamson wrote:
> > I feel compelled to point out that this code is buggy.
> I18N::Langinfo
> > is not portable to all platforms that Perl runs on, and CODESET
> gives
> > the locale of LC_CTYPE, which may not be the same locale that $! is > > returned in: LC_MESSAGES. (Note that the code could be modified to > > change LC_CTYPE to the locale of LC_MESSAGES temporarily around the > > langinfo call to addess this bug.) Also, some vendors'
> nl_langinfo()
> > was, at the time, so buggy that the core .t for this doesn't do any > > "real" testing. > > http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-
> Langinfo/t/Langinfo. > > I now feel compelled to point out that I should have been more clear > that this code is fine, not buggy, if used in the environment in which > it was likely designed for. On a platform with a working > nl_langinfo() > and the programmer knows that LC_MESSAGES and LC_CTYPE are always in > sync, this worked well, until I broke it.
More importantly, as Victor pointed out, it breaks programs that are not trying to do anything with character sets or locales, such as ack, when they are running on a utf8 terminal in a utf8 locale. I dare to bet those are the most common. On dromedary I get this with the system perl (5.12.3): $ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' Nincs ilyen fájl vagy könyvtár at -e line 1. When I build my own (blead)perl, I get this: $ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!' Nincs ilyen f?jl vagy k?nyvt?r at -e line 1. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.8k
On Sun Sep 01 11:09:41 2013, sprout wrote: Show quoted text
> On Sun Sep 01 09:24:19 2013, public@khwilliamson.com wrote:
> > On 08/31/2013 10:36 PM, Karl Williamson wrote:
> > > I feel compelled to point out that this code is buggy.
> > I18N::Langinfo
> > > is not portable to all platforms that Perl runs on, and CODESET
> > gives
> > > the locale of LC_CTYPE, which may not be the same locale that $! is > > > returned in: LC_MESSAGES. (Note that the code could be modified to > > > change LC_CTYPE to the locale of LC_MESSAGES temporarily around the > > > langinfo call to addess this bug.) Also, some vendors'
> > nl_langinfo()
> > > was, at the time, so buggy that the core .t for this doesn't do any > > > "real" testing. > > > http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-
> > Langinfo/t/Langinfo. > > > > I now feel compelled to point out that I should have been more clear > > that this code is fine, not buggy, if used in the environment in which > > it was likely designed for. On a platform with a working > > nl_langinfo() > > and the programmer knows that LC_MESSAGES and LC_CTYPE are always in > > sync, this worked well, until I broke it.
> > More importantly, as Victor pointed out, it breaks programs that are not > trying to do anything with character sets or locales, such as ack, when > they are running on a utf8 terminal in a utf8 locale. I dare to bet > those are the most common. > > On dromedary I get this with the system perl (5.12.3): > > $ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' > Nincs ilyen f�jl vagy k�nyvt�r at -e line 1.
RT screwed it up. That appeared perfectly fine. Show quoted text
> > When I build my own (blead)perl, I get this: > > $ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!' > Nincs ilyen f?jl vagy k?nyvt?r at -e line 1.
That is as it appeared, with question marks. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 582b
On Sun Sep 01 10:46:13 2013, zefram@fysh.org wrote: Show quoted text
> Karl Williamson wrote:
> >Within its scope Perl attempts to decode $! as best it can,
> > Scoping doesn't work well for this sort of thing. The decoding happens in > get magic, when the variable is being read. If that behaviour is affected > by the lexical scope in which the reading happens, this means that > different readers will see different values in the same variable, which > is awfully confusing if a reference to the variable gets passed around.
A new global variable is another option. -- Father Chrysostomos
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Tue, 3 Sep 2013 03:10:13 +0400
To: Father Chrysostomos via RT <perlbug-followup [...] perl.org>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 140b

2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org>
Show quoted text
A new global variable is another option.

perhaps ${^DECODED_ERROR} ?
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Mon, 09 Sep 2013 19:06:57 -0600
To: Victor Efimov <victor [...] vsespb.ru>
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 777b
On 09/02/2013 05:10 PM, Victor Efimov wrote: Show quoted text
> > 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org > <mailto:perlbug-followup@perl.org>> > > A new global variable is another option. > > perhaps ${^DECODED_ERROR} ?
I have come to believe that this is probably the best way forward. That is, revert the $! change, and tell people who need it to use the new global variable which will decode as best it can on the given platform based on the locale in effect. In any event, there should be uniform treatment of $! and $^E. That means that a parallel variable should be provided for $^E. Does anyone know if the strings for the platforms that have separate $^E strings return those in the current locale or not? These include vms, win32, dos, and os/2.
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Tue, 10 Sep 2013 13:45:35 +0400
To: Karl Williamson <public [...] khwilliamson.com>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 268b

Also there should be unicode version of POSIX::strerror implemented, probably.

2013/9/10 Karl Williamson <public@khwilliamson.com>
Show quoted text
In any event, there should be uniform treatment of $! and $^E.
That means that a parallel variable should be provided for $^E.


CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Tue, 10 Sep 2013 14:01:17 +0400
To: Karl Williamson <public [...] khwilliamson.com>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 613b

On Win32 (Strawberry perl):

$chcp
Текущая кодовая страница: 866

$perl -MEncode -e "binmode STDOUT, ':encoding(CP866)'; open my $f, '<', 'notafile' or print decode('WINDOWS-1251', qq{error is: [$^E]})"
error is: [Не удается найти указанный файл]

(firs command outputs "codepage" encoding, last prints sane Russian error message)

2013/9/10 Karl Williamson <public@khwilliamson.com>
Show quoted text

Does anyone know if the strings for the platforms that have separate $^E strings return those in the current locale or not?

These include vms, win32, dos, and os/2.


CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Mon, 16 Sep 2013 10:04:06 -0600
To: Victor Efimov <victor [...] vsespb.ru>
From: Karl Williamson <public [...] khwilliamson.com>
On 09/09/2013 07:06 PM, Karl Williamson wrote: Show quoted text
> On 09/02/2013 05:10 PM, Victor Efimov wrote:
>> >> 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org >> <mailto:perlbug-followup@perl.org>> >> >> A new global variable is another option. >> >> perhaps ${^DECODED_ERROR} ?
> > > I have come to believe that this is probably the best way forward. That > is, revert the $! change, and tell people who need it to use the new > global variable which will decode as best it can on the given platform > based on the locale in effect. >
In looking at this, I thought of something else. I do believe that the current behavior is correct for such a variable within the lexical scope of "use locale". But outside such scope the behavior would be to decode fully, as best as practicable on the platform being run on. Then it occurred to me would merely changing $! (and $^E) to behave this way address your issues? It is a change in behavior from the way things have alway been, but outside "use locale", it would fully decode, which someone in the thread was the issue with the current fix.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.2k
So, you propose: 1. in scope of 'use locale' implement old behaviour (5.18 and earlier) 2. outside of scope - "decode fully, as best as practicable on the platform being run on" I don't think this will solves the problem. Existing programs will still break (and to fix it you'll need to add 'use locale', which can introduce other bugs to program). Existing programs might work with modern unicode and, AFAIK, adding 'use locale' just not recommended for this case. The fact that they use '$!' is not necessary means it's legacy code which don't work with unicode. It can be brand new code written for 5.18 Also, @Zefram mentioned here https://rt.perl.org/rt3/Ticket/Display.html?id=119499#txn-1250019 that lexical scope for such things isn't a good idea. Show quoted text
> decode fully, as best as practicable on the platform being run on
function, which sometimes returns character string with UTF-8 bit set, and sometimes returns byte string in unknown encoding is useless IMHO. so if you decode $!, decoding should be done always. if decoding is failed, IMHO better to return undef or something. On Mon Sep 16 09:05:17 2013, public@khwilliamson.com wrote: Show quoted text
> On 09/09/2013 07:06 PM, Karl Williamson wrote:
> > On 09/02/2013 05:10 PM, Victor Efimov wrote:
> >> > >> 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org > >> <mailto:perlbug-followup@perl.org>> > >> > >> A new global variable is another option. > >> > >> perhaps ${^DECODED_ERROR} ?
> > > > > > I have come to believe that this is probably the best way forward. That > > is, revert the $! change, and tell people who need it to use the new > > global variable which will decode as best it can on the given platform > > based on the locale in effect. > >
> > In looking at this, I thought of something else. I do believe that the > current behavior is correct for such a variable within the lexical scope > of "use locale". But outside such scope the behavior would be to decode > fully, as best as practicable on the platform being run on. > > Then it occurred to me would merely changing $! (and $^E) to behave this > way address your issues? It is a change in behavior from the way things > have alway been, but outside "use locale", it would fully decode, which > someone in the thread was the issue with the current fix. >
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.5k
On Mon Sep 16 09:05:17 2013, public@khwilliamson.com wrote: Show quoted text
> On 09/09/2013 07:06 PM, Karl Williamson wrote:
> > On 09/02/2013 05:10 PM, Victor Efimov wrote:
> >> > >> 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org > >> <mailto:perlbug-followup@perl.org>> > >> > >> A new global variable is another option. > >> > >> perhaps ${^DECODED_ERROR} ?
> > > > > > I have come to believe that this is probably the best way forward. That > > is, revert the $! change, and tell people who need it to use the new > > global variable which will decode as best it can on the given platform > > based on the locale in effect. > >
> > In looking at this, I thought of something else. I do believe that the > current behavior is correct for such a variable within the lexical scope > of "use locale". But outside such scope the behavior would be to decode > fully, as best as practicable on the platform being run on. > > Then it occurred to me would merely changing $! (and $^E) to behave this > way address your issues? It is a change in behavior from the way things > have alway been, but outside "use locale", it would fully decode, which > someone in the thread was the issue with the current fix.
I was the one who implied that. What I meant was that, if decoding happens unconditionally, at least one can check the Perl version to determine how to handle $!. It is still backward-incompatible. I was then going to suggest lexically scoping the new behaviour, but Zefram has already pointed out why that is not a good idea. A new global variable is the best choice at this point. -- Father Chrysostomos
CC: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 09 Oct 2013 19:46:02 -0600
To: perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 8.5k
On 09/20/2013 09:11 PM, Father Chrysostomos via RT wrote: Show quoted text
> On Mon Sep 16 09:05:17 2013, public@khwilliamson.com wrote:
>> On 09/09/2013 07:06 PM, Karl Williamson wrote:
>>> On 09/02/2013 05:10 PM, Victor Efimov wrote:
>>>> >>>> 2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org >>>> <mailto:perlbug-followup@perl.org>> >>>> >>>> A new global variable is another option. >>>> >>>> perhaps ${^DECODED_ERROR} ?
>>> >>> >>> I have come to believe that this is probably the best way forward. That >>> is, revert the $! change, and tell people who need it to use the new >>> global variable which will decode as best it can on the given platform >>> based on the locale in effect. >>>
>> >> In looking at this, I thought of something else. I do believe that the >> current behavior is correct for such a variable within the lexical scope >> of "use locale". But outside such scope the behavior would be to decode >> fully, as best as practicable on the platform being run on. >> >> Then it occurred to me would merely changing $! (and $^E) to behave this >> way address your issues? It is a change in behavior from the way things >> have alway been, but outside "use locale", it would fully decode, which >> someone in the thread was the issue with the current fix.
> > I was the one who implied that. What I meant was that, if decoding > happens unconditionally, at least one can check the Perl version to > determine how to handle $!. It is still backward-incompatible. I was > then going to suggest lexically scoping the new behaviour, but Zefram > has already pointed out why that is not a good idea. A new global > variable is the best choice at this point. > >
tl;dr 0) A brief overview of how locales work with Perl is presented 1) $! used to work as if it always was in the scope of both 'use locale' and 'use bytes' 2) The blamed commit removed the 'use bytes' component, breaking code that relied on that; fixing some code that didn't. 3) Many people think that 'use bytes' should be outlawed. Thus we should take a good hard look before reverting the commit and restoring 'use bytes' behavior. 4) $! now acts (with regard to encoding) as any other scalar does within the scope of 'use locale'. My proposal is to leave it that way when in that scope. Thus, it doesn't become an outlier that has to be treated specially. 5) Outside such scope: on systems that have nl_langinfo(), $! would automatically be decoded to UTF-8; otherwise to English (C locale), which the end user could google translate if necessary. 6) An objection has been raised that this creates problems when references to $! are passed, and in XS code where it gets its caller's scope. But this is no different than any variable that deals with locales. 7) An alternative is to revert this commit (bringing back 'use bytes' behavior), and to create a new variable that always fully decodes. But that doesn't help code that is in 'use locale'. There would be no variable that gives correct behavior for that situation (The behavior of the current commit is that correct behavior). Perhaps another new variable would be created that does what the current commit does, regardless of scope, making 3 variables. Also, $^E also has this problem, and should have the same solution applied to it as we do to $!. That would mean 4 new variables would have to be created, making 6 variables. That seems overly ugly, and confusing. =================================== I'd like to start with a brief refresher on Perl and locales. Every C program always is running in a particular locale. Absent a setlocale() to the contrary, that locale is the "C" locale, which gives the behavior described in K&R. But a setlocale() call to something else will cause many libc functions to behave differently. Under those, theoretically: 1) any particular byte in a string could mean nearly any character (or portion of a character); 2) the language for the text of $! could be anything; 3) etc. There can be single-byte locales, wide character (U16 or U32 usually) locales, and varying character length locales (which UTF-8 is). Perl has never officially supported anything other than single byte locales. In practice, almost all published locales have every ASCII-range code point mean the corresponding ASCII character, hence differing only in non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty much as best it can. One of the first things that Perl does when it starts up (with a minor exception for embedded Perl, added in the 5.19 series) is to call setlocale(), thus causing the libc functions to change behavior. The locale that is set is determined from the caller's environment, typically using the LANG or other environment variables. Increasingly, on Linux systems anyway, this is some UTF-8 locale. But Perl isn't supposed to expose the underlying locale outside the scope of 'use locale'. Various patches in the 5.19 series have fixed all known such leaks except for various POSIX:: functions where it doesn't make sense to hide, and $!. The rationale for the latter is that $! is for the user of the program, not the programmer, and so should be output in the user's language, as gleaned from his/her locale. What happens if a string scalar is in some locale, and a code point that requires UTF-8 is added to it? The answer is that this is generally not a good idea to do, but Perl copes by converting the scalar to UTF-8, with the code points below 256 assumed to be what they mean in the (single-byte) locale, even if they require 2 UTF-8 bytes to represent. This means that operations that cross the 255/256 boundary in a UTF-8 locale are undefined. For example, the uppercase of \xFF is \x{178} normally (as in Unicode they are the SMALL and UPPER y with diaresis respectively), but within the scope of 'use locale' uc("\xFF") remains \xFF, because we don't know what character \xFF really represents in that locale. In just the ISO-8859 series of locales, it can be U+FF, or U+040F, U+0138, U+2019, or unassigned. (Note that if we knew that a locale is UTF-8, we would know what \xFF really is, and so could treat things just like non-locale Perl does). That the meaning of characters is context dependent means that when using locale, it generally is not a good idea to pass references to variables. Correct me if I'm wrong, but I believe this means that XS code gets its caller's lexical scope with regard to this. Until the commit that generated this ticket, $! returned the bytes that comprise the message regardless of whether the message was in UTF-8 or not. Thus it behaved as if it were in the scope of both 'use locale' and 'use bytes'. What the commit effectively did was to remove the 'use bytes' behavior, causing $! to behave as any other string scalar does under 'use locale'. Many people on this list think that we should get rid of 'use bytes'; that its behavior is never desired. (I'm not one of them BTW, but I think it should be used only very rarely.) Thus, on the face of it, it is suspect that $! should behave as if it is in 'use bytes', and I'm having a hard time groking the argument that we should revert back to that. To clarify my proposal (since Victor misunderstood it), I propose, within 'use locale' scope, leaving the behavior as the commit changed it to. $! now behaves as other variables in such scope behave; it no longer is an outlier that has to be treated specially. Outside that scope, I propose to fully decode $! into Perl's internal coding (essentially UTF-8). The latter would automatically load the needed modules. If the system did not have nl_langinfo(), I now think that the best thing to do is to output the message in the C locale, yielding it in English, which the user could machine translate. We are not going to return undef, as Victor suggested, as that would be throwing away potentially crucial information. As I mentioned above, it's not a good idea to pass references to locale-encoded variables. I don't see how $! is different from other locale variables in its orneriness. It just comes with the territory. The idea of reverting this commit and having another global variable that does the full decoding harms code within 'use locale' scope. Instead of this variable being a typical scalar there, it becomes an outlier, which has to have special treatment. We could add a third variable which behaves as the current commit now does to accommodate such code. This is getting unwieldy. Whatever behavior we decide to do has to also be applied to $^E. Now we would then have 6 variables instead of 2. I think my proposal is the least bad of those presented so far.
CC: Craig Berry via RT <perlbug-followup [...] perl.org>, "Perl5 Porters (E-mail)" <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 9 Oct 2013 22:27:45 -0500
To: Karl Williamson <public [...] khwilliamson.com>
From: "Craig A. Berry" <craig.a.berry [...] gmail.com>
Download (untitled) / with headers
text/plain 473b
On Wed, Oct 9, 2013 at 8:46 PM, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
> Whatever behavior we decide to do has to also be applied to $^E.
Probably not, actually. perlvar.pod says that $^E is "Error information specific to the current operating system." As I've indicated to you off-list, how you get different languages for system messages on any given operating system is as likely as not going to be completely orthogonal to the notion of a POSIX locale.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 8.3k
And how one should fix code below (both examples example1 and example2) to work same way in 5.18 and 5.20 ? ===== example1.pl use strict; use warnings; use Encode; my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied my $locale_encoding = eval { require I18N::Langinfo; my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()); defined (find_encoding($enc)) ? $enc : undef; }; $locale_encoding ||= $Config{default_locale_encoding}; binmode STDERR, ":encoding($locale_encoding)"; open (my $f, "<", "not_a_file") or do { die decode($locale_encoding, "$!", Encode::DIE_ON_ERR|Encode::LEAVE_SRC); } ===== $ perl example1.pl No such file or directory at example1.pl line 15. $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl example1.pl Нет такого файла или каталога at example1.pl line 15. ===== example2.pl use strict; use warnings; open (my $f, "<", "not_a_file") or do { die "$!"; } ===== $ perl example2.pl No such file or directory at example2.pl line 4. $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl example2.pl Нет такого файла или каталога at example2.pl line 4. On Wed Oct 09 18:46:47 2013, public@khwilliamson.com wrote: Show quoted text
> > tl;dr > > 0) A brief overview of how locales work with Perl is presented > 1) $! used to work as if it always was in the scope of both 'use locale' > and 'use bytes' > 2) The blamed commit removed the 'use bytes' component, breaking code > that relied on that; fixing some code that didn't. > 3) Many people think that 'use bytes' should be outlawed. Thus we > should take a good hard look before reverting the commit and restoring > 'use bytes' behavior. > 4) $! now acts (with regard to encoding) as any other scalar does within > the scope of 'use locale'. My proposal is to leave it that way when in > that scope. Thus, it doesn't become an outlier that has to be treated > specially. > 5) Outside such scope: on systems that have nl_langinfo(), $! would > automatically be decoded to UTF-8; otherwise to English (C locale), > which the end user could google translate if necessary. > 6) An objection has been raised that this creates problems when > references to $! are passed, and in XS code where it gets its caller's > scope. But this is no different than any variable that deals with
locales. Show quoted text
> 7) An alternative is to revert this commit (bringing back 'use bytes' > behavior), and to create a new variable that always fully decodes. But > that doesn't help code that is in 'use locale'. There would be no > variable that gives correct behavior for that situation (The behavior of > the current commit is that correct behavior). Perhaps another new > variable would be created that does what the current commit does, > regardless of scope, making 3 variables. Also, $^E also has this > problem, and should have the same solution applied to it as we do to $!. > That would mean 4 new variables would have to be created, making 6 > variables. That seems overly ugly, and confusing. > > =================================== > > I'd like to start with a brief refresher on Perl and locales. Every C > program always is running in a particular locale. Absent a setlocale() > to the contrary, that locale is the "C" locale, which gives the behavior > described in K&R. But a setlocale() call to something else will cause > many libc functions to behave differently. Under those, theoretically: > 1) any particular byte in a string could mean nearly any character (or > portion of a character); > 2) the language for the text of $! could be anything; > 3) etc. > There can be single-byte locales, wide character (U16 or U32 usually) > locales, and varying character length locales (which UTF-8 is). Perl > has never officially supported anything other than single byte locales. > In practice, almost all published locales have every ASCII-range code > point mean the corresponding ASCII character, hence differing only in > non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty > much as best it can. > > One of the first things that Perl does when it starts up (with a minor > exception for embedded Perl, added in the 5.19 series) is to call > setlocale(), thus causing the libc functions to change behavior. The > locale that is set is determined from the caller's environment, > typically using the LANG or other environment variables. Increasingly, > on Linux systems anyway, this is some UTF-8 locale. > > But Perl isn't supposed to expose the underlying locale outside the > scope of 'use locale'. Various patches in the 5.19 series have fixed > all known such leaks except for various POSIX:: functions where it > doesn't make sense to hide, and $!. The rationale for the latter is > that $! is for the user of the program, not the programmer, and so > should be output in the user's language, as gleaned from his/her locale. > > What happens if a string scalar is in some locale, and a code point that > requires UTF-8 is added to it? The answer is that this is generally not > a good idea to do, but Perl copes by converting the scalar to UTF-8, > with the code points below 256 assumed to be what they mean in the > (single-byte) locale, even if they require 2 UTF-8 bytes to represent. > This means that operations that cross the 255/256 boundary in a UTF-8 > locale are undefined. For example, the uppercase of \xFF is \x{178} > normally (as in Unicode they are the SMALL and UPPER y with diaresis > respectively), but within the scope of 'use locale' uc("\xFF") remains > \xFF, because we don't know what character \xFF really represents in > that locale. In just the ISO-8859 series of locales, it can be U+FF, or > U+040F, U+0138, U+2019, or unassigned. (Note that if we knew that a > locale is UTF-8, we would know what \xFF really is, and so could treat > things just like non-locale Perl does). > > That the meaning of characters is context dependent means that when > using locale, it generally is not a good idea to pass references to > variables. Correct me if I'm wrong, but I believe this means that XS > code gets its caller's lexical scope with regard to this. > > Until the commit that generated this ticket, $! returned the bytes that > comprise the message regardless of whether the message was in UTF-8 or > not. Thus it behaved as if it were in the scope of both 'use locale' > and 'use bytes'. What the commit effectively did was to remove the 'use > bytes' behavior, causing $! to behave as any other string scalar does > under 'use locale'. Many people on this list think that we should get > rid of 'use bytes'; that its behavior is never desired. (I'm not one of > them BTW, but I think it should be used only very rarely.) Thus, on the > face of it, it is suspect that $! should behave as if it is in 'use > bytes', and I'm having a hard time groking the argument that we should > revert back to that. > > To clarify my proposal (since Victor misunderstood it), I propose, > within 'use locale' scope, leaving the behavior as the commit changed it > to. $! now behaves as other variables in such scope behave; it no > longer is an outlier that has to be treated specially. Outside that > scope, I propose to fully decode $! into Perl's internal coding > (essentially UTF-8). The latter would automatically load the needed > modules. If the system did not have nl_langinfo(), I now think that the > best thing to do is to output the message in the C locale, yielding it > in English, which the user could machine translate. We are not going to > return undef, as Victor suggested, as that would be throwing away > potentially crucial information. > > As I mentioned above, it's not a good idea to pass references to > locale-encoded variables. I don't see how $! is different from other > locale variables in its orneriness. It just comes with the territory. > > The idea of reverting this commit and having another global variable > that does the full decoding harms code within 'use locale' scope. > Instead of this variable being a typical scalar there, it becomes an > outlier, which has to have special treatment. We could add a third > variable which behaves as the current commit now does to accommodate > such code. This is getting unwieldy. Whatever behavior we decide to do > has to also be applied to $^E. Now we would then have 6 variables > instead of 2. > > I think my proposal is the least bad of those presented so far. > >
CC: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Tue, 15 Oct 2013 15:58:50 -0600
To: perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 9.3k
On 10/10/2013 06:12 AM, Victor Efimov via RT wrote: Show quoted text
> And how one should fix code below (both examples example1 and example2) > to work same way in 5.18 and 5.20 ? > > > ===== example1.pl > use strict; > use warnings; > use Encode; > my %Config = ( default_locale_encoding => 'UTF-8' ); # user supplied > my $locale_encoding = eval { > require I18N::Langinfo; > my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()); > defined (find_encoding($enc)) ? $enc : undef; > }; > > $locale_encoding ||= $Config{default_locale_encoding}; > binmode STDERR, ":encoding($locale_encoding)"; > > open (my $f, "<", "not_a_file") or do { > die decode($locale_encoding, "$!", Encode::DIE_ON_ERR|Encode::LEAVE_SRC); > } > ===== > > $ perl example1.pl > No such file or directory at example1.pl line 15. > > $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl example1.pl > Нет такого файла или каталога at example1.pl line 15. > > ===== example2.pl > use strict; > use warnings; > open (my $f, "<", "not_a_file") or do { > die "$!"; > } > ===== > > $ perl example2.pl > No such file or directory at example2.pl line 4. > > $ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl example2.pl > Нет такого файла или каталога at example2.pl line 4.
What you want is for $! to work like it's in 'use bytes'. I can change the patch so that it checks for 'use bytes' and if within that scope returns without the utf8 flag set. You would then just need to add a 'use bytes' to get it to work the same way it always has. There are people who would disapprove of ever using bytes, which means they think the behavior you want is wrong. I'm not one of them. I think that 'use bytes' should be rare, mostly used in testing, but it sometimes is the easiest, clearest way of getting at the bytes that comprise a UTF-8-encoded character. utf8::encode() can be used for that, but destroys its argument and I think its name is much less clear than 'use bytes'. I have tested doing this, and it works. Show quoted text
> > > On Wed Oct 09 18:46:47 2013, public@khwilliamson.com wrote:
>> >> tl;dr >> >> 0) A brief overview of how locales work with Perl is presented >> 1) $! used to work as if it always was in the scope of both 'use locale' >> and 'use bytes' >> 2) The blamed commit removed the 'use bytes' component, breaking code >> that relied on that; fixing some code that didn't. >> 3) Many people think that 'use bytes' should be outlawed. Thus we >> should take a good hard look before reverting the commit and restoring >> 'use bytes' behavior. >> 4) $! now acts (with regard to encoding) as any other scalar does within >> the scope of 'use locale'. My proposal is to leave it that way when in >> that scope. Thus, it doesn't become an outlier that has to be treated >> specially. >> 5) Outside such scope: on systems that have nl_langinfo(), $! would >> automatically be decoded to UTF-8; otherwise to English (C locale), >> which the end user could google translate if necessary. >> 6) An objection has been raised that this creates problems when >> references to $! are passed, and in XS code where it gets its caller's >> scope. But this is no different than any variable that deals with
> locales.
>> 7) An alternative is to revert this commit (bringing back 'use bytes' >> behavior), and to create a new variable that always fully decodes. But >> that doesn't help code that is in 'use locale'. There would be no >> variable that gives correct behavior for that situation (The behavior of >> the current commit is that correct behavior). Perhaps another new >> variable would be created that does what the current commit does, >> regardless of scope, making 3 variables. Also, $^E also has this >> problem, and should have the same solution applied to it as we do to $!. >> That would mean 4 new variables would have to be created, making 6 >> variables. That seems overly ugly, and confusing. >> >> =================================== >> >> I'd like to start with a brief refresher on Perl and locales. Every C >> program always is running in a particular locale. Absent a setlocale() >> to the contrary, that locale is the "C" locale, which gives the behavior >> described in K&R. But a setlocale() call to something else will cause >> many libc functions to behave differently. Under those, theoretically: >> 1) any particular byte in a string could mean nearly any character (or >> portion of a character); >> 2) the language for the text of $! could be anything; >> 3) etc. >> There can be single-byte locales, wide character (U16 or U32 usually) >> locales, and varying character length locales (which UTF-8 is). Perl >> has never officially supported anything other than single byte locales. >> In practice, almost all published locales have every ASCII-range code >> point mean the corresponding ASCII character, hence differing only in >> non-ASCII bytes. Perl avoids assuming this ASCII correspondence pretty >> much as best it can. >> >> One of the first things that Perl does when it starts up (with a minor >> exception for embedded Perl, added in the 5.19 series) is to call >> setlocale(), thus causing the libc functions to change behavior. The >> locale that is set is determined from the caller's environment, >> typically using the LANG or other environment variables. Increasingly, >> on Linux systems anyway, this is some UTF-8 locale. >> >> But Perl isn't supposed to expose the underlying locale outside the >> scope of 'use locale'. Various patches in the 5.19 series have fixed >> all known such leaks except for various POSIX:: functions where it >> doesn't make sense to hide, and $!. The rationale for the latter is >> that $! is for the user of the program, not the programmer, and so >> should be output in the user's language, as gleaned from his/her locale. >> >> What happens if a string scalar is in some locale, and a code point that >> requires UTF-8 is added to it? The answer is that this is generally not >> a good idea to do, but Perl copes by converting the scalar to UTF-8, >> with the code points below 256 assumed to be what they mean in the >> (single-byte) locale, even if they require 2 UTF-8 bytes to represent. >> This means that operations that cross the 255/256 boundary in a UTF-8 >> locale are undefined. For example, the uppercase of \xFF is \x{178} >> normally (as in Unicode they are the SMALL and UPPER y with diaresis >> respectively), but within the scope of 'use locale' uc("\xFF") remains >> \xFF, because we don't know what character \xFF really represents in >> that locale. In just the ISO-8859 series of locales, it can be U+FF, or >> U+040F, U+0138, U+2019, or unassigned. (Note that if we knew that a >> locale is UTF-8, we would know what \xFF really is, and so could treat >> things just like non-locale Perl does). >> >> That the meaning of characters is context dependent means that when >> using locale, it generally is not a good idea to pass references to >> variables. Correct me if I'm wrong, but I believe this means that XS >> code gets its caller's lexical scope with regard to this. >> >> Until the commit that generated this ticket, $! returned the bytes that >> comprise the message regardless of whether the message was in UTF-8 or >> not. Thus it behaved as if it were in the scope of both 'use locale' >> and 'use bytes'. What the commit effectively did was to remove the 'use >> bytes' behavior, causing $! to behave as any other string scalar does >> under 'use locale'. Many people on this list think that we should get >> rid of 'use bytes'; that its behavior is never desired. (I'm not one of >> them BTW, but I think it should be used only very rarely.) Thus, on the >> face of it, it is suspect that $! should behave as if it is in 'use >> bytes', and I'm having a hard time groking the argument that we should >> revert back to that. >> >> To clarify my proposal (since Victor misunderstood it), I propose, >> within 'use locale' scope, leaving the behavior as the commit changed it >> to. $! now behaves as other variables in such scope behave; it no >> longer is an outlier that has to be treated specially. Outside that >> scope, I propose to fully decode $! into Perl's internal coding >> (essentially UTF-8). The latter would automatically load the needed >> modules. If the system did not have nl_langinfo(), I now think that the >> best thing to do is to output the message in the C locale, yielding it >> in English, which the user could machine translate. We are not going to >> return undef, as Victor suggested, as that would be throwing away >> potentially crucial information. >> >> As I mentioned above, it's not a good idea to pass references to >> locale-encoded variables. I don't see how $! is different from other >> locale variables in its orneriness. It just comes with the territory. >> >> The idea of reverting this commit and having another global variable >> that does the full decoding harms code within 'use locale' scope. >> Instead of this variable being a typical scalar there, it becomes an >> outlier, which has to have special treatment. We could add a third >> variable which behaves as the current commit now does to accommodate >> such code. This is getting unwieldy. Whatever behavior we decide to do >> has to also be applied to $^E. Now we would then have 6 variables >> instead of 2. >> >> I think my proposal is the least bad of those presented so far. >> >>
> > > > > --- > via perlbug: queue: perl5 status: open > https://rt.perl.org:443/rt3/Ticket/Display.html?id=119499 >
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.4k
On Tue Oct 15 14:59:45 2013, public@khwilliamson.com wrote: Show quoted text
> What you want is for $! to work like it's in 'use bytes'. I can change > the patch so that it checks for 'use bytes' and if within that scope > returns without the utf8 flag set. You would then just need to add a > 'use bytes' to get it to work the same way it always has. > > There are people who would disapprove of ever using bytes, which means > they think the behavior you want is wrong. I'm not one of them. I > think that 'use bytes' should be rare, mostly used in testing, but it > sometimes is the easiest, clearest way of getting at the bytes that > comprise a UTF-8-encoded character. utf8::encode() can be used for > that, but destroys its argument and I think its name is much less clear > than 'use bytes'. > > I have tested doing this, and it works.
New behaviour looks sane to me. It's probably thay way it's supposed to work from beginning. Main problem solved (when $! sometimes returned characters, sometimes bytes). There were comments that enabling new behaviour in lexical scope is not good and danger (but you stated that it's probably OK). We enabled it by default, and users now can switch to *old* behaviour in *lexical* scope (with use bytes or use locale). I think arguments that lexical scope is not good can apply here too. The big problem that I see now is backward compatibility. Any existing code that uses $! is probably broken. Users will have to fix it with use locale/use bytes. Few examples that I found (where filenames are concatenated with $!): ==== Fild::Temp unless ($!{EEXIST}) { ${$options{ErrStr}} = "Could not create temp file $path: $!"; return (); } File::Find unless (defined $topnlink) { warnings::warnif "Can't stat $top_item: $!\n"; next Proc_Top_Item; } LWP::UserAgent my @stat = stat($tmpfile) or die "Could not stat tmpfile '$tmpfile': $!"; or die "Cannot rename '$tmpfile' to '$file': $!\n"; ==== Note, that if filename here contains non-ASCII characters and is binary string, merging it with character string $! would produce broken result. Even if filename is ASCII, it would break old behaviour when die exception printed to STDERR. If filename is character string, that code did not work correctly previously. Another issue that there is POSIX::strerror, and IMHO it should behave just like $! for consistency (i.e. produce different things depending on lexical scope). POSIX::strerror is pure perl.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.8k
On Wed Oct 09 18:46:47 2013, public@khwilliamson.com wrote: Show quoted text
> Until the commit that generated this ticket, $! returned the bytes that > comprise the message regardless of whether the message was in UTF-8 or > not. Thus it behaved as if it were in the scope of both 'use locale' > and 'use bytes'. What the commit effectively did was to remove the 'use > bytes' behavior, causing $! to behave as any other string scalar does > under 'use locale'. Many people on this list think that we should get > rid of 'use bytes'; that its behavior is never desired. (I'm not one of > them BTW, but I think it should be used only very rarely.) Thus, on the > face of it, it is suspect that $! should behave as if it is in 'use > bytes', and I'm having a hard time groking the argument that we should > revert back to that.
The problem with the bytes pragma is that two scalars may compare equal ($a eq $b) outside its scope, but be different ($a ne $b) within its scope. It changes the contents of scalars, but only some scalars. $! does not do that. In fact, it is more akin to the default input and output streams, which do not do any automatic decoding or encoding until one asks for it. I don’t have enough room in my brain to fit all the issues that are currently going on, so I can’t really comment on what makes sense under ‘use locale’. But I would ask that you consider things at a more practical level. Simple programs like ack that do not take encodings into account should work without any change. The one-liner that I posted is still broken in bleadperl. Try running LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' on dromedary with and without bleadperl. That type of code should continue to work, regardless of what we come up with. Maybe what you are really after is a *function* that returns a decoded $!. -- Father Chrysostomos
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Tue, 22 Oct 2013 23:40:24 +0200
To: perl5-porters [...] perl.org
From: Lukas Mai <plokinom [...] gmail.com>
Download (untitled) / with headers
text/plain 668b
On 22.10.2013 22:48, Father Chrysostomos via RT wrote: Show quoted text
> > Simple programs like ack that do not take encodings into account should > work without any change. The one-liner that I posted is still broken in > bleadperl. Try running > > LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' > > on dromedary with and without bleadperl. That type of code should > continue to work, regardless of what we come up with.
With or without PERL_UNICODE=SL? Because that's on by default in my environment. Show quoted text
> Maybe what you are really after is a *function* that returns a decoded $!.
Doesn't interpolate nicely in error messages. -- Lukas Mai <plokinom@gmail.com>
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 816b
On Tue Oct 22 14:41:09 2013, plokinom@gmail.com wrote: Show quoted text
> On 22.10.2013 22:48, Father Chrysostomos via RT wrote:
> > > > Simple programs like ack that do not take encodings into account should > > work without any change. The one-liner that I posted is still broken in > > bleadperl. Try running > > > > LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' > > > > on dromedary with and without bleadperl. That type of code should > > continue to work, regardless of what we come up with.
> > With or without PERL_UNICODE=SL? > > Because that's on by default in my environment.
All I can say is, ouch! I have always found use of PERL_UNICODE to be suspicious. The problem with PERL_UNICODE is that it enforces things on a program that might have its own STDIN/STDERR handling. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1016b
On Tue Oct 22 14:46:12 2013, sprout wrote: Show quoted text
> On Tue Oct 22 14:41:09 2013, plokinom@gmail.com wrote:
> > On 22.10.2013 22:48, Father Chrysostomos via RT wrote:
> > > > > > Simple programs like ack that do not take encodings into account
should Show quoted text
> > > work without any change. The one-liner that I posted is still
broken in Show quoted text
> > > bleadperl. Try running > > > > > > LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' > > > > > > on dromedary with and without bleadperl. That type of code should > > > continue to work, regardless of what we come up with.
> > > > With or without PERL_UNICODE=SL? > > > > Because that's on by default in my environment.
> > All I can say is, ouch! I have always found use of PERL_UNICODE to be > suspicious.
I think I meant suspect, or whatever. Show quoted text
> The problem with PERL_UNICODE is that it enforces things on > a program that might have its own STDIN/STDERR handling.
In particular, PERL_UNICODE=SL breaks any simple Perl implementation of cat. -- Father Chrysostomos
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 23 Oct 2013 01:56:28 +0400
To: Lukas Mai <plokinom [...] gmail.com>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 766b



2013/10/23 Lukas Mai <plokinom@gmail.com>
Show quoted text
On 22.10.2013 22:48, Father Chrysostomos via RT wrote:
>
> Simple programs like ack that do not take encodings into account should
> work without any change.  The one-liner that I posted is still broken in
> bleadperl.  Try running
>
> LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'
>
> on dromedary with and without bleadperl.  That type of code should
> continue to work, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.



I think things like 'ack' won't work this way. They read data also from @ARGV, config files, they work with filesystem's filenames.
Actually, use of PERL_UNICODE=SL is pretty limited, imho.

Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Wed, 23 Oct 2013 00:03:17 +0200
To: perl5-porters [...] perl.org
From: Lukas Mai <plokinom [...] gmail.com>
Download (untitled) / with headers
text/plain 423b
On 22.10.2013 23:52, Father Chrysostomos via RT wrote: Show quoted text
> On Tue Oct 22 14:46:12 2013, sprout wrote: >
>> The problem with PERL_UNICODE is that it enforces things on >> a program that might have its own STDIN/STDERR handling.
> > In particular, PERL_UNICODE=SL breaks any simple Perl implementation of cat.
Isn't such a "simple" implementation already broken on systems like Windows? -- Lukas Mai <plokinom@gmail.com>
Date: Tue, 26 Nov 2013 21:24:58 -0700
To: Lukas Mai <plokinom [...] gmail.com>, perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Download (untitled) / with headers
text/plain 271b
I have now pushed a series of patches that make the handling of this uniform for $^E and $! on Win32 and OS/2. That means changing a single place will automatically propagate to all areas, once we decide what that is. I hope to soon have some time to look further.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 518b
On Tue Nov 26 20:25:58 2013, public@khwilliamson.com wrote: Show quoted text
> I have now pushed a series of patches that make the handling of this > uniform for $^E and $! on Win32 and OS/2. That means changing a single > place will automatically propagate to all areas, once we decide what > that is. I hope to soon have some time to look further.
This is a 5.20 blocker. Did you have time to look further? Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are. Tony
CC: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
To: perlbug-followup [...] perl.org, "OtherRecipients of perl Ticket #119499:;" [...] smtp.indra.com
Date: Wed, 05 Feb 2014 16:51:24 -0700
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 785b
On 02/05/2014 03:52 PM, Tony Cook via RT wrote: Show quoted text
> On Tue Nov 26 20:25:58 2013, public@khwilliamson.com wrote:
>> I have now pushed a series of patches that make the handling of this >> uniform for $^E and $! on Win32 and OS/2. That means changing a single >> place will automatically propagate to all areas, once we decide what >> that is. I hope to soon have some time to look further.
> > This is a 5.20 blocker. > > Did you have time to look further? > > Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are. > > Tony >
This is correctly listed as a blocker. I have thought further about this, but am not ready to pursue it; I am trying to get all the user-visible changes in before I finish up my research on this.
To: perlbug-followup [...] perl.org
Date: Sat, 01 Mar 2014 22:43:34 -0700
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 5.4k
This is my attempt to bring some clarity to this issue and stake out my position regarding it. I haven't re-read the thread thoroughly just now, so I may miss some issues, but I have been very aware of the central problem regarding this for months now, and have been thinking about it for the same amount of time, so I believe that what follows is an adequate summary of that. First the background. This ticket is about a commit that fixed two tickets with the same underlying cause, https://rt.perl.org/Ticket/Display.html?id=112208 "printing $! when open.pm sets utf8 default on filehandles yields garbage", and #117429, merged with the earlier ticket. The problem is that $! was returning UTF-8 encoded text, but the UTF-8 flag was not set, so it displayed as garbage. The fix was simply to set the UTF-8 flag if the text is valid UTF-8. The problem with this is that it breaks code that just output the returned bytes and the filehandle doesn't have the utf8 default on. Now what gets displayed there looks like garbage. If the broken code uses $! within the scope of the hated-by-some 'use bytes', then the UTF-8 flag doesn't get set, and the return is precisely what it used to be. Thus a potential solution is to force such code to change to do a 'use bytes'. FC is concerned that programs like ack will have to change if we choose this scenario. Otherwise we are in a quandary. If we revert the commit, code that "does the right thing" by setting their filehandle appropriately gets garbage; whereas if we keep it, code that is unprepared to handle UTF-8 can get garbage. There's probably far more of the latter than the former, but do we wish to punish code that DTRT? Before proceeding, I want to make an assertion: I think that it is better for someone to get output in a language foreign to them, than it is to get garbage bytes. This is because they can put the former into something like Google translate to get a reasonable translation back into their own language; and I believe that what appears to be garbage bytes is much more problematical to figure out what was intended. Do you accept or reject this assertion? If you don't accept it, then you need to persuade me and others who do accept it, why not, and there's not much point in you reading the rest of this message. If you do accept it, one solution is to always output $! in English, which we would do by always using the C locale when generating its text. This could be relaxed by using the POSIX locale instead. On most platforms this will be identical to the C locale, but on VMS, at least, it can include Western European languages as well, though I think that VMS only returns $! in English. A more general solution would be to output it in the native locale unless it is UTF-8 encoded, in which case it would be converted to English. This would then cause the code like (apparently) ack to see no change in behavior, except that some errors would now come out in English; and the code that was affected by #119499 would get English, instead of garbage. The reason that this issue comes up for programs that don't handle UTF-8 is because $! does not respect 'use locale'. The reason for this is that $! typically gives the user an OS error that is outside Perl's purview, and it's best that these messages be displayed in the user's preferred language. But since what we have now causes garbage to be displayed for one class of user, it seems to me to be a higher priority, given my assertion, to output something sane for everybody, rather than something ideal for some, and garbage for others. That leads to yet another possibility, one that rjbs has previously vetoed, but which I'm bringing up again here alongside this background that he may not have considered: And that is to have $! respect 'use locale'. Outside of 'use locale' it would be C or POSIX, which would mean English. Within the scope of 'use locale', it would be the user's language. Programs that do a 'use locale' can be assumed to be written to be able to handle them, including the increasingly common UTF-8 locales. It seems to me wrong to deliver $! locale-encoded to programs that aren't prepared to accept it. We may have gotten away with this for non-UTF-8 locales because most code won't try to parse the stringified $! (and it's probably foolish to try to parse it), but the UTF-8 flag throws a wrench into this uneasy truce. To state my position explicitly: I don't think it's a good idea to return a UTF-8 encoded string to code that isn't expecting that possibility. And I don't think it's OK to have user's see garbage bytes. To avoid doing these, we have to return English whenever that could happen. 'use locale' in code should be enough to signal it's prepared to handle UTF-8; otherwise it's buggy. So still another possibility is to deliver $! in the current locale if $! isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 inside. That leaves code that sets things up properly to get $! returned in the user's language; and code that doesn't will also get the user's language unless what is returned would be in UTF-8, in which case it will come out in English, instead of garbage. This seems to me to be the best solution. Another possibility, suggested by FC, is to leave $! as-is, but create a new variable that behaves differently. I think it's far better to get $! to work reasonably than to come up with an alternative variable.
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, Perl5 Porters <perl5-porters [...] perl.org>
To: Karl Williamson <public [...] khwilliamson.com>
Date: Sun, 2 Mar 2014 12:10:29 +0400
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 1.3k
2014-03-02 9:43 GMT+04:00 Karl Williamson <public@khwilliamson.com>: Show quoted text
> Before proceeding, I want to make an assertion: I think that it is better > for someone to get output in a language foreign to them, than it is to get > garbage bytes. This is because they can put the former into something like > Google translate to get a reasonable translation back into their own > language; and I believe that what appears to be garbage bytes is much more > problematical to figure out what was intended. > > Do you accept or reject this assertion? >
Of course English better than garbage. BUT this is correct only for "broken" programs - "It's better if broken program output English than garbage" You want new programs (which follow perl documentation, which are without bugs) to output sometimes English sometimes other languages, depending on locale charset. For end users it will look like this: "one one machine everything is fine, on another machine perl don't respect locale, all Gnu tools and Python scripts works fine and print messages in my language but Perl script don't seem to respect locale in random circumstances". to me it looks like English better than garbage, but 5.18 behaviour even better anyway (we can put $! to the list http://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen together with @ARGV %ENV etc)
To: perlbug-followup [...] perl.org
Date: Mon, 10 Mar 2014 23:55:13 -0600
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: perl5-porters [...] perl.org, Lukas Mai <l.mai [...] web.de>
From: Karl Williamson <public [...] khwilliamson.com>
tl;dr summary of this I assert it is better to have an error message come out in a foreign language (probably English) than to have apparent garbage bytes emitted. If we output UTF-8 bytes without the UTF-8 flag being on to code that handles UTF-8, they will appear to be garbage bytes. But if we set the flag, this breaks code that isn't expecting to handle UTF-8. We break one class or the other. The only way around it is to output bytes that are the same in both UTF-8 and non-UTF8, unless we are confident that the code can handle UTF-8. That means outputting ASCII when we don't have that confidence. My bottom line proposal is to look at the $! text, and if it contains only ASCII, output it as-is. We can be reasonably confident that the program can handle UTF-8 if we are within the scope of 'use locale'. ($! should not be UTF-8 unless the current locale is UTF-8.). Within that scope we also output $!, as-is, setting the UTF-8 flag if it is UTF-8. But if we are not within such scope we can't be confident at all about how the I/O is set up, etc. In that case, for non-ASCII $! text, we switch momentarily to the C locale, and re-get $!, which we then output. This text will be in ASCII and (almost certainly) English, which can be placed in something like Google translate. This is not ideal but it pretty much assures that no one is going to get garbage bytes that Google translate won't likely be able to figure out. On 03/01/2014 10:43 PM, Karl Williamson wrote: Show quoted text
> This is my attempt to bring some clarity to this issue and stake out my > position regarding it. I haven't re-read the thread thoroughly just > now, so I may miss some issues, but I have been very aware of the > central problem regarding this for months now, and have been thinking > about it for the same amount of time, so I believe that what follows is > an adequate summary of that. > > First the background. This ticket is about a commit that fixed two > tickets with the same underlying cause, > https://rt.perl.org/Ticket/Display.html?id=112208 "printing $! when > open.pm sets utf8 default on filehandles yields garbage", and #117429, > merged with the earlier ticket. > > The problem is that $! was returning UTF-8 encoded text, but the UTF-8 > flag was not set, so it displayed as garbage. > > The fix was simply to set the UTF-8 flag if the text is valid UTF-8. > > The problem with this is that it breaks code that just output the > returned bytes and the filehandle doesn't have the utf8 default on. Now > what gets displayed there looks like garbage. > > If the broken code uses $! within the scope of the hated-by-some 'use > bytes', then the UTF-8 flag doesn't get set, and the return is precisely > what it used to be. > > Thus a potential solution is to force such code to change to do a 'use > bytes'. FC is concerned that programs like ack will have to change if > we choose this scenario. > > Otherwise we are in a quandary. If we revert the commit, code that > "does the right thing" by setting their filehandle appropriately gets > garbage; whereas if we keep it, code that is unprepared to handle UTF-8 > can get garbage. There's probably far more of the latter than the > former, but do we wish to punish code that DTRT? > > Before proceeding, I want to make an assertion: I think that it is > better for someone to get output in a language foreign to them, than it > is to get garbage bytes. This is because they can put the former into > something like Google translate to get a reasonable translation back > into their own language; and I believe that what appears to be garbage > bytes is much more problematical to figure out what was intended. > > Do you accept or reject this assertion? > > If you don't accept it, then you need to persuade me and others who do > accept it, why not, and there's not much point in you reading the rest > of this message. > > If you do accept it, one solution is to always output $! in English, > which we would do by always using the C locale when generating its text. > > This could be relaxed by using the POSIX locale instead. On most > platforms this will be identical to the C locale, but on VMS, at least, > it can include Western European languages as well, though I think that > VMS only returns $! in English. > > A more general solution would be to output it in the native locale > unless it is UTF-8 encoded, in which case it would be converted to > English. This would then cause the code like (apparently) ack to see no > change in behavior, except that some errors would now come out in > English; and the code that was affected by #119499 would get English, > instead of garbage. > > The reason that this issue comes up for programs that don't handle UTF-8 > is because $! does not respect 'use locale'. The reason for this is > that $! typically gives the user an OS error that is outside Perl's > purview, and it's best that these messages be displayed in the user's > preferred language. But since what we have now causes garbage to be > displayed for one class of user, it seems to me to be a higher priority, > given my assertion, to output something sane for everybody, rather than > something ideal for some, and garbage for others. > > That leads to yet another possibility, one that rjbs has previously > vetoed, but which I'm bringing up again here alongside this background > that he may not have considered: And that is to have $! respect 'use > locale'. Outside of 'use locale' it would be C or POSIX, which would > mean English. Within the scope of 'use locale', it would be the user's > language. Programs that do a 'use locale' can be assumed to be written > to be able to handle them, including the increasingly common UTF-8 locales. > > It seems to me wrong to deliver $! locale-encoded to programs that > aren't prepared to accept it. We may have gotten away with this for > non-UTF-8 locales because most code won't try to parse the stringified > $! (and it's probably foolish to try to parse it), but the UTF-8 flag > throws a wrench into this uneasy truce. > > To state my position explicitly: I don't think it's a good idea to > return a UTF-8 encoded string to code that isn't expecting that > possibility. And I don't think it's OK to have user's see garbage > bytes. To avoid doing these, we have to return English whenever that > could happen. 'use locale' in code should be enough to signal it's > prepared to handle UTF-8; otherwise it's buggy. > > So still another possibility is to deliver $! in the current locale if > $! isn't UTF-8; otherwise to use English outside of 'use locale' and the > UTF-8 inside. That leaves code that sets things up properly to get $! > returned in the user's language; and code that doesn't will also get > the user's language unless what is returned would be in UTF-8, in which > case it will come out in English, instead of garbage. This seems to me > to be the best solution. > > Another possibility, suggested by FC, is to leave $! as-is, but create a > new variable that behaves differently. I think it's far better to get > $! to work reasonably than to come up with an alternative variable. > > > > > > > > >
RT-Send-CC: sog [...] msg.com.mx, fawaka [...] gmail.com, plokinom [...] gmail.com, l.mai [...] web.de, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.6k
I looked at https://github.com/petdance/ack2/issues/367 which shows that ack is broken by the 5.19.2 change. If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken. What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.) What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate. Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage. Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form die "my message in English: $!" I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise. My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what.
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Thu, 27 Mar 2014 02:06:14 +0400
To: Father Chrysostomos via RT <perlbug-followup [...] perl.org>
CC: sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 3.9k
2014-03-27 1:41 GMT+04:00 Karl Williamson via RT <perlbug-followup@perl.org>: Show quoted text
> I looked at https://github.com/petdance/ack2/issues/367 > which shows that ack is broken by the 5.19.2 change. > > If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken. > > What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.) > > What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate. > > Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.
yes agree. anyway warnings are bad. and broken latin1 bad too. Show quoted text
> > Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form > > die "my message in English: $!"
Right, usually "my message in English" indeed is in English because authors don't bother with full localization and translations to all languages, but for consistency it's better to see $! in locale's language. Other programs usually show it in user language. Show quoted text
> > I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.
I would disagree, they try to migrate to unicode https://github.com/petdance/ack2/issues/120 https://github.com/petdance/ack2/issues/344 https://github.com/petdance/ack2/issues/350 https://github.com/petdance/ack2/issues/355 ack is searching _text_ using _perl regexps_ in text files. it even ignore files detected as binary (by default, at least, in my installation) Show quoted text
> > My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what. >
I am writing programs with correct use of modern Perl unicode now, but never used 'use locale', seems it adds additional side effect to code? Can there be special option for 'use locale' to not change anything at all, except $! behaviour (in lexical scope) ? also, can code without 'use locale' behave like 5.18 (i.e. not always in English; bytes) ? and with 'use locale :errno_only' change $! to return unicode character string. Show quoted text
> --- > via perlbug: queue: perl5 status: open > https://rt.perl.org/Ticket/Display.html?id=119499
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
To: Victor Efimov <victor [...] vsespb.ru>, Father Chrysostomos via RT <perlbug-followup [...] perl.org>
Date: Wed, 26 Mar 2014 17:12:38 -0600
Download (untitled) / with headers
text/plain 5.6k
On 03/26/2014 04:06 PM, Victor Efimov wrote: Show quoted text
> 2014-03-27 1:41 GMT+04:00 Karl Williamson via RT <perlbug-followup@perl.org>:
>> I looked at https://github.com/petdance/ack2/issues/367 >> which shows that ack is broken by the 5.19.2 change. >> >> If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken. >> >> What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.) >> >> What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate. >> >> Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.
> > yes agree. anyway warnings are bad. and broken latin1 bad too.
It's arguable that the warnings should have been output all along. since really it is UTF-8 being output to a terminal that perl thinks can't handle it. Show quoted text
>
>> >> Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form >> >> die "my message in English: $!"
> > Right, usually "my message in English" indeed is in English because > authors don't bother with full localization and translations to all > languages, but for consistency it's better to see $! in locale's > language. Other programs usually show it in user language. >
>> >> I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.
> > I would disagree, they try to migrate to unicode > > https://github.com/petdance/ack2/issues/120 > https://github.com/petdance/ack2/issues/344 > https://github.com/petdance/ack2/issues/350 > https://github.com/petdance/ack2/issues/355 > > ack is searching _text_ using _perl regexps_ in text files. it even > ignore files detected as binary (by default, at least, in my > installation)
I stand corrected. Show quoted text
>
>> >> My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what. >>
> > I am writing programs with correct use of modern Perl unicode now, but > never used 'use locale', seems it adds additional side effect to code? > Can there be special option for 'use locale' to not change anything at > all, except $! behaviour (in lexical scope) ?
locale works a lot better (I anticipate) in 5.20 than before. I think it should finally be possible to 'use locale' as a matter of habit. I was already thinking that 'use locale' in 5.22 should have the ability to select LC_CTYPE and LC_COLLATE individually. It seems logical to make this general, so you could say 'use locale ':messages, numeric'; to get just the effects you want. Some of this could conceivably be added in 5.20 if it helps to resolve this blocker. Show quoted text
> > also, can code without 'use locale' behave like 5.18 (i.e. not always > in English; bytes)
The problem is that the commit fixed real bugs in code that didn't "use locale" Thus the quandary. If we go back to 5.18 behavior, those bugs come back. I believe that my proposal that only ASCII messages get displayed outside of 'use locale' is the only "sure" method that doesn't display garbage to someone. (Note that ASCII doesn't mean necessarily English. Many error messages in Western European languages consist only of ASCII characters. I realize that doesn't help Russian or Chinese, etc.) Also, I hadn't realized this before, but sometimes the message's characters aren't just garbage that someone with the motivation and skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be displayed instead, so information is lost and can't be recovered. Show quoted text
> ? and with 'use locale :errno_only' change $! to > return unicode character string.
I don't see how this differs from your suggestion above for an option to 'use locale' to just effect $! (which is BTW LC_MESSAGES). And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK. Can someone explain what languages error messages are displayed in under varied locales?
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
Date: Wed, 26 Mar 2014 20:08:35 -0600
To: Victor Efimov <victor [...] vsespb.ru>, Father Chrysostomos via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 6.3k
On 03/26/2014 05:12 PM, Karl Williamson wrote: Show quoted text
> On 03/26/2014 04:06 PM, Victor Efimov wrote:
>> 2014-03-27 1:41 GMT+04:00 Karl Williamson via RT >> <perlbug-followup@perl.org>:
>>> I looked at https://github.com/petdance/ack2/issues/367 >>> which shows that ack is broken by the 5.19.2 change. >>> >>> If you look at that link, you'll see that the russian comes out fine, >>> but with a warning that didn't use to be there; the french is broken. >>> >>> What is happening is that ack treats everything as bytes, and so >>> everything just worked. STDERR is opened as a byte-oriented file, >>> and if $! actually did contain UTF-8, it wasn't marked as such, and >>> its component bytes were output as-is, so that if in fact the >>> terminal is expecting UTF-8, they come out looking like UTF-8 to it, >>> and everything held together. (Garbage would ensue if the terminal >>> wasn't expecting the encoding that $! is in; I haven't checked, but >>> my guess is that the grep output is also output as-is, so if the file >>> encodings differ from the terminal expectation, that garbage could be >>> printed; but in practice I doubt that this is a problem.) >>> >>> What the 5.19 change did effectively is to make the stringification >>> of "$!" obey "use bytes". Most code isn't in bytes' scope, so the >>> UTF-8 flag gets turned on if appropriate. >>> >>> Perl's do_print() function checks if the stream is listed as UTF-8 or >>> not. The string being output is converted to the stream's encoding >>> if necessary and possible. If not possible, things are just output >>> as-is, possibly with warnings. In ack's case the stream never is >>> (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as >>> UTF-8, and so tries to get converted to the non-UTF-8 stream. This >>> is impossible in Russian, so the bytes are output as-is, with a >>> warning. Since the terminal really is UTF-8, they display >>> correctly. But it is possible to convert the French text, as all the >>> characters in the message in the bug report are Latin1. So >>> do_print() does this, but since the terminal's encoding doesn't match >>> what ack thinks it is, the non-ascii characters come out as garbage.
>> >> yes agree. anyway warnings are bad. and broken latin1 bad too.
> > It's arguable that the warnings should have been output all along. since > really it is UTF-8 being output to a terminal that perl thinks can't > handle it.
>>
>>> >>> Note that ack has some of its messages hard-coded in English. For >>> example, it does a -e on the file name, and outputs English-only if >>> it doesn't exist. rjbs has pointed out to me privately that typical >>> uses of $! are of the form >>> >>> die "my message in English: $!"
>> >> Right, usually "my message in English" indeed is in English because >> authors don't bother with full localization and translations to all >> languages, but for consistency it's better to see $! in locale's >> language. Other programs usually show it in user language. >>
>>> >>> I am not an ack user, but it appears to me that ack is like a filter >>> which doesn't care about encodings. It is byte rather than character >>> oriented. This seems to me to be an appropriate use of 'use bytes', >>> and if ack did this, this bug would not arise.
>> >> I would disagree, they try to migrate to unicode >> >> https://github.com/petdance/ack2/issues/120 >> https://github.com/petdance/ack2/issues/344 >> https://github.com/petdance/ack2/issues/350 >> https://github.com/petdance/ack2/issues/355 >> >> ack is searching _text_ using _perl regexps_ in text files. it even >> ignore files detected as binary (by default, at least, in my >> installation)
> > I stand corrected. >
>>
>>> >>> My proposal to only use ASCII characters in error messages unless >>> within 'use locale' would also fix this problem. All messages that >>> print in Russian and some messages in French, would now appear in >>> English, adding to the several that already print in English no >>> matter what. >>>
>> >> I am writing programs with correct use of modern Perl unicode now, but >> never used 'use locale', seems it adds additional side effect to code? >> Can there be special option for 'use locale' to not change anything at >> all, except $! behaviour (in lexical scope) ?
> > locale works a lot better (I anticipate) in 5.20 than before. I think > it should finally be possible to 'use locale' as a matter of habit. > > I was already thinking that 'use locale' in 5.22 should have the ability > to select LC_CTYPE and LC_COLLATE individually. It seems logical to > make this general, so you could say > > 'use locale ':messages, numeric'; > > to get just the effects you want. Some of this could conceivably be > added in 5.20 if it helps to resolve this blocker. >
>> >> also, can code without 'use locale' behave like 5.18 (i.e. not always >> in English; bytes)
> > > The problem is that the commit fixed real bugs in code that didn't "use > locale" Thus the quandary. If we go back to 5.18 behavior, those bugs > come back. I believe that my proposal that only ASCII messages get > displayed outside of 'use locale' is the only "sure" method that doesn't > display garbage to someone. (Note that ASCII doesn't mean necessarily > English. Many error messages in Western European languages consist only > of ASCII characters. I realize that doesn't help Russian or Chinese, etc.) > > Also, I hadn't realized this before, but sometimes the message's > characters aren't just garbage that someone with the motivation and > skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be > displayed instead, so information is lost and can't be recovered. >
> > ? and with 'use locale :errno_only' change $! to > > return unicode character string.
> > I don't see how this differs from your suggestion above for an option to > 'use locale' to just effect $! (which is BTW LC_MESSAGES). > > And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK. Can > someone explain what languages error messages are displayed in under > varied locales? >
Another possibility to get programs like ack to work unchanged is to add a non-printing above-Latin1 character to the stringification of $! when it is UTF-8 and there are only Latin1 characters in it. A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to downgrade. The drawback is that code that analyzes $! could be thrown off. But code generally should be analyzing the numeric value anyway, and not the string representation
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
To: Karl Williamson <public [...] khwilliamson.com>
Date: Thu, 27 Mar 2014 12:01:22 +0400
From: Victor Efimov <victor [...] vsespb.ru>
Download (untitled) / with headers
text/plain 1.1k
2014-03-27 3:12 GMT+04:00 Karl Williamson <public@khwilliamson.com>: Show quoted text
> > locale works a lot better (I anticipate) in 5.20 than before.
So, it worked bad before? Than it will be hard to write code compatible with 5.20 and, say, 5.8.8 at same time (that again related to 'ack'-like programs - it's command line program that should work in system perl installed by end users. it's not a web application where programmer can choose perl version) Show quoted text
> The problem is that the commit fixed real bugs in code that didn't "use > locale" Thus the quandary. If we go back to 5.18 behavior, those bugs come > back.
Who told that it was bug? I saw this behaviour but never thought it is a bug, because there is note in documentation: === While Perl does have extensive ways to input and output in Unicode, and a few other "entry points" like the @ARGV array (which can sometimes be interpreted as UTF-8), there are still many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not. === a user reported this as bug because he did not read this. for me it's documented behaviour.
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Date: Thu, 27 Mar 2014 11:57:24 +0100
To: perl5-porters [...] perl.org
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Download (untitled) / with headers
text/plain 578b
* Karl Williamson <public@khwilliamson.com> [2014-03-27 03:10]: Show quoted text
> Another possibility to get programs like ack to work unchanged is to > add a non-printing above-Latin1 character to the stringification of $! > when it is UTF-8 and there are only Latin1 characters in it. > A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to > downgrade. The drawback is that code that analyzes $! could be thrown > off. But code generally should be analyzing the numeric value anyway, > and not the string representation
Maybe you can attach magic that prevents a downgrade?
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
Date: Thu, 27 Mar 2014 11:42:27 -0600
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
On 03/27/2014 04:57 AM, Aristotle Pagaltzis wrote: Show quoted text
> * Karl Williamson <public@khwilliamson.com> [2014-03-27 03:10]:
>> Another possibility to get programs like ack to work unchanged is to >> add a non-printing above-Latin1 character to the stringification of $! >> when it is UTF-8 and there are only Latin1 characters in it. >> A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to >> downgrade. The drawback is that code that analyzes $! could be thrown >> off. But code generally should be analyzing the numeric value anyway, >> and not the string representation
> > Maybe you can attach magic that prevents a downgrade? >
That sounds like a better approach, but it is an area that I know essentially nothing about. If I were to do it, it seems not so likely that I could get it right by 5.20; I don't know how hard it would be for someone experienced in the magical arts of Perl™. Likewise, adding the ZERO WIDTH SPACE would need to be done early in the development cycle to see what might break, not late, so shouldn't be considered as a 5.20 solution.
From: Karl Williamson <public [...] khwilliamson.com>
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
To: Victor Efimov <victor [...] vsespb.ru>
Date: Thu, 27 Mar 2014 12:14:44 -0600
Download (untitled) / with headers
text/plain 2.1k
On 03/27/2014 02:01 AM, Victor Efimov wrote: Show quoted text
> 2014-03-27 3:12 GMT+04:00 Karl Williamson <public@khwilliamson.com>:
>> >> locale works a lot better (I anticipate) in 5.20 than before.
> > So, it worked bad before? Than it will be hard to write code > compatible with 5.20 and, say, 5.8.8 at same time (that again related > to 'ack'-like programs - it's command line program that should work > in system perl installed by end users. it's not a web application > where programmer can choose perl version)
I don't follow your logic. 5.20 will contain a bunch of bug fixes related to locale handling. Earlier versions will continue to work as before. Perhaps what you meant is that it will be hard to write code that takes advantage of whatever 5.20 has, but still works in older releases. That could be true, but it's not something that there is anything that can be done about, except possibly some things in PPPort.h, if we end up adding new macros. It's a given that we can't break things like ack unless there is an easy workaround that is backwards compatible. Show quoted text
>
>> The problem is that the commit fixed real bugs in code that didn't "use >> locale" Thus the quandary. If we go back to 5.18 behavior, those bugs come >> back.
> > Who told that it was bug? I saw this behaviour but never thought it is > a bug, because there is note in documentation: > === > While Perl does have extensive ways to input and output in Unicode, > and a few other "entry points" like the @ARGV array (which can > sometimes be interpreted as UTF-8), there are still many places where > Unicode (in some encoding or another) could be given as arguments or > received as results, or both, but it is not. > === > a user reported this as bug because he did not read this. for me it's > documented behaviour.
I disagree that documenting bad behavior means it should not eventually be fixed. The commit that led to this ticket fixed two other tickets, now merged as https://rt.perl.org/Ticket/Display.html?id=112208. Those tickets seem to me to be perfectly legitimate as being bugs deserving of being fixed. If we revert this commit, those bugs come back.
From: Victor Efimov <victor [...] vsespb.ru>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
Date: Thu, 27 Mar 2014 22:38:22 +0400
To: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.3k
2014-03-27 22:14 GMT+04:00 Karl Williamson <public@khwilliamson.com>: Show quoted text
> I don't follow your logic. 5.20 will contain a bunch of bug fixes related > to locale handling. Earlier versions will continue to work as before. > Perhaps what you meant is that it will be hard to write code that takes > advantage of whatever 5.20 has, but still works in older releases.
That is hard to write code which works in 5.8 and 5.20 at same time (_without_ taking advantages of 5.20), because now I need to 'use locale', and I assume in old version of perl 'use locale' works bad and introduce additional complexities. Show quoted text
> I disagree that documenting bad behavior means it should not eventually be > fixed. The commit that led to this ticket fixed two other tickets, now > merged as https://rt.perl.org/Ticket/Display.html?id=112208. Those tickets > seem to me to be perfectly legitimate as being bugs deserving of being > fixed.
But those are not a bugs compared to real trouble now. And if it was documented, those are feature requests. Real trouble now: old code open(my $f, ">", $filename) or die $!; will issue warnings. there are lot of "or die $!" in perl documentation and now everything broken. Why it's so complex to just introduce $DECODED_ERRNO or a pragma to turn utf8 flag on (which works in lexical scope)? That's much better than breaking so much old code and inserting "zero width whitespaces" into messages.
Date: Thu, 27 Mar 2014 23:07:36 +0100
To: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 6.7k
On 2 March 2014 06:43, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
> This is my attempt to bring some clarity to this issue and stake out my > position regarding it. I haven't re-read the thread thoroughly just now, so > I may miss some issues, but I have been very aware of the central problem > regarding this for months now, and have been thinking about it for the same > amount of time, so I believe that what follows is an adequate summary of > that. > > First the background. This ticket is about a commit that fixed two tickets > with the same underlying cause, > https://rt.perl.org/Ticket/Display.html?id=112208 "printing $! when open.pm > sets utf8 default on filehandles yields garbage", and #117429, merged with > the earlier ticket. > > The problem is that $! was returning UTF-8 encoded text, but the UTF-8 flag > was not set, so it displayed as garbage. > > The fix was simply to set the UTF-8 flag if the text is valid UTF-8. > > The problem with this is that it breaks code that just output the returned > bytes and the filehandle doesn't have the utf8 default on. Now what gets > displayed there looks like garbage. > > If the broken code uses $! within the scope of the hated-by-some 'use > bytes', then the UTF-8 flag doesn't get set, and the return is precisely > what it used to be. > > Thus a potential solution is to force such code to change to do a 'use > bytes'. FC is concerned that programs like ack will have to change if we > choose this scenario.
Unless I have misunderstood then it is not just ack. But pretty much every Perl program I ever wrote, or saw, that was in Perl. This type of pattern is extremely pervasive: open my $fh, ">", $file or die "Failed to open '$file' for writing: $!"; I am under the impression you are saying they all have change to: open my $fh, ">", $file or do { use bytes; die "Failed to open '$file' for writing: $!" }; Which I find almost astounding. Please tell me I have misunderstood. Show quoted text
> Otherwise we are in a quandary. If we revert the commit, code that "does > the right thing" by setting their filehandle appropriately gets garbage; > whereas if we keep it, code that is unprepared to handle UTF-8 can get > garbage. There's probably far more of the latter than the former, but do we > wish to punish code that DTRT? > > Before proceeding, I want to make an assertion: I think that it is better > for someone to get output in a language foreign to them, than it is to get > garbage bytes. This is because they can put the former into something like > Google translate to get a reasonable translation back into their own > language; and I believe that what appears to be garbage bytes is much more > problematical to figure out what was intended. > > Do you accept or reject this assertion?
I accept it. However I think it is secondary to the question of requiring pretty much every script that uses filehandles to change. Maybe I am wrong that is what you are suggestion, but if it is then IMO it cannot be the right answer. Show quoted text
> If you don't accept it, then you need to persuade me and others who do > accept it, why not, and there's not much point in you reading the rest of > this message. > > If you do accept it, one solution is to always output $! in English, which > we would do by always using the C locale when generating its text. > > This could be relaxed by using the POSIX locale instead. On most platforms > this will be identical to the C locale, but on VMS, at least, it can include > Western European languages as well, though I think that VMS only returns $! > in English. > > A more general solution would be to output it in the native locale unless it > is UTF-8 encoded, in which case it would be converted to English. This > would then cause the code like (apparently) ack to see no change in > behavior, except that some errors would now come out in English; and the > code that was affected by #119499 would get English, instead of garbage. > > The reason that this issue comes up for programs that don't handle UTF-8 is > because $! does not respect 'use locale'. The reason for this is that $! > typically gives the user an OS error that is outside Perl's purview, and > it's best that these messages be displayed in the user's preferred language. > But since what we have now causes garbage to be displayed for one class of > user, it seems to me to be a higher priority, given my assertion, to output > something sane for everybody, rather than something ideal for some, and > garbage for others.
For me prioritising "use locale" over every other script is inappropriate. IMO relatively few scripts use it. IMO for years the general recommendation about "use locale" has been to avoid it. I personally would get rid of it completely. Show quoted text
> That leads to yet another possibility, one that rjbs has previously vetoed, > but which I'm bringing up again here alongside this background that he may > not have considered: And that is to have $! respect 'use locale'. Outside > of 'use locale' it would be C or POSIX, which would mean English. Within > the scope of 'use locale', it would be the user's language. Programs that > do a 'use locale' can be assumed to be written to be able to handle them, > including the increasingly common UTF-8 locales. > > It seems to me wrong to deliver $! locale-encoded to programs that aren't > prepared to accept it. We may have gotten away with this for non-UTF-8 > locales because most code won't try to parse the stringified $! (and it's > probably foolish to try to parse it), but the UTF-8 flag throws a wrench > into this uneasy truce. > > To state my position explicitly: I don't think it's a good idea to return a > UTF-8 encoded string to code that isn't expecting that possibility. And I > don't think it's OK to have user's see garbage bytes. To avoid doing these, > we have to return English whenever that could happen. 'use locale' in code > should be enough to signal it's prepared to handle UTF-8; otherwise it's > buggy. > > So still another possibility is to deliver $! in the current locale if $! > isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 > inside. That leaves code that sets things up properly to get $! returned in > the user's language; and code that doesn't will also get the user's > language unless what is returned would be in UTF-8, in which case it will > come out in English, instead of garbage. This seems to me to be the best > solution. > > Another possibility, suggested by FC, is to leave $! as-is, but create a new > variable that behaves differently. I think it's far better to get $! to > work reasonably than to come up with an alternative variable.
I personally think that $! should be left alone, and you should introduce a new pragma to control the decoding behavior of $!. Those people with bugs related to it can use the pragma. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
To: demerphq <demerphq [...] gmail.com>
Date: Thu, 27 Mar 2014 22:02:43 -0600
From: Karl Williamson <public [...] khwilliamson.com>
I was wrong in several things when I wrote this; please skip to later posts on the thread. On 03/27/2014 04:07 PM, demerphq wrote: Show quoted text
> On 2 March 2014 06:43, Karl Williamson <public@khwilliamson.com> wrote:
>> This is my attempt to bring some clarity to this issue and stake out my >> position regarding it. I haven't re-read the thread thoroughly just now, so >> I may miss some issues, but I have been very aware of the central problem >> regarding this for months now, and have been thinking about it for the same >> amount of time, so I believe that what follows is an adequate summary of >> that. >> >> First the background. This ticket is about a commit that fixed two tickets >> with the same underlying cause, >> https://rt.perl.org/Ticket/Display.html?id=112208 "printing $! when open.pm >> sets utf8 default on filehandles yields garbage", and #117429, merged with >> the earlier ticket. >> >> The problem is that $! was returning UTF-8 encoded text, but the UTF-8 flag >> was not set, so it displayed as garbage. >> >> The fix was simply to set the UTF-8 flag if the text is valid UTF-8. >> >> The problem with this is that it breaks code that just output the returned >> bytes and the filehandle doesn't have the utf8 default on. Now what gets >> displayed there looks like garbage. >> >> If the broken code uses $! within the scope of the hated-by-some 'use >> bytes', then the UTF-8 flag doesn't get set, and the return is precisely >> what it used to be. >> >> Thus a potential solution is to force such code to change to do a 'use >> bytes'. FC is concerned that programs like ack will have to change if we >> choose this scenario.
> > Unless I have misunderstood then it is not just ack. > > But pretty much every Perl program I ever wrote, or saw, that was in Perl. > > This type of pattern is extremely pervasive: > > open my $fh, ">", $file > or die "Failed to open '$file' for writing: $!"; > > I am under the impression you are saying they all have change to: > > open my $fh, ">", $file > or do { use bytes; die "Failed to open '$file' for writing: $!" }; > > Which I find almost astounding. Please tell me I have misunderstood. >
>> Otherwise we are in a quandary. If we revert the commit, code that "does >> the right thing" by setting their filehandle appropriately gets garbage; >> whereas if we keep it, code that is unprepared to handle UTF-8 can get >> garbage. There's probably far more of the latter than the former, but do we >> wish to punish code that DTRT? >> >> Before proceeding, I want to make an assertion: I think that it is better >> for someone to get output in a language foreign to them, than it is to get >> garbage bytes. This is because they can put the former into something like >> Google translate to get a reasonable translation back into their own >> language; and I believe that what appears to be garbage bytes is much more >> problematical to figure out what was intended. >> >> Do you accept or reject this assertion?
> > I accept it. However I think it is secondary to the question of > requiring pretty much every script that uses filehandles to change. > Maybe I am wrong that is what you are suggestion, but if it is then > IMO it cannot be the right answer. >
>> If you don't accept it, then you need to persuade me and others who do >> accept it, why not, and there's not much point in you reading the rest of >> this message. >> >> If you do accept it, one solution is to always output $! in English, which >> we would do by always using the C locale when generating its text. >> >> This could be relaxed by using the POSIX locale instead. On most platforms >> this will be identical to the C locale, but on VMS, at least, it can include >> Western European languages as well, though I think that VMS only returns $! >> in English. >> >> A more general solution would be to output it in the native locale unless it >> is UTF-8 encoded, in which case it would be converted to English. This >> would then cause the code like (apparently) ack to see no change in >> behavior, except that some errors would now come out in English; and the >> code that was affected by #119499 would get English, instead of garbage. >> >> The reason that this issue comes up for programs that don't handle UTF-8 is >> because $! does not respect 'use locale'. The reason for this is that $! >> typically gives the user an OS error that is outside Perl's purview, and >> it's best that these messages be displayed in the user's preferred language. >> But since what we have now causes garbage to be displayed for one class of >> user, it seems to me to be a higher priority, given my assertion, to output >> something sane for everybody, rather than something ideal for some, and >> garbage for others.
> > For me prioritising "use locale" over every other script is > inappropriate. IMO relatively few scripts use it. IMO for years the > general recommendation about "use locale" has been to avoid it. I > personally would get rid of it completely. >
>> That leads to yet another possibility, one that rjbs has previously vetoed, >> but which I'm bringing up again here alongside this background that he may >> not have considered: And that is to have $! respect 'use locale'. Outside >> of 'use locale' it would be C or POSIX, which would mean English. Within >> the scope of 'use locale', it would be the user's language. Programs that >> do a 'use locale' can be assumed to be written to be able to handle them, >> including the increasingly common UTF-8 locales. >> >> It seems to me wrong to deliver $! locale-encoded to programs that aren't >> prepared to accept it. We may have gotten away with this for non-UTF-8 >> locales because most code won't try to parse the stringified $! (and it's >> probably foolish to try to parse it), but the UTF-8 flag throws a wrench >> into this uneasy truce. >> >> To state my position explicitly: I don't think it's a good idea to return a >> UTF-8 encoded string to code that isn't expecting that possibility. And I >> don't think it's OK to have user's see garbage bytes. To avoid doing these, >> we have to return English whenever that could happen. 'use locale' in code >> should be enough to signal it's prepared to handle UTF-8; otherwise it's >> buggy. >> >> So still another possibility is to deliver $! in the current locale if $! >> isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8 >> inside. That leaves code that sets things up properly to get $! returned in >> the user's language; and code that doesn't will also get the user's >> language unless what is returned would be in UTF-8, in which case it will >> come out in English, instead of garbage. This seems to me to be the best >> solution. >> >> Another possibility, suggested by FC, is to leave $! as-is, but create a new >> variable that behaves differently. I think it's far better to get $! to >> work reasonably than to come up with an alternative variable.
> > I personally think that $! should be left alone, and you should > introduce a new pragma to control the decoding behavior of $!. Those > people with bugs related to it can use the pragma. > > Yves > >
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #119499] $! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+
CC: Father Chrysostomos via RT <perlbug-followup [...] perl.org>, sog [...] msg.com.mx, Leon Timmermans <fawaka [...] gmail.com>, Lukas Mai <plokinom [...] gmail.com>, l.mai [...] web.de, Perl5 Porters <perl5-porters [...] perl.org>
To: Victor Efimov <victor [...] vsespb.ru>
Date: Thu, 27 Mar 2014 23:09:40 -0600
Download (untitled) / with headers
text/plain 3.8k
In this post, I will just give some new insights I had today. There are real bugs (even if the others previously mentioned aren't regarded as such) when "$!" isn't returned with the UTF-8 flag on, and when $! is stringified to its locale string outside of "use locale" scope. Consider this one liner: LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤: $!"' In blead, it prints, as it should, Wide character in die at -e line 1 致命錯誤: 不允许的操作 at -e line 1 In 5.18.2 it prints this garbage instead Wide character in die at -e line 1 致命錯誤: 不允许的操作 at -e line 1 The reason is that the program is encoded in utf8, and $! has returned utf8 (only in the 5.18 case) without setting the utf8 flag, and so Perl takes the bytes that form $! and upgrades those bytes into utf8 (again). In other words, its encoding twice. (I chose Chinese because its script could not be confused with Western European characters, and I used Google translate, so the constant portion of the text may not make sense; I apologize to the Chinese speakers reading this.) "use utf8" is not necessary for this. It could be "die "$prefix: $!" where $prefix has its utf8 flag on. These examples show, once again, the perils of having a scalar that's in UTF-8, but pretending it's not, even if it's just in a die(). I claim they conclusively show the brokenness of the 5.18 code. Another problem with all existing versions is if the $prefix is written in Latin1. Recall that the default character sets of Perl are ASCII, Latin1, and full Unicode, each a superset of the previous. So someone might in Hungarian might write ./perl -Ilib -le '$!=1; die "fatális hibát: $!"' (apologies to the Hungarian speakers) If this is however run in a non-Latin1 locale, like say LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"' The first part of the string is in Latin1, and the 2nd part is in Latin7. These are not compatible (except for their common ASCII range and a few punctuation characters). If the terminal is set to display Latin1, the first part looks ok, the second is garbage, and vice versa (except the common characters will look ok in both) There is no current way for an application to guard against this; it is a sitting duck. $! always comes out in the underlying locale. (The reason this doesn't show up more often, is apparently people write their prefix messages in English, hence ASCII, and all the locales, like 88597, are supersets of ASCII. I claim this shows the perils of having stuff appear in the underlying locale outside the scope of 'use locale'. An unsuspecting application that doesn't even know that locales exist can be hit by the user's environment passing in a locale, or by any module somewhere in the tool chain doing a setlocale(). I believe the solution is to make $! return the C locale messages outside the scope of 'use locale', just like the other categories. By being in such scope, the caller is indicating its willingness to handle and be smart about locale issues. Otherwise it shouldn't have to be exposed to them. My recent proposal also works. That is to use the $! locale value provided it is all ASCII. That means that a fair number of system messages in various European languages will come out natively, but not those that might adversely affect things like ack. The problem with this is that the application still doesn't have control. Note that in the messages above, that Perl itself outputs its warnings and messages like "at -e line 1". Nobody has any control over that, and I can't believe this fact hasn't discouraged some applications from using Perl in non-English settings. What part of CPAN is expecting native-language $! ? I don't know, but given the vagaries, including some things always being in English, and being at the mercy of the user's locale environment, I suspect not much.
RT-Send-CC: sog [...] msg.com.mx, fawaka [...] gmail.com, plokinom [...] gmail.com, l.mai [...] web.de, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 222b
Fixed for v5.20 by b17e32ea3ba5ef7362d2a3d1a433661afb897786 The plan for v5.21 is to make $! return locale messages only from within the scope of 'use locale'. In other words, locale has to be opt-in. -- Karl Williamson
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.5k
I did not ever receive this message. Only receive a notice that the bug is resolved. On Thu Mar 27 22:09:05 2014, public@khwilliamson.com wrote: Show quoted text
> In this post, I will just give some new insights I had today. > > There are real bugs (even if the others previously mentioned aren't > regarded as such) when "$!" isn't returned with the UTF-8 flag on, and > when $! is stringified to its locale string outside of "use locale" scope. > > Consider this one liner: > > LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤: $!"' > > In blead, it prints, as it should, > Wide character in die at -e line 1 > 致命錯誤: 不允许的操作 at -e line 1 > > In 5.18.2 it prints this garbage instead > Wide character in die at -e line 1 > 致命錯誤: 不允许的操作 at -e line 1 >
It's general limitation of perl - one should not merge character strings with binary strings. Not a bug, but expected behaviour. Show quoted text
> Another problem with all existing versions is if the $prefix is written > in Latin1. Recall that the default character sets of Perl are ASCII, > Latin1, and full Unicode, each a superset of the previous. So someone > might in Hungarian might write > > ./perl -Ilib -le '$!=1; die "fatális hibát: $!"' > > (apologies to the Hungarian speakers) > > If this is however run in a non-Latin1 locale, like say > > LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"' > > The first part of the string is in Latin1, and the 2nd part is in > Latin7. These are not compatible (except for their common ASCII range > and a few punctuation characters). If the terminal is set to display > Latin1, the first part looks ok, the second is garbage, and vice versa > (except the common characters will look ok in both)
Locale is iso88597 so terminal should be set to iso88597 (otherwise everything is garbage). And if it is, it's not surprise that Latin1 is garbage. Show quoted text
> > What part of CPAN is expecting native-language $! ? I don't know, but > given the vagaries, including some things always being in English, and > being at the mercy of the user's locale environment, I suspect not much. >
So you are worrying more about broken tests on CPAN, and don't worry much about real bugs in users code (which not caught with tests). User will be surprised that perl stopped giving $! in locale's language, but they cannot catch this in tests because they never ever suspect that such brokenness can be introduced (unit test are white box testing - you can test only for bugs you expect)


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org