$! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+ #13208

p5pRT · 2013-08-28T08:52:13Z

Migrated from rt.perl.org#119499 (status was 'resolved')

Searchable as RT119499$

p5pRT · 2013-08-28T08:52:13Z

From victor@vsespb.ru

$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value
properly.

Also I am not sure if it will be possible to decode it when language with
Latin-1 -only characters is set.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek
-e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x1468e30
"\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262
\320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8
"\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432}
\x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"]
CUR = 34
LEN = 40

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251
perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342
\344\356\361\362\363\357\345"\0
CUR = 18
LEN = 24

p5pRT · 2013-08-28T17:04:20Z

From victor@vsespb.ru

Seems this is result of
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 fix.

However I think fix is wrong.

1) it breaks old code, which:

a) tries to decode $! using Encode::decode and
I18N::Langinfo::langinfo(I18N::Langinfo::CODESET())

b) which prints error messages to screen as-is (without "binmode STDOUT
:encoding")

2) Sometimes it returns binary string (under non-utf8 locales, or when
message is ASCII-only), sometimes character string (when locale is UTF-8).

It's hard to distinct one from another. Possible solution is
utf8::is_utf8(), but use of utf8::is_utf8 advertised as a danger way.

Another solution is use Encode::decode_utf8 when locale is UTF-8 ( but
not Encode::decode("UTF-8"...) ).

Problem that this method's documentation is wrong - several people
reported this:

https://rt.cpan.org/Public/Bug/Display.html?id=87267
https://rt.cpan.org/Public/Bug/Display.html?id=61671
dankogai/p5-encode#11
dankogai/p5-encode#10

3) It's not documented in perllocale, perlunicode, perlvar.

4) It's not clear how it works in case of Latin-1 characters in UTF-8
locale.

On Wed Aug 28 01:52:13 2013, vsespb wrote:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value
properly.

Also I am not sure if it will be possible to decode it when language with
Latin-1 -only characters is set.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek
-e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x1468e30
"\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262
\320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8
"\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432}
\x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"]
CUR = 34
LEN = 40

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251
perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342
\344\356\361\362\363\357\345"\0
CUR = 18
LEN = 24

p5pRT · 2013-08-28T17:19:47Z

From @khwilliamson

On 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote:

# New Ticket Created by Victor Efimov
# Please include the string: [perl #119499]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499 >

I am trying to understand your issues with this change. I believe it is
working correctly now.

$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I don't understand your use of the word 'binary' here. In both cases,
it returns characters in contexts where strings are appropriate, and the
numeric value in contexts where numbers are appropriate

In string contexts, it returns the appropriate encoding. In UTF-8
locales, it returns the UTF-8 encoded character string. In non-UTF-8
locales, it returns the single-byte string in the correct encoding.

I believe this is useless and just makes it harder to decode $! value
properly.

I don't have a clue as to why you think this is useless. This change
was to fix https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208
(reported also as perl #117429, so more than one person found this to be
a bug). The patch merely examines the string text of $!, and if it is
UTF-8, sets the flag indicating that.

Code that is trying to decode $! should be using the (constant) numeric
value rather than trying to parse the (locale-dependent) string.

Also I am not sure if it will be possible to decode it when language with
Latin-1 -only characters is set.

Again, use the numeric value when trying to parse the error.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek
-e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x1468e30
"\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276 \320\262
\320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8
"\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432}
\x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"]
CUR = 34
LEN = 40

I ran this, substituting 'say $!' for the Dump, and got this output:
Отказано в доступе

which is the correct Cyrillic text. Prior to the patch, this would have
printed garbage.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251
perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342
\344\356\361\362\363\357\345"\0
CUR = 18
LEN = 24

I do not have a Windows machine with CP1251, but I hand looked at this
dump, and the characters are Отказано в доступе in that code page. So
this looks proper.

p5pRT · 2013-08-28T17:19:48Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2013-08-28T17:40:40Z

From victor@vsespb.ru

Code that is trying to decode $! should be using the (constant) numeric
value rather than trying to parse the (locale-dependent) string.

I am not trying to parse $!. I am trying to print original error message
to the screen for the user.

In string contexts, it returns the appropriate encoding. In UTF-8
locales, it returns the UTF-8 encoded character string. In non-UTF-8
locales, it returns the single-byte string in the correct encoding.

That is just wrong to sometimes return bytes, sometimes characters.

The following example worked fine before this change:

use strict;
use warnings;
use I18N::Langinfo;
use Encode;
my $enc = I18N::Langinfo::langinfo(I18N::Langinfo::CODESET());
binmode STDOUT, ":encoding($enc)";

my $filename = "not a file ".chr(0x444);

open my $f, "<", $filename or do {
my $error = "$!";
$error = decode($enc, "$error");
print "Error accessing file $filename: $error\n";
};

but with this change:

- under non-Unicode locales works fine.
- under UTF-8 locales fails with "Cannot decode string with wide
characters "

Possible fix for this example is:

replace
$error = decode($enc, "$error");
with
$error = utf8::is_utf8($error) ? $error : decode($enc, "$error");

Another place where it breaks old code is:

perl -e 'open my $f, "<", "notafile" or die $!'

now prints warning: "Wide character in die" when locale is UTF-8 and
message contains wide characters.

I ran this, substituting 'say $!' for the Dump, and got this output:
Отказано в доступе
which is the correct Cyrillic text. Prior to the patch, this would have
printed garbage.

No, prior to this patch it prints correct (same) text but without "Wide
character" warnings.

On Wed Aug 28 10:19:47 2013, public@khwilliamson.com wrote:

On 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote:

# New Ticket Created by Victor Efimov
# Please include the string: [perl #119499]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499 >

I am trying to understand your issues with this change. I believe it is
working correctly now.

$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I don't understand your use of the word 'binary' here. In both cases,
it returns characters in contexts where strings are appropriate, and the
numeric value in contexts where numbers are appropriate

In string contexts, it returns the appropriate encoding. In UTF-8
locales, it returns the UTF-8 encoded character string. In non-UTF-8
locales, it returns the single-byte string in the correct encoding.

I believe this is useless and just makes it harder to decode $! value
properly.

I don't have a clue as to why you think this is useless. This change
was to fix https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208
(reported also as perl #117429, so more than one person found this to be
a bug). The patch merely examines the string text of $!, and if it is
UTF-8, sets the flag indicating that.

Code that is trying to decode $! should be using the (constant) numeric
value rather than trying to parse the (locale-dependent) string.

Also I am not sure if it will be possible to decode it when language
with
Latin-1 -only characters is set.

Again, use the numeric value when trying to parse the error.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX
-MDevel::Peek
-e '$!=EACCES; Dump "$!"'

SV = PV(0x144dd80) at 0x14702a0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x1468e30
"\320\236\321\202\320\272\320\260\320\267\320\260\320\275\320\276
\320\262
\320\264\320\276\321\201\321\202\321\203\320\277\320\265"\0 [UTF8
"\x{41e}\x{442}\x{43a}\x{430}\x{437}\x{430}\x{43d}\x{43e} \x{432}
\x{434}\x{43e}\x{441}\x{442}\x{443}\x{43f}\x{435}"]
CUR = 34
LEN = 40

I ran this, substituting 'say $!' for the Dump, and got this output:
Отказано в доступе

which is the correct Cyrillic text. Prior to the patch, this would have
printed garbage.

LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251
LC_MESSAGES=ru_RU.CP1251
perl -MPOSIX -MDevel::Peek -e '$!=EACCES; Dump "$!"'

SV = PV(0x1db8d80) at 0x1ddf7e0
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x1f680d0 "\316\362\352\340\347\340\355\356 \342
\344\356\361\362\363\357\345"\0
CUR = 18
LEN = 24

I do not have a Windows machine with CP1251, but I hand looked at this
dump, and the characters are Отказано в доступе in that code page. So
this looks proper.

p5pRT · 2013-08-28T19:44:17Z

From @Leont

On Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov
<perlbug-followup@perl.org>wrote:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value
properly.

Automatic decoding is definitely the more useful behavior. I agree
inconsistency is a bad thing though. Not sure it's easy to fix though.
Patches welcome.

Also I am not sure if it will be possible to decode it when language with

Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon

p5pRT · 2013-08-28T20:02:14Z

From victor@vsespb.ru

Automatic decoding is definitely the more useful behavior

yes. when
a) it's documented (perllocale or perlunicode or perlvar)

b) it's not breaking existing code.
OR
c) it turned on with 'use feature' or something.

I agree inconsistency is a bad thing though.

yes, especially when sometimes it's bytes, sometimes character and you
have to check UTF-8 flag.

Not sure it's easy to fix though.

I think in Perl you can get encoding with
I18N::Langinfo::langinfo(I18N::Langinfo::CODESET())
Then decode using Encode module. (both are core modules)

Perlhaps that can be fixed in Perl code, in Errno. (We already load
Errno when %! accessed), which will auto-load I18N::Langinfo and Encode?

And I am totally not sure about perl C internals.

Patches welcome.

I cannot do C coding.

Also I think that old code, relying on old behaviour was not relying on
something undocumented.

it was partly documented:

http://perldoc.perl.org/perllocale.html

Note especially that the string value of $! and the error messages
given by external utilities may be changed by LC_MESSAGES

(also perllocale now have updates, related to $! in blead)

http://perldoc.perl.org/perlunicode.html

there are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both,
but it is not.

So ideal fix would be imho:
1. document it (perllocale or perlunicode or perlvar)
2. decode $! on non-UTF locales. always return character strings.
3. turn on new behaviour only with 'use feature'

On Wed Aug 28 12:44:17 2013, LeonT wrote:

On Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov
<perlbug-followup@perl.org>wrote:

$! returned as character string under 5.19.2+ and UTF-8 locales. But as
binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value
properly.

Automatic decoding is definitely the more useful behavior. I agree
inconsistency is a bad thing though. Not sure it's easy to fix though.
Patches welcome.

Also I am not sure if it will be possible to decode it when language with

Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon

p5pRT · 2013-08-28T21:36:38Z

From victor@vsespb.ru

There is a distribution which decodes POSIX::strerror with I18N::Langinfo:

http://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/

also, another possible problem, that all examples in perl documentation,
with "die $!" now raise warnings:

http://perldoc.perl.org/perlopentut.html

open(INFO, "datafile") || die("can't open datafile: $!");
open(INFO, "< datafile") || die("can't open datafile: $!");
open(RESULTS,"> runstats") || die("can't open runstats: $!");
open(LOG, ">> logfile ") || die("can't open logfile: $!");

========
$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO,
"datafile") || die $!;'
Wide character in die at -e line 1.
Нет такого файла или каталога at -e line 1.

On Wed Aug 28 13:02:14 2013, vsespb wrote:

Automatic decoding is definitely the more useful behavior

yes. when
a) it's documented (perllocale or perlunicode or perlvar)

b) it's not breaking existing code.
OR
c) it turned on with 'use feature' or something.

I agree inconsistency is a bad thing though.

yes, especially when sometimes it's bytes, sometimes character and you
have to check UTF-8 flag.

Not sure it's easy to fix though.

I think in Perl you can get encoding with
I18N::Langinfo::langinfo(I18N::Langinfo::CODESET())
Then decode using Encode module. (both are core modules)

Perlhaps that can be fixed in Perl code, in Errno. (We already load
Errno when %! accessed), which will auto-load I18N::Langinfo and Encode?

And I am totally not sure about perl C internals.

Patches welcome.

I cannot do C coding.

Also I think that old code, relying on old behaviour was not relying on
something undocumented.

it was partly documented:

http://perldoc.perl.org/perllocale.html

Note especially that the string value of $! and the error messages
given by external utilities may be changed by LC_MESSAGES

(also perllocale now have updates, related to $! in blead)

http://perldoc.perl.org/perlunicode.html

there are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both,
but it is not.

So ideal fix would be imho:
1. document it (perllocale or perlunicode or perlvar)
2. decode $! on non-UTF locales. always return character strings.
3. turn on new behaviour only with 'use feature'

On Wed Aug 28 12:44:17 2013, LeonT wrote:

On Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov
<perlbug-followup@perl.org>wrote:

$! returned as character string under 5.19.2+ and UTF-8 locales.
But as
binary strings
under single-byte encoding locales.

I believe this is useless and just makes it harder to decode $! value
properly.

Automatic decoding is definitely the more useful behavior. I agree
inconsistency is a bad thing though. Not sure it's easy to fix though.
Patches welcome.

Also I am not sure if it will be possible to decode it when language
with

Latin-1 -only characters is set.

AFAIK that should work perfectly fine.

Leon

p5pRT · 2013-08-28T23:23:26Z

From sog@msg.mx

Time to set PERL_UNICODE=SL ?

Salvador Ortiz.

On 08/28/2013 04:36 PM, Victor Efimov via RT wrote:

There is a distribution which decodes POSIX::strerror with I18N::Langinfo:

http://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/

also, another possible problem, that all examples in perl documentation,
with "die $!" now raise warnings:

http://perldoc.perl.org/perlopentut.html
 open$INFO\, "datafile"$ || die$"can't open datafile&#8203;: $\!"$;
 open$INFO\, "\< datafile"$ || die$"can't open datafile&#8203;: $\!"$;
 open$RESULTS\,"> runstats"$ || die$"can't open runstats&#8203;: $\!"$;
 open$LOG\, ">> logfile "$ || die$"can't open logfile&#8203;: $\!"$;
========
$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO,
"datafile") || die $!;'
Wide character in die at -e line 1.
Нет такого файла или каталога at -e line 1.

p5pRT · 2013-08-29T06:40:08Z

From @cpansprout

On Wed Aug 28 10:19:47 2013, public@khwilliamson.com wrote:

On 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote:

I believe this is useless and just makes it harder to decode $! value
properly.

I don't have a clue as to why you think this is useless. This change
was to fix https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208
(reported also as perl #117429, so more than one person found this to be
a bug). The patch merely examines the string text of $!, and if it is
UTF-8, sets the flag indicating that.

You are describing from the point of view of internals. From the user’s
standpoint, this means you are decoding $! if the character set is
UTF-8, but leaving it encoded otherwise.

This means even #112208 is not fixed, because the test case was ‘use
open <:std :encoding(utf-8)>’ followed by $!. If $! is not utf-8 and
you try to feed it through STDOUT, you still get garbage on the screen.

The ultimate problem is that perl has no way of guaranteeing that $! can
be fed to STDOUT and come out correctly. Even if it could do that,
there is no way for it to tell that STDOUT/STDERR is where $! is going
to go.

So now, $! may or may not be encoded, and you have to way of telling
reliably without doing the same environment checks that perl itself did
internally before deciding to decode $! itself.

--

Father Chrysostomos

p5pRT · 2013-08-29T08:15:06Z

From victor@vsespb.ru

On Wed Aug 28 23:40:08 2013, sprout wrote:

So now, $! may or may not be encoded, and you have to way of telling
reliably without doing the same environment checks that perl itself did
internally before deciding to decode $! itself.

Small corrections:

a) Actually there is a way: check is_utf8($!) flag (which is not good
because is_utf8 marked as danger, and it's documented you cant distinct
characters from bytes with this flag)

b) Current fix does not do environment checks, it just tries to do UTF-8
validity check
http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e

p5pRT · 2013-08-29T20:05:00Z

From @khwilliamson

On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:

On Wed Aug 28 23:40:08 2013, sprout wrote:

So now, $! may or may not be encoded, and you have to way of telling
reliably without doing the same environment checks that perl itself did
internally before deciding to decode $! itself.

I don't follow these arguments. What that commit did is only to look at
the string returned by the operating system, and if it is encoded in
UTF-8, to set that flag in the scalar. That's it (*). If the OS didn't
return UTF-8, it leaves the flag alone. I find it hard to comprehend
that this isn't the right thing to do. For the first time, $! in string
context is no different than any other string scalar in Perl. They have
a utf-8 bit set which means that the encoding is in UTF-8, or they don't
have it set, which means that the encoding is unknown to Perl. This
commit did not change the latter part one iota.
We have conventions as to what the bytes in that scalar mean depending
on the context it is used, the pragmas that are in effect in those
contexts, and the operations that are being performed on it. But they
are just conventions. This commit did not change that.

What is different about $! is that we have made the decision to respect
locale when accessing it even when not in the scope of 'use locale'. In
light of these issues, perhaps this should be discussed again. I'll let
the people who argued for that decision to again argue for it.

The change fixed two bug reports for the common case where the locales
for messages and the I/O matched and where people had not taken pains to
deal with locale. I think that should trump the less frequent cases,
given the conflicts.

If code wants $! to be expressed in a certain language, it should set
the locale to that language while accessing $! and then restore the old
locale.

Small corrections:

a) Actually there is a way: check is_utf8($!) flag (which is not good
because is_utf8 marked as danger, and it's documented you cant distinct
characters from bytes with this flag)

I don't see that danger marked currently in the pod for utf8.pm. Where
do you see that?

b) Current fix does not do environment checks, it just tries to do UTF-8
validity check
http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e

(*) To be precise

1) if the string returned by the OS is entirely ASCII, it does not set
the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are
identical, so the flag is irrelevant. And yes, this is buggy if
operating under a non-ASCII 7-bit locale, as in ISO 646. These locales
have all been superseded so should be rare today, but a bug report could
be written on this.

2) As Victor notes, the commit does a UTF-8 validity check, so it is
possible that that could give false positives. But as Wikipedia says,
"One of the few cases where charset detection works reliably is
detecting UTF-8. This is due to the large percentage of invalid byte
sequences in UTF-8, so that text in any other encoding that uses bytes
with the high bit set is extremely unlikely to pass a UTF-8 validity
test." (The original emphasized "extremely".) I checked this out with
the CP1251 character set, and the only modern Russian character that
could be a continuation byte is ё. All other vowels and consonants must
be start bytes. That means that to generate a false positive, an OS
message in CP1251 must only contain words whose 2nd, 4th, ... bytes are
that vowel. That just isn't going to happen, though the common Russian
word Её (her, hers, ...) could be confusable if there were no other
words in the message.

p5pRT · 2013-08-29T21:06:57Z

From victor@vsespb.ru

On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:

I don't see that danger marked currently in the pod for utf8.pm.
Where
do you see that?

http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22?

Please, unless you're hacking the internals, or debugging weirdness,
don't think about the UTF8 flag at all. That means that you very
probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.

2) As Victor notes, the commit does a UTF-8 validity check, so it is
possible that that could give false positives. But as Wikipedia says,
"One of the few cases where charset detection works reliably is
detecting UTF-8. This is due to the large percentage of invalid byte
sequences in UTF-8, so that text in any other encoding that uses bytes
with the high bit set is extremely unlikely to pass a UTF-8 validity
test." (The original emphasized "extremely".) I checked this out
with
the CP1251 character set, and the only modern Russian character that
could be a continuation byte is ё. All other vowels and consonants
must
be start bytes. That means that to generate a false positive, an OS
message in CP1251 must only contain words whose 2nd, 4th, ... bytes
are
that vowel. That just isn't going to happen, though the common
Russian
word Её (her, hers, ...) could be confusable if there were no other
words in the message.

I agree that it's pretty reliable. However different languages and
different encodings can show different misdetection rate. For example
rate for CP866 (this is ancient encoding probably) higher than for
CP1251. Also Russian alphabet does not contain A-Z characters, unlike
German or French. So French error message can contain just couple of
non-ASCII-7bit characters, unlike Russian.

I would not surprise if this detection is *not* introducing any single
bug for any combinations of encoding and language.

However I would not surprise too, if this detection is broken for some
Language-Encoding pair (perhaps for non-Western, non-Cyrilic languages).

p5pRT · 2013-08-29T22:27:18Z

From victor@vsespb.ru

On Thu Aug 29 14:06:57 2013, vsespb wrote:

On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:

2) As Victor notes, the commit does a UTF-8 validity check, so it is
I agree that it's pretty reliable. However different languages and

Generator of byte sequences that are valid in UTF-8 and in another
encoding, and which represend letters (\w) in another encoding.

#!/usr/bin/env perl

use strict;
use warnings;
use Encode;
use utf8;

binmode STDOUT, ":encoding(UTF-8)";

my @A = grep { /\w/ } map { chr($_) } (128..1024);

for my $z1 (@A) { for my $z2 ('', @A) { for my $z3 ('', @A) {
for my $encoding (qw/ISO-8859-1 ISO-8859-2 ISO-8859-3 ISO-8859-4
ISO-8859-7 ISO-8859-8 ISO-8859-9 ISO-8859-10/) {
my $S = $z1.$z2.$z3;
my $e = eval { encode($encoding, "$S", Encode::FB_CROAK); };
next unless $e;
my $xx = $e;
$xx =~ s/(.)/sprintf("\\x%02X",ord($1))/eg;
Encode::_utf8_on($e);
if (utf8::valid($e)) {
print "# $encoding [$S]".(length($S))." [$e] [$xx]\n";
print <<"END";
perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my \$z = "$xx";
print "[", decode("UTF-8", "\$z", Encode::FB_CROAK), "]\\t[",
decode("$encoding", "\$z", Encode::FB_CROAK), "]\\n"'
END
}
}
}}}
__END__

example output:

perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z =
"\xC3\xBE"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[",
decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"'
perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z =
"\xC3\xBC"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[",
decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"'
perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z =
"\xC3\xA1"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[",
decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"'

example output of output example:

$perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z =
"\xC3\xBC"; print "[", decode("UTF-8", "$z", Encode::FB_CROAK), "]\t[",
decode("ISO-8859-2", "$z", Encode::FB_CROAK), "]\n"'
[ü] [Ăź]

p5pRT · 2013-08-31T09:32:15Z

From victor@vsespb.ru

On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:

have all been superseded so should be rare today, but a bug report
could
be written on this.

Rare under linux. AFAIK FreeBSD 9 (latest stable) users still have
single-byte encoding locale by default (at least they try hard to
get UTF-8 working and it's only partly supported)
http://forums.freebsd.org/showthread.php?t=34682

There are real users with non-UTF8 locale, I saw one. We've spent
serveral hours trying to find why my perl script hangs sometimes,
and in the end found bug in perlio
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=117537 related to non-UTF

A real application which is broken by this change is 'ack' (ack 1 and
ack 2).

Russian error messages now printed with warning (under UTF-8 locale!).
French error messages now corrupted (because it's Latin-1) under UTF-8
locale too. beyondgrep/ack2#367

p5pRT · 2013-08-31T13:27:01Z

From @cpansprout

On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:

On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:

On Wed Aug 28 23:40:08 2013, sprout wrote:

So now, $! may or may not be encoded, and you have to way of
telling
reliably without doing the same environment checks that perl itself
did
internally before deciding to decode $! itself.

I don't follow these arguments. What that commit did is only to look
at
the string returned by the operating system, and if it is encoded in
UTF-8, to set that flag in the scalar. That's it (*). If the OS
didn't
return UTF-8, it leaves the flag alone. I find it hard to comprehend
that this isn't the right thing to do. For the first time, $! in
string
context is no different than any other string scalar in Perl. They
have
a utf-8 bit set which means that the encoding is in UTF-8,

You are still describing this from the point of view of the internals.

From the users point of view, the utf8 flag does not mean it is encoded
in utf8. It means it is *de*coded; just a sequence of Unicode characters.

or they
don't
have it set, which means that the encoding is unknown to Perl.

I.e., still encoded.

This
commit did not change the latter part one iota.

The former is the problem, not the latter. If a program can find out
what encoding the OS is using for errno messages, it should be able to
apply that encoding to $! via decode($os_encoding, $!,
Encode::FB_CROAK). But that fails now when perl thought it saw utf8.

We have conventions as to what the bytes in that scalar mean depending
on the context it is used, the pragmas that are in effect in those
contexts, and the operations that are being performed on it. But they
are just conventions. This commit did not change that.

I don’t follow. The bytes inside the scalar are not visible to the Perl
program without resorting to introspection that should never be used for
dispatch.

Your commit changed the content of the scalar as returned by ord and
substr, but only sometimes. It’s the ‘only sometimes’ that is problematic.

What is different about $! is that we have made the decision to
respect
locale when accessing it even when not in the scope of 'use locale'.

The problem here is that the locale is only sometimes being respected.

In
light of these issues, perhaps this should be discussed again. I'll
let
the people who argued for that decision to again argue for it.

The change fixed two bug reports for the common case where the locales
for messages and the I/O matched and where people had not taken pains
to
deal with locale. I think that should trump the less frequent cases,
given the conflicts.

But the less frequent cases now require one to introspect internal
scalar flags that should make no difference.

Also, is that really more frequent? What about scripts that pass $!
straight to STDOUT without layers, knowing that $! is already in the
character set the terminal expects?

If code wants $! to be expressed in a certain language, it should set
the locale to that language while accessing $! and then restore the
old
locale.

Are you suggesting that perl itself start defaulting to the C locale for $!?

Small corrections:

a) Actually there is a way: check is_utf8($!) flag (which is not
good
because is_utf8 marked as danger, and it's documented you cant
distinct
characters from bytes with this flag)

I don't see that danger marked currently in the pod for utf8.pm.
Where
do you see that?

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8. Functionally the same as Encode::is_utf8().

I think he is referring to ‘internally’ here, which indicates that you
shouldn’t rely on it.

b) Current fix does not do environment checks, it just tries to do
UTF-8
validity check

http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e

(*) To be precise

1) if the string returned by the OS is entirely ASCII, it does not set
the UTF-8 flag. This is because ASCII UTF-8 and non-UTF-8 are
identical, so the flag is irrelevant. And yes, this is buggy if
operating under a non-ASCII 7-bit locale, as in ISO 646. These
locales
have all been superseded so should be rare today, but a bug report
could
be written on this.

2) As Victor notes, the commit does a UTF-8 validity check, so it is
possible that that could give false positives. But as Wikipedia says,
"One of the few cases where charset detection works reliably is
detecting UTF-8. This is due to the large percentage of invalid byte
sequences in UTF-8, so that text in any other encoding that uses bytes
with the high bit set is extremely unlikely to pass a UTF-8 validity
test." (The original emphasized "extremely".) I checked this out
with
the CP1251 character set, and the only modern Russian character that
could be a continuation byte is ё. All other vowels and consonants
must
be start bytes. That means that to generate a false positive, an OS
message in CP1251 must only contain words whose 2nd, 4th, ... bytes
are
that vowel. That just isn't going to happen, though the common
Russian
word Её (her, hers, ...) could be confusable if there were no other
words in the message.

That is all very nice, but how would you rewrite this code to work in
5.19.2 and up?

if (!open fh, $filename) {
# add_to_log expects a string of characters, so decode it
add_to_log($filename, 0+$!, Encode::decode(
I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()),
$!
));
return;
}

--

Father Chrysostomos

p5pRT · 2013-09-01T04:37:22Z

From @khwilliamson

On 08/31/2013 07:27 AM, Father Chrysostomos via RT wrote:

On Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:

On 08/29/2013 02:15 AM, Victor Efimov via RT wrote:

On Wed Aug 28 23:40:08 2013, sprout wrote:

So now, $! may or may not be encoded, and you have to way of
telling
reliably without doing the same environment checks that perl itself
did
internally before deciding to decode $! itself.

I don't follow these arguments. What that commit did is only to look
at
the string returned by the operating system, and if it is encoded in
UTF-8, to set that flag in the scalar. That's it (*). If the OS
didn't
return UTF-8, it leaves the flag alone. I find it hard to comprehend
that this isn't the right thing to do. For the first time, $! in
string
context is no different than any other string scalar in Perl. They
have
a utf-8 bit set which means that the encoding is in UTF-8,

You are still describing this from the point of view of the internals.

I persist in this because I believe your point is a red herring. I
believe that it is a valid and strong argument that bringing outlier
behavior into conformity with the rest of how Perl operates may very
well trump other concerns. I was attempting to show that that is what
this commit did.

Rather than address most of the rest of your email, some of which I
believe are speciour or false, let's cut to the chase

how would you rewrite this code to work in
5.19.2 and up?

if (!open fh, $filename) {
# add_to_log expects a string of characters, so decode it
add_to_log($filename, 0+$!, Encode::decode(
I18N::Langinfo::langinfo(I18N::Langinfo::CODESET()),
$!
));
return;
}

I feel compelled to point out that this code is buggy. I18N::Langinfo
is not portable to all platforms that Perl runs on, and CODESET gives
the locale of LC_CTYPE, which may not be the same locale that $! is
returned in: LC_MESSAGES. (Note that the code could be modified to
change LC_CTYPE to the locale of LC_MESSAGES temporarily around the
langinfo call to addess this bug.) Also, some vendors' nl_langinfo()
was, at the time, so buggy that the core .t for this doesn't do any
"real" testing.
http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-Langinfo/t/Langinfo.t

But on platforms where it works reliably, and the typical case where
LC_CTYPE matches LC_MESSAGES, my commit does break this code. If it
were my code here, I'd 'use bytes' (I don't believe bytes.pm should be
removed from core; that this area is one of the few valid uses for it,
and this is not the thread to discuss it), or utf8::is_utf8() (I think
we should soften somewhat the admonition against using that.)

I think all of us would agree that deference should be paid to
(apparently) working code when making changes. And it may be that this
commit is so egregious, or not really helpful in enough places that its
cost benefit ratio is not high enough to keep.

And $! remains an outlier in the sense that it is AFAIK, and I've looked
hard (perhaps not hard enough), now the only place (except for some
POSIX:: routines) where the program's underlying current locale leaks
outside the scope of 'use locale'. The main argument that I've heard
for doing that is that $! is often for the end-user and not the
programmer. But it isn't for the end user if what gets displayed is
gibberish, which includes being in some language the user doesn't know,
though the latter is better than garbage bytes. So what I'm advocating
is re-examining whether we wish $! to respect 'use locale' or not. If
we chose to respect 'use locale', outside that, it would return messages
in the system default locale, typically "C".

I'm pretty confident that the problem can't be solved so that no code
has to change and things just start working correctly for everybody.
Currently, using $! in production code that can be operated by users who
might have their own locales is much more complicated than people
imagine. "die $!" could print gibberish. Maybe a partial answer is to
create a wrapper that does the best it can on the platform it is running
on, and suggest people change to use it.

If this commit is reverted, we do need to decide how we will address the
bugs it fixed and the new ones that are sure to come in (barring some
better answer). Do we reject them and say you need to handle $! yourself?

p5pRT · 2013-09-01T14:37:35Z

From victor@vsespb.ru

2013/9/1 Karl Williamson <public@khwilliamson.com>

And $! remains an outlier in the sense that it is AFAIK, and I've looked
hard (perhaps not hard enough), now the only place (except for some POSIX::
routines) where the program's underlying current locale leaks outside the
scope of 'use locale'.

But that is not the only place, where non-ASCII character can appear.

The following is documented in perlunicode:

"While Perl does have extensive ways to input and output in Unicode, and a
few other "entry points" like the @ARGV array (which can sometimes be
interpreted as
UTF-8), there are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both, but
it is not."

I believe that note can mean that encoded $! is not a bug, but a feature.
If it's considered a bug, then all other places where non-ASCII appears
encoded, and it's not explicitly documented, can be considered as bug
(examples are $0, %INC values, @INC, something else?)

Thus it's impossible for people to use those variables now, as it may
change anytime in the future.

p5pRT · 2013-09-01T15:24:00Z

From @Leont

On Sun, Sep 1, 2013 at 4:36 PM, Victor Efimov <victor@vsespb.ru> wrote:

But that is not the only place, where non-ASCII character can appear.

The following is documented in perlunicode:

"While Perl does have extensive ways to input and output in Unicode, and a
few other "entry points" like the @ARGV array (which can sometimes be
interpreted as
UTF-8), there are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both, but
it is not."

I believe that note can mean that encoded $! is not a bug, but a feature.
If it's considered a bug, then all other places where non-ASCII appears
encoded, and it's not explicitly documented, can be considered as bug
(examples are $0, %INC values, @INC, something else?)

Thus it's impossible for people to use those variables now, as it may
change anytime in the future.

$! is inherently a piece of text, not piece of binary data. As such, it
makes perfect sense to treat it as such an automatically decode it. The
same is not necessarily true for your other examples.

Leon

p5pRT · 2013-09-01T15:26:04Z

From @Leont

On Sun, Sep 1, 2013 at 6:36 AM, Karl Williamson <public@khwilliamson.com>wrote:

And $! remains an outlier in the sense that it is AFAIK, and I've looked
hard (perhaps not hard enough), now the only place (except for some POSIX::
routines) where the program's underlying current locale leaks outside the
scope of 'use locale'.

Yeah, in POSIX strftime and the is* functions are also affected.

The main argument that I've heard for doing that is that $! is often for
the end-user and not the programmer. But it isn't for the end user if what
gets displayed is gibberish, which includes being in some language the user
doesn't know, though the latter is better than garbage bytes. So what I'm
advocating is re-examining whether we wish $! to respect 'use locale' or
not. If we chose to respect 'use locale', outside that, it would return
messages in the system default locale, typically "C".

That does sounds like consistency to me.

I'm pretty confident that the problem can't be solved so that no code has

to change and things just start working correctly for everybody.

That is my feeling too. The new situation feels rather unfinished to me,
but the old situation was clearly not the most useful behavior we can offer.

Currently, using $! in production code that can be operated by users who

might have their own locales is much more complicated than people imagine.
"die $!" could print gibberish.

Indeed.

Leon

p5pRT · 2013-09-01T16:24:19Z

From @khwilliamson

On 08/31/2013 10:36 PM, Karl Williamson wrote:

I feel compelled to point out that this code is buggy. I18N::Langinfo
is not portable to all platforms that Perl runs on, and CODESET gives
the locale of LC_CTYPE, which may not be the same locale that $! is
returned in: LC_MESSAGES. (Note that the code could be modified to
change LC_CTYPE to the locale of LC_MESSAGES temporarily around the
langinfo call to addess this bug.) Also, some vendors' nl_langinfo()
was, at the time, so buggy that the core .t for this doesn't do any
"real" testing.
http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-Langinfo/t/Langinfo.

I now feel compelled to point out that I should have been more clear
that this code is fine, not buggy, if used in the environment in which
it was likely designed for. On a platform with a working nl_langinfo()
and the programmer knows that LC_MESSAGES and LC_CTYPE are always in
sync, this worked well, until I broke it.

p5pRT · 2013-09-01T16:48:27Z

From victor@vsespb.ru

2013/9/1 Leon Timmermans <fawaka@gmail.com>

$! is inherently a piece of text, not piece of binary data. As such, it
makes perfect sense to treat it as such an automatically decode it. The
same is not necessarily true for your other examples.

btw, interesting that $^E is not affected by this change, i.e when $! is
same as $^E (I tested on linux only), $^E does not have utf-8 flag, while
$! has.

p5pRT · 2013-09-01T17:16:10Z

From @khwilliamson

On 09/01/2013 10:47 AM, Victor Efimov wrote:

2013/9/1 Leon Timmermans <fawaka@gmail.com <mailto:fawaka@gmail.com>>
$\! is inherently a piece of text\, not piece of binary data\. As such\,
it makes perfect sense to treat it as such an automatically decode
it\. The same is not necessarily true for your other examples\.
btw, interesting that $^E is not affected by this change, i.e when $! is
same as $^E (I tested on linux only), $^E does not have utf-8 flag,
while $! has.

I've been wondering myself what should happen with $^E, and I believe
the two should be made consistent.

Some other thoughts I've had about this issue.

The commit did not break the ISO 646 7-bit codings, as the behavior is
unchanged for those.

Those encodings must not be very important nor have been for quite some
time, as it does not appear that Encode supports them.

We could have a feature automatically turned on in v5.20. I'll call it
'errno' for now ('mauve' having been taken ;) ).

Without it being on, $! works as it did in <=v5.18.

Within its scope Perl attempts to decode $! as best it can, autoloading
Encode and trying to determine the locale using nl_langinfo() if available.

This may be a crazy idea; but I thought I'd put it out there to
stimulate discussion

p5pRT · 2013-09-01T17:46:13Z

From zefram@fysh.org

Karl Williamson wrote:

Within its scope Perl attempts to decode $! as best it can,

Scoping doesn't work well for this sort of thing. The decoding happens in
get magic, when the variable is being read. If that behaviour is affected
by the lexical scope in which the reading happens, this means that
different readers will see different values in the same variable, which
is awfully confusing if a reference to the variable gets passed around.
Worse, XS code gets the behaviour of *its caller's* lexical scope.

Amusingly, $[ used to influence $#foo magic variables in this manner.
It's one of the reasons I'm glad we got rid of $[.

-zefram

p5pRT · 2013-09-01T18:08:31Z

From victor@vsespb.ru

one problem with lexical scope is also POSIX::strerror.
currently it's implemented using

strerror => 'errno => local $! = $_[0]; "$!"',

thus it has to be fixed too if we implement lexical featurization.

2013/9/1 Zefram <zefram@fysh.org>

Karl Williamson wrote:

Within its scope Perl attempts to decode $! as best it can,

Scoping doesn't work well for this sort of thing. The decoding happens in
get magic, when the variable is being read. If that behaviour is affected
by the lexical scope in which the reading happens, this means that
different readers will see different values in the same variable, which
is awfully confusing if a reference to the variable gets passed around.
Worse, XS code gets the behaviour of *its caller's* lexical scope.

Amusingly, $[ used to influence $#foo magic variables in this manner.
It's one of the reasons I'm glad we got rid of $[.

-zefram

p5pRT · 2013-09-01T18:09:41Z

From @cpansprout

On Sun Sep 01 09:24:19 2013, public@khwilliamson.com wrote:

On 08/31/2013 10:36 PM, Karl Williamson wrote:

I feel compelled to point out that this code is buggy.
I18N::Langinfo
is not portable to all platforms that Perl runs on, and CODESET
gives
the locale of LC_CTYPE, which may not be the same locale that $! is
returned in: LC_MESSAGES. (Note that the code could be modified to
change LC_CTYPE to the locale of LC_MESSAGES temporarily around the
langinfo call to addess this bug.) Also, some vendors'
nl_langinfo()
was, at the time, so buggy that the core .t for this doesn't do any
"real" testing.
http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-
Langinfo/t/Langinfo.

I now feel compelled to point out that I should have been more clear
that this code is fine, not buggy, if used in the environment in which
it was likely designed for. On a platform with a working
nl_langinfo()
and the programmer knows that LC_MESSAGES and LC_CTYPE are always in
sync, this worked well, until I broke it.

More importantly, as Victor pointed out, it breaks programs that are not
trying to do anything with character sets or locales, such as ack, when
they are running on a utf8 terminal in a utf8 locale. I dare to bet
those are the most common.

On dromedary I get this with the system perl (5.12.3):

$ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'
Nincs ilyen fájl vagy könyvtár at -e line 1.

When I build my own (blead)perl, I get this:

$ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!'
Nincs ilyen f?jl vagy k?nyvt?r at -e line 1.

--

Father Chrysostomos

p5pRT · 2013-09-01T18:30:40Z

From @cpansprout

On Sun Sep 01 11:09:41 2013, sprout wrote:

On Sun Sep 01 09:24:19 2013, public@khwilliamson.com wrote:

On 08/31/2013 10:36 PM, Karl Williamson wrote:

I feel compelled to point out that this code is buggy.
I18N::Langinfo
is not portable to all platforms that Perl runs on, and CODESET
gives
the locale of LC_CTYPE, which may not be the same locale that $! is
returned in: LC_MESSAGES. (Note that the code could be modified to
change LC_CTYPE to the locale of LC_MESSAGES temporarily around the
langinfo call to addess this bug.) Also, some vendors'
nl_langinfo()
was, at the time, so buggy that the core .t for this doesn't do any
"real" testing.
http://perl5.git.perl.org/perl.git/blame/HEAD:/ext/I18N-
Langinfo/t/Langinfo.

I now feel compelled to point out that I should have been more clear
that this code is fine, not buggy, if used in the environment in which
it was likely designed for. On a platform with a working
nl_langinfo()
and the programmer knows that LC_MESSAGES and LC_CTYPE are always in
sync, this worked well, until I broke it.

More importantly, as Victor pointed out, it breaks programs that are not
trying to do anything with character sets or locales, such as ack, when
they are running on a utf8 terminal in a utf8 locale. I dare to bet
those are the most common.

On dromedary I get this with the system perl (5.12.3):

$ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'
Nincs ilyen f�jl vagy k�nyvt�r at -e line 1.

RT screwed it up. That appeared perfectly fine.

When I build my own (blead)perl, I get this:

$ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!'
Nincs ilyen f?jl vagy k?nyvt?r at -e line 1.

That is as it appeared, with question marks.

--

Father Chrysostomos

p5pRT · 2013-09-01T18:33:15Z

From @cpansprout

On Sun Sep 01 10:46:13 2013, zefram@fysh.org wrote:

Karl Williamson wrote:

Within its scope Perl attempts to decode $! as best it can,

Scoping doesn't work well for this sort of thing. The decoding happens in
get magic, when the variable is being read. If that behaviour is affected
by the lexical scope in which the reading happens, this means that
different readers will see different values in the same variable, which
is awfully confusing if a reference to the variable gets passed around.

A new global variable is another option.

--

Father Chrysostomos

p5pRT · 2013-09-02T23:10:59Z

From victor@vsespb.ru

2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org>

A new global variable is another option.

perhaps ${^DECODED_ERROR} ?

p5pRT · 2013-09-10T01:07:47Z

From @khwilliamson

On 09/02/2013 05:10 PM, Victor Efimov wrote:

2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org
<mailto:perlbug-followup@perl.org>>
A new global variable is another option\.
perhaps ${^DECODED_ERROR} ?

I have come to believe that this is probably the best way forward. That
is, revert the $! change, and tell people who need it to use the new
global variable which will decode as best it can on the given platform
based on the locale in effect.

In any event, there should be uniform treatment of $! and $^E.
That means that a parallel variable should be provided for $^E.

Does anyone know if the strings for the platforms that have separate $^E
strings return those in the current locale or not?

These include vms, win32, dos, and os/2.

p5pRT · 2013-10-16T11:28:02Z

From victor@vsespb.ru

On Tue Oct 15 14:59:45 2013, public@khwilliamson.com wrote:

What you want is for $! to work like it's in 'use bytes'. I can change
the patch so that it checks for 'use bytes' and if within that scope
returns without the utf8 flag set. You would then just need to add a
'use bytes' to get it to work the same way it always has.

There are people who would disapprove of ever using bytes, which means
they think the behavior you want is wrong. I'm not one of them. I
think that 'use bytes' should be rare, mostly used in testing, but it
sometimes is the easiest, clearest way of getting at the bytes that
comprise a UTF-8-encoded character. utf8::encode() can be used for
that, but destroys its argument and I think its name is much less clear
than 'use bytes'.

I have tested doing this, and it works.

New behaviour looks sane to me. It's probably thay way it's supposed to
work from beginning. Main problem solved (when $! sometimes returned
characters, sometimes bytes).

There were comments that enabling new behaviour in lexical scope is not
good and danger (but you stated that it's probably OK). We enabled it by
default, and users now can switch to *old* behaviour in *lexical* scope
(with use bytes or use locale). I think arguments that lexical scope is
not good can apply here too.

The big problem that I see now is backward compatibility. Any existing
code that uses $! is probably broken.

Users will have to fix it with use locale/use bytes.

Few examples that I found (where filenames are concatenated with $!):

====
Fild::Temp
unless ($!{EEXIST}) {
${$options{ErrStr}} = "Could not create temp file $path: $!";
return ();
}

File::Find
unless (defined $topnlink) {
warnings::warnif "Can't stat $top_item: $!\n";
next Proc_Top_Item;
}

LWP::UserAgent
my @stat = stat($tmpfile) or die "Could not stat tmpfile
'$tmpfile': $!";
or die "Cannot rename '$tmpfile' to '$file': $!\n";

====

Note, that if filename here contains non-ASCII characters and is binary
string, merging it with character string $! would produce broken result.

Even if filename is ASCII, it would break old behaviour when die
exception printed to STDERR.

If filename is character string, that code did not work correctly
previously.

Another issue that there is POSIX::strerror, and IMHO it should behave
just like $! for consistency (i.e. produce different things depending on
lexical scope). POSIX::strerror is pure perl.

p5pRT · 2013-10-22T20:48:08Z

From @cpansprout

On Wed Oct 09 18:46:47 2013, public@khwilliamson.com wrote:

Until the commit that generated this ticket, $! returned the bytes that
comprise the message regardless of whether the message was in UTF-8 or
not. Thus it behaved as if it were in the scope of both 'use locale'
and 'use bytes'. What the commit effectively did was to remove the 'use
bytes' behavior, causing $! to behave as any other string scalar does
under 'use locale'. Many people on this list think that we should get
rid of 'use bytes'; that its behavior is never desired. (I'm not one of
them BTW, but I think it should be used only very rarely.) Thus, on the
face of it, it is suspect that $! should behave as if it is in 'use
bytes', and I'm having a hard time groking the argument that we should
revert back to that.

The problem with the bytes pragma is that two scalars may compare equal
($a eq $b) outside its scope, but be different ($a ne $b) within its
scope. It changes the contents of scalars, but only some scalars.

$! does not do that. In fact, it is more akin to the default input and
output streams, which do not do any automatic decoding or encoding until
one asks for it.

I don’t have enough room in my brain to fit all the issues that are
currently going on, so I can’t really comment on what makes sense under
‘use locale’. But I would ask that you consider things at a more
practical level.

Simple programs like ack that do not take encodings into account should
work without any change. The one-liner that I posted is still broken in
bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should
continue to work, regardless of what we come up with.

Maybe what you are really after is a *function* that returns a decoded $!.

--

Father Chrysostomos

p5pRT · 2013-10-22T21:41:09Z

From @mauke

On 22.10.2013 22:48, Father Chrysostomos via RT wrote:

Simple programs like ack that do not take encodings into account should
work without any change. The one-liner that I posted is still broken in
bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should
continue to work, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

Maybe what you are really after is a *function* that returns a decoded $!.

Doesn't interpolate nicely in error messages.

--
Lukas Mai <plokinom@gmail.com>

p5pRT · 2013-10-22T21:46:12Z

From @cpansprout

On Tue Oct 22 14:41:09 2013, plokinom@gmail.com wrote:

On 22.10.2013 22:48, Father Chrysostomos via RT wrote:

Simple programs like ack that do not take encodings into account should
work without any change. The one-liner that I posted is still broken in
bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should
continue to work, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

All I can say is, ouch! I have always found use of PERL_UNICODE to be
suspicious. The problem with PERL_UNICODE is that it enforces things on
a program that might have its own STDIN/STDERR handling.

--

Father Chrysostomos

p5pRT · 2013-10-22T21:52:35Z

From @cpansprout

On Tue Oct 22 14:46:12 2013, sprout wrote:

On Tue Oct 22 14:41:09 2013, plokinom@gmail.com wrote:

On 22.10.2013 22:48, Father Chrysostomos via RT wrote:

Simple programs like ack that do not take encodings into account
should
work without any change. The one-liner that I posted is still
broken in
bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should
continue to work, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

All I can say is, ouch! I have always found use of PERL_UNICODE to be
suspicious.

I think I meant suspect, or whatever.

The problem with PERL_UNICODE is that it enforces things on
a program that might have its own STDIN/STDERR handling.

In particular, PERL_UNICODE=SL breaks any simple Perl implementation of cat.

--

Father Chrysostomos

p5pRT · 2013-10-22T21:57:14Z

From victor@vsespb.ru

2013/10/23 Lukas Mai <plokinom@gmail.com>

On 22.10.2013 22:48, Father Chrysostomos via RT wrote:

Simple programs like ack that do not take encodings into account should
work without any change. The one-liner that I posted is still broken in
bleadperl. Try running

LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!'

on dromedary with and without bleadperl. That type of code should
continue to work, regardless of what we come up with.

With or without PERL_UNICODE=SL?

Because that's on by default in my environment.

I think things like 'ack' won't work this way. They read data also from
@ARGV, config files, they work with filesystem's filenames.
Actually, use of PERL_UNICODE=SL is pretty limited, imho.

p5pRT · 2013-10-22T22:04:20Z

From @mauke

On 22.10.2013 23:52, Father Chrysostomos via RT wrote:

On Tue Oct 22 14:46:12 2013, sprout wrote:

The problem with PERL_UNICODE is that it enforces things on
a program that might have its own STDIN/STDERR handling.

In particular, PERL_UNICODE=SL breaks any simple Perl implementation of cat.

Isn't such a "simple" implementation already broken on systems like Windows?

--
Lukas Mai <plokinom@gmail.com>

p5pRT · 2013-11-27T04:25:58Z

From @khwilliamson

I have now pushed a series of patches that make the handling of this
uniform for $^E and $! on Win32 and OS/2. That means changing a single
place will automatically propagate to all areas, once we decide what
that is. I hope to soon have some time to look further.

p5pRT · 2014-02-05T22:52:35Z

From @tonycoz

On Tue Nov 26 20:25:58 2013, public@khwilliamson.com wrote:

I have now pushed a series of patches that make the handling of this
uniform for $^E and $! on Win32 and OS/2. That means changing a single
place will automatically propagate to all areas, once we decide what
that is. I hope to soon have some time to look further.

This is a 5.20 blocker.

Did you have time to look further?

Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are.

Tony

p5pRT · 2014-02-05T23:51:46Z

From @khwilliamson

On 02/05/2014 03:52 PM, Tony Cook via RT wrote:

On Tue Nov 26 20:25:58 2013, public@khwilliamson.com wrote:

I have now pushed a series of patches that make the handling of this
uniform for $^E and $! on Win32 and OS/2. That means changing a single
place will automatically propagate to all areas, once we decide what
that is. I hope to soon have some time to look further.

This is a 5.20 blocker.

Did you have time to look further?

Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are.

Tony

This is correctly listed as a blocker. I have thought further about
this, but am not ready to pursue it; I am trying to get all the
user-visible changes in before I finish up my research on this.

p5pRT · 2014-03-02T05:44:03Z

From @khwilliamson

This is my attempt to bring some clarity to this issue and stake out my
position regarding it. I haven't re-read the thread thoroughly just
now, so I may miss some issues, but I have been very aware of the
central problem regarding this for months now, and have been thinking
about it for the same amount of time, so I believe that what follows is
an adequate summary of that.

First the background. This ticket is about a commit that fixed two
tickets with the same underlying cause,
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when
open.pm sets utf8 default on filehandles yields garbage", and #117429,
merged with the earlier ticket.

The problem is that $! was returning UTF-8 encoded text, but the UTF-8
flag was not set, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the
returned bytes and the filehandle doesn't have the utf8 default on. Now
what gets displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use
bytes', then the UTF-8 flag doesn't get set, and the return is precisely
what it used to be.

Thus a potential solution is to force such code to change to do a 'use
bytes'. FC is concerned that programs like ack will have to change if
we choose this scenario.

Otherwise we are in a quandary. If we revert the commit, code that
"does the right thing" by setting their filehandle appropriately gets
garbage; whereas if we keep it, code that is unprepared to handle UTF-8
can get garbage. There's probably far more of the latter than the
former, but do we wish to punish code that DTRT?

Before proceeding, I want to make an assertion: I think that it is
better for someone to get output in a language foreign to them, than it
is to get garbage bytes. This is because they can put the former into
something like Google translate to get a reasonable translation back
into their own language; and I believe that what appears to be garbage
bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

If you don't accept it, then you need to persuade me and others who do
accept it, why not, and there's not much point in you reading the rest
of this message.

If you do accept it, one solution is to always output $! in English,
which we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most
platforms this will be identical to the C locale, but on VMS, at least,
it can include Western European languages as well, though I think that
VMS only returns $! in English.

A more general solution would be to output it in the native locale
unless it is UTF-8 encoded, in which case it would be converted to
English. This would then cause the code like (apparently) ack to see no
change in behavior, except that some errors would now come out in
English; and the code that was affected by #119499 would get English,
instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8
is because $! does not respect 'use locale'. The reason for this is
that $! typically gives the user an OS error that is outside Perl's
purview, and it's best that these messages be displayed in the user's
preferred language. But since what we have now causes garbage to be
displayed for one class of user, it seems to me to be a higher priority,
given my assertion, to output something sane for everybody, rather than
something ideal for some, and garbage for others.

That leads to yet another possibility, one that rjbs has previously
vetoed, but which I'm bringing up again here alongside this background
that he may not have considered: And that is to have $! respect 'use
locale'. Outside of 'use locale' it would be C or POSIX, which would
mean English. Within the scope of 'use locale', it would be the user's
language. Programs that do a 'use locale' can be assumed to be written
to be able to handle them, including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that
aren't prepared to accept it. We may have gotten away with this for
non-UTF-8 locales because most code won't try to parse the stringified
$! (and it's probably foolish to try to parse it), but the UTF-8 flag
throws a wrench into this uneasy truce.

To state my position explicitly: I don't think it's a good idea to
return a UTF-8 encoded string to code that isn't expecting that
possibility. And I don't think it's OK to have user's see garbage
bytes. To avoid doing these, we have to return English whenever that
could happen. 'use locale' in code should be enough to signal it's
prepared to handle UTF-8; otherwise it's buggy.

So still another possibility is to deliver $! in the current locale if
$! isn't UTF-8; otherwise to use English outside of 'use locale' and the
UTF-8 inside. That leaves code that sets things up properly to get $!
returned in the user's language; and code that doesn't will also get
the user's language unless what is returned would be in UTF-8, in which
case it will come out in English, instead of garbage. This seems to me
to be the best solution.

Another possibility, suggested by FC, is to leave $! as-is, but create a
new variable that behaves differently. I think it's far better to get
$! to work reasonably than to come up with an alternative variable.

p5pRT · 2014-03-02T08:11:09Z

From victor@vsespb.ru

2014-03-02 9:43 GMT+04:00 Karl Williamson <public@khwilliamson.com>:

Before proceeding, I want to make an assertion: I think that it is better
for someone to get output in a language foreign to them, than it is to get
garbage bytes. This is because they can put the former into something like
Google translate to get a reasonable translation back into their own
language; and I believe that what appears to be garbage bytes is much more
problematical to figure out what was intended.

Do you accept or reject this assertion?

Of course English better than garbage. BUT this is correct only for
"broken" programs - "It's better if broken program output English than
garbage"
You want new programs (which follow perl documentation, which are
without bugs) to output sometimes English sometimes other languages,
depending on locale charset.

For end users it will look like this: "one one machine everything is
fine, on another machine perl don't respect locale, all Gnu tools and
Python scripts works fine and print messages in my language
but Perl script don't seem to respect locale in random circumstances".

to me it looks like English better than garbage, but 5.18 behaviour
even better anyway (we can put $! to the list
http://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen
together with
@ARGV %ENV etc)

p5pRT · 2014-03-11T05:54:31Z

From @khwilliamson

tl;dr summary of this

I assert it is better to have an error message come out in a foreign
language (probably English) than to have apparent garbage bytes emitted.

If we output UTF-8 bytes without the UTF-8 flag being on to code that
handles UTF-8, they will appear to be garbage bytes. But if we set the
flag, this breaks code that isn't expecting to handle UTF-8. We break
one class or the other. The only way around it is to output bytes that
are the same in both UTF-8 and non-UTF8, unless we are confident that
the code can handle UTF-8. That means outputting ASCII when we don't
have that confidence.

My bottom line proposal is to look at the $! text, and if it contains
only ASCII, output it as-is.

We can be reasonably confident that the program can handle UTF-8 if we
are within the scope of 'use locale'. ($! should not be UTF-8 unless
the current locale is UTF-8.). Within that scope we also output $!,
as-is, setting the UTF-8 flag if it is UTF-8.

But if we are not within such scope we can't be confident at all about
how the I/O is set up, etc. In that case, for non-ASCII $! text, we
switch momentarily to the C locale, and re-get $!, which we then output.
This text will be in ASCII and (almost certainly) English, which can
be placed in something like Google translate.

This is not ideal but it pretty much assures that no one is going to get
garbage bytes that Google translate won't likely be able to figure out.

On 03/01/2014 10:43 PM, Karl Williamson wrote:

This is my attempt to bring some clarity to this issue and stake out my
position regarding it. I haven't re-read the thread thoroughly just
now, so I may miss some issues, but I have been very aware of the
central problem regarding this for months now, and have been thinking
about it for the same amount of time, so I believe that what follows is
an adequate summary of that.

First the background. This ticket is about a commit that fixed two
tickets with the same underlying cause,
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when
open.pm sets utf8 default on filehandles yields garbage", and #117429,
merged with the earlier ticket.

The problem is that $! was returning UTF-8 encoded text, but the UTF-8
flag was not set, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the
returned bytes and the filehandle doesn't have the utf8 default on. Now
what gets displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use
bytes', then the UTF-8 flag doesn't get set, and the return is precisely
what it used to be.

Thus a potential solution is to force such code to change to do a 'use
bytes'. FC is concerned that programs like ack will have to change if
we choose this scenario.

Otherwise we are in a quandary. If we revert the commit, code that
"does the right thing" by setting their filehandle appropriately gets
garbage; whereas if we keep it, code that is unprepared to handle UTF-8
can get garbage. There's probably far more of the latter than the
former, but do we wish to punish code that DTRT?

Before proceeding, I want to make an assertion: I think that it is
better for someone to get output in a language foreign to them, than it
is to get garbage bytes. This is because they can put the former into
something like Google translate to get a reasonable translation back
into their own language; and I believe that what appears to be garbage
bytes is much more problematical to figure out what was intended.

Do you accept or reject this assertion?

If you don't accept it, then you need to persuade me and others who do
accept it, why not, and there's not much point in you reading the rest
of this message.

If you do accept it, one solution is to always output $! in English,
which we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most
platforms this will be identical to the C locale, but on VMS, at least,
it can include Western European languages as well, though I think that
VMS only returns $! in English.

A more general solution would be to output it in the native locale
unless it is UTF-8 encoded, in which case it would be converted to
English. This would then cause the code like (apparently) ack to see no
change in behavior, except that some errors would now come out in
English; and the code that was affected by #119499 would get English,
instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8
is because $! does not respect 'use locale'. The reason for this is
that $! typically gives the user an OS error that is outside Perl's
purview, and it's best that these messages be displayed in the user's
preferred language. But since what we have now causes garbage to be
displayed for one class of user, it seems to me to be a higher priority,
given my assertion, to output something sane for everybody, rather than
something ideal for some, and garbage for others.

That leads to yet another possibility, one that rjbs has previously
vetoed, but which I'm bringing up again here alongside this background
that he may not have considered: And that is to have $! respect 'use
locale'. Outside of 'use locale' it would be C or POSIX, which would
mean English. Within the scope of 'use locale', it would be the user's
language. Programs that do a 'use locale' can be assumed to be written
to be able to handle them, including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that
aren't prepared to accept it. We may have gotten away with this for
non-UTF-8 locales because most code won't try to parse the stringified
$! (and it's probably foolish to try to parse it), but the UTF-8 flag
throws a wrench into this uneasy truce.

To state my position explicitly: I don't think it's a good idea to
return a UTF-8 encoded string to code that isn't expecting that
possibility. And I don't think it's OK to have user's see garbage
bytes. To avoid doing these, we have to return English whenever that
could happen. 'use locale' in code should be enough to signal it's
prepared to handle UTF-8; otherwise it's buggy.

So still another possibility is to deliver $! in the current locale if
$! isn't UTF-8; otherwise to use English outside of 'use locale' and the
UTF-8 inside. That leaves code that sets things up properly to get $!
returned in the user's language; and code that doesn't will also get
the user's language unless what is returned would be in UTF-8, in which
case it will come out in English, instead of garbage. This seems to me
to be the best solution.

Another possibility, suggested by FC, is to leave $! as-is, but create a
new variable that behaves differently. I think it's far better to get
$! to work reasonably than to come up with an alternative variable.

p5pRT · 2014-03-26T21:41:06Z

From @khwilliamson

I looked at beyondgrep/ack2#367
which shows that ack is broken by the 5.19.2 change.

If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.

Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English: $!"

I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what.

p5pRT · 2014-03-26T22:06:49Z

From victor@vsespb.ru

2014-03-27 1:41 GMT+04:00 Karl Williamson via RT <perlbug-followup@perl.org>:

I looked at beyondgrep/ack2#367
which shows that ack is broken by the 5.19.2 change.

If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.

yes agree. anyway warnings are bad. and broken latin1 bad too.

Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English: $!"

Right, usually "my message in English" indeed is in English because
authors don't bother with full localization and translations to all
languages, but for consistency it's better to see $! in locale's
language. Other programs usually show it in user language.

I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.

I would disagree, they try to migrate to unicode

https://github.com/petdance/ack2/issues/120
beyondgrep/ack2#344
beyondgrep/ack2#350
beyondgrep/ack2#355

ack is searching _text_ using _perl regexps_ in text files. it even
ignore files detected as binary (by default, at least, in my
installation)

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what.

I am writing programs with correct use of modern Perl unicode now, but
never used 'use locale', seems it adds additional side effect to code?
Can there be special option for 'use locale' to not change anything at
all, except $! behaviour (in lexical scope) ?

also, can code without 'use locale' behave like 5.18 (i.e. not always
in English; bytes) ? and with 'use locale :errno_only' change $! to
return unicode character string.

---
via perlbug: queue: perl5 status: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119499

p5pRT · 2014-03-26T23:12:04Z

From @khwilliamson

On 03/26/2014 04:06 PM, Victor Efimov wrote:

2014-03-27 1:41 GMT+04:00 Karl Williamson via RT <perlbug-followup@perl.org>:

I looked at beyondgrep/ack2#367
which shows that ack is broken by the 5.19.2 change.

If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if $! actually did contain UTF-8, it wasn't marked as such, and its component bytes were output as-is, so that if in fact the terminal is expecting UTF-8, they come out looking like UTF-8 to it, and everything held together. (Garbage would ensue if the terminal wasn't expecting the encoding that $! is in; I haven't checked, but my guess is that the grep output is also output as-is, so if the file encodings differ from the terminal expectation, that garbage could be printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage.

yes agree. anyway warnings are bad. and broken latin1 bad too.

It's arguable that the warnings should have been output all along.
since really it is UTF-8 being output to a terminal that perl thinks
can't handle it.

Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form

die "my message in English: $!"

Right, usually "my message in English" indeed is in English because
authors don't bother with full localization and translations to all
languages, but for consistency it's better to see $! in locale's
language. Other programs usually show it in user language.

I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise.

I would disagree, they try to migrate to unicode

https://github.com/petdance/ack2/issues/120
beyondgrep/ack2#344
beyondgrep/ack2#350
beyondgrep/ack2#355

ack is searching _text_ using _perl regexps_ in text files. it even
ignore files detected as binary (by default, at least, in my
installation)

I stand corrected.

My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what.

I am writing programs with correct use of modern Perl unicode now, but
never used 'use locale', seems it adds additional side effect to code?
Can there be special option for 'use locale' to not change anything at
all, except $! behaviour (in lexical scope) ?

locale works a lot better (I anticipate) in 5.20 than before. I think
it should finally be possible to 'use locale' as a matter of habit.

I was already thinking that 'use locale' in 5.22 should have the ability
to select LC_CTYPE and LC_COLLATE individually. It seems logical to
make this general, so you could say

'use locale ':messages, numeric';

to get just the effects you want. Some of this could conceivably be
added in 5.20 if it helps to resolve this blocker.

also, can code without 'use locale' behave like 5.18 (i.e. not always
in English; bytes)

The problem is that the commit fixed real bugs in code that didn't "use
locale" Thus the quandary. If we go back to 5.18 behavior, those bugs
come back. I believe that my proposal that only ASCII messages get
displayed outside of 'use locale' is the only "sure" method that doesn't
display garbage to someone. (Note that ASCII doesn't mean necessarily
English. Many error messages in Western European languages consist only
of ASCII characters. I realize that doesn't help Russian or Chinese, etc.)

Also, I hadn't realized this before, but sometimes the message's
characters aren't just garbage that someone with the motivation and
skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be
displayed instead, so information is lost and can't be recovered.

? and with 'use locale :errno_only' change $! to
return unicode character string.

I don't see how this differs from your suggestion above for an option to
'use locale' to just effect $! (which is BTW LC_MESSAGES).

And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK. Can
someone explain what languages error messages are displayed in under
varied locales?

p5pRT · 2014-03-27T02:08:08Z

From @khwilliamson

On 03/26/2014 05:12 PM, Karl Williamson wrote:

On 03/26/2014 04:06 PM, Victor Efimov wrote:

2014-03-27 1:41 GMT+04:00 Karl Williamson via RT
<perlbug-followup@perl.org>:

I looked at beyondgrep/ack2#367
which shows that ack is broken by the 5.19.2 change.

If you look at that link, you'll see that the russian comes out fine,
but with a warning that didn't use to be there; the french is broken.

What is happening is that ack treats everything as bytes, and so
everything just worked. STDERR is opened as a byte-oriented file,
and if $! actually did contain UTF-8, it wasn't marked as such, and
its component bytes were output as-is, so that if in fact the
terminal is expecting UTF-8, they come out looking like UTF-8 to it,
and everything held together. (Garbage would ensue if the terminal
wasn't expecting the encoding that $! is in; I haven't checked, but
my guess is that the grep output is also output as-is, so if the file
encodings differ from the terminal expectation, that garbage could be
printed; but in practice I doubt that this is a problem.)

What the 5.19 change did effectively is to make the stringification
of "$!" obey "use bytes". Most code isn't in bytes' scope, so the
UTF-8 flag gets turned on if appropriate.

Perl's do_print() function checks if the stream is listed as UTF-8 or
not. The string being output is converted to the stream's encoding
if necessary and possible. If not possible, things are just output
as-is, possibly with warnings. In ack's case the stream never is
(AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as
UTF-8, and so tries to get converted to the non-UTF-8 stream. This
is impossible in Russian, so the bytes are output as-is, with a
warning. Since the terminal really is UTF-8, they display
correctly. But it is possible to convert the French text, as all the
characters in the message in the bug report are Latin1. So
do_print() does this, but since the terminal's encoding doesn't match
what ack thinks it is, the non-ascii characters come out as garbage.

yes agree. anyway warnings are bad. and broken latin1 bad too.

It's arguable that the warnings should have been output all along. since
really it is UTF-8 being output to a terminal that perl thinks can't
handle it.

Note that ack has some of its messages hard-coded in English. For
example, it does a -e on the file name, and outputs English-only if
it doesn't exist. rjbs has pointed out to me privately that typical
uses of $! are of the form

die "my message in English: $!"

Right, usually "my message in English" indeed is in English because
authors don't bother with full localization and translations to all
languages, but for consistency it's better to see $! in locale's
language. Other programs usually show it in user language.

I am not an ack user, but it appears to me that ack is like a filter
which doesn't care about encodings. It is byte rather than character
oriented. This seems to me to be an appropriate use of 'use bytes',
and if ack did this, this bug would not arise.

I would disagree, they try to migrate to unicode

https://github.com/petdance/ack2/issues/120
beyondgrep/ack2#344
beyondgrep/ack2#350
beyondgrep/ack2#355

ack is searching _text_ using _perl regexps_ in text files. it even
ignore files detected as binary (by default, at least, in my
installation)

I stand corrected.

My proposal to only use ASCII characters in error messages unless
within 'use locale' would also fix this problem. All messages that
print in Russian and some messages in French, would now appear in
English, adding to the several that already print in English no
matter what.

I am writing programs with correct use of modern Perl unicode now, but
never used 'use locale', seems it adds additional side effect to code?
Can there be special option for 'use locale' to not change anything at
all, except $! behaviour (in lexical scope) ?

locale works a lot better (I anticipate) in 5.20 than before. I think
it should finally be possible to 'use locale' as a matter of habit.

I was already thinking that 'use locale' in 5.22 should have the ability
to select LC_CTYPE and LC_COLLATE individually. It seems logical to
make this general, so you could say

'use locale ':messages, numeric';

to get just the effects you want. Some of this could conceivably be
added in 5.20 if it helps to resolve this blocker.

also, can code without 'use locale' behave like 5.18 (i.e. not always
in English; bytes)

The problem is that the commit fixed real bugs in code that didn't "use
locale" Thus the quandary. If we go back to 5.18 behavior, those bugs
come back. I believe that my proposal that only ASCII messages get
displayed outside of 'use locale' is the only "sure" method that doesn't
display garbage to someone. (Note that ASCII doesn't mean necessarily
English. Many error messages in Western European languages consist only
of ASCII characters. I realize that doesn't help Russian or Chinese, etc.)

Also, I hadn't realized this before, but sometimes the message's
characters aren't just garbage that someone with the motivation and
skill could figure out, but the UNICODE REPLACEMENT CHARACTER can be
displayed instead, so information is lost and can't be recovered.

? and with 'use locale :errno_only' change $! to
return unicode character string.

I don't see how this differs from your suggestion above for an option to
'use locale' to just effect $! (which is BTW LC_MESSAGES).

And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK. Can
someone explain what languages error messages are displayed in under
varied locales?

Another possibility to get programs like ack to work unchanged is to add
a non-printing above-Latin1 character to the stringification of $! when
it is UTF-8 and there are only Latin1 characters in it. A possibility
is a ZERO WIDTH SPACE. Then do_print() wouldn't try to downgrade. The
drawback is that code that analyzes $! could be thrown off. But code
generally should be analyzing the numeric value anyway, and not the
string representation

p5pRT · 2014-03-27T08:01:52Z

From victor@vsespb.ru

2014-03-27 3:12 GMT+04:00 Karl Williamson <public@khwilliamson.com>:

locale works a lot better (I anticipate) in 5.20 than before.

So, it worked bad before? Than it will be hard to write code
compatible with 5.20 and, say, 5.8.8 at same time (that again related
to 'ack'-like programs - it's command line program that should work
in system perl installed by end users. it's not a web application
where programmer can choose perl version)

The problem is that the commit fixed real bugs in code that didn't "use
locale" Thus the quandary. If we go back to 5.18 behavior, those bugs come
back.

Who told that it was bug? I saw this behaviour but never thought it is
a bug, because there is note in documentation:

While Perl does have extensive ways to input and output in Unicode,
and a few other "entry points" like the @ARGV array (which can
sometimes be interpreted as UTF-8), there are still many places where
Unicode (in some encoding or another) could be given as arguments or
received as results, or both, but it is not.

a user reported this as bug because he did not read this. for me it's
documented behaviour.

p5pRT · 2014-03-27T10:57:52Z

From @ap

* Karl Williamson <public@khwilliamson.com> [2014-03-27 03:10]:

Another possibility to get programs like ack to work unchanged is to
add a non-printing above-Latin1 character to the stringification of $!
when it is UTF-8 and there are only Latin1 characters in it.
A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to
downgrade. The drawback is that code that analyzes $! could be thrown
off. But code generally should be analyzing the numeric value anyway,
and not the string representation

Maybe you can attach magic that prevents a downgrade?

p5pRT · 2014-03-27T17:41:57Z

From @khwilliamson

On 03/27/2014 04:57 AM, Aristotle Pagaltzis wrote:

* Karl Williamson <public@khwilliamson.com> [2014-03-27 03:10]:

Another possibility to get programs like ack to work unchanged is to
add a non-printing above-Latin1 character to the stringification of $!
when it is UTF-8 and there are only Latin1 characters in it.
A possibility is a ZERO WIDTH SPACE. Then do_print() wouldn't try to
downgrade. The drawback is that code that analyzes $! could be thrown
off. But code generally should be analyzing the numeric value anyway,
and not the string representation

Maybe you can attach magic that prevents a downgrade?

That sounds like a better approach, but it is an area that I know
essentially nothing about. If I were to do it, it seems not so likely
that I could get it right by 5.20; I don't know how hard it would be for
someone experienced in the magical arts of Perl™.

Likewise, adding the ZERO WIDTH SPACE would need to be done early in the
development cycle to see what might break, not late, so shouldn't be
considered as a 5.20 solution.

p5pRT · 2014-03-27T18:14:29Z

From @khwilliamson

On 03/27/2014 02:01 AM, Victor Efimov wrote:

2014-03-27 3:12 GMT+04:00 Karl Williamson <public@khwilliamson.com>:

locale works a lot better (I anticipate) in 5.20 than before.

So, it worked bad before? Than it will be hard to write code
compatible with 5.20 and, say, 5.8.8 at same time (that again related
to 'ack'-like programs - it's command line program that should work
in system perl installed by end users. it's not a web application
where programmer can choose perl version)

I don't follow your logic. 5.20 will contain a bunch of bug fixes
related to locale handling. Earlier versions will continue to work as
before. Perhaps what you meant is that it will be hard to write code
that takes advantage of whatever 5.20 has, but still works in older
releases. That could be true, but it's not something that there is
anything that can be done about, except possibly some things in
PPPort.h, if we end up adding new macros.

It's a given that we can't break things like ack unless there is an easy
workaround that is backwards compatible.

The problem is that the commit fixed real bugs in code that didn't "use
locale" Thus the quandary. If we go back to 5.18 behavior, those bugs come
back.

Who told that it was bug? I saw this behaviour but never thought it is
a bug, because there is note in documentation:

While Perl does have extensive ways to input and output in Unicode,
and a few other "entry points" like the @ARGV array (which can
sometimes be interpreted as UTF-8), there are still many places where
Unicode (in some encoding or another) could be given as arguments or
received as results, or both, but it is not.

a user reported this as bug because he did not read this. for me it's
documented behaviour.

I disagree that documenting bad behavior means it should not eventually
be fixed. The commit that led to this ticket fixed two other tickets,
now merged as https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208. Those
tickets seem to me to be perfectly legitimate as being bugs deserving of
being fixed.

If we revert this commit, those bugs come back.

p5pRT · 2014-03-27T18:38:45Z

From victor@vsespb.ru

2014-03-27 22:14 GMT+04:00 Karl Williamson <public@khwilliamson.com>:

I don't follow your logic. 5.20 will contain a bunch of bug fixes related
to locale handling. Earlier versions will continue to work as before.
Perhaps what you meant is that it will be hard to write code that takes
advantage of whatever 5.20 has, but still works in older releases.

That is hard to write code which works in 5.8 and 5.20 at same time
(_without_ taking advantages of 5.20), because now
I need to 'use locale', and I assume in old version of perl 'use
locale' works bad and introduce additional complexities.

I disagree that documenting bad behavior means it should not eventually be
fixed. The commit that led to this ticket fixed two other tickets, now
merged as https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208. Those tickets
seem to me to be perfectly legitimate as being bugs deserving of being
fixed.

But those are not a bugs compared to real trouble now. And if it was
documented, those are feature requests.
Real trouble now: old code open(my $f, ">", $filename) or die $!; will
issue warnings. there are lot of "or die $!" in perl documentation and
now everything broken.

Why it's so complex to just introduce $DECODED_ERRNO or a pragma to
turn utf8 flag on (which works in lexical scope)? That's much better
than breaking so much old code and inserting "zero width whitespaces"
into messages.

p5pRT · 2014-03-27T22:08:08Z

From @demerphq

On 2 March 2014 06:43, Karl Williamson <public@khwilliamson.com> wrote:

This is my attempt to bring some clarity to this issue and stake out my
position regarding it. I haven't re-read the thread thoroughly just now, so
I may miss some issues, but I have been very aware of the central problem
regarding this for months now, and have been thinking about it for the same
amount of time, so I believe that what follows is an adequate summary of
that.

First the background. This ticket is about a commit that fixed two tickets
with the same underlying cause,
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when open.pm
sets utf8 default on filehandles yields garbage", and #117429, merged with
the earlier ticket.

The problem is that $! was returning UTF-8 encoded text, but the UTF-8 flag
was not set, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the returned
bytes and the filehandle doesn't have the utf8 default on. Now what gets
displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use
bytes', then the UTF-8 flag doesn't get set, and the return is precisely
what it used to be.

Thus a potential solution is to force such code to change to do a 'use
bytes'. FC is concerned that programs like ack will have to change if we
choose this scenario.

Unless I have misunderstood then it is not just ack.

But pretty much every Perl program I ever wrote, or saw, that was in Perl.

This type of pattern is extremely pervasive:

open my $fh, ">", $file
or die "Failed to open '$file' for writing: $!";

I am under the impression you are saying they all have change to:

open my $fh, ">", $file
or do { use bytes; die "Failed to open '$file' for writing: $!" };

Which I find almost astounding. Please tell me I have misunderstood.

Otherwise we are in a quandary. If we revert the commit, code that "does
the right thing" by setting their filehandle appropriately gets garbage;
whereas if we keep it, code that is unprepared to handle UTF-8 can get
garbage. There's probably far more of the latter than the former, but do we
wish to punish code that DTRT?

Before proceeding, I want to make an assertion: I think that it is better
for someone to get output in a language foreign to them, than it is to get
garbage bytes. This is because they can put the former into something like
Google translate to get a reasonable translation back into their own
language; and I believe that what appears to be garbage bytes is much more
problematical to figure out what was intended.

Do you accept or reject this assertion?

I accept it. However I think it is secondary to the question of
requiring pretty much every script that uses filehandles to change.
Maybe I am wrong that is what you are suggestion, but if it is then
IMO it cannot be the right answer.

If you don't accept it, then you need to persuade me and others who do
accept it, why not, and there's not much point in you reading the rest of
this message.

If you do accept it, one solution is to always output $! in English, which
we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most platforms
this will be identical to the C locale, but on VMS, at least, it can include
Western European languages as well, though I think that VMS only returns $!
in English.

A more general solution would be to output it in the native locale unless it
is UTF-8 encoded, in which case it would be converted to English. This
would then cause the code like (apparently) ack to see no change in
behavior, except that some errors would now come out in English; and the
code that was affected by #119499 would get English, instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8 is
because $! does not respect 'use locale'. The reason for this is that $!
typically gives the user an OS error that is outside Perl's purview, and
it's best that these messages be displayed in the user's preferred language.
But since what we have now causes garbage to be displayed for one class of
user, it seems to me to be a higher priority, given my assertion, to output
something sane for everybody, rather than something ideal for some, and
garbage for others.

For me prioritising "use locale" over every other script is
inappropriate. IMO relatively few scripts use it. IMO for years the
general recommendation about "use locale" has been to avoid it. I
personally would get rid of it completely.

That leads to yet another possibility, one that rjbs has previously vetoed,
but which I'm bringing up again here alongside this background that he may
not have considered: And that is to have $! respect 'use locale'. Outside
of 'use locale' it would be C or POSIX, which would mean English. Within
the scope of 'use locale', it would be the user's language. Programs that
do a 'use locale' can be assumed to be written to be able to handle them,
including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that aren't
prepared to accept it. We may have gotten away with this for non-UTF-8
locales because most code won't try to parse the stringified $! (and it's
probably foolish to try to parse it), but the UTF-8 flag throws a wrench
into this uneasy truce.

To state my position explicitly: I don't think it's a good idea to return a
UTF-8 encoded string to code that isn't expecting that possibility. And I
don't think it's OK to have user's see garbage bytes. To avoid doing these,
we have to return English whenever that could happen. 'use locale' in code
should be enough to signal it's prepared to handle UTF-8; otherwise it's
buggy.

So still another possibility is to deliver $! in the current locale if $!
isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8
inside. That leaves code that sets things up properly to get $! returned in
the user's language; and code that doesn't will also get the user's
language unless what is returned would be in UTF-8, in which case it will
come out in English, instead of garbage. This seems to me to be the best
solution.

Another possibility, suggested by FC, is to leave $! as-is, but create a new
variable that behaves differently. I think it's far better to get $! to
work reasonably than to come up with an alternative variable.

I personally think that $! should be left alone, and you should
introduce a new pragma to control the decoding behavior of $!. Those
people with bugs related to it can use the pragma.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2014-03-28T04:02:36Z

From @khwilliamson

I was wrong in several things when I wrote this; please skip to later
posts on the thread.

On 03/27/2014 04:07 PM, demerphq wrote:

On 2 March 2014 06:43, Karl Williamson <public@khwilliamson.com> wrote:

This is my attempt to bring some clarity to this issue and stake out my
position regarding it. I haven't re-read the thread thoroughly just now, so
I may miss some issues, but I have been very aware of the central problem
regarding this for months now, and have been thinking about it for the same
amount of time, so I believe that what follows is an adequate summary of
that.

First the background. This ticket is about a commit that fixed two tickets
with the same underlying cause,
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=112208 "printing $! when open.pm
sets utf8 default on filehandles yields garbage", and #117429, merged with
the earlier ticket.

The problem is that $! was returning UTF-8 encoded text, but the UTF-8 flag
was not set, so it displayed as garbage.

The fix was simply to set the UTF-8 flag if the text is valid UTF-8.

The problem with this is that it breaks code that just output the returned
bytes and the filehandle doesn't have the utf8 default on. Now what gets
displayed there looks like garbage.

If the broken code uses $! within the scope of the hated-by-some 'use
bytes', then the UTF-8 flag doesn't get set, and the return is precisely
what it used to be.

Thus a potential solution is to force such code to change to do a 'use
bytes'. FC is concerned that programs like ack will have to change if we
choose this scenario.

Unless I have misunderstood then it is not just ack.

But pretty much every Perl program I ever wrote, or saw, that was in Perl.

This type of pattern is extremely pervasive:

open my $fh, ">", $file
or die "Failed to open '$file' for writing: $!";

I am under the impression you are saying they all have change to:

open my $fh, ">", $file
or do { use bytes; die "Failed to open '$file' for writing: $!" };

Which I find almost astounding. Please tell me I have misunderstood.

Otherwise we are in a quandary. If we revert the commit, code that "does
the right thing" by setting their filehandle appropriately gets garbage;
whereas if we keep it, code that is unprepared to handle UTF-8 can get
garbage. There's probably far more of the latter than the former, but do we
wish to punish code that DTRT?

Before proceeding, I want to make an assertion: I think that it is better
for someone to get output in a language foreign to them, than it is to get
garbage bytes. This is because they can put the former into something like
Google translate to get a reasonable translation back into their own
language; and I believe that what appears to be garbage bytes is much more
problematical to figure out what was intended.

Do you accept or reject this assertion?

I accept it. However I think it is secondary to the question of
requiring pretty much every script that uses filehandles to change.
Maybe I am wrong that is what you are suggestion, but if it is then
IMO it cannot be the right answer.

If you don't accept it, then you need to persuade me and others who do
accept it, why not, and there's not much point in you reading the rest of
this message.

If you do accept it, one solution is to always output $! in English, which
we would do by always using the C locale when generating its text.

This could be relaxed by using the POSIX locale instead. On most platforms
this will be identical to the C locale, but on VMS, at least, it can include
Western European languages as well, though I think that VMS only returns $!
in English.

A more general solution would be to output it in the native locale unless it
is UTF-8 encoded, in which case it would be converted to English. This
would then cause the code like (apparently) ack to see no change in
behavior, except that some errors would now come out in English; and the
code that was affected by #119499 would get English, instead of garbage.

The reason that this issue comes up for programs that don't handle UTF-8 is
because $! does not respect 'use locale'. The reason for this is that $!
typically gives the user an OS error that is outside Perl's purview, and
it's best that these messages be displayed in the user's preferred language.
But since what we have now causes garbage to be displayed for one class of
user, it seems to me to be a higher priority, given my assertion, to output
something sane for everybody, rather than something ideal for some, and
garbage for others.

For me prioritising "use locale" over every other script is
inappropriate. IMO relatively few scripts use it. IMO for years the
general recommendation about "use locale" has been to avoid it. I
personally would get rid of it completely.

That leads to yet another possibility, one that rjbs has previously vetoed,
but which I'm bringing up again here alongside this background that he may
not have considered: And that is to have $! respect 'use locale'. Outside
of 'use locale' it would be C or POSIX, which would mean English. Within
the scope of 'use locale', it would be the user's language. Programs that
do a 'use locale' can be assumed to be written to be able to handle them,
including the increasingly common UTF-8 locales.

It seems to me wrong to deliver $! locale-encoded to programs that aren't
prepared to accept it. We may have gotten away with this for non-UTF-8
locales because most code won't try to parse the stringified $! (and it's
probably foolish to try to parse it), but the UTF-8 flag throws a wrench
into this uneasy truce.

To state my position explicitly: I don't think it's a good idea to return a
UTF-8 encoded string to code that isn't expecting that possibility. And I
don't think it's OK to have user's see garbage bytes. To avoid doing these,
we have to return English whenever that could happen. 'use locale' in code
should be enough to signal it's prepared to handle UTF-8; otherwise it's
buggy.

So still another possibility is to deliver $! in the current locale if $!
isn't UTF-8; otherwise to use English outside of 'use locale' and the UTF-8
inside. That leaves code that sets things up properly to get $! returned in
the user's language; and code that doesn't will also get the user's
language unless what is returned would be in UTF-8, in which case it will
come out in English, instead of garbage. This seems to me to be the best
solution.

Another possibility, suggested by FC, is to leave $! as-is, but create a new
variable that behaves differently. I think it's far better to get $! to
work reasonably than to come up with an alternative variable.

I personally think that $! should be left alone, and you should
introduce a new pragma to control the decoding behavior of $!. Those
people with bugs related to it can use the pragma.

Yves

p5pRT · 2014-03-28T05:09:05Z

From @khwilliamson

In this post, I will just give some new insights I had today.

There are real bugs (even if the others previously mentioned aren't
regarded as such) when "$!" isn't returned with the UTF-8 flag on, and
when $! is stringified to its locale string outside of "use locale" scope.

Consider this one liner:

LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤: $!"'

In blead, it prints, as it should,
Wide character in die at -e line 1
致命錯誤: 不允许的操作 at -e line 1

In 5.18.2 it prints this garbage instead
Wide character in die at -e line 1
致命錯誤: ä¸�å��è®¸ç��æ��ä½� at -e line 1

The reason is that the program is encoded in utf8, and $! has returned
utf8 (only in the 5.18 case) without setting the utf8 flag, and so Perl
takes the bytes that form $! and upgrades those bytes into utf8 (again).
In other words, its encoding twice.

(I chose Chinese because its script could not be confused with Western
European characters, and I used Google translate, so the constant
portion of the text may not make sense; I apologize to the Chinese
speakers reading this.)

"use utf8" is not necessary for this. It could be "die "$prefix: $!"
where $prefix has its utf8 flag on.

These examples show, once again, the perils of having a scalar that's in
UTF-8, but pretending it's not, even if it's just in a die(). I claim
they conclusively show the brokenness of the 5.18 code.

Another problem with all existing versions is if the $prefix is written
in Latin1. Recall that the default character sets of Perl are ASCII,
Latin1, and full Unicode, each a superset of the previous. So someone
might in Hungarian might write

./perl -Ilib -le '$!=1; die "fatális hibát: $!"'

(apologies to the Hungarian speakers)

If this is however run in a non-Latin1 locale, like say

LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"'

The first part of the string is in Latin1, and the 2nd part is in
Latin7. These are not compatible (except for their common ASCII range
and a few punctuation characters). If the terminal is set to display
Latin1, the first part looks ok, the second is garbage, and vice versa
(except the common characters will look ok in both)

There is no current way for an application to guard against this; it is
a sitting duck. $! always comes out in the underlying locale. (The
reason this doesn't show up more often, is apparently people write their
prefix messages in English, hence ASCII, and all the locales, like
88597, are supersets of ASCII.

I claim this shows the perils of having stuff appear in the underlying
locale outside the scope of 'use locale'. An unsuspecting application
that doesn't even know that locales exist can be hit by the user's
environment passing in a locale, or by any module somewhere in the tool
chain doing a setlocale().

I believe the solution is to make $! return the C locale messages
outside the scope of 'use locale', just like the other categories. By
being in such scope, the caller is indicating its willingness to handle
and be smart about locale issues. Otherwise it shouldn't have to be
exposed to them.

My recent proposal also works. That is to use the $! locale value
provided it is all ASCII. That means that a fair number of system
messages in various European languages will come out natively, but not
those that might adversely affect things like ack. The problem with
this is that the application still doesn't have control.

Note that in the messages above, that Perl itself outputs its warnings
and messages like "at -e line 1". Nobody has any control over that, and
I can't believe this fact hasn't discouraged some applications from
using Perl in non-English settings.

What part of CPAN is expecting native-language $! ? I don't know, but
given the vagaries, including some things always being in English, and
being at the mercy of the user's locale environment, I suspect not much.

p5pRT · 2014-04-01T18:23:20Z

From @khwilliamson

Fixed for v5.20 by b17e32e

The plan for v5.21 is to make $! return locale messages only from within the scope of 'use locale'. In other words, locale has to be opt-in.
--
Karl Williamson

p5pRT · 2014-04-01T18:23:21Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT · 2014-04-01T18:41:12Z

From victor@vsespb.ru

I did not ever receive this message. Only receive a notice that the bug is resolved.

On Thu Mar 27 22:09:05 2014, public@khwilliamson.com wrote:

In this post, I will just give some new insights I had today.

There are real bugs (even if the others previously mentioned aren't
regarded as such) when "$!" isn't returned with the UTF-8 flag on, and
when $! is stringified to its locale string outside of "use locale" scope.

Consider this one liner:

LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; $!=1; die "致命錯誤: $!"'

In blead, it prints, as it should,
Wide character in die at -e line 1
致命錯誤: 不允许的操作 at -e line 1

In 5.18.2 it prints this garbage instead
Wide character in die at -e line 1
致命錯誤: ä¸�å��è®¸ç��æ��ä½� at -e line 1

It's general limitation of perl - one should not merge character strings with binary strings. Not a bug, but expected behaviour.

Another problem with all existing versions is if the $prefix is written
in Latin1. Recall that the default character sets of Perl are ASCII,
Latin1, and full Unicode, each a superset of the previous. So someone
might in Hungarian might write

./perl -Ilib -le '$!=1; die "fatális hibát: $!"'

(apologies to the Hungarian speakers)

If this is however run in a non-Latin1 locale, like say

LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"'

The first part of the string is in Latin1, and the 2nd part is in
Latin7. These are not compatible (except for their common ASCII range
and a few punctuation characters). If the terminal is set to display
Latin1, the first part looks ok, the second is garbage, and vice versa
(except the common characters will look ok in both)

Locale is iso88597 so terminal should be set to iso88597 (otherwise everything is garbage). And if it is, it's not
surprise that Latin1 is garbage.

What part of CPAN is expecting native-language $! ? I don't know, but
given the vagaries, including some things always being in English, and
being at the mercy of the user's locale environment, I suspect not much.

So you are worrying more about broken tests on CPAN, and don't worry much about real bugs in users code (which not caught with tests). User will be surprised that perl stopped giving $! in locale's language, but they cannot catch this in tests because they never ever suspect that such brokenness can be introduced (unit test are white box testing - you can test only for bugs you expect)

p5pRT closed this as completed Apr 1, 2014

p5pRT added the Severity Low label Oct 19, 2019

$! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+ #13208

$! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+ #13208

Comments

p5pRT commented Aug 28, 2013

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

p5pRT commented Aug 28, 2013

From @khwilliamson

p5pRT commented Aug 28, 2013

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

p5pRT commented Aug 28, 2013

From @Leont

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

======== $ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO, "datafile") || die $!;' Wide character in die at -e line 1. Нет такого файла или каталога at -e line 1.

p5pRT commented Aug 28, 2013

From sog@msg.mx

======== $ LANG=ru_RU LANGUAGE=ru_RU​:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO, "datafile") || die $!;' Wide character in die at -e line 1. Нет такого файла или каталога at -e line 1.

p5pRT commented Aug 29, 2013

From @cpansprout

p5pRT commented Aug 29, 2013

From victor@vsespb.ru

p5pRT commented Aug 29, 2013

From @khwilliamson

p5pRT commented Aug 29, 2013

From victor@vsespb.ru

http​://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22?

Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.

p5pRT commented Aug 29, 2013

From victor@vsespb.ru

p5pRT commented Aug 31, 2013

From victor@vsespb.ru

p5pRT commented Aug 31, 2013

From @cpansprout

p5pRT commented Sep 1, 2013

From @khwilliamson

p5pRT commented Sep 1, 2013

From victor@vsespb.ru

p5pRT commented Sep 1, 2013

From @Leont

p5pRT commented Sep 1, 2013

From @Leont

p5pRT commented Sep 1, 2013

From @khwilliamson

p5pRT commented Sep 1, 2013

From victor@vsespb.ru

p5pRT commented Sep 1, 2013

From @khwilliamson

p5pRT commented Sep 1, 2013

From zefram@fysh.org

p5pRT commented Sep 1, 2013

From victor@vsespb.ru

p5pRT commented Sep 1, 2013

From @cpansprout

p5pRT commented Sep 1, 2013

From @cpansprout

p5pRT commented Sep 1, 2013

From @cpansprout

p5pRT commented Sep 2, 2013

From victor@vsespb.ru

p5pRT commented Sep 10, 2013

From @khwilliamson

p5pRT commented Oct 16, 2013

From victor@vsespb.ru

p5pRT commented Oct 22, 2013

From @cpansprout

p5pRT commented Oct 22, 2013

From @mauke

p5pRT commented Oct 22, 2013

From @cpansprout

p5pRT commented Oct 22, 2013

From @cpansprout

p5pRT commented Oct 22, 2013

From victor@vsespb.ru

p5pRT commented Oct 22, 2013

========
$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO,
"datafile") || die $!;'
Wide character in die at -e line 1.
Нет такого файла или каталога at -e line 1.

========
$ LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -e 'open(INFO,
"datafile") || die $!;'
Wide character in die at -e line 1.
Нет такого файла или каталога at -e line 1.

http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22?

Please, unless you're hacking the internals, or debugging weirdness,
don't think about the UTF8 flag at all. That means that you very
probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.

Who told that it was bug? I saw this behaviour but never thought it is
a bug, because there is note in documentation:

Who told that it was bug? I saw this behaviour but never thought it is
a bug, because there is note in documentation: