New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
$! returned with UTF-8 flag under UTF-8 locales only under 5.19.2+ #13208
Comments
From victor@vsespb.ru$! returned as character string under 5.19.2+ and UTF-8 locales. But as I believe this is useless and just makes it harder to decode $! value Also I am not sure if it will be possible to decode it when language with LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.utf8 perl -MPOSIX -MDevel::Peek SV = PV(0x144dd80) at 0x14702a0 LANG=ru_RU LANGUAGE=ru_RU:ru LC_ALL=ru_RU.CP1251 LC_MESSAGES=ru_RU.CP1251 SV = PV(0x1db8d80) at 0x1ddf7e0 |
From victor@vsespb.ruSeems this is result of However I think fix is wrong. 1) it breaks old code, which: a) tries to decode $! using Encode::decode and b) which prints error messages to screen as-is (without "binmode STDOUT 2) Sometimes it returns binary string (under non-utf8 locales, or when It's hard to distinct one from another. Possible solution is Another solution is use Encode::decode_utf8 when locale is UTF-8 ( but Problem that this method's documentation is wrong - several people https://rt.cpan.org/Public/Bug/Display.html?id=87267 3) It's not documented in perllocale, perlunicode, perlvar. 4) It's not clear how it works in case of Latin-1 characters in UTF-8 On Wed Aug 28 01:52:13 2013, vsespb wrote:
|
From @khwilliamsonOn 08/28/2013 02:52 AM, Victor Efimov (via RT) wrote:
I am trying to understand your issues with this change. I believe it is
I don't understand your use of the word 'binary' here. In both cases, In string contexts, it returns the appropriate encoding. In UTF-8
I don't have a clue as to why you think this is useless. This change Code that is trying to decode $! should be using the (constant) numeric
Again, use the numeric value when trying to parse the error.
I ran this, substituting 'say $!' for the Dump, and got this output: which is the correct Cyrillic text. Prior to the patch, this would have
I do not have a Windows machine with CP1251, but I hand looked at this |
The RT System itself - Status changed from 'new' to 'open' |
From victor@vsespb.ru
I am not trying to parse $!. I am trying to print original error message
That is just wrong to sometimes return bytes, sometimes characters. The following example worked fine before this change: use strict; my $filename = "not a file ".chr(0x444); open my $f, "<", $filename or do { but with this change: - under non-Unicode locales works fine. Possible fix for this example is: replace Another place where it breaks old code is: perl -e 'open my now prints warning: "Wide character in die" when locale is UTF-8 and
No, prior to this patch it prints correct (same) text but without "Wide On Wed Aug 28 10:19:47 2013, public@khwilliamson.com wrote:
|
From @LeontOn Wed, Aug 28, 2013 at 10:52 AM, Victor Efimov
Automatic decoding is definitely the more useful behavior. I agree Also I am not sure if it will be possible to decode it when language with
AFAIK that should work perfectly fine. Leon |
From victor@vsespb.ru
yes. when b) it's not breaking existing code.
yes, especially when sometimes it's bytes, sometimes character and you
I think in Perl you can get encoding with Perlhaps that can be fixed in Perl code, in Errno. (We already load And I am totally not sure about perl C internals.
I cannot do C coding. Also I think that old code, relying on old behaviour was not relying on it was partly documented: http://perldoc.perl.org/perllocale.html
(also perllocale now have updates, related to $! in blead) http://perldoc.perl.org/perlunicode.html
So ideal fix would be imho: On Wed Aug 28 12:44:17 2013, LeonT wrote:
|
From victor@vsespb.ruThere is a distribution which decodes POSIX::strerror with I18N::Langinfo: http://search.cpan.org/~kryde/I18N-Langinfo-Wide-7/ also, another possible problem, that all examples in perl documentation, http://perldoc.perl.org/perlopentut.html open(INFO, "datafile") || die("can't open datafile: $!"); ========
|
From sog@msg.mxTime to set PERL_UNICODE=SL ? Salvador Ortiz. On 08/28/2013 04:36 PM, Victor Efimov via RT wrote:
|
From @cpansproutOn Wed Aug 28 10:19:47 2013, public@khwilliamson.com wrote:
You are describing from the point of view of internals. From the user’s This means even #112208 is not fixed, because the test case was ‘use The ultimate problem is that perl has no way of guaranteeing that $! can So now, $! may or may not be encoded, and you have to way of telling -- Father Chrysostomos |
From victor@vsespb.ruOn Wed Aug 28 23:40:08 2013, sprout wrote:
Small corrections: a) Actually there is a way: check is_utf8($!) flag (which is not good b) Current fix does not do environment checks, it just tries to do UTF-8 |
From @khwilliamsonOn 08/29/2013 02:15 AM, Victor Efimov via RT wrote:
I don't follow these arguments. What that commit did is only to look at What is different about $! is that we have made the decision to respect The change fixed two bug reports for the common case where the locales If code wants $! to be expressed in a certain language, it should set
I don't see that danger marked currently in the pod for utf8.pm. Where
(*) To be precise 1) if the string returned by the OS is entirely ASCII, it does not set 2) As Victor notes, the commit does a UTF-8 validity check, so it is |
From victor@vsespb.ruOn Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:
http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22?Please, unless you're hacking the internals, or debugging weirdness,
|
From victor@vsespb.ruOn Thu Aug 29 14:06:57 2013, vsespb wrote:
Generator of byte sequences that are valid in UTF-8 and in another #!/usr/bin/env perl use strict; binmode STDOUT, ":encoding(UTF-8)"; my @A = grep { /\w/ } map { chr($_) } (128..1024); for my $z1 (@A) { for my $z2 ('', @A) { for my $z3 ('', @A) { example output: perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z = example output of output example: $perl -e 'use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $z = |
From victor@vsespb.ruOn Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:
Rare under linux. AFAIK FreeBSD 9 (latest stable) users still have There are real users with non-UTF8 locale, I saw one. We've spent A real application which is broken by this change is 'ack' (ack 1 and Russian error messages now printed with warning (under UTF-8 locale!). |
From @cpansproutOn Thu Aug 29 13:05:00 2013, public@khwilliamson.com wrote:
You are still describing this from the point of view of the internals. From the users point of view, the utf8 flag does not mean it is encoded
I.e., still encoded.
The former is the problem, not the latter. If a program can find out
I don’t follow. The bytes inside the scalar are not visible to the Perl Your commit changed the content of the scalar as returned by ord and
The problem here is that the locale is only sometimes being respected.
But the less frequent cases now require one to introspect internal Also, is that really more frequent? What about scripts that pass $!
Are you suggesting that perl itself start defaulting to the C locale for $!?
(Since Perl 5.8.1) Test whether I<$string> is marked internally as I think he is referring to ‘internally’ here, which indicates that you
http://perl5.git.perl.org/perl.git/commitdiff/1500bd919ffeae0f3252f8d1bb28b03b043d328e
That is all very nice, but how would you rewrite this code to work in if (!open fh, $filename) { -- Father Chrysostomos |
From @khwilliamsonOn 08/31/2013 07:27 AM, Father Chrysostomos via RT wrote:
I persist in this because I believe your point is a red herring. I Rather than address most of the rest of your email, some of which I
I feel compelled to point out that this code is buggy. I18N::Langinfo But on platforms where it works reliably, and the typical case where I think all of us would agree that deference should be paid to And $! remains an outlier in the sense that it is AFAIK, and I've looked I'm pretty confident that the problem can't be solved so that no code If this commit is reverted, we do need to decide how we will address the |
From victor@vsespb.ru2013/9/1 Karl Williamson <public@khwilliamson.com>
But that is not the only place, where non-ASCII character can appear. The following is documented in perlunicode: "While Perl does have extensive ways to input and output in Unicode, and a I believe that note can mean that encoded $! is not a bug, but a feature. Thus it's impossible for people to use those variables now, as it may |
From @LeontOn Sun, Sep 1, 2013 at 4:36 PM, Victor Efimov <victor@vsespb.ru> wrote:
$! is inherently a piece of text, not piece of binary data. As such, it Leon |
From @LeontOn Sun, Sep 1, 2013 at 6:36 AM, Karl Williamson <public@khwilliamson.com>wrote:
Yeah, in POSIX strftime and the is* functions are also affected.
That does sounds like consistency to me. I'm pretty confident that the problem can't be solved so that no code has
That is my feeling too. The new situation feels rather unfinished to me, Currently, using $! in production code that can be operated by users who
Indeed. Leon |
From @khwilliamsonOn 08/31/2013 10:36 PM, Karl Williamson wrote:
I now feel compelled to point out that I should have been more clear |
From victor@vsespb.ru2013/9/1 Leon Timmermans <fawaka@gmail.com>
btw, interesting that |
From @khwilliamsonOn 09/01/2013 10:47 AM, Victor Efimov wrote:
I've been wondering myself what should happen with $^E, and I believe Some other thoughts I've had about this issue. The commit did not break the ISO 646 7-bit codings, as the behavior is Those encodings must not be very important nor have been for quite some We could have a feature automatically turned on in v5.20. I'll call it Without it being on, $! works as it did in <=v5.18. Within its scope Perl attempts to decode $! as best it can, autoloading This may be a crazy idea; but I thought I'd put it out there to |
From zefram@fysh.orgKarl Williamson wrote:
Scoping doesn't work well for this sort of thing. The decoding happens in Amusingly, -zefram |
From victor@vsespb.ruone problem with lexical scope is also POSIX::strerror. strerror => 'errno => local $! = thus it has to be fixed too if we implement lexical featurization. 2013/9/1 Zefram <zefram@fysh.org>
|
From @cpansproutOn Sun Sep 01 09:24:19 2013, public@khwilliamson.com wrote:
More importantly, as Victor pointed out, it breaks programs that are not On dromedary I get this with the system perl (5.12.3): $ LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' When I build my own (blead)perl, I get this: $ LC_ALL=hu_HU.utf8 ./perl -e 'open "oentuheon" or die $!' -- Father Chrysostomos |
From @cpansproutOn Sun Sep 01 11:09:41 2013, sprout wrote:
RT screwed it up. That appeared perfectly fine.
That is as it appeared, with question marks. -- Father Chrysostomos |
From @cpansproutOn Sun Sep 01 10:46:13 2013, zefram@fysh.org wrote:
A new global variable is another option. -- Father Chrysostomos |
From victor@vsespb.ru2013/9/1 Father Chrysostomos via RT <perlbug-followup@perl.org>
|
From @khwilliamsonOn 09/02/2013 05:10 PM, Victor Efimov wrote:
I have come to believe that this is probably the best way forward. That In any event, there should be uniform treatment of Does anyone know if the strings for the platforms that have separate $^E These include vms, win32, dos, and os/2. |
From victor@vsespb.ruOn Tue Oct 15 14:59:45 2013, public@khwilliamson.com wrote:
New behaviour looks sane to me. It's probably thay way it's supposed to There were comments that enabling new behaviour in lexical scope is not The big problem that I see now is backward compatibility. Any existing Users will have to fix it with use locale/use bytes. Few examples that I found (where filenames are concatenated with $!): ==== File::Find LWP::UserAgent ==== Note, that if filename here contains non-ASCII characters and is binary Even if filename is ASCII, it would break old behaviour when die If filename is character string, that code did not work correctly Another issue that there is POSIX::strerror, and IMHO it should behave |
From @cpansproutOn Wed Oct 09 18:46:47 2013, public@khwilliamson.com wrote:
The problem with the bytes pragma is that two scalars may compare equal $! does not do that. In fact, it is more akin to the default input and I don’t have enough room in my brain to fit all the issues that are Simple programs like ack that do not take encodings into account should LC_ALL=hu_HU.utf8 perl -e 'open "oentuheon" or die $!' on dromedary with and without bleadperl. That type of code should Maybe what you are really after is a *function* that returns a decoded $!. -- Father Chrysostomos |
From @maukeOn 22.10.2013 22:48, Father Chrysostomos via RT wrote:
With or without PERL_UNICODE=SL? Because that's on by default in my environment.
Doesn't interpolate nicely in error messages. -- |
From @cpansproutOn Tue Oct 22 14:41:09 2013, plokinom@gmail.com wrote:
All I can say is, ouch! I have always found use of PERL_UNICODE to be -- Father Chrysostomos |
From @cpansproutOn Tue Oct 22 14:46:12 2013, sprout wrote:
I think I meant suspect, or whatever.
In particular, PERL_UNICODE=SL breaks any simple Perl implementation of cat. -- Father Chrysostomos |
From victor@vsespb.ru2013/10/23 Lukas Mai <plokinom@gmail.com>
I think things like 'ack' won't work this way. They read data also from |
From @maukeOn 22.10.2013 23:52, Father Chrysostomos via RT wrote:
Isn't such a "simple" implementation already broken on systems like Windows? -- |
From @khwilliamsonI have now pushed a series of patches that make the handling of this |
From @tonycozOn Tue Nov 26 20:25:58 2013, public@khwilliamson.com wrote:
This is a 5.20 blocker. Did you have time to look further? Though I'll admit the conversation has gone back and forth so much I'm not sure what remaining issues there are. Tony |
From @khwilliamsonOn 02/05/2014 03:52 PM, Tony Cook via RT wrote:
This is correctly listed as a blocker. I have thought further about |
From @khwilliamsonThis is my attempt to bring some clarity to this issue and stake out my First the background. This ticket is about a commit that fixed two The problem is that $! was returning UTF-8 encoded text, but the UTF-8 The fix was simply to set the UTF-8 flag if the text is valid UTF-8. The problem with this is that it breaks code that just output the If the broken code uses $! within the scope of the hated-by-some 'use Thus a potential solution is to force such code to change to do a 'use Otherwise we are in a quandary. If we revert the commit, code that Before proceeding, I want to make an assertion: I think that it is Do you accept or reject this assertion? If you don't accept it, then you need to persuade me and others who do If you do accept it, one solution is to always output $! in English, This could be relaxed by using the POSIX locale instead. On most A more general solution would be to output it in the native locale The reason that this issue comes up for programs that don't handle UTF-8 That leads to yet another possibility, one that rjbs has previously It seems to me wrong to deliver $! locale-encoded to programs that To state my position explicitly: I don't think it's a good idea to So still another possibility is to deliver $! in the current locale if Another possibility, suggested by FC, is to leave $! as-is, but create a |
From victor@vsespb.ru2014-03-02 9:43 GMT+04:00 Karl Williamson <public@khwilliamson.com>:
Of course English better than garbage. BUT this is correct only for For end users it will look like this: "one one machine everything is to me it looks like English better than garbage, but 5.18 behaviour |
From @khwilliamsontl;dr summary of this I assert it is better to have an error message come out in a foreign If we output UTF-8 bytes without the UTF-8 flag being on to code that My bottom line proposal is to look at the $! text, and if it contains We can be reasonably confident that the program can handle UTF-8 if we But if we are not within such scope we can't be confident at all about This is not ideal but it pretty much assures that no one is going to get On 03/01/2014 10:43 PM, Karl Williamson wrote:
|
From @khwilliamsonI looked at beyondgrep/ack2#367 If you look at that link, you'll see that the russian comes out fine, but with a warning that didn't use to be there; the french is broken. What is happening is that ack treats everything as bytes, and so everything just worked. STDERR is opened as a byte-oriented file, and if What the 5.19 change did effectively is to make the stringification of "$!" obey "use bytes". Most code isn't in bytes' scope, so the UTF-8 flag gets turned on if appropriate. Perl's do_print() function checks if the stream is listed as UTF-8 or not. The string being output is converted to the stream's encoding if necessary and possible. If not possible, things are just output as-is, possibly with warnings. In ack's case the stream never is (AFAIK) UTF-8. Starting in 5.19.2+, the message can be marked as UTF-8, and so tries to get converted to the non-UTF-8 stream. This is impossible in Russian, so the bytes are output as-is, with a warning. Since the terminal really is UTF-8, they display correctly. But it is possible to convert the French text, as all the characters in the message in the bug report are Latin1. So do_print() does this, but since the terminal's encoding doesn't match what ack thinks it is, the non-ascii characters come out as garbage. Note that ack has some of its messages hard-coded in English. For example, it does a -e on the file name, and outputs English-only if it doesn't exist. rjbs has pointed out to me privately that typical uses of $! are of the form die "my message in English: $!" I am not an ack user, but it appears to me that ack is like a filter which doesn't care about encodings. It is byte rather than character oriented. This seems to me to be an appropriate use of 'use bytes', and if ack did this, this bug would not arise. My proposal to only use ASCII characters in error messages unless within 'use locale' would also fix this problem. All messages that print in Russian and some messages in French, would now appear in English, adding to the several that already print in English no matter what. |
From victor@vsespb.ru2014-03-27 1:41 GMT+04:00 Karl Williamson via RT <perlbug-followup@perl.org>:
yes agree. anyway warnings are bad. and broken latin1 bad too.
Right, usually "my message in English" indeed is in English because
I would disagree, they try to migrate to unicode https://github.com/petdance/ack2/issues/120 ack is searching _text_ using _perl regexps_ in text files. it even
I am writing programs with correct use of modern Perl unicode now, but also, can code without 'use locale' behave like 5.18 (i.e. not always
|
From @khwilliamsonOn 03/26/2014 04:06 PM, Victor Efimov wrote:
It's arguable that the warnings should have been output all along.
I stand corrected.
locale works a lot better (I anticipate) in 5.20 than before. I think I was already thinking that 'use locale' in 5.22 should have the ability 'use locale ':messages, numeric'; to get just the effects you want. Some of this could conceivably be
The problem is that the commit fixed real bugs in code that didn't "use Also, I hadn't realized this before, but sometimes the message's
I don't see how this differs from your suggestion above for an option to And that reminds me, MS Windows doesn't have LC_MESSAGES, AFAIK. Can |
From @khwilliamsonOn 03/26/2014 05:12 PM, Karl Williamson wrote:
Another possibility to get programs like ack to work unchanged is to add |
From victor@vsespb.ru2014-03-27 3:12 GMT+04:00 Karl Williamson <public@khwilliamson.com>:
So, it worked bad before? Than it will be hard to write code
Who told that it was bug? I saw this behaviour but never thought it is
|
From @ap* Karl Williamson <public@khwilliamson.com> [2014-03-27 03:10]:
Maybe you can attach magic that prevents a downgrade? |
From @khwilliamsonOn 03/27/2014 04:57 AM, Aristotle Pagaltzis wrote:
That sounds like a better approach, but it is an area that I know Likewise, adding the ZERO WIDTH SPACE would need to be done early in the |
From @khwilliamsonOn 03/27/2014 02:01 AM, Victor Efimov wrote:
I don't follow your logic. 5.20 will contain a bunch of bug fixes It's a given that we can't break things like ack unless there is an easy
I disagree that documenting bad behavior means it should not eventually If we revert this commit, those bugs come back. |
From victor@vsespb.ru2014-03-27 22:14 GMT+04:00 Karl Williamson <public@khwilliamson.com>:
That is hard to write code which works in 5.8 and 5.20 at same time
But those are not a bugs compared to real trouble now. And if it was Why it's so complex to just introduce $DECODED_ERRNO or a pragma to |
From @demerphqOn 2 March 2014 06:43, Karl Williamson <public@khwilliamson.com> wrote:
Unless I have misunderstood then it is not just ack. But pretty much every Perl program I ever wrote, or saw, that was in Perl. This type of pattern is extremely pervasive: open my $fh, ">", $file I am under the impression you are saying they all have change to: open my $fh, ">", $file Which I find almost astounding. Please tell me I have misunderstood.
I accept it. However I think it is secondary to the question of
For me prioritising "use locale" over every other script is
I personally think that $! should be left alone, and you should Yves -- |
From @khwilliamsonI was wrong in several things when I wrote this; please skip to later On 03/27/2014 04:07 PM, demerphq wrote:
|
From @khwilliamsonIn this post, I will just give some new insights I had today. There are real bugs (even if the others previously mentioned aren't Consider this one liner: LC_ALL=zh_CN.utf8 ./perl -Ilib -le 'use utf8; In blead, it prints, as it should, In 5.18.2 it prints this garbage instead The reason is that the program is encoded in utf8, and $! has returned (I chose Chinese because its script could not be confused with Western "use utf8" is not necessary for this. It could be "die "$prefix: $!" These examples show, once again, the perils of having a scalar that's in Another problem with all existing versions is if the $prefix is written ./perl -Ilib -le '$!=1; die "fatális hibát: $!"' (apologies to the Hungarian speakers) If this is however run in a non-Latin1 locale, like say LC_ALL=el_GR.iso88597 ./perl -Ilib -le '$!=1; die "fatális hibát: $!"' The first part of the string is in Latin1, and the 2nd part is in There is no current way for an application to guard against this; it is I claim this shows the perils of having stuff appear in the underlying I believe the solution is to make $! return the C locale messages My recent proposal also works. That is to use the $! locale value Note that in the messages above, that Perl itself outputs its warnings What part of CPAN is expecting native-language $! ? I don't know, but |
From @khwilliamsonFixed for v5.20 by b17e32e The plan for v5.21 is to make $! return locale messages only from within the scope of 'use locale'. In other words, locale has to be opt-in. |
@khwilliamson - Status changed from 'open' to 'resolved' |
From victor@vsespb.ruI did not ever receive this message. Only receive a notice that the bug is resolved. On Thu Mar 27 22:09:05 2014, public@khwilliamson.com wrote:
It's general limitation of perl - one should not merge character strings with binary strings. Not a bug, but expected behaviour.
Locale is iso88597 so terminal should be set to iso88597 (otherwise everything is garbage). And if it is, it's not
So you are worrying more about broken tests on CPAN, and don't worry much about real bugs in users code (which not caught with tests). User will be surprised that perl stopped giving $! in locale's language, but they cannot catch this in tests because they never ever suspect that such brokenness can be introduced (unit test are white box testing - you can test only for bugs you expect) |
Migrated from rt.perl.org#119499 (status was 'resolved')
Searchable as RT119499$
The text was updated successfully, but these errors were encountered: