New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readdir() return value should be documented as always downgraded #13183
Comments
From victor@vsespb.rureaddir() return value should be documented as always downgraded. otherwise assumptions: example: opendir(my $dh, '.');; above code fails if binary strings were upgraded and there are line "$ARGV[0] ? utf8::upgrade($_) : utf8::downgrade($_);" represent the solution: programmer should explicitly call utf8::downgrade($_) before "-e" opendir(my $dh, '.');; but that is correct solution only if we are sure that readdir returns same probably true for readlink and @ARGV |
From @b2gillsOn Tue, Aug 20, 2013 at 5:07 PM, Victor Efimov
We should DEFINITELY document that filesystem names This can become problematic if you normalize a UTF8 text So readdir should only return strings with the UTF8 flag set |
The RT System itself - Status changed from 'new' to 'open' |
From victor@vsespb.ru
I think it's documented that it returns binary data: http://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen
Problem that it's hard to understand that readdir() will always return Who knows, maybe sometimes it set (or will set in the future) utf-8 bit The only thing documented about readdir(): 1) it does not return so if we imagine that for filename "\xC2\xB5" readdir set UTF-8 flag, it
I think it's something different. Do you mean Unicode NFC/NFD On Tue Aug 20 20:42:41 2013, brad wrote:
|
From @LeontOn Wed, Aug 21, 2013 at 12:07 AM, Victor Efimov
I think this "always downgraded" concept is misunderstanding how Unicode 2. binary data can be (randomly) upgraded or downgraded by 3rd party code
If you want to treat something as a textual string, then all your code Anything else is madness. If 3rd party code is upgrading your binary data, then you're passing binary
Correct. 4. Programmer might want to work with filenames as with binary strings (not
Worse, filesystem encoding is generally non-portable. On Windows and Mac,
No, he should not! He should either encode or downgrade, depending on
Utf-8 encoded data will roundtrip an upgrade/downgrade, but the Leon |
From victor@vsespb.ru
my $s1 = chr(0x100); Another example when ASCII data can get utf-8 flag: use encoding "utf8"; variable $x will upgrade binary string if concatenated with it. and it's point is programmer cannot control if his binary data upgraded or no. On Wed Aug 21 01:33:37 2013, LeonT wrote:
|
From @LeontOn Wed, Aug 21, 2013 at 11:19 AM, Victor Efimov via RT <
I just explained that choice in my previous message… You're completely
That's silly and unfortunate. If one gets invalid input, one should treat
Did you notice the word "if"? point is programmer cannot control if his binary data upgraded or no. Bullshit. It's more difficult that it should be, but he can absolutely that is probably why syswrite(), print() (at least with :raw layer), PerlIO can explicitly switch both ways. And no, it doesn't always Digest::SHA, MIME::Base64 and other functions, that work with binary
Neither of those have any excuse to do so IMHO. Leon |
From victor@vsespb.ru
point was I am not talking about textual data. only about binary data.
no, upgradad _binary_ data never raise "Wide character" errors/warnings.
syswrite with :raw layer do so too. On Wed Aug 21 03:43:02 2013, LeonT wrote:
|
From @ap* Leon Timmermans <fawaka@gmail.com> [2013-08-21 10:35]:
Uh, exactly where did you get that idea? Downgrading means changing the string’s internal representation from the
Sure. But the unfortunate fact is that open() and friends still suffer
You are confusing upgrade/downgrade with decode/encode. You are correct
That really doesn’t matter to the issue in question.
That hinges on whether readdir() returns UTF8-encoded filenames decoded That is, currently. I hope we can eventually fix open() et al and put this issue behind us.
Only if the UTF8 flag is not respected. Which open() doesn’t. Otherwise
On the contrary, *if* readdir() does the right, then it is not just the Regards, |
From @LeontOn Sun, Aug 25, 2013 at 4:57 AM, Aristotle Pagaltzis <pagaltzis@gmx.de>wrote:
I could have phrased it better (in particular mention internal
Not sure what you mean with that, given that the bug is that there is
I'm not sure what you want; it currently already always returns bytes.
upgrade($foo) eq (is_utf8($foo) ? $foo : decode('latin-1', $foo)); Downgrade is nothing more or less than an efficient way to encode to
It does if we ever decide to support Win32's unicode filename APIs (we
It would not be semantically correct: it would already be mojibake, even if
Because it is double-encoded!
It's currently safe and completely unnecessary. I hope we can eventually fix open() et al and put this issue behind us.
I think this concept of "UTF8 flag is not respected" is nonsensical in this Leon |
From victor@vsespb.ruSeems we disagree only about whenever binary data can be upgraded by So, here is code. Programmer concatenates binary data (filename) with text
|
From @LeontOn Tue, Aug 27, 2013 at 11:59 PM, Victor Efimov <victor@vsespb.ru> wrote:
No, I believe the fix is either of: In particular, I prefer preventing implicit upgrades, as I find them Your approach works in this case because your explicit downgrade is matched My arguments for downgrading binary data: In discussing these things, sometimes some words mean different things to
That doesn't necessarily make it sensible to do.
That makes no sense to me, care to explain?
That is true either way.
That's not exactly relevant to this discussion.
Both JSON::XS and PerlIO allow you to explicitly switch between either I think that sort of choice is often the best way to go forward. E.g.
No, it comes from downgrading data, not necessarily 'binary' data. Leon |
From victor@vsespb.ru
(1) - this ticket was only about case when programmer don't know the (2) so I was right when told that
No! I am talking only about filenames as "binary strings" - (that was ==== ====
"binary" explained here: http://perldoc.perl.org/perlunifaq.html#How-can-I-determine-if-a-string-is-a-text-string-or-a-binary-string? "How can I determine if a string is a text string or a binary string?" "This is something you, the programmer, has to keep track of; sorry." http://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen "The following are such interfaces. Also, see The Unicode Bug. For all So programmer can treat filenames as binary data. And work with it just
I meant filenames are binary data. but often people have to work with On Wed Aug 28 10:44:12 2013, LeonT wrote:
|
From @xdgOn Tue, Aug 20, 2013 at 6:07 PM, Victor Efimov
I've really gotten lost in this thread. I think the original point of I think the remaining confusion is over what to recommend for Here, I think I more or less agree with Leon that preventing implicit In other words, if you know you're dealing with binary data, make sure Thus it seems like the recommendation should be: (1) If you know the encoding of a name read from readdir(), decode it (2) If you *don't* know the encoding, make sure any strings you're Have I misunderstood something? David -- |
From zefram@fysh.orgDavid Golden wrote:
I disagree. The internal representation should be as invisible as It also means that operators that take a string operand and just use the -zefram |
From victor@vsespb.ru
I actually though that _could_ be placed at _least_ in Unicode documentation Currenly what is documented is: http://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen "For all of these interfaces Perl currently (as of v5.16.0) simply i.e. documented as "byte strings" without note that it's downgraded().
Yes, it's indeed wise. At least it's faster and saves memory. But is Perl code (eq, print, sysrwite) works fine with upgraded binary strings. I think it's not documented that upgraded binary data is invalid. I also suspect there can be existing code written which work with binary Now, if we recommend never concatenate strings with UTF-8 flag with "If you ever concatenate ASCII string with UTF-8 bit with filename, After that it seems we'll have three categories of strings: 1. character strings previously we had only two categories (1,2) So I suggest simply document that filenames are always downgraded, so
No. On Wed Aug 28 12:12:34 2013, xdg@xdg.me wrote:
|
From zefram@fysh.orgVictor Efimov via RT wrote:
That sort of statement is rather ambiguous. From the now-recommended But what that statement in perlunicode(1) really means is that these Actually that documentation isn't written to be interpreted that way. -zefram |
From @LeontOn Wed, Aug 28, 2013 at 9:27 PM, Zefram <zefram@fysh.org> wrote:
I don't see the disagreement.
That's an entirely different (but valid) discussion, orthogonal to this Leon |
From @ikegamiOn Wed, Aug 21, 2013 at 4:32 AM, Leon Timmermans <fawaka@gmail.com> wrote:
It has noting to do with UTF-8 or latin-1. (Both upgraded and downgraded
Indeed. 2. binary data can be (randomly) upgraded or downgraded by 3rd party code
You will need to correctly decode it on input, and correctly encode it on (The only time you need to upgrade or downgrade is when you deal with buggy
Unfortunately. This is one of the last instances of The Unicode Bug in core.
But it is here. upgrade and downgrade is the only way to get predictable Perl treats file names as bytes, and its operators expects these bytes to On some systems, you can get away with passing upgraded Unicode code points. Options: -e _d($file_name_bytes) sub _d { my ($s) = @_; utf8::downgrade($s); $s } (In that particular example, _u isn't actually needed because decode_utf8 - Eric |
From @xdgOn Wed, Aug 28, 2013 at 3:27 PM, Zefram <zefram@fysh.org> wrote:
If you flip those statements around, it's because we have faulty I agree 100% that it would be great if it could be invisible, but I'm readdir( $dir_handle, ":utf8" ); # read and decode For file *content* we have layers. For file *names* we make users David -- |
From Mark@Overmeer.net* David Golden (xdg@xdg.me) [130828 22:09]:
Although very well possible, it is very inconvenient when the admin has Therefore, a good default is probably the codeset in LC_CTYPE Mark Overmeer MSc MARKOV Solutions |
From victor@vsespb.ru- this ticket only for case when encoding is unknown and filenames are 2013/8/29 Mark Overmeer <mark@overmeer.net>
|
Migrated from rt.perl.org#119395 (status was 'open')
Searchable as RT119395$
The text was updated successfully, but these errors were encountered: