New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perl's open() has broken Unicode file name support #15883
Comments
From @paliFunction open() has broken processing of non-ASCII file names. Look at these two examples: $ perl -e 'open my $file, ">", "\N{U+FF}"' $ perl -e 'open my $file, ">", "\xFF"' First one create file with name 0xc3 0xbf (ÿ), second one with name 0xff And because those two strings "\N{U+FF}" and "\xFF" are equal they must $ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"' Bug is in open() implementation in PP(pp_open) in file pp_sys.c. File name is read from perl scalar to C char* as: tmps = SvPV_const(sv, len); But after that SvUTF8(sv) is *not* used to check if char* tmps is So to fixing this bug it is needed to define how function open should It also means that it is needed to define what Perl_do_open6() should There are basically two problems with it: 1) On some systems (e.g. on Linux) file name could be arbitrary sequence 2) Perl modules probably already uses perl Unicode scalars as argument And decision should still allow to open any file on VFS from 1) and Current state is worse as both 1) and 2) is broken. |
From @jkeenanOn Tue, 21 Feb 2017 20:58:03 GMT, pali@cpan.org wrote:
ISTR seeing a fair amount of discussion of this issue on #p5p. Would anyone care to summarize this discussion? Thank you very much. -- |
The RT System itself - Status changed from 'new' to 'open' |
From @paliSome more informations: Windows has two sets of functions for accessing files. First with -A Linux stores file names in binary octets, there is no encoding or Which means there is no way to have uniform and same multiplaform I'm thinking that for Linux we could specify some (hint) variable which This would allow us to have uniform open() function with takes Unicode But problem is how currently Perl's open() function is implemented. It |
From zefram@fysh.orgpali@cpan.org wrote:
Depends what you're trying to do "uniformly". If you want to be able The thing that can't be done uniformly is to generate a filename from an The strings that we should be using as filenames are strings that are
Ugh. If the `hint' is lexically scoped, this loses as soon as a
These are both crap as defaults. The locale's nominal encoding is quite The trouble here really arises because the scheme effectively uses the
The current behaviour is broken on any platform. To get to anything sane -zefram |
From @LeontOn Mon, Feb 27, 2017 at 10:21 PM, <pali@cpan.org> wrote:
Correct observations. Except OS X makes this more complicated still:
I would welcome a 'unicode_filenames' feature. I don't think any value
Both. Neither. Welcome to The Unicode Bug. Leon |
From @tonycozOn Tue, 21 Feb 2017 12:58:03 -0800, pali@cpan.org wrote:
This sounds like something that could be prototyped on CPAN by replacing CORE::GLOBAL::open, CORE::GLOBAL::readdir etc. Tony |
From @paliOn Monday 27 February 2017 15:27:32 Zefram via RT wrote:
Yes, global. Ideally something which can be set when starting perl (e.g.
Yes. And due to this reasons modules in normal cases should not change
We can use Encode::encode() function in non-croak mode which replace This could be default behaviour so all those OS related functions do not
It is not a crap as default. Currently locale encoding is what is used So if your locale encoding is set to ASCII then more applications are But as there are too many functions from Unicode space to bytes and more Therefore locale encoding is what we can use as it is the only one
Latin-1 is not sane as it is unable to handle Unicode strings with
If we want to handle any Unicode string created in perl and passed to If we want to open arbitrary file stored on disk (in bytes) then we need Both cannot be achieved. And if there is some function it is still not So user or application (or library or system) must know in which Therefore I suggest to use default encoding from locale with ability to Either temporary change locale encoding or passing some argument to perl
We can define new use feature 'unicode_filenames' or something like that |
From @paliOn Tuesday 28 February 2017 00:35:45 Leon Timmermans wrote:
It is not a problem or complicated issue. It just means that OS X uses
So it is time for feature unicode_filenames and fix that bug. |
From zefram@fysh.orgpali@cpan.org wrote:
No, that fucks up the filenames. After such a substituting decode,
I don't follow your argument here. You don't seem to be addressing the
My objective isn't to make every Unicode string represent a filename.
Impossible. The locale model of character encoding (as you treat it The locale encoding is OK if one treats it strictly as a user
I run in the C locale, which on this system has nominally ASCII encoding The only programs I've encountered that have any difficulty with
That would be a lexically-scoped effect, which (as mentioned earlier) -zefram |
From zefram@fysh.orgI wrote:
Oops, I meant "arbitrary octet strings" there. -zefram |
From @paliOn Thursday 02 March 2017 04:23:35 Zefram via RT wrote:
Understood. As wrote in my first email we probably cannot have both
Basically output from ordinary applications are Unicode file names, not Same, user enter into file open dialog or into console stdin filename as Also I want to create file named "ÿ" with perl in same way on Windows So to have fixed open() we need to be able to represent every perl
I understand... and in current model with perl strings it is impossible.
Right!
Yes, it is broken. But problem is that it is used by system
Probably some programs like "ls" is not able to print UTF-8 encoded file
We need to store "unicode filename" information into perl scalar itself. Another idea: Cannot we create new magic like for vstring which would contains This could fix problem of accessing arbitrary file, you just compose It is usable? Or are there also problems? |
From zefram@fysh.orgpali@cpan.org wrote:
No, it *is* possible, and easy. What's not possible is to do that and
It can't print them *literally*, and it handles that issue quite well.
No. This would be octet-vs-character distinction all over again; -zefram |
From @xenuOn Sat, 4 Mar 2017 05:21:37 +0000
Is it? Remember that we're also talking about Windows. |
From zefram@fysh.orgTomasz Konojacki wrote:
See upthread. The easy way to do it is different on Windows from how An issue that we haven't yet considered is passing filenames as In any system with a fixed open(), this probably ought to continue to What about on Windows? What form does argv[] take, in its most native -zefram |
From @paliOn Saturday 04 March 2017 06:22:18 you wrote:
So it is not possible (at least not easy). See my first post which I As wrote, I want to create file named "ÿ" which is stored in perl
Why?
But this is your argument. On Linux it is needed to use octets as file
Any pointers?
Yes, but what is the problem? It would be magic scalar we all get/set Also functions like readdir can correctly prepare such scalar, so if you So what is the problem with this idea? |
From @paliOn Saturday 04 March 2017 15:28:02 you wrote:
You suggest that on Linux we should use only binary octets for file So if user want to create file named "ÿ", then it would be needed to do use utf8; (resp. replace utf8::encode with another function which converts perl So, this your approach is not useful. As script for creating file named To solve this problem, you need to be able to pass Unicode string as
Like other winapi functions, for argv here you have also -A and -W |
From zefram@fysh.orgpali@cpan.org wrote:
You can't do this, because, at the level you're specifying it, this isn't The code that you then give is a bit more specific. I think the effect You *could* make the top-level program cleaner by hiding the platform open my $file, ">", pali_filename_encode("\xff") or die; The filename encoder translates an arbitrary Unicode string into -zefram |
From @paliOn Sunday 05 March 2017 11:44:40 you wrote:
Exactly! This is what high-level program want to do and achieve. They Decoder does not have to always fail on non-encodable input. It can e.g. But before we can start implementing such thing (e.g. in File::Spec |
From @paliOn Tuesday 28 February 2017 00:35:45 Leon Timmermans wrote:
For completeness: Windows uses UCS-2 for file names and also in corresponding WinAPI -W OS X uses non-standard variant of Unicode NFD encoded in UTF-8. Linux use just binary octets. Idea how to handle file names in Perl: Store file names in extended Perl's Unicode (with code points above On Linux, take file name (which is char*) and start decoding it from On OS X, take file name (which is char* but in UTF-8) and just decode it On Windows, take file name (wchar_t* which is uint16_t*) compatible for |
From @dur-randirOn Mon, 20 Aug 2018 01:48:07 -0700, pali@cpan.org wrote:
And then someone passes this string to an API call that expects well-formed UTF-8, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user, and now you suggest to take a step back - I don't think that's a good idea. It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions. |
From @paliOn Tuesday 21 August 2018 02:02:18 Sergey Aleynikov via RT wrote:
Yesterday on IRC I presented following idea, which could solve above Introduce a new qf operator which takes Unicode string and returns perl Also functions like readdir() would return these file name objects Those file name objects could have proper stringification operator to This would allow: All those fs functions (like open()) would work like before, so there |
From @dur-randirOn Tue, 21 Aug 2018 02:11:41 -0700, pali@cpan.org wrote:
Yeah, that's a path of changing API. |
Did anything happen about this? There is a general issue with unicode from the environment/command-line in Windows which is really causing issues with a large app I have as the usage of it has exploded and we have to support more and more languages. The issue seems to be discussed, with a suggested fix, here: https://www.nu42.com/2017/02/perl-unicode-windows-trilogy-one.html I did test this fix and it does solve the issues I have but I'm not in a position to say whether it's enough in general. Also, since I use |
It's complex, if we want to:
it's not going to be a simple change. |
The additional difficulty is that we would probably need a solution that makes sense on both Unix and on Windows, despite the two having wildly different handling of character encodings in the file-system (and other system APIs). |
Given how long these issues have been extant, can I assume then that they won't ever be fixed? It's just somewhat sad that this is going to provide more fuel to the general ascendency of python which, as far as I know, doesn't have these issues ... |
Unicode on Windows almost certainly won't be fixed in 5.34, but no one said it won't happen in a later release. We have a pretty good understanding of those issues, they were discussed many times on various channels and there's definitely the will to fix them. It's an extremely complicated issue and we're still yet to decide how exactly it should be fixed, but I'm sure we will get there eventually. |
BTW, there's a workaround for those issues. If you're using Windows 10 1803 or newer, enabling "Use Unicode UTF-8 for worldwide Keep in mind that this switch is global and it may break some legacy applications. |
It won't fix upgraded vs downgraded SVs referring to different filenames. |
Sure, but that issue exists on the other platforms (like Linux) too. |
Very useful to know that beta Windows option and that MS is finally joining everyone on UTF-8. I tried this and it indeed worked nicely and presumably reduced the messing about in the future solving this. |
Shouldn't we document this somehow? |
I wrote a module that I think fixes at least the “upgraded vs downgraded SVs referring to different filenames” problem: |
There is an "easy" work-around for handling filenames that are not valid UTF-8 or UTF-16 in OSs where those encodings are the default. Perl utf8 is able to encode characters in the range 0-0x7FFFFFFFFFFFFFFF but currently Unicode defines less than 300000 symbols. That means that most of that space is unused and is going to remain unused for the foreseeable future. We can create an encoding (butf8 - bijective utf8) that uses some of those unused codes (for instance, the last 128) to represent invalid utf-8 sequences.. For example, in an OS where The user may prepend the string Characters with codes in the reserved range appearing in file names should also be handled as a special case. For instance, if the filename contains the sequence of bytes In Windows, invalid UTF-16 sequences can be represented as pairs of bytes as in the previous case, or just a 16bits space can be reserved. |
@salva: This is something which I have already suggested in this discussion. See my comment #15883 (comment) and then following discussion (as it has some issues). |
@pali, I had missed your comment. I don't see any issue with that approach. Code points up to 0x7FFFFFFFFFFFFFFF can already be generated in Perl and so, they can already be passed to external functions. If that is not desirable, then, what's required is a That's also just an hypothetical issue. In practice, most libraries know that bad data exists and handle it in some way (for instance, ignoring it, or signalling an error). In the end, what I see is that it is 2021 and Perl doesn't know yet to access my file system correctly. This is a very critical issue for people outside the ASCII bubble!!! Just blocking any proposed solution given because of minor issues is the wrong thing to do. |
Just for reference:
and...
So, it seems all of those languages but perl do the right thing :-( |
@salortiz What do you think of the notion of using extra flags in the SV to indicate that the string is text? Then Perl could implement the semantics you envision:
This would even facilitate working Windows filesystem operations. :) |
@FGasper: This is also something which I proposed in this tiket. See my comment #15883 (comment) about |
@pali Sort of … my proposal is to use extra flags on the SV to solve the more general problem of differentiating text strings from byte strings. Th |
IMO that is the wrong approach. You are just pushing into the developer the responsibility to encode/decode the data before/after calling any file-system related builtin. Using the right encoding at the file system level is not something optional that you do when you know you have some non-ASCII data. On the contrary, it must be the default and every piece of code must take that into account always, and the sensible way to make that happen is to show Perl how to do it transparently using sane defaults. In practice that means doing what every other language is already doing:
And then, decide what is more important to you, absolute backward compatibility so that the feature is only available when explicitly activated ( Finally add the required machinery to let the user/developer disable it when for whatever reason, he wants to do otherwise. |
I certainly agree that an encode-for-the-system-by-default workflow makes the most sense. As long as it also preserves an easy way to express any arbitrary filename that the system supports, it sounds good to me. |
Migrated from rt.perl.org#130831 (status was 'open')
Searchable as RT130831$
The text was updated successfully, but these errors were encountered: