Skip Menu |
Report information
Id: 130831
Status: open
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: pali [at] cpan.org
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



To: perlbug [...] perl.org
From: pali [...] cpan.org
Subject: Perl's open() has broken Unicode file name support
Date: Tue, 21 Feb 2017 21:14:57 +0100
Download (untitled) / with headers
text/plain 1.6k
Function open() has broken processing of non-ASCII file names. Look at these two examples: $ perl -e 'open my $file, ">", "\N{U+FF}"' $ perl -e 'open my $file, ">", "\xFF"' First one create file with name 0xc3 0xbf (ÿ), second one with name 0xff And because those two strings "\N{U+FF}" and "\xFF" are equal they must create same file, not two different. $ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"' equal Bug is in open() implementation in PP(pp_open) in file pp_sys.c. File name is read from perl scalar to C char* as: tmps = SvPV_const(sv, len); But after that SvUTF8(sv) is *not* used to check if char* tmps is encoded in UTF-8 or Latin1. It pass tmps directly to do_open6() function without SvUTF8 information. So to fixing this bug it is needed to define how function open should process filename. Either as binary octets and SvPVbyte() instead of SvPV() should be used, or as Unicode string and SvPVutf8() instead of SvPV() should be used. It also means that it is needed to define what Perl_do_open6() should expect. Its argument for file name is of type: const char *oname. It should be either binary octets or UTF-8. There are basically two problems with it: 1) On some systems (e.g. on Linux) file name could be arbitrary sequence of binary characters. It does not have to be valid UTF-8 representation. 2) Perl modules probably already uses perl Unicode scalars as argument for file names And decision should still allow to open any file on VFS from 1) and probably should not break 2). And I'm not sure if it is possible to have both 1) and 2) together. Current state is worse as both 1) and 2) is broken.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.9k
On Tue, 21 Feb 2017 20:58:03 GMT, pali@cpan.org wrote: Show quoted text
> Function open() has broken processing of non-ASCII file names. > > Look at these two examples: > > $ perl -e 'open my $file, ">", "\N{U+FF}"' > > $ perl -e 'open my $file, ">", "\xFF"' > > First one create file with name 0xc3 0xbf (ÿ), second one with name 0xff > > And because those two strings "\N{U+FF}" and "\xFF" are equal they must > create same file, not two different. > > $ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"' > equal > > Bug is in open() implementation in PP(pp_open) in file pp_sys.c. > > File name is read from perl scalar to C char* as: > > tmps = SvPV_const(sv, len); > > But after that SvUTF8(sv) is *not* used to check if char* tmps is > encoded in UTF-8 or Latin1. It pass tmps directly to do_open6() function > without SvUTF8 information. > > So to fixing this bug it is needed to define how function open should > process filename. Either as binary octets and SvPVbyte() instead of > SvPV() should be used, or as Unicode string and SvPVutf8() instead of > SvPV() should be used. > > It also means that it is needed to define what Perl_do_open6() should > expect. Its argument for file name is of type: const char *oname. It > should be either binary octets or UTF-8. > > There are basically two problems with it: > > 1) On some systems (e.g. on Linux) file name could be arbitrary sequence > of binary characters. It does not have to be valid UTF-8 representation. > > 2) Perl modules probably already uses perl Unicode scalars as argument > for file names > > And decision should still allow to open any file on VFS from 1) and > probably should not break 2). And I'm not sure if it is possible to have > both 1) and 2) together. > > Current state is worse as both 1) and 2) is broken.
ISTR seeing a fair amount of discussion of this issue on #p5p. Would anyone care to summarize this discussion? Thank you very much. -- James E Keenan (jkeenan@cpan.org)
Date: Mon, 27 Feb 2017 22:21:32 +0100
To: perlbug-followup [...] perl.org
From: pali [...] cpan.org
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
Download (untitled) / with headers
text/plain 1.6k
Some more informations: Windows has two sets of functions for accessing files. First with -A suffix which takes file names in encoding of current 8bit codepage. Second with -W suffix which takes file names in Unicode (more precisely in Windows variant of UTF-16). With -A functions it is possible to access only those files which file names contains only characters available in current 8bit codepage. Internally are all file names stored in Unicode. So -W functions must be used to have access to any file name. And therefore for Windows we need Unicode file name in perl open() function to have access to any file stored on disk. Linux stores file names in binary octets, there is no encoding or requirement for Unicode. Therefore to access any file on Linux, Perl's open() function should takes downgraded/non-Unicode file name. Which means there is no way to have uniform and same multiplaform support for file access without hacks. I'm thinking that for Linux we could specify some (hint) variable which will contains encoding name (it can be hidden in some pragma module...). And then Perl's open() function can takes Unicode file name and can convert it to encoding (specified by that variable). As default value for that variable (for encoding) can be used from locale or defaults to UTF-8 (which is probably most used and sane default value). This would allow us to have uniform open() function with takes Unicode file name on (probably) any platform. I think this is the only sane approach if Perl want to support Unicode file names. But problem is how currently Perl's open() function is implemented. It expects bytes or Unicode string?
Date: Mon, 27 Feb 2017 23:26:57 +0000
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: Zefram <zefram [...] fysh.org>
To: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 4.1k
pali@cpan.org wrote: Show quoted text
>Which means there is no way to have uniform and same multiplaform >support for file access without hacks.
Depends what you're trying to do "uniformly". If you want to be able to open any file, then each platform has an obvious way of representing any filename as a Perl string (as a full Unicode string on Windows and as an octet string on Unix), so using Perl strings for filenames could be a uniform interface. The format of filename strings does vary between platforms, but we already have such variation in the directory separators, and we have File::Spec to provide a uniform interface to it. The thing that can't be done uniformly is to generate a filename from an arbitrary Unicode string in accordance with the platform's conventions. We could of course add a File::Spec method that attempts to do this, but there's a fundamental problem that Unix doesn't actually have a consistent convention for it. But this isn't really a big problem. We don't need to use arbitrary Unicode strings, that weren't intended to be filenames, as filenames. It's something to avoid: a lot of security problems have arisen from programs that tried to use arbitrary data strings in this way. The strings that we should be using as filenames are strings that are explicitly specified by the user as filenames. The user, at runtime, can be expected to be aware of platform conventions and to supply locally-appropriate filenames. Show quoted text
>I'm thinking that for Linux we could specify some (hint) variable which >will contains encoding name (it can be hidden in some pragma module...).
Ugh. If the `hint' is lexically scoped, this loses as soon as a filename crosses a module boundary. If global, that would be saner; it's effectively part of the interface to the OS. But you then have a backcompat issue that you have to handle encoding failures in code paths that currently never generate exceptions. There's also a terrible problem with OS interfaces that return filenames (readdir(3), readlink(2), et al): you have to *decode* the filename, and if it doesn't decode then you've lost the ability to work with arbitrary existing files. Show quoted text
> As default value >for that variable (for encoding) can be used from locale or defaults to >UTF-8 (which is probably most used and sane default value).
These are both crap as defaults. The locale's nominal encoding is quite likely to be ASCII, and both ASCII and UTF-8 are incapable of generating certain octet strings as output. Thus if filenames are subjected to either of these encodings then it is impossible for the user to specify some filenames that are valid at the syscall interface, and if such a filename actually exists then you run into the above-mentioned decoding problem. For example, the one-octet string "\xc0" doesn't decode as either ASCII or UTF-8. The only sane default, if you want to offer this encoding system, is Latin-1, which behaves as a null encoding on Perl octet strings. The trouble here really arises because the scheme effectively uses the encoding in reverse. Normally we use a character encoding to encode a character string as an octet string, so that we can store those octets and later read them to recover the original character string. With Unix filenames, however, the thing that we want to represent and store, which is the filename as it appears at the OS interface, is an octet string. The encoding layer, if there is one, is concerned with representing that octet string as a character string. An encoding that can't handle all octet strings is a problem, just as in normal circumstances a character encoding that can't handle all character strings is a problem. Most character encodings are just not designed to be used in reverse, and don't have a design goal of encoding to all octet strings or of decode-then-encode round-tripping. Show quoted text
>But problem is how currently Perl's open() function is implemented. It >expects bytes or Unicode string?
The current behaviour is broken on any platform. To get to anything sane we will need a change that breaks some backcompat. In that situation we are not constrained by the present arrangement of the open() internals. -zefram
Date: Tue, 28 Feb 2017 00:35:45 +0100
CC: perlbug <perlbug-followup [...] perl.org>
To: pali [...] cpan.org
From: Leon Timmermans <fawaka [...] gmail.com>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
On Mon, Feb 27, 2017 at 10:21 PM, <pali@cpan.org> wrote: Show quoted text
> Windows has two sets of functions for accessing files. First with -A > suffix which takes file names in encoding of current 8bit codepage. > Second with -W suffix which takes file names in Unicode (more precisely > in Windows variant of UTF-16). With -A functions it is possible to > access only those files which file names contains only characters > available in current 8bit codepage. Internally are all file names stored > in Unicode. So -W functions must be used to have access to any file > name. And therefore for Windows we need Unicode file name in perl open() > function to have access to any file stored on disk. > > Linux stores file names in binary octets, there is no encoding or > requirement for Unicode. Therefore to access any file on Linux, Perl's > open() function should takes downgraded/non-Unicode file name. > > Which means there is no way to have uniform and same multiplaform > support for file access without hacks.
Correct observations. Except OS X makes this more complicated still: it uses UTF-8 encoded bytes, normalized using a non-standard variation of NFD. Show quoted text
> I'm thinking that for Linux we could specify some (hint) variable which > will contains encoding name (it can be hidden in some pragma module...). > And then Perl's open() function can takes Unicode file name and can > convert it to encoding (specified by that variable). As default value > for that variable (for encoding) can be used from locale or defaults to > UTF-8 (which is probably most used and sane default value). > > This would allow us to have uniform open() function with takes Unicode > file name on (probably) any platform. I think this is the only sane > approach if Perl want to support Unicode file names.
I would welcome a 'unicode_filenames' feature. I don't think any value other than binary is sane on Linux though. I think we learned from perl 5.8.0. Show quoted text
> But problem is how currently Perl's open() function is implemented. It > expects bytes or Unicode string?
Both. Neither. Welcome to The Unicode Bug. Leon
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 616b
On Tue, 21 Feb 2017 12:58:03 -0800, pali@cpan.org wrote: Show quoted text
> So to fixing this bug it is needed to define how function open should > process filename. Either as binary octets and SvPVbyte() instead of > SvPV() should be used, or as Unicode string and SvPVutf8() instead of > SvPV() should be used. > > It also means that it is needed to define what Perl_do_open6() should > expect. Its argument for file name is of type: const char *oname. It > should be either binary octets or UTF-8.
This sounds like something that could be prototyped on CPAN by replacing CORE::GLOBAL::open, CORE::GLOBAL::readdir etc. Tony
Date: Wed, 1 Mar 2017 15:42:32 +0100
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 5.3k
On Monday 27 February 2017 15:27:32 Zefram via RT wrote: Show quoted text
> >I'm thinking that for Linux we could specify some (hint) variable which > >will contains encoding name (it can be hidden in some pragma module...).
> > Ugh. If the `hint' is lexically scoped, this loses as soon as a > filename crosses a module boundary. If global, that would be saner;
Yes, global. Ideally something which can be set when starting perl (e.g. perl parameter) or via env variable. Show quoted text
> it's effectively part of the interface to the OS.
Yes. And due to this reasons modules in normal cases should not change value of that variable. Show quoted text
> But you then have > a backcompat issue that you have to handle encoding failures in code > paths that currently never generate exceptions. There's also a terrible > problem with OS interfaces that return filenames (readdir(3), readlink(2), > et al): you have to *decode* the filename, and if it doesn't decode then > you've lost the ability to work with arbitrary existing files.
We can use Encode::encode() function in non-croak mode which replace invalid characters by some replacement and throw warning about it. This could be default behaviour so all those OS related functions do not die. Maybe there could be some switch (feature?) which change mode of encode function to die. And new could can handle and deal with it. Show quoted text
> > As default value > >for that variable (for encoding) can be used from locale or defaults to > >UTF-8 (which is probably most used and sane default value).
> > These are both crap as defaults. The locale's nominal encoding is quite > likely to be ASCII, and both ASCII and UTF-8 are incapable of generating > certain octet strings as output.
It is not a crap as default. Currently locale encoding is what is used for such actions. It is used for converting multibyte characters into octets and vice-versa in other applications. So if your locale encoding is set to ASCII then more applications are unable to print on your terminal non-ascii characters. But as there are too many functions from Unicode space to bytes and more are in some cases "correct" and more are used, there is no one which should be used. So when you chose any you still get problems. Therefore locale encoding is what we can use as it is the only one information which we have from operating system here. Show quoted text
> Thus if filenames are subjected to > either of these encodings then it is impossible for the user to specify > some filenames that are valid at the syscall interface, and if such a > filename actually exists then you run into the above-mentioned decoding > problem. For example, the one-octet string "\xc0" doesn't decode as > either ASCII or UTF-8. The only sane default, if you want to offer this > encoding system, is Latin-1, which behaves as a null encoding on Perl > octet strings.
Latin-1 is not sane as it is unable to handle Unicode strings with characters above U+0000FF. It wrong as ASCII or UTF-8. Show quoted text
> The trouble here really arises because the scheme effectively uses the > encoding in reverse. Normally we use a character encoding to encode a > character string as an octet string, so that we can store those octets > and later read them to recover the original character string. With Unix > filenames, however, the thing that we want to represent and store, which > is the filename as it appears at the OS interface, is an octet string. > The encoding layer, if there is one, is concerned with representing > that octet string as a character string. An encoding that can't handle > all octet strings is a problem, just as in normal circumstances a > character encoding that can't handle all character strings is a problem. > Most character encodings are just not designed to be used in reverse, > and don't have a design goal of encoding to all octet strings or of > decode-then-encode round-tripping.
If we want to handle any Unicode string created in perl and passed to Perl's open() function we need to use some Unicode transformation function. If we want to open arbitrary file stored on disk (in bytes) then we need to use encoding which maps from whole space of characters to some Unicode strings. Both cannot be achieved. And if there is some function it is still not useful. As file names on disk are already stored in some encoding. Just kernel do not care about it and even it do not know that encoding. So user or application (or library or system) must know in which encoding are stored file names. And this should be present in current locale. Therefore I suggest to use default encoding from locale with ability to change it. So if user has stored files in different encoding as specified in locale, then user has already problem to handle such files in applications which uses wchar_t and probably already know how to deal with it... Either temporary change locale encoding or passing some argument to perl (or env variable or perl variable) to specify correct one. Show quoted text
> >But problem is how currently Perl's open() function is implemented. It > >expects bytes or Unicode string?
> > The current behaviour is broken on any platform. To get to anything sane > we will need a change that breaks some backcompat. In that situation > we are not constrained by the present arrangement of the open() internals.
We can define new use feature 'unicode_filenames' or something like that and then Perl's open() function can be "fixed".
To: Leon Timmermans <fawaka [...] gmail.com>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
CC: perlbug <perlbug-followup [...] perl.org>
Date: Wed, 1 Mar 2017 15:49:58 +0100
Download (untitled) / with headers
text/plain 2.5k
On Tuesday 28 February 2017 00:35:45 Leon Timmermans wrote: Show quoted text
> On Mon, Feb 27, 2017 at 10:21 PM, <pali@cpan.org> wrote:
> > Windows has two sets of functions for accessing files. First with -A > > suffix which takes file names in encoding of current 8bit codepage. > > Second with -W suffix which takes file names in Unicode (more precisely > > in Windows variant of UTF-16). With -A functions it is possible to > > access only those files which file names contains only characters > > available in current 8bit codepage. Internally are all file names stored > > in Unicode. So -W functions must be used to have access to any file > > name. And therefore for Windows we need Unicode file name in perl open() > > function to have access to any file stored on disk. > > > > Linux stores file names in binary octets, there is no encoding or > > requirement for Unicode. Therefore to access any file on Linux, Perl's > > open() function should takes downgraded/non-Unicode file name. > > > > Which means there is no way to have uniform and same multiplaform > > support for file access without hacks.
> > Correct observations. Except OS X makes this more complicated still: > it uses UTF-8 encoded bytes, normalized using a non-standard variation > of NFD.
It is not a problem or complicated issue. It just means that OS X uses also Unicode API, same as Windows. Just uses different representation of Unicode, say OS X variant of UTF-8. We have no problem here to generate OS X representation from perl string and vice-versa. It just needs platform specific code, same as Windows for its variant of UTF-16. Show quoted text
> > I'm thinking that for Linux we could specify some (hint) variable which > > will contains encoding name (it can be hidden in some pragma module...). > > And then Perl's open() function can takes Unicode file name and can > > convert it to encoding (specified by that variable). As default value > > for that variable (for encoding) can be used from locale or defaults to > > UTF-8 (which is probably most used and sane default value). > > > > This would allow us to have uniform open() function with takes Unicode > > file name on (probably) any platform. I think this is the only sane > > approach if Perl want to support Unicode file names.
> > I would welcome a 'unicode_filenames' feature. I don't think any value > other than binary is sane on Linux though. I think we learned from > perl 5.8.0. >
> > But problem is how currently Perl's open() function is implemented. It > > expects bytes or Unicode string?
> > Both. Neither. Welcome to The Unicode Bug.
So it is time for feature unicode_filenames and fix that bug.
Date: Thu, 2 Mar 2017 03:22:49 +0000
To: perl5-porters [...] perl.org
From: Zefram <zefram [...] fysh.org>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
Download (untitled) / with headers
text/plain 3.7k
pali@cpan.org wrote: Show quoted text
>We can use Encode::encode() function in non-croak mode which replace >invalid characters by some replacement
No, that fucks up the filenames. After such a substituting decode, re-encoding the result will produce some octet string different from the original. So if you read a filename from a directory, attempting to use that filename to address the file will at best fail because it's a non-existent name. (If you're unlucky then it'll address a *different* file.) Show quoted text
>So if your locale encoding is set to ASCII then more applications are >unable to print on your terminal non-ascii characters.
I don't follow your argument here. You don't seem to be addressing the crapness of making it impossible to deal with arbitrary filenames at the syscall interface. Show quoted text
>Latin-1 is not sane as it is unable to handle Unicode strings with >characters above U+0000FF. It wrong as ASCII or UTF-8.
My objective isn't to make every Unicode string represent a filename. My objective is to have every filename represented by some Perl string. Latin-1 would be a poor choice in situations where it is desired to represent arbitrary Unicode strings, but it's an excellent choice for the job of representing filenames. Different jobs have different requirements, leading to different design choices. Show quoted text
>So user or application (or library or system) must know in which >encoding are stored file names. And this should be present in current >locale.
Impossible. The locale model of character encoding (as you treat it here) is fundamentally broken. The model is that every string in the universe (every file content, filename, command line argument, etc.) is encoded in the same way, and the locale environment variable tells you which universe you're in. But in the real universe, files, filenames, and so on turn up encoded how their authors liked to encode them, and that's not always the same. In the real universe we have to cope with data that is not encoded in our preferred way. The locale encoding is OK if one treats it strictly as a user *preference*. What one can do with such a preference without risking running into uncooperative data is quite limited. Show quoted text
> So if user has stored files in different encoding as >specified in locale, then user has already problem to handle such files
I run in the C locale, which on this system has nominally ASCII encoding (which is in fact my preferred encoding), and yet I occasionally run into filenames that are derived from UTF-8 or Latin-1 encoding. Do you realise how much difficulty I have in dealing with such files? None at all. For my shell is 8-bit clean, and every program I use just passes the octet string straight through (e.g., from argv to syscalls). This is a healthy system. The only programs I've encountered that have any difficulty with non-ASCII filenames are two language implementations (Rakudo Perl 6 and GNU Guile 2.0) that I don't use for real work. Both of them have decided, independently, that filenames must be encodings of arbitrary Unicode strings. Interestingly, they've reached different conclusions about what encoding is used: Guile considers it to be the locale's nominal encoding, whereas Rakudo reckons it's UTF-8 regardless of locale. (Rakudo is making an attempt to augment its concept of Unicode strings to be able to represent arbitrary Unicode strings in a way compatible with UTF-8, but that's not fully working yet, and I'm not convinced that it can ever work satisfactorily.) Don't make the same mistake as these projects. Show quoted text
>We can define new use feature 'unicode_filenames' or something like that >and then Perl's open() function can be "fixed".
That would be a lexically-scoped effect, which (as mentioned earlier) loses as soon as a filename crosses a module boundary. -zefram
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: Zefram <zefram [...] fysh.org>
To: perl5-porters [...] perl.org
Date: Thu, 2 Mar 2017 03:25:34 +0000
Download (untitled) / with headers
text/plain 223b
I wrote: Show quoted text
>(Rakudo is making an attempt to augment its concept of Unicode strings to >be able to represent arbitrary Unicode strings in a way compatible with >UTF-8,
Oops, I meant "arbitrary octet strings" there. -zefram
Date: Sat, 4 Mar 2017 01:09:44 +0100
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 5.7k
On Thursday 02 March 2017 04:23:35 Zefram via RT wrote: Show quoted text
> pali@cpan.org wrote:
> >We can use Encode::encode() function in non-croak mode which replace > >invalid characters by some replacement
> > No, that fucks up the filenames. After such a substituting decode, > re-encoding the result will produce some octet string different from > the original. So if you read a filename from a directory, attempting > to use that filename to address the file will at best fail because > it's a non-existent name. (If you're unlucky then it'll address a > *different* file.) >
> >So if your locale encoding is set to ASCII then more applications > >are unable to print on your terminal non-ascii characters.
> > I don't follow your argument here. You don't seem to be addressing > the crapness of making it impossible to deal with arbitrary > filenames at the syscall interface.
Understood. As wrote in my first email we probably cannot have both ability to access arbitrary file and having uniform access to files represented by perl Unicode strings. Show quoted text
> >Latin-1 is not sane as it is unable to handle Unicode strings with > >characters above U+0000FF. It wrong as ASCII or UTF-8.
> > My objective isn't to make every Unicode string represent a filename.
Basically output from ordinary applications are Unicode file names, not bytes, which is shown to users. Same, user enter into file open dialog or into console stdin filename as sequence of key press which represents some characters (which fully maps to Unicode) and not sequence of bytes. Also I want to create file named "ÿ" with perl in same way on Windows and Linux. So to have fixed open() we need to be able to represent every perl Unicode string as file name. (With possibility to fail if underlaying system is not able to store current file name) Show quoted text
> My objective is to have every filename represented by some Perl > string.
I understand... and in current model with perl strings it is impossible. Show quoted text
> Latin-1 would be a poor choice in situations where it is > desired to represent arbitrary Unicode strings,
Right! Show quoted text
> but it's an > excellent choice for the job of representing filenames. Different > jobs have different requirements, leading to different design > choices. >
> >So user or application (or library or system) must know in which > >encoding are stored file names. And this should be present in > >current locale.
> > Impossible. The locale model of character encoding (as you treat it > here) is fundamentally broken.
Yes, it is broken. But problem is that it is used by system applications... :-( Show quoted text
> The locale encoding is OK if one treats it strictly as a user > *preference*. What one can do with such a preference without risking > running into uncooperative data is quite limited. >
> > So if user has stored files in different encoding as > > > >specified in locale, then user has already problem to handle such > >files
> > I run in the C locale, which on this system has nominally ASCII > encoding (which is in fact my preferred encoding), and yet I > occasionally run into filenames that are derived from UTF-8 or > Latin-1 encoding. Do you realise how much difficulty I have in > dealing with such files? None at all. For my shell is 8-bit clean, > and every program I use just passes the octet string straight > through (e.g., from argv to syscalls). This is a healthy system.
Probably some programs like "ls" is not able to print UTF-8 encoded file names into your terminal... Show quoted text
> The only programs I've encountered that have any difficulty with > non-ASCII filenames are two language implementations (Rakudo Perl 6 > and GNU Guile 2.0) that I don't use for real work. Both of them have > decided, independently, that filenames must be encodings of arbitrary > Unicode strings. Interestingly, they've reached different > conclusions about what encoding is used: Guile considers it to be > the locale's nominal encoding, whereas Rakudo reckons it's UTF-8 > regardless of locale. (Rakudo is making an attempt to augment its > concept of Unicode strings to be able to represent arbitrary Unicode > strings in a way compatible with UTF-8, but that's not fully working > yet, and I'm not convinced that it can ever work satisfactorily.) > Don't make the same mistake as these projects. >
> >We can define new use feature 'unicode_filenames' or something like > >that and then Perl's open() function can be "fixed".
> > That would be a lexically-scoped effect, which (as mentioned earlier) > loses as soon as a filename crosses a module boundary.
We need to store "unicode filename" information into perl scalar itself. And make sure it wont be lost when doing assignment or another string functions... Another idea: Cannot we create new magic like for vstring which would contains additional informations for file name? Functions like readdir could properly create such magic scalars and when passed to open it would correctly handle it. And like vstring it could contain some string representation in PV slot, so it would be possible to pass such scalar into print/warn functions or any XS functions which would not be capable of that new magic. In magic property could be stored platform/system dependent settings, like which encoding is used. This could fix problem of accessing arbitrary file, you just compose magic scalar (maybe via some function or pragma) in system dependent representation and then pass it into open(). And also fix problem to pass any Unicode file name, you compose normal perl Unicode string and based on some settings it would be converted by open() to system dependent representation. open() would first try to use magic properties and if they are not present then it fallback to Encode on content of string. Maybe usage of Encode needs to be enabled by globally (or locally). It is usable? Or are there also problems?
To: perl5-porters [...] perl.org
From: Zefram <zefram [...] fysh.org>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
Date: Sat, 4 Mar 2017 05:21:37 +0000
Download (untitled) / with headers
text/plain 1.7k
pali@cpan.org wrote: Show quoted text
>On Thursday 02 March 2017 04:23:35 Zefram via RT wrote:
>> My objective is to have every filename represented by some Perl >> string.
> >I understand... and in current model with perl strings it is impossible.
No, it *is* possible, and easy. What's not possible is to do that and simultaneously achieve your other goal of having almost all Unicode strings represent some filename in a manner that's conventional for the platform. One of these goals is more important than the other. Show quoted text
>Probably some programs like "ls" is not able to print UTF-8 encoded file >names into your terminal...
It can't print them *literally*, and it handles that issue quite well. GNU ls(1) pays attention to the locale encoding in a sensible manner, mainly looking at the character repertoire. In the ASCII locale, by default it displays a question mark in place of high-half octets, which clues me in that there's a problematic octet. With the -b option it represents them as backslash escapes, which if need be I can copy into a shell $'' construct. Actually tab completion is almost always the solution to entering the filename at the shell, and the completion that it generates uses $''. This is a healthy system: I have no difficulty in examining and using awkward filenames through my preferred medium of ASCII. Show quoted text
>Cannot we create new magic like for vstring which would contains >additional informations for file name?
No. This would be octet-vs-character distinction all over again; see several previous discussions on p5p. vstrings kinda work, though not fully, because we hardly ever perform string operations on version numbers with an expectation of producing a version number as output. But we manipulate filenames by string means all the time. -zefram
From: Tomasz Konojacki <me [...] xenu.pl>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
To: perl5-porters [...] perl.org
Date: Sat, 04 Mar 2017 15:06:24 +0100
Download (untitled) / with headers
text/plain 405b
On Sat, 4 Mar 2017 05:21:37 +0000 Zefram <zefram@fysh.org> wrote: Show quoted text
> pali@cpan.org wrote:
> >On Thursday 02 March 2017 04:23:35 Zefram via RT wrote:
> >> My objective is to have every filename represented by some Perl > >> string.
> > > >I understand... and in current model with perl strings it is impossible.
> > No, it *is* possible, and easy.
Is it? Remember that we're also talking about Windows.
Date: Sat, 4 Mar 2017 14:27:37 +0000
To: perl5-porters [...] perl.org
From: Zefram <zefram [...] fysh.org>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
Download (untitled) / with headers
text/plain 1.6k
Tomasz Konojacki wrote: Show quoted text
>Is it? Remember that we're also talking about Windows.
See upthread. The easy way to do it is different on Windows from how it is on Unix, but in both cases there's an obvious and simple way to represent all native filenames as Perl strings. The parts that would be platform-dependent are reasonably well localised within the core; programs written in Perl wouldn't need to be aware of the difference. An issue that we haven't yet considered is passing filenames as command-line arguments. Before Unicode, we could expect something like open(H, "<", $ARGV[0]) to work. (Well, pre-SvUTF8 Perl didn't have three-arg open, but apart from the syntax that would work.) Currently $ENV{PERL_UNICODE} means that a program can't fully predict how argv[] will be mapped into @ARGV, but as it happens the Unicode bug in open() papers over that, so feeding an @ARGV element directly into open() like this will still work. (You lose if you perform any string operation on the way, though.) In any system with a fixed open(), this probably ought to continue to work: a filename supplied as a command-line argument, in the platform's conventional manner, should yield an @ARGV element which, if fed to open() et al, functions as that filename. Unlike the question of encoding character strings as filenames, Unix does have well-defined conventions for this, with argv elements and filenames in the syscall API both being nul-terminated octet strings, and an identity mapping expected between them. What about on Windows? What form does argv[] take, in its most native version? How does one conventionally encode a Unicode filename as a command-line argument? -zefram
Date: Sun, 5 Mar 2017 11:03:37 +0100
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 1.8k
On Saturday 04 March 2017 06:22:18 you wrote: Show quoted text
> pali@cpan.org wrote:
> >On Thursday 02 March 2017 04:23:35 Zefram via RT wrote:
> >> My objective is to have every filename represented by some Perl > >> string.
> > > >I understand... and in current model with perl strings it is > >impossible.
> > No, it *is* possible, and easy. What's not possible is to do that > and simultaneously achieve your other goal of having almost all > Unicode strings represent some filename in a manner that's > conventional for the platform. One of these goals is more important > than the other.
So it is not possible (at least not easy). See my first post which I wrote to this bug. For you it is just not important, but it is important for me + other people too. And what I wrote in first post is a bug which I would like to see fixed. As wrote, I want to create file named "ÿ" which is stored in perl string. And I should be able to do it via perl uniform function without hacks like $^O. Show quoted text
> >Cannot we create new magic like for vstring which would contains > >additional informations for file name?
> > No.
Why? Show quoted text
> This would be octet-vs-character distinction all over again;
But this is your argument. On Linux it is needed to use octets as file name to support arbitrary file stored on disk. Show quoted text
> see several previous discussions on p5p.
Any pointers? Show quoted text
> vstrings kinda work, though > not fully, because we hardly ever perform string operations on > version numbers with an expectation of producing a version number as > output. But we manipulate filenames by string means all the time.
Yes, but what is the problem? It would be magic scalar we all get/set operations on it could be implemented in platform dependent manner. Also functions like readdir can correctly prepare such scalar, so if you modify or directly pass to open, you will open any file correctly. So what is the problem with this idea?
Date: Sun, 5 Mar 2017 11:13:22 +0100
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
To: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Saturday 04 March 2017 15:28:02 you wrote: Show quoted text
> Tomasz Konojacki wrote:
> >Is it? Remember that we're also talking about Windows.
> > See upthread. The easy way to do it is different on Windows from how > it is on Unix, but in both cases there's an obvious and simple way to > represent all native filenames as Perl strings.
You suggest that on Linux we should use only binary octets for file name. Such thing will not work on Windows, where you need to pass Unicode string as file names. So if user want to create file named "ÿ", then it would be needed to do something like this: use utf8; my $filename = "ÿ"; utf8::encode($filename) $O^ ne "MSWin32"; open my $file, ">", $filename or die; (resp. replace utf8::encode with another function which converts perl Unicode string to byte octets). So, this your approach is not useful. As script for creating file named "ÿ" would need to deal with all platforms and its dependent behaviour. To solve this problem, you need to be able to pass Unicode string as file name into open. Show quoted text
> What about on Windows? What form does argv[] take, in its most > native version? How does one conventionally encode a Unicode > filename as a command-line argument?
Like other winapi functions, for argv here you have also -A and -W variants. -A is encoded in current locale and -W in modified UTF-16. So if you want you can take Unicode string.
From: Zefram <zefram [...] fysh.org>
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
To: perl5-porters [...] perl.org
Date: Sun, 5 Mar 2017 10:43:12 +0000
Download (untitled) / with headers
text/plain 1.2k
pali@cpan.org wrote: Show quoted text
>So if user want to create file named "ÿ",
You can't do this, because, at the level you're specifying it, this isn't a well-defined action on Unix. Some encoding needs to be used to turn the character into an octet string, and there isn't anything intrinsic to the platform that determines which encoding to use. The code that you then give is a bit more specific. I think the effect you're trying to specify in the code is that you use the octet string "\xc3\xbf" on Unix and the character string "\x{ff}" on Windows. If this lower-level description is actually what you want to achieve, then you should expect to need platform-dependent code to do it, because this is by definition a platform-dependent effect. You *could* make the top-level program cleaner by hiding the platform dependence, and on Unix the choice of encoding, in a module. Your program could then look like open my $file, ">", pali_filename_encode("\xff") or die; The filename encoder translates an arbitrary Unicode string into a filename in a manner that is conventional for the platform, and represents the filename as a Perl string in the manner required for open(). It could well become part of File::Spec. Note that the corresponding decoder must fail on some inputs. -zefram
Date: Sun, 5 Mar 2017 11:59:04 +0100
To: perlbug-followup [...] perl.org
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
Download (untitled) / with headers
text/plain 1.9k
On Sunday 05 March 2017 11:44:40 you wrote: Show quoted text
> pali@cpan.org wrote:
> >So if user want to create file named "ÿ",
> > You can't do this, because, at the level you're specifying it, this > isn't a well-defined action on Unix. Some encoding needs to be used > to turn the character into an octet string, and there isn't anything > intrinsic to the platform that determines which encoding to use. > > The code that you then give is a bit more specific. I think the > effect you're trying to specify in the code is that you use the > octet string "\xc3\xbf" on Unix and the character string "\x{ff}" on > Windows. If this lower-level description is actually what you want > to achieve, then you should expect to need platform-dependent code > to do it, because this is by definition a platform-dependent effect. > > You *could* make the top-level program cleaner by hiding the platform > dependence, and on Unix the choice of encoding, in a module. Your > program could then look like > > open my $file, ">", pali_filename_encode("\xff") or die; > > The filename encoder translates an arbitrary Unicode string into > a filename in a manner that is conventional for the platform, and > represents the filename as a Perl string in the manner required > for open(). It could well become part of File::Spec. Note that the > corresponding decoder must fail on some inputs. > > -zefram
Exactly! This is what high-level program want to do and achieve. They really should do not care about low-level OS differences. Decoder does not have to always fail on non-encodable input. It can e.g. directly use Encode module and allow caller to specify what to do with bad input: https://metacpan.org/pod/Encode#Handling-Malformed-Data But before we can start implementing such thing (e.g. in File::Spec module) we need to have defined API for open() and resolved this bug ("\xFF" eq "\N{U+FF}") which I described in first post. Because now it is not specified if open() takes Unicode string or byte octets...
To: perlbug-followup [...] perl.org
Date: Mon, 20 Aug 2018 10:47:49 +0200
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
From: pali [...] cpan.org
Download (untitled) / with headers
text/plain 2.8k
On Tuesday 28 February 2017 00:35:45 Leon Timmermans wrote: Show quoted text
> On Mon, Feb 27, 2017 at 10:21 PM, <pali@cpan.org> wrote:
> > Windows has two sets of functions for accessing files. First with -A > > suffix which takes file names in encoding of current 8bit codepage. > > Second with -W suffix which takes file names in Unicode (more precisely > > in Windows variant of UTF-16). With -A functions it is possible to > > access only those files which file names contains only characters > > available in current 8bit codepage. Internally are all file names stored > > in Unicode. So -W functions must be used to have access to any file > > name. And therefore for Windows we need Unicode file name in perl open() > > function to have access to any file stored on disk. > > > > Linux stores file names in binary octets, there is no encoding or > > requirement for Unicode. Therefore to access any file on Linux, Perl's > > open() function should takes downgraded/non-Unicode file name. > > > > Which means there is no way to have uniform and same multiplaform > > support for file access without hacks.
> > Correct observations. Except OS X makes this more complicated still: > it uses UTF-8 encoded bytes, normalized using a non-standard variation > of NFD.
For completeness: Windows uses UCS-2 for file names and also in corresponding WinAPI -W functions which operates with file names. It is not UTF-16 as file names may really have unpaired surrogates. OS X uses non-standard variant of Unicode NFD encoded in UTF-8. Linux use just binary octets. Idea how to handle file names in Perl: Store file names in extended Perl's Unicode (with code points above U+1FFFFF). Non-extended code points would represent normal Unicode code points. And code points above U+1FFFFF would represent parts of file name which cannot be unambiguously represented in Unicode. On Linux, take file name (which is char*) and start decoding it from UTF-8. Sequence of bytes which cannot be decoded as UTF-8 would be decoded as sequence of extended code points (e.g. U+200000 - U+2000FF). This operation has inverse therefore can be used for conversion of any file name stored on Linux system. Plus it is UTF-8 friendly, if filenames in VFS are stored in UTF-8 (which is now common), then perl's say function can correctly print them. On OS X, take file name (which is char* but in UTF-8) and just decode it from UTF-8. For conversion from Perl's Unicode to char* just do that non-standard NFD normalization and encode to UTF-8. On Windows, take file name (wchar_t* which is uint16_t*) compatible for -W WinAPI function which represents UCS-2 sequence and decode it to Unicode. There can be unpaired surrogates and represents it either as Unicode surrogate code points, or use extended Perl's code points (bove U+1FFFFF). Reverse process (from perl's Unicode to wchar_t*/uint16_t*) is obvious.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 773b
On Mon, 20 Aug 2018 01:48:07 -0700, pali@cpan.org wrote: Show quoted text
> Store file names in extended Perl's Unicode (with code points above > U+1FFFFF). Non-extended code points would represent normal Unicode code > points. And code points above U+1FFFFF would represent parts of file > name which cannot be unambiguously represented in Unicode.
And then someone passes this string to an API call that expects well-formed UTF-8, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user, and now you suggest to take a step back - I don't think that's a good idea. It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions.
To: perlbug-followup [...] perl.org
From: pali [...] cpan.org
Subject: Re: [perl #130831] Perl's open() has broken Unicode file name support
Date: Tue, 21 Aug 2018 11:11:01 +0200
Download (untitled) / with headers
text/plain 1.9k
On Tuesday 21 August 2018 02:02:18 Sergey Aleynikov via RT wrote: Show quoted text
> On Mon, 20 Aug 2018 01:48:07 -0700, pali@cpan.org wrote:
> > Store file names in extended Perl's Unicode (with code points above > > U+1FFFFF). Non-extended code points would represent normal Unicode code > > points. And code points above U+1FFFFF would represent parts of file > > name which cannot be unambiguously represented in Unicode.
> > And then someone passes this string to an API call that expects well-formed UTF-8, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user, and now you suggest to take a step back - I don't think that's a good idea. > > It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions.
Yesterday on IRC I presented following idea, which could solve above problem. Introduce a new qf operator which takes Unicode string and returns perl object which would represent file name. Internally object itself can store file name as it needs (e.g. sequence of integer code points, if storing code points above U+1FFFFF in UTF-8 string is bad) and every perl's filesystem function (like open()) would interpret these file name objects specially -- without The Unicode bug, etc... Also functions like readdir() would return these file name objects instead of regular strings. Those file name objects could have proper stringification operator to always produce printable string of file name. And for those non-representable code points above U+1FFFFF, stringification function can escape it via some ASCII sequences. This would allow: In module ABC to create a file name via qf operator and pass it into module CDE which calls open() on argument passed from module ABC. All those fs functions (like open()) would work like before, so there would not be any regression for existing code. Just when passed argument is that special object, it would be handled differently.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 618b
On Tue, 21 Aug 2018 02:11:41 -0700, pali@cpan.org wrote: Show quoted text
> Introduce a new qf operator which takes Unicode string and returns > perl > object which would represent file name. Internally object itself can > store file name as it needs (e.g. sequence of integer code points, if > storing code points above U+1FFFFF in UTF-8 string is bad) and every > perl's filesystem function (like open()) would interpret these file > name > objects specially -- without The Unicode bug, etc... > > Also functions like readdir() would return these file name objects > instead of regular strings.
Yeah, that's a path of changing API.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org