Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Win32 Unicode support #17094

Open
p5pRT opened this issue Jul 15, 2019 · 14 comments
Open

[meta] Win32 Unicode support #17094

p5pRT opened this issue Jul 15, 2019 · 14 comments
Labels
distro-mswin32 type-core Unicode and System Calls Bad interactions of syscalls and UTF-8

Comments

@p5pRT
Copy link

p5pRT commented Jul 15, 2019

Migrated from rt.perl.org#134286 (status was 'open')

Searchable as RT134286$

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2019

From @tonycoz

Created by @tonycoz

Perl typically uses so called "ANSI" APIs, due to compatibility
with other platforms, eg. calling unlink() is cross-platform, and
to a certain extent works in Unicode locales (using the encoded
bytes rather than characters), but the equivalents on Win32 don't
work.

chcp 65001 isn't a solution, since pretty much everything but
console output will still be in the system code page.

So this is a meta ticket covering tickets improving Perl's Win32
Unicode support.

Any changes need to be switchable (so a user with old ANSI
encoded filenames for example don't lose a bunch of work), and
need to not break other platforms.

I can see at least the following issues​:

- command line arguments

- filenames across many different operators

- process creation (system, exec, readpipe/qx(), pipe open())

- environment variables

- console (maybe)

- many similar changes in bundled modules

Discussion is welcome, patches that satsify the above
requirements more so.

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.28.1:

Configured by strawberry-perl at Sun Dec  2 14:25:09 2018.

Summary of my perl5 (revision 5 version 28 subversion 1) configuration:
   
  Platform:
    osname=MSWin32
    osvers=10.0.17134.407
    archname=MSWin32-x64-multi-thread
    uname='Win32 strawberry-perl 5.28.1.1 #1 Sun Dec  2 14:24:00 2018 x64'
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='gcc'
    ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields'
    optimize='-s -O2'
    cppflags='-DWIN32'
    ccversion=''
    gccversion='7.1.0'
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='long long'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='g++.exe'
    ldflags ='-s -L"C:\sperl-5.28.1.1-64bit-portable\perl\lib\CORE" -L"C:\sperl-5.28.1.1-64bit-portable\c\lib"'
    libpth=C:\sperl-5.28.1.1-64bit-portable\c\lib C:\sperl-5.28.1.1-64bit-portable\c\x86_64-w64-mingw32\lib C:\sperl-5.28.1.1-64bit-portable\c\lib\gcc\x86_64-w64-mingw32\7.1.0
    libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
    libc=
    so=dll
    useshrplib=true
    libperl=libperl528.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=xs.dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-mdll -s -L"C:\sperl-5.28.1.1-64bit-portable\perl\lib\CORE" -L"C:\sperl-5.28.1.1-64bit-portable\c\lib"'



@INC for perl 5.28.1:
    c:/sperl-5.28.1.1-64bit-portable/perl/site/lib
    c:/sperl-5.28.1.1-64bit-portable/perl/vendor/lib
    c:/sperl-5.28.1.1-64bit-portable/perl/lib


Environment for perl 5.28.1:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=c:\sperl-5.28.1.1-64bit-portable\perl\site\bin;c:\sperl-5.28.1.1-64bit-portable\perl\bin;c:\sperl-5.28.1.1-64bit-portable\c\bin;C:\Program Files\Microsoft MPI\Bin\;C:\Program Files (x86)\Common Files\Intel\Shared Files\cpp\bin\Intel64;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\;C:\Users\Tony\AppData\Local\Microsoft\WindowsApps;C:\Users\Tony\AppData\Local\GitHubDesktop\bin
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2019

From bokutin@bokut.in

I've been waiting.

Python and Ruby are already possible.

I think it will not be helpful, I have built in the past by myself.
https://github.com/bokutin/strawberry-perl-USING_WIDE-revival

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2019

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2019

From @pali

I can see at least the following issues​:
...
- filenames across many different operators
- process creation (system, exec, readpipe/qx(), pipe open())
...
Discussion is welcome, patches that satsify the above
requirements more so.

Same problem is also on Linux systems. I already created ticket
https://rt.perl.org/Public/Bug/Display.html?id=130831
where is discussion how to handle. I proposed some solution (which could
work for both Windows and Linux systems) but seems other people do not
like it.

(PS​: Please CC me for future discussion as I do not know how can I add
myself to CC list on RT)

@tonycoz
Copy link
Contributor

tonycoz commented Dec 16, 2020

Platforms

This ticket is specifically about Win32, but ideally we should do the same for POSIX-like systems.

Aim

Provide support for accessing filenames, program arguments and the environment as strings of characters rather than as strings of bytes, and support Win32's unicode interfaces.

This would need to handle possibly mis-encoded filenames, environment entries and argv entries on both Win32 and non-Win32 systems.

This would be opt-in

The new behaviour would only occur under a new -C flag, possibly -Cs, to avoid breaking existing code written to work under the older rules.

User visible behaviour

With the new flag, as follows.

Win32

  • argv[] would be populated with utf8 versions of the command-line parameters (-CA would be needed to decode it.)

  • %ENV will be populated from the Unicode versions of the environment, and

  • readdir() will return the UTF-8 version of the UTF-16 name.

POSIX-like systems

  • If both -CA and -Cs are provided, @argv encoding will follow "Handling of mis-encoded names" (-CA currently just switches the UTF-8 flag on from a quick look)

  • %ENV will be populated as if the environment was UTF-8 and will follow "Handling of mis-encoded names"

  • readdir() will decode the returned names as UTF-8

The UTF-8 flag

No promises are made on the value of the flag, it may be set for strings representable as ISO 11859-1, or it may not.

Handling of mis-encoded names

Outside of Win32, the strings returned by POSIX readdir() and found in argv[] and the environment can contain non-UTF-8 byte sequences.

On Win32, filenames may contain lone surrogates, which cannot properly be converted to UTF-8.

When translating from bytes to characters non UTF-8 sequences such as overlongs or extended UTF-8 sequences (surrogates, code points above 0x10FFFF) will be treated as invalidly encoded bytes.

Invalidly encoded bytes will be encoded as code point 0x200000 + byte value.

On Win32, lone surrogates will be encoded as code point 0x210000 + surrogate code point.

This will be reversed when calling the underlying file API.

Extended characters outside these ranges will result in a file not found error or perhaps EINVAL.

XS/API

New APIs will be provided:

PerlIO *PerlIO_open_sv(SV *path, const char *mode);
int PerlLIO_rename_sv(SV *oldname, SV *newname);
etc

either follow the current behaviour, or perform the filename conversions discussed above if perl was invoked with -Cs.

Rationale

Why not make the behaviour lexical

Filename are passed around between perl lexical scopes, consider filename received from File::Find, or passed to IO::File.

I use supers/overlongs in my filenames, this will break that

Don't use this option, it's intended to support displayable filenames transparently.

Why do all the fancy decoding?

POSIX filenames are byte strings, they may not be valid UTF-8, and Win32 has a similar issue with lone surrogates.

Simply storing the byte sequences with the UTF-8 flag on would produce invalid internal state for perl.

This is less of an issue for Win32's lone surrogates, but treating them as valid seems incorrect to me.

References

[PEP383] https://www.python.org/dev/peps/pep-0383/ - "PEP 383 -- Non-decodable Bytes in System Character Interfaces." This makes the faulty assemption that Win32 filenames are Unicode.

@Grinnz Grinnz added the Unicode and System Calls Bad interactions of syscalls and UTF-8 label Dec 16, 2020
@Grinnz
Copy link
Contributor

Grinnz commented Dec 16, 2020

IMO there should be a mechanism to enable this behavior other than the commandline flag. The current -C options other than -CA have an equivalent operation via the open pragma (though sometimes obtusely). There's also precedent for such an operation having a global effect (unfortunately) as PerlIO layers can't be lexically scoped on the global standard handles.

@xenu
Copy link
Member

xenu commented Dec 16, 2020

I think it should be a completely separate CLI flag, that is, not a subflag of -C.

IMO there should be a mechanism to enable this behavior other than the commandline flag.

CLI flags can be set via PERL5OPT environmental variable and using the shebang. Also, I imagine that this feature will eventually become opt-out instead of opt-in.

Anyway, I generally agree with this proposal, however I don't understand why you don't want to use "Low Surrogates" and "High Surrogates" blocks for unpaired surrogates, as specified by WTF-8. What do we gain by keeping them illegal?

@xenu
Copy link
Member

xenu commented Dec 16, 2020

IMO there should be a mechanism to enable this behavior other than the commandline flag.

CLI flags can be set via PERL5OPT environmental variable and using the shebang. Also, I imagine that this feature will eventually become opt-out instead of opt-in.

On second thought, a global variable (in addition to the flag) probably would make sense but I'm worried that people will want to abuse it with local.

@pali
Copy link
Member

pali commented Dec 16, 2020

Anyway, I generally agree with this proposal, however I don't understand why you don't want to use "Low Surrogates" and "High Surrogates" blocks for unpaired surrogates, as specified by WTF-8. What do we gain by keeping them illegal?

Unpaired "Low Surrogates" and "High Surrogates" are illegal in UNICODE. So I think we should avoid using it and also add posibility to detect between real unpaired surrogates which comes from other places (and are illegal) and from win32 filenames (which are legal).

@pali
Copy link
Member

pali commented Dec 16, 2020

Win32

  • argv[] would be populated with utf8 versions of the command-line parameters (-CA would be needed to decode it.)

I would suggest to have argv[] in UNICODE (not in UTF-8, not in UTF-16)

  • readdir() will return the UTF-8 version of the UTF-16 name.

Same here, I would suggest to have it in UNICODE (not UTF-8, not UTF-16).

POSIX-like systems

  • %ENV will be populated as if the environment was UTF-8 and will follow "Handling of mis-encoded names"
  • readdir() will decode the returned names as UTF-8

And same here.

Why not make the behaviour lexical

Filename are passed around between perl lexical scopes, consider filename received from File::Find, or passed to IO::File.

Earlier I suggested solution for this issue: Add a new SV flag (or any other way to mark particular SV*) that indicates that its value is UNICODE version of filename. And lexical pragma could change behavior of Perl that for filenames it sets this new flag, to ensure that e.g. result from readdir() will be UNICODE also when stored into variable (SV*) and used outside of the lexical block.

@pali
Copy link
Member

pali commented Dec 16, 2020

Related issues:
#15883
#17091
#9578

@tonycoz
Copy link
Contributor

tonycoz commented Dec 16, 2020

Win32

  • argv[] would be populated with utf8 versions of the command-line parameters (-CA would be needed to decode it.)

I would suggest to have argv[] in UNICODE (not in UTF-8, not in UTF-16)

argv[] is a C variable, it has no unicode flag

  • readdir() will return the UTF-8 version of the UTF-16 name.

Same here, I would suggest to have it in UNICODE (not UTF-8, not UTF-16).

POSIX-like systems

  • %ENV will be populated as if the environment was UTF-8 and will follow "Handling of mis-encoded names"
  • readdir() will decode the returned names as UTF-8

And same here.

Maybe I wasn't clear enough here. The intent in each case is that @argv (with -CA), %ENV and the result of readdir() would be populated in utf8 from the wide version of the system APIs per the description below, and if necessary upgraded.

Saying "in UNICODE" here without saying what you mean in terms of an effect on the implementation isn't meaningful.

Why not make the behaviour lexical
Filename are passed around between perl lexical scopes, consider filename received from File::Find, or passed to IO::File.

Earlier I suggested solution for this issue: Add a new SV flag (or any other way to mark particular SV*) that indicates that its value is UNICODE version of filename. And lexical pragma could change behavior of Perl that for filenames it sets this new flag, to ensure that e.g. result from readdir() will be UNICODE also when stored into variable (SV*) and used outside of the lexical block.

You don't say what happens when such a SV is combined with an SV obtained while the unicode interface isn't active. Or how it's combined with SVs not from readdir(). I would see this requiring two flags to prevent problems, one to indicate it came from readdir() and another to indicate the mode it was read in.

Also such a flag would be lost when the value is serialized (JSON, as a database column value, etc), making this effectively global (it's up to the developer to make it global across interacting processes) avoids that.

@pali
Copy link
Member

pali commented Dec 16, 2020

Win32

  • argv[] would be populated with utf8 versions of the command-line parameters (-CA would be needed to decode it.)

I would suggest to have argv[] in UNICODE (not in UTF-8, not in UTF-16)

argv[] is a C variable, it has no unicode flag

Ou, sorry for that. C variables of course in UTF-8.

  • readdir() will return the UTF-8 version of the UTF-16 name.

Same here, I would suggest to have it in UNICODE (not UTF-8, not UTF-16).

POSIX-like systems

  • %ENV will be populated as if the environment was UTF-8 and will follow "Handling of mis-encoded names"
  • readdir() will decode the returned names as UTF-8

And same here.

Maybe I wasn't clear enough here. The intent in each case is that @argv (with -CA), %ENV and the result of readdir() would be populated in utf8 from the wide version of the system APIs per the description below, and if necessary upgraded.

Saying "in UNICODE" here without saying what you mean in terms of an effect on the implementation isn't meaningful.

By UNICODE I that Perl scalars would contain sequence of UNICODE code points. By UTF-8 I mean that scalar would contain sequence of UTF-8 bytes. For example letter á in UNICODE is "\N{U+E1}" and in UTF-8 is "\x{c3}\x{a1}". I hope it is clear now.

Why not make the behaviour lexical
Filename are passed around between perl lexical scopes, consider filename received from File::Find, or passed to IO::File.

Earlier I suggested solution for this issue: Add a new SV flag (or any other way to mark particular SV*) that indicates that its value is UNICODE version of filename. And lexical pragma could change behavior of Perl that for filenames it sets this new flag, to ensure that e.g. result from readdir() will be UNICODE also when stored into variable (SV*) and used outside of the lexical block.

You don't say what happens when such a SV is combined with an SV obtained while the unicode interface isn't active. Or how it's combined with SVs not from readdir(). I would see this requiring two flags to prevent problems, one to indicate it came from readdir() and another to indicate the mode it was read in.

This is something which needs to be discussed and designed. I just want to show that it is possible to design and implement it. Of course it is not easy and there are lot of edge cases...

Also such a flag would be lost when the value is serialized (JSON, as a database column value, etc), making this effectively global (it's up to the developer to make it global across interacting processes) avoids that.

Without fixing serializers and extending it, it would not work. But at least it would work for pure-perl code and just "pragma" can be intially marked as experimental to provide at least something and in later versions fixing it / extending until we come up with the stabilized implementation.

I'm just trying to show that it is possible to fix this issue.

@tonycoz
Copy link
Contributor

tonycoz commented Dec 17, 2020

IMO there should be a mechanism to enable this behavior other than the commandline flag. The current -C options other than -CA have an equivalent operation via the open pragma (though sometimes obtusely). There's also precedent for such an operation having a global effect (unfortunately) as PerlIO layers can't be lexically scoped on the global standard handles.

I can see being able to use a global variable to control filename handling (the result of readdir(), and how open, rename, unlink etc handle names provided, but any problems encountered when a filename crosses this boundary would be the user's problem.

The command-line option is needed to correctly setup @argv and %ENV, and to control what happens when %ENV is modified. I don't think the %ENV handling should be controllable at runtime.

A command-line option is actually on the late side, we need argv[] properly setup on Win32 for -I to make sense, and fixing that might require some moderately ugly hackery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distro-mswin32 type-core Unicode and System Calls Bad interactions of syscalls and UTF-8
Projects
None yet
Development

No branches or pull requests

5 participants