New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@ARGV, -CA and Win32 #16681
Comments
From @tonycozCreated by @tonycozParsing non-ASCII command-lines on Win32 is moderately broken. Whether in codepage 65001 or not, non-ASCII (or at least, non-system J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(qq(αω) J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(shift) The -CA switch makes no difference. While this is caused by the interaction between the C runtime and the Simplest (for the user) would be for -CA to retrieve the UTF-16 Relying only on -CA might also have some backcompat issues, since if To simplify the implementation we might depend on an environment Perl Info
|
From @xenuOn Sun, 02 Sep 2018 18:26:01 -0700
While this ticket is about @ARGV and not filenames, I consider it a It's *exactly* the same problem. In this case, command line arguments While this brings *horrible* experience to perl users, I don't think |
The RT System itself - Status changed from 'new' to 'open' |
From @tonycozOn Sun, 02 Sep 2018 19:49:57 -0700, me@xenu.pl wrote:
It's certainly a strongly related problem, but it's not the same problem. It came up while I was diagnosing code that accepted a filename (the name in the test case) and called the Win32 specific APIs to open it and failed.
I think it's something we can improve. The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things. The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch, and it works for me for commands run from the command prompt. However, one test fails: run/switchC.t .. ok 10 - \#!perl -C This fails because backticks have the same type of problem - backticks don't know about unicode either. By the time the chr(256) gets to win32_popen() all it sees is "\xc4\c80", and it can't tell if that was Latin1 or UTF-8, so it's passed through as 00C4 0080 rather than 0100. Tony |
From @tonycoz0001-perl-133496-convert-utf-16-arguments-to-utf-8-if-CA-.patchFrom 3c90673de0140195fa62034ce7b49f7fd9949d8c Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Wed, 5 Sep 2018 13:24:30 +1000
Subject: (perl #133496) convert utf-16 arguments to utf-8 if -CA set
perl generally uses the ANSI/multi-byte APIs on Win32, which means
that the Unicode values that Windows keeps can be corrupted when
accepted by perl.
When the -CA switch is supplied this change refetches and splits the
UTF-16 the command line from the system and converts each argument
to UTF-8, replacing the original argv.
This has two effects:
- any other following arguments, especially including -[eE] code, are
encoded as UTF-8, so -Mutf8 works sensibly
- @ARGV is properly populated with Unicode arguments instead of
the ANSI-fied version of the UTF-16 arguments.
With this change, the test for -CA in run/switchC.t fails, since it
ends up passes "\xc4\x80" to system, which ends up as UTF-16 00c4 0080
rather than 0100.
---
perl.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 61 insertions(+), 1 deletion(-)
diff --git a/perl.c b/perl.c
index 30ee577483..10a3510c11 100644
--- a/perl.c
+++ b/perl.c
@@ -42,6 +42,10 @@
#include "nwutil.h"
#endif
+#ifdef WIN32
+#include <shellapi.h>
+#endif
+
#ifdef DEBUG_LEAKING_SCALARS_FORK_DUMP
# ifdef I_SYSUIO
# include <sys/uio.h>
@@ -1557,6 +1561,55 @@ Perl_call_atexit(pTHX_ ATEXIT_t fn, void *ptr)
++PL_exitlistlen;
}
+#ifdef WIN32
+
+#define win32_upgrade_argv(pargv) S_win32_upgrade_argv(aTHX_ (pargv))
+
+STATIC void
+S_win32_upgrade_argv(pTHX_ char ***pargv) {
+ LPWSTR cmdline = GetCommandLineW();
+ int num_args;
+ LPWSTR *argsw = CommandLineToArgvW(cmdline, &num_args);
+ STRLEN argsc_sz =
+ WideCharToMultiByte(CP_UTF8, 0, cmdline, -1,
+ NULL, 0, 0, 0);
+ /* maybe combine these into one block */
+ /* FIXME: leaking this memory for now */
+ LPSTR *argvc = safemalloc(sizeof(LPSTR) * (num_args+1) + argsc_sz);
+ LPSTR argsc = (LPSTR)(argvc + num_args+1);
+ LPSTR argsc_end = argsc + argsc_sz;
+ LPSTR dest = argsc;
+ int i;
+
+ for (i = 0; i < num_args; ++i) {
+ argvc[i] = dest;
+ dest += WideCharToMultiByte(CP_UTF8, 0, argsw[i], -1,
+ dest, argsc_end - dest, 0, 0);
+ }
+ argvc[i] = NULL;
+
+ LocalFree(argsw);
+
+ /* this assumes both the CRT and CommandLineToArgW() split
+ the command-line in the same way
+ */
+ assert(num_args == PL_origargc);
+ *pargv = argvc + (*pargv - PL_origargv);
+
+ SAVEFREEPV(argvc);
+}
+
+#define WIN32_UPGRADE_ARGV \
+ STMT_START { \
+ if (!replaced_args && (PL_unicode & PERL_UNICODE_ARGV_FLAG)) { \
+ win32_upgrade_argv(&argv); \
+ replaced_args = TRUE; \
+ } \
+ } STMT_END
+#else
+#define WIN32_UPGRADE_ARGV NOOP
+#endif
+
/*
=for apidoc Am|int|perl_parse|PerlInterpreter *my_perl|XSINIT_t xsinit|int argc|char **argv|char **env
@@ -2034,6 +2087,9 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit)
SV *linestr_sv = NULL;
bool add_read_e_script = FALSE;
U32 lex_start_flags = 0;
+#ifdef WIN32
+ bool replaced_args = FALSE;
+#endif
PERL_SET_PHASE(PERL_PHASE_START);
@@ -2072,8 +2128,9 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit)
case 'W':
case 'X':
case 'w':
- if ((s = moreswitches(s)))
+ if ((s = moreswitches(s))) {
goto reswitch;
+ }
break;
case 't':
@@ -2192,6 +2249,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit)
default:
Perl_croak(aTHX_ "Unrecognized switch: -%s (-h will show valid options)",s);
}
+ WIN32_UPGRADE_ARGV;
}
}
@@ -2265,6 +2323,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit)
#endif
} else {
moreswitches(d);
+ WIN32_UPGRADE_ARGV;
}
}
}
@@ -2387,6 +2446,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit)
linestr_sv = newSV_type(SVt_PV);
lex_start_flags |= LEX_START_COPIED;
find_beginning(linestr_sv, rsfp);
+ WIN32_UPGRADE_ARGV;
if (cddir && PerlDir_chdir( (char *)cddir ) < 0)
Perl_croak(aTHX_ "Can't chdir to %s",cddir);
}
--
2.14.1.windows.1
|
From @tonycozOn Tue, 04 Sep 2018 20:40:16 -0700, tonyc wrote:
Also, it breaks embedding, so don't apply this patch. Maybe an alternative is to not make it depend on the -CA switch, but on the current code page. If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch. The program then depends on the normal -CA behaviour to treat that as UTF-8, so perl code sees Unicode in @ARGV. It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour. Tony |
From @khwilliamsonOn 09/05/2018 05:20 PM, Tony Cook via RT wrote:
I haven't looked at this thread in detail, but using script runs can be |
From @xenuOn Wed, 05 Sep 2018 16:20:03 -0700
You mean the console codepage? There are some problem with that approach. Console codepages don't exist in windows subsystem applications (like C:\Users\xenu>wperl -MWin32 -E "open my($fh), '>', 'a.txt'; print {$fh} Win32::GetConsoleCP()" Another problem is that it won't cover situations where it's impossible I think that the only reasonable way to fix the win32 unicode bug is to IMO we should reintroduce that switch. On second thought, I think, in the long run, we should enable unicode |
From @tonycozOn Thu, 06 Sep 2018 11:44:09 -0700, me@xenu.pl wrote:
The argv handling looks similar to what my patch does - with the same problem for embedding. The wide system calls handling appears to assume all SVs are UTF-8 encoded, even without the SVf_UTF8 flag set.
I think fixing @ARGV would be reasonably painless for backcompat, but the rest wide-character support is too likely to break things, I think. I wonder how much CPAN testing was done with -C for perl 5.8.0. Tony |
Migrated from rt.perl.org#133496 (status was 'open')
Searchable as RT133496$
The text was updated successfully, but these errors were encountered: