Skip Menu |
Report information
Id: 121783
Status: open
Priority: 0/
Queue: perl5

Owner: tonyc <tony [at] develop-help.com>
Requestors: nanis [at] cpan.org
Cc:
AdminCc:

Operating System: mswin32
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version: 5.19.12
Fixed In: (no value)

Attachments
0001-perl-121783-work-around-a-bug-in-WriteFile.patch



Date: Thu, 1 May 2014 21:25:55 -0400
To: perlbug [...] perl.org
Subject: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output
From: "A. Sinan Unur" <nanis [...] cpan.org>
This is a bug report for perl from nanis@cpan.org, generated with the help of perlbug 1.40 running under perl 5.19.12. ----------------------------------------------------------------- [Please describe your issue here] On Windows 8.1 64-bit and Windows Vista 32-bit, using self-built 5.18.2 and 5.19.12, and ActivePerl 5.16.3, printing UTF-8 encoded text in a cmd.exe shell where the code page was set to 65001 (UTF-8) causes unexpected output. E.g.: # Normal C:\Users\sinan\src> perl -e "print qq{abc\n}" abc # alpha, beta, gamma # last octet, 0xb3, seems to be repeated on a separate line C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}" αβγ � # last octet, 0x7a, seems to be repeated on a separate line C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}" αβγxyz z # without a newline, more unexpected octets C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3}" αβγ�γ� # with trailing ascii, last three octets seem to be repeated C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz}" αβγxyzxyz C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz123}" αβγxyz123123 For comparison, the following C program, compiled with Microsoft (R) C/C++ Optimizing Compiler Version 16.00.30319.01 for x64 outputs just αβγ: #include <stdio.h> int main(void) { /* UTF-8 encoded alpha, beta, gamma */ char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 }; puts(x); return 0; } Further, C:\Users\sinan\src> type pttt.pl use utf8; use strict; use warnings; use warnings qw(FATAL utf8); binmode STDOUT, ':utf8'; print 'αβγxyz', "\n"; print 'αβγxyz'; C:\Users\sinan\src> perl pttt.pl αβγxyz z αβγxyzxyz Note that piping the Perl scripts through xxd or other programs or saving output to file works as expected: C:\Users\sinan\src> perl pttt.pl > ttt C:\Users\sinan\src> type ttt αβγxyz αβγxyz More info: http://blog.nu42.com/2014/05/utf-8-ouput-from-perl-and-c-programs-in.html http://stackoverflow.com/questions/23416075/why-am-i-getting-the-last-octet-repeated-when-my-perl-program-outputs-a-utf-8-en Also, when the console code page is set to 437, the output from the C program and the Perl program are identical. Finally, using `syswrite` with the UTF-8 encoded string also works as expected: C:\Users\sinan\src> perl -e "syswrite STDOUT, qq{\xce\xb1\xce\xb2\xce\xb3\n }" αβγ I suspect an interaction between Perl's IO layers and cmd.exe set to code page = 65001, but I haven't been able to pinpoint it yet. For code pages, see http://msdn.microsoft.com/en-us/library/bb643325.aspx 65001 utf-8 Unicode (UTF-8) Thank you, -- Sinan [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl 5.19.12: Configured by sinan at Thu May 1 19:41:21 2014. Summary of my perl5 (revision 5 version 19 subversion 12) configuration: Derived from: Platform: osname=MSWin32, osvers=6.3, archname=MSWin32-x64-multi-thread uname='' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cl', ccflags ='-nologo -GF -W3 -O1 /favor:INTEL64 -MD -Zi -DNDEBUG -GL -fp:precise -DWIN32 -D_CONSOLE -DNO_STRICT -DWIN64 -DCONSERVATIVE -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO', optimize='-O1 /favor:INTEL64 -MD -Zi -DNDEBUG -GL -fp:precise', cppflags='-DWIN32' ccversion='16.00.30319.01', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8 ivtype='__int64', ivsize=8, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\Users\sinan\perl\blead\lib\CORE" -machine:AMD64 "/manifestdependency:type='Win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*' publicKeyToken='6595b64144ccf1df' language='*'"' libpth="C:\Program Files\Microsoft SDKs\Windows\v7.1\lib\x64" libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl519.lib gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\Users\sinan\perl\blead\lib\CORE" -machine:AMD64 "/manifestdependency:type='Win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*' publicKeyToken='6595b64144ccf1df' language='*'"' Locally applied patches: uncommitted-changes --- @INC for perl 5.19.12: C:/Users/sinan/perl/blead/site/lib C:/Users/sinan/perl/blead/lib . --- Environment for perl 5.19.12: HOME (unset) LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=C:\opt\bin;C:\opt\vim\vim74;C:\opt\gs\gs9.14\bin;C:\opt\gs\gs9.14\lib;C:\Program Files\TortoiseHg;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files\Microsoft Windows Performance Toolkit\;C:\Program Files\TortoiseSVN\bin;C:\Users\sinan\perl\blead\site\bin;C:\Users\sinan\perl\blead\bin PERL_BADLANG (unset) SHELL (unset)
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.5k
On Thu May 01 18:26:16 2014, nanis@cpan.org wrote: Show quoted text
> On Windows 8.1 64-bit and Windows Vista 32-bit, using self-built > 5.18.2 > and 5.19.12, and ActivePerl 5.16.3, printing UTF-8 encoded text in a > cmd.exe shell where the code page was set to 65001 (UTF-8) causes > unexpected output. > > E.g.: > > # Normal > C:\Users\sinan\src> perl -e "print qq{abc\n}" > abc > > # alpha, beta, gamma > # last octet, 0xb3, seems to be repeated on a separate line > C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}" > αβγ > � > > # last octet, 0x7a, seems to be repeated on a separate line > C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}" > αβγxyz > z
This is caused by a bug in Windows. When writing to a console set to code page 65001, WriteFile() returns the number of characters written instead of the number of bytes. So the write loop in PerlIOBuf_flush() is told that only 8 bytes have been written (6 visible, CR, LF) instead of the 11 that actually were, and so it loops and writes the last 3 again (z, CR, LF). See: http://social.msdn.microsoft.com/Forums/vstudio/en-US/e4b91f49-6f60-4ffe-887a-e18e39250905/possible-bugs-in-writefile-and-crt-unicode-issues?forum=vcgeneral for a thread on how this breaks MSVC console output. Other languages have the same problem: haskell: https://ghc.haskell.org/trac/ghc/ticket/4471 python: http://bugs.python.org/issue1602 As to fixing[1] it, maybe we could add a perlio flag that assumes successful writes are always complete, and set that for the Win32 console. Tony [1] working around Microsoft's long-standing bug
From: "A. Sinan Unur" <nanis [...] cpan.org>
Subject: Re: [perl #121783] Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output
To: perlbug-followup [...] perl.org
Date: Thu, 29 May 2014 15:37:34 -0400
Download (untitled) / with headers
text/plain 3.1k
Thank you, Tony, I was not aware of this issue. Now, reading the MSDN documentation[1], I see that WriteFile using synchronous IO either writes everything that was requested, or fails: Show quoted text
> The WriteFile function returns when one of the following conditions occur:
Show quoted text
> a) The number of bytes requested is written. > b) A read operation releases buffer space on the read end of the pipe (if the write was blocked). For more information, see the Pipes section. > c) An asynchronous handle is being used and the write is occurring asynchronously. > d) An error occurs.
Note that, if a pipe is broken during a synchronous write, WriteFile returns with error. Therefore, it seems to me, if WriteFile succeeds, there are only two possible values for &len in PerlIOWin32_write. If the bytes are counted correctly, len will equal count. Otherwise, if bytes are counted incorrectly, len will not equal count even though everything was written. In any other case, WriteFile will return an error, so PerlIOWin32_write will end up returning -1 anyway. Therefore, the right thing to do seems to be to always return count from PerlIOWin32_write if WriteFile succeeds regardless of the value of len. In fact, I just re-built 5.20.0 with this change. However, the behavior remains the same: C:\Users\sinan> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3123}" αβγ123123 I am baffled. -- Sinan [1]: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365747%28v=vs.85%29.aspx On Tue, May 27, 2014 at 8:08 PM, Tony Cook via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Thu May 01 18:26:16 2014, nanis@cpan.org wrote:
>> On Windows 8.1 64-bit and Windows Vista 32-bit, using self-built >> 5.18.2 >> and 5.19.12, and ActivePerl 5.16.3, printing UTF-8 encoded text in a >> cmd.exe shell where the code page was set to 65001 (UTF-8) causes >> unexpected output. >> >> E.g.: >> >> # Normal >> C:\Users\sinan\src> perl -e "print qq{abc\n}" >> abc >> >> # alpha, beta, gamma >> # last octet, 0xb3, seems to be repeated on a separate line >> C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}" >> αβγ >> � >> >> # last octet, 0x7a, seems to be repeated on a separate line >> C:\Users\sinan\src> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}" >> αβγxyz >> z
> > This is caused by a bug in Windows. > > When writing to a console set to code page 65001, WriteFile() returns the number of characters written instead of the number of bytes. > > So the write loop in PerlIOBuf_flush() is told that only 8 bytes have been written (6 visible, CR, LF) instead of the 11 that actually were, and so it loops and writes the last 3 again (z, CR, LF). > > See: > > http://social.msdn.microsoft.com/Forums/vstudio/en-US/e4b91f49-6f60-4ffe-887a-e18e39250905/possible-bugs-in-writefile-and-crt-unicode-issues?forum=vcgeneral > > for a thread on how this breaks MSVC console output. > > Other languages have the same problem: > > haskell: https://ghc.haskell.org/trac/ghc/ticket/4471 > python: http://bugs.python.org/issue1602 > > As to fixing[1] it, maybe we could add a perlio flag that assumes successful writes are always complete, and set that for the Win32 console. > > Tony > > [1] working around Microsoft's long-standing bug
-- A. Sinan Unur http://www.unur.com/sinan/
From: Tony Cook <tony [...] develop-help.com>
CC: perlbug-followup [...] perl.org
Subject: Re: [perl #121783] Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output
To: "A. Sinan Unur" <nanis [...] cpan.org>
Date: Fri, 30 May 2014 15:56:54 +1000
Download (untitled) / with headers
text/plain 1.6k
On Thu, May 29, 2014 at 03:37:34PM -0400, A. Sinan Unur wrote: Show quoted text
> Thank you, Tony, I was not aware of this issue. > > Now, reading the MSDN documentation[1], I see that WriteFile using > synchronous IO either writes everything that was requested, or fails: >
> > The WriteFile function returns when one of the following conditions occur:
>
> > a) The number of bytes requested is written. > > b) A read operation releases buffer space on the read end of the pipe (if the write was blocked). For more information, see the Pipes section. > > c) An asynchronous handle is being used and the write is occurring asynchronously. > > d) An error occurs.
> > Note that, if a pipe is broken during a synchronous write, WriteFile > returns with error. > > Therefore, it seems to me, if WriteFile succeeds, there are only two > possible values for &len in PerlIOWin32_write. If the bytes are > counted correctly, len will equal count. Otherwise, if bytes are > counted incorrectly, len will not equal count even though everything > was written. In any other case, WriteFile will return an error, so > PerlIOWin32_write will end up returning -1 anyway. > > > Therefore, the right thing to do seems to be to always return count > from PerlIOWin32_write if WriteFile succeeds regardless of the value > of len. > > In fact, I just re-built 5.20.0 with this change. However, the > behavior remains the same: > > C:\Users\sinan> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3123}" > αβγ123123 > > I am baffled.
PerlIOWin32_write() is part of the :win32 layer, which is incomplete and isn't used. Win32 uses :unix as the bottom layer for file handles so you change makes no difference. Tony
Subject: Re: [perl #121783] Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output
From: "A. Sinan Unur" <nanis [...] cpan.org>
Date: Fri, 30 May 2014 10:46:31 -0400
To: perlbug-followup [...] perl.org
On Fri, May 30, 2014 at 1:56 AM, Tony Cook <tony@develop-help.com> wrote: Show quoted text
> On Thu, May 29, 2014 at 03:37:34PM -0400, A. Sinan Unur wrote:
>> Thank you, Tony, I was not aware of this issue. >>
... Show quoted text
>> >> Therefore, the right thing to do seems to be to always return count >> from PerlIOWin32_write if WriteFile succeeds regardless of the value >> of len. >> >> In fact, I just re-built 5.20.0 with this change. However, the >> behavior remains the same: >> >> C:\Users\sinan> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3123}" >> αβγ123123 >> >> I am baffled.
> > PerlIOWin32_write() is part of the :win32 layer, which is incomplete > and isn't used. > > Win32 uses :unix as the bottom layer for file handles so you change > makes no difference.
Well, that explains a lot, doesn't it. I focused my attention explicitly on the WriteFile function based on the description of the bug and thought that write calls on Windows were directed to that somehow. I am just feeling my way in the dark here. I'll look in perlio.* and perliol.h then. Thank you, -- A. Sinan Unur http://www.unur.com/sinan/
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 378b
On Tue May 27 17:08:24 2014, tonyc wrote: Show quoted text
> As to fixing[1] it, maybe we could add a perlio flag that assumes > successful writes are always complete, and set that for the Win32 > console. > > Tony > > [1] working around Microsoft's long-standing bug
Here's a patch that does roughly what I suggested, though at the win32_write() level rather than at the PerlIO level. Tony
Subject: 0001-perl-121783-work-around-a-bug-in-WriteFile.patch
From ef02acb1c78894083637626b9cda8d411b923cc2 Mon Sep 17 00:00:00 2001 From: Tony Cook <tony@develop-help.com> Date: Thu, 16 Oct 2014 12:17:33 +1100 Subject: [perl #121783] work around a bug in WriteFile() --- win32/win32.c | 33 ++++++++++++++++++++++++++++++++- 1 files changed, 32 insertions(+), 1 deletions(-) diff --git a/win32/win32.c b/win32/win32.c index 26d419e..a13522d 100644 --- a/win32/win32.c +++ b/win32/win32.c @@ -3322,7 +3322,38 @@ win32_read(int fd, void *buf, unsigned int cnt) DllExport int win32_write(int fd, const void *buf, unsigned int cnt) { - return write(fd, buf, cnt); + int len = write(fd, buf, cnt); + if (len != cnt && len > 0) { + /* make sure win32_isatty() doesn't fiddle with + * errno/GetLastError() + */ + dSAVE_ERRNO; + if (win32_isatty(fd)) { + /* WriteFile() to a console returns the number of characters + * written to the display rather than the number of bytes + * written. + * + * eg. if the console CP is set to 65001, and the console + * font is a TrueType font, writing + * "\xce\xb1\xce\xb2\xce\xb3\n" will return 4 instead of 7. + * + * If the console font is a raster font, it will return the + * full count instead, this means we can't reliably convert + * the returned character count into a byte count. + * + * Since WriteConsole() (which WriteFile() appears to + * implemented in terms of) simply fails when supplied too + * much data, we assume the same holds for WriteFile(). + * + * So we assume that if anything was written that the entire + * buffer was written correctly. + */ + len = cnt; + } + RESTORE_ERRNO; + } + + return len; } DllExport int -- 1.7.4.msysgit.0


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org