Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 endianess switch after line break #8038

Open
p5pRT opened this issue Jul 26, 2005 · 13 comments
Open

UTF-16 endianess switch after line break #8038

p5pRT opened this issue Jul 26, 2005 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 26, 2005

Migrated from rt.perl.org#36659 (status was 'open')

Searchable as RT36659$

@p5pRT
Copy link
Author

p5pRT commented Jul 26, 2005

From jr@terragate.net

Created by jr@terragate.net

This is a bug report for perl from jr@​terragate.net,
generated with the help of perlbug 1.35 running under perl v5.8.7.

-----------------------------------------------------------------
There is an endianess switch after each newline while outputting UTF-16 on
Win32.

Example script​:

#!/usr/bin/perl

binmode(STDOUT, '​:encoding(UTF-16)');
map { print $_ . "\n" } @​ARGV;

This produces the following (called with "foo bar baz" as command line params)​:

0000000​: feff 0066 006f 006f 000d 0a00 6200 6100 ...f.o.o....b.a.
0000010​: 7200 0d0a 0062 0061 007a 000d 0a r....b.a.z...

The carriage return (\r) is correctly outputted as 0xd but after that the
newline is printed in little endian (0xa00 instead of 0xa). Furthermore all
following chars are printed in LE until the next line break.

In my point of view this looks like a bug in the code that transparently adds
a \r for each newline on the windows platform.

So each line break causes a switch from BE to LE and vice versa.

Output of the same script on Mac OS X (Perl 5.8.6)​:

0000000​: feff 0066 006f 006f 000a 0062 0061 0072 ...f.o.o...b.a.r
0000010​: 000a 0062 0061 007a 000a ...b.a.z..

No problem here.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.7:

Configured by builder at Mon Jun  6 13:36:05 2005.

Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
  Platform:
    osname=MSWin32, osvers=5.0, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    usethreads=define use5005threads=undef useithreads=define
usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cl', ccflags ='-nologo -Gf -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE
-DNO_STRICT -DHAVE_DES_FCRYPT -DBUILT_BY_ACTIVESTATE -DNO_HASH_SEED
-DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-DPERL_MSVCRT_READFIX',
    optimize='-MD -Zi -DNDEBUG -O1',
    cppflags='-DWIN32'
    ccversion='12.00.8804', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf 
-libpath:"C:\Perl\lib\CORE"  -machine:x86'
    libpth=\lib
    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib 
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib
msvcrt.lib
    perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib 
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib
msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl58.lib
    gnulibc_version='undef'
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf 
-libpath:"C:\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY
    #  if !defined(PERL_DARWIN)
    Iin_load_module moved for compatibility with build 806
    #  endif
    #  if defined(__hpux)
    Avoid signal flag SA_RESTART for older versions of HP-UX
    #  endif
    PerlEx hacks for CGI::Carp
    Less verbose ExtUtils::Install and Pod::Find
    instmodsh upgraded from ExtUtils-MakeMaker-6.25
    24699 ICMP_UNREACHABLE handling in Net::Ping
    21540 Fix backward-compatibility issues in if.pm


@INC for perl v5.8.7:
    C:/Perl/lib
    C:/Perl/site/lib
    .


Environment for perl v5.8.7:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Perl\bin\;C:\WINNT\system32;C:\WINNT;C:\WINNT\System32\Wbem
    PERL_BADLANG (unset)
    SHELL (unset)


@p5pRT
Copy link
Author

p5pRT commented Jul 26, 2005

From shouldbedomo@mac.com

On 2005–07–26, at 11​:31, Jeremias Reith (via RT) wrote​:

There is an endianess switch after each newline while outputting
UTF-16 on
Win32.

Example script​:

#!/usr/bin/perl

binmode(STDOUT, '​:encoding(UTF-16)');
map { print $_ . "\n" } @​ARGV;

This produces the following (called with "foo bar baz" as command
line params)​:

0000000​: feff 0066 006f 006f 000d 0a00 6200 6100 ...f.o.o....b.a.
0000010​: 7200 0d0a 0062 0061 007a 000d 0a r....b.a.z...

The carriage return (\r) is correctly outputted as 0xd but after
that the
newline is printed in little endian (0xa00 instead of 0xa).
Furthermore all
following chars are printed in LE until the next line break.

In my point of view this looks like a bug in the code that
transparently adds
a \r for each newline on the windows platform.

So each line break causes a switch from BE to LE and vice versa.

Output of the same script on Mac OS X (Perl 5.8.6)​:

0000000​: feff 0066 006f 006f 000a 0062 0061 0072 ...f.o.o...b.a.r
0000010​: 000a 0062 0061 007a 000a ...b.a.z..

No problem here.

The issue seems to be that the implicit push of the :crlf layer onto
the handle takes place before the explicit push of :encoding(UTF-16)
when, to get the correct results, it should happen after. Some more
tests on Mac OS X perl 5.8.6​:

$ perl -we 'binmode(STDOUT, "​:crlf"); binmode(STDOUT, "​:encoding
(UTF-16)"); map { print $_ . "\n" } @​ARGV;' foo bar baz | od -x
0000000 feff 0066 006f 006f 000d 0a00 6200
6100
0000020 7200 0d0a 0062 0061 007a 000d 0a00
0000035
$ perl -we 'binmode(STDOUT, "​:encoding(UTF-16)"); binmode(STDOUT,
"​:crlf"); map { print $_ . "\n" } @​ARGV;' foo bar baz | od -x
0000000 feff 0066 006f 006f 000d 000a 0062
0061
0000020 0072 000d 000a 0062 0061 007a 000d
000a
0000040
$ perl -we 'binmode(STDOUT, "​:raw​:encoding(UTF-16)​:crlf"); map
{ print $_ . "\n" } @​ARGV;' foo bar baz | od -x
0000000 feff 0066 006f 006f 000d 000a 0062
0061
0000020 0072 000d 000a 0062 0061 007a 000d
000a
0000040

The first is wrong, and what I suspect is happening due to the
implicit push on Windows; the second, with the pushes swapped, gives
the right answer; the third does too and is what I think you need to
get around the issue. (The :raw is a no-op for Mac OS X, but won't be
for Windows.) As I don't have a Windows box to hand, please try it
and get back to us with the result.

Thanks.
--
Dominic Dunlop

@p5pRT
Copy link
Author

p5pRT commented Jul 26, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2005

From jr@terragate.net

C​:\>perl -we "binmode(STDOUT, '​:crlf'); binmode(STDOUT, '​:encoding(UTF-16)');
map { print $_ . \"\n\" } @​ARGV;" foo bar baz > 1.txt

@khwilliamson
Copy link
Contributor

This persists in 5.31

@Leont
Copy link
Contributor

Leont commented Nov 27, 2019

The problem here is that that binmode effectively achieves :crlf:encoding(UTF-16), when what you want here is probably :encoding(UTF-16):crlf. It's not the endianness that's wrong (though I can see why you would think that), it's that the crlf translation is done on encoded data, instead of on the textual data.

Binmoding :raw:encoding(...):crlf should give the desired result.

@khwilliamson
Copy link
Contributor

Confirmed that that does give the desired result.

So what to do about this ticket. Is there some documentation that should be improved?

@Leont
Copy link
Contributor

Leont commented Dec 2, 2019

I think there are three options:

  1. Make :encoding detect an active :crlf below itself and insert itself underneath instead of on top. It probably should have done this from the start, but it does have backwards compatibility issues to do this now.
  2. Make :encoding detect an active :crlf below itself and warn the user that this probably doesn't do what they want it to do
  3. Document that this doesn't DWIM.

Actually, there is a fourth option: first doing 2 and later doing 1.

@richardleach
Copy link
Contributor

Option 4 sounds ideal, even if it means having quite a few releases between the two steps.

@Leont
Copy link
Contributor

Leont commented Dec 8, 2019

A complication with option 2 (and by extension option 4) is that it's only doing the wrong thing for non-ascii-safe encodings (such as utf-16, utf-32). For utf-8 and iso-8859 encodings it does do the right thing because the other of the operations doesn't matter for them.

@toddr
Copy link
Member

toddr commented Feb 13, 2020

I'm assuming that while windows is heavily affected here, it's not a windows bug in particular, right?

@Leont
Copy link
Contributor

Leont commented Feb 14, 2020

I'm assuming that while windows is heavily affected here, it's not a windows bug in particular, right?

Technically no, practically yes. It takes effort to achieve this on other platforms, it's trivial to accidentally do it on WIndows.

@Leont
Copy link
Contributor

Leont commented Feb 14, 2020

A complication with option 2 (and by extension option 4) is that it's only doing the wrong thing for non-ascii-safe encodings (such as utf-16, utf-32). For utf-8 and iso-8859 encodings it does do the right thing because the other of the operations doesn't matter for them.

I think resolving this will require Encode objects to have a method to check if they're ascii-safe. It should only want for encodings where it isn't.

@Leont Leont self-assigned this Feb 15, 2020
@xenu xenu removed the affects-5.8 label Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants