Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

Open
p5pRT opened this issue Dec 1, 2010 · 18 comments
Open

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

p5pRT opened this issue Dec 1, 2010 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 1, 2010

Migrated from rt.perl.org#80058 (status was 'open')

Searchable as RT80058$

@p5pRT
Copy link
Author

p5pRT commented Dec 1, 2010

From mezmerik@gmail.com

Created by mezmerik@gmail.com

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian)  or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

Here's the test program​:

open FH_IN, "<​:encoding(utf16be)", "src.txt" or die;
open FH_OUT, ">​:encoding(utf16be)", "output.txt" or die;

while (<FH_IN>) {
   print FH_OUT $_;
}

I think "src.txt" and "output.txt" should be identical. But not.

1) if "src.txt" is only two CRLFs, its bytecodes are "FE FF 00 0D 00
0A 00 0D 00 0A"; the "output.txt" becomes "FE FF 00 0D 00 0D 0A 00 0D
00 0D 0A", each "0A" gets a unnecessary and wrong preceding "0D".

2) if "src.txt" is only one chinese charater "�", whose unicode and
UTF-16BE bytecode is "4E 0A", with BOM, the file's whole bytes are "FE
FF 4E 0A"; the "output.txt" becomes "FE FF 4E 0D 0A".

modify the program code​:

while (<FH_IN>) {
   chomp;
   print FH_OUT $_;
}

1) "src.txt" which is only two CRLFs, "FE FF 00 0D 00 0A 00 0D 00 0A"
becomes "FE FF 00 0D 00 0D". So, chomp only get rid of LF(00 0A). it
should erase 4 bytes "00 0D 00 0A".

That's what I found when operating UTF-16 files. I'll appreciate your
efforts to improve Unicode support. Many thanks!

 Joey

Perl Info

Flags:
   category=core
   severity=low

Site configuration information for perl 5.12.2:

Configured by SYSTEM at Mon Sep  6 23:12:49 2010.

Summary of my perl5 (revision 5 version 12 subversion 2) configuration:

 Platform:
   osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
   uname=''
   config_args='undef'
   hint=recommended, useposix=true, d_sigaction=undef
   useithreads=define, usemultiplicity=define
   useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
   use64bitint=undef, use64bitall=undef, uselongdouble=undef
   usemymalloc=n, bincompat5005=undef
 Compiler:
   cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-D_USE_32BIT_TIME_T -DPERL_MSVCRT_READFIX',
   optimize='-MD -Zi -DNDEBUG -O1',
   cppflags='-DWIN32'
   ccversion='12.00.8804', gccversion='', gccosandvers=''
   intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
   d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
   ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='__int64', lseeksize=8
   alignbytes=8, prototype=define
 Linker and Libraries:
   ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf
-libpath:"C:\Perl\lib\CORE"  -machine:x86'
   libpth=\lib
   libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib
odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
   perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib
winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib
version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
   libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib
   gnulibc_version=''
 Dynamic Linking:
   dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
   cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-opt:ref,icf  -libpath:"C:\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
   ACTIVEPERL_LOCAL_PATCHES_ENTRY
   1fd8fa4 Add Wolfram Humann to AUTHORS
   f120055 make string-append on win32 100 times faster
   a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7
   007cfe1 Don't pretend to support really old VC++ compilers
   6d8f7c9 Get rid of obsolete PerlCRT.dll support
   d956618 Make Term::ReadLine::findConsole fall back to STDIN if
/dev/tty can't be opened
   321e50c Escape patch strings before embedding them in patchlevel.h


@INC for perl 5.12.2:
   C:/Perl/site/lib
   C:/Perl/lib
   .


Environment for perl 5.12.2:
   HOME (unset)
   LANG (unset)
   LANGUAGE (unset)
   LD_LIBRARY_PATH (unset)
   LOGDIR (unset)
   PATH=C:\Program Files\ActiveState Komodo IDE
6\;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files\NVIDIA
Corporation\PhysX\Common;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\
   PERL_BADLANG (unset)
   SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Dec 1, 2010

From mezmerik@gmail.com

þÿ�


@p5pRT
Copy link
Author

p5pRT commented Dec 1, 2010

From mezmerik@gmail.com

þÿN

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @tonycoz

On Wed Dec 01 04​:38​:26 2010, mezmerik@​gmail.com wrote​:

This is a bug report for perl from mezmerik@​gmail.com,
generated with the help of perlbug 1.39 running under perl 5.12.2.

-----------------------------------------------------------------
[Please describe your issue here]

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian)  or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

This means you get other broken behaviour, such as inserting a 0d byte
before characters in the U+AXX range​:

C​:\Users\tony>perl -e "open my $fh, '>​:encoding(utf16be)', 'foo.txt' or
die $!;
print $fh qq(\x{a90}hello\n)"

C​:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print
unpack('H*',
$_), qq'\n' }" <foo.txt
0d0a9000680065006c006c006f000d0a

The workaround (or perhaps the only real fix) is to clear the :crlf
layer and add it back on above your unicode layer​:

C​:\Users\tony>perl -e "open my $fh, '>​:raw​:encoding(utf16be)​:crlf',
'foo.txt' or
die $!; print $fh qq(\x{a90}hello\n)"

C​:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print
unpack('H*',
$_), qq'\n' }" <foo.txt
0a9000680065006c006c006f000d000a

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

(/me avoids going wild with speculation)

Tony

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @nwc10

On Thu, Sep 19, 2013 at 10​:27​:32PM -0700, Tony Cook via RT wrote​:

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

I guess it's a kind of (emergent) flaw with the whole design of layers.
In that layers can meaningfully be anything of

  text -> binary (eg Unicode -> UTF-8)
  binary -> binary (eg gzip)
  binary -> text (eg uuencode, or these days Base64)
  text -> text (pedantically rot13)

(in terms of output)
(I think I've read that Python 3 went too far the other way on this by
banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary, as it
corrupts binary data.)

(/me avoids going wild with speculation)

So the design of layers ought to categorise their feed and not-feed* sides
as text or binary, and forbid plugging the wrong sorts together.

If we had that, then attempting to push UTF-16 atop CRLF would be an error
immediately. But, of course, the DWIM approach is that pushing UTF-16 onto
CRLF would cause UTF-16 to burrow under CRLF, given that issuing an error
of the form of "I can see what you're trying to do, but I'm not going to
help you" isn't very nice.

I believe that the second half of Jarkko's quote applies​:

  Documenting bugs before they're found is kinda hard.
  Can I borrow your time machine? Mine won't start.

Failing that, we wrestle with the punchline to the "Irishman giving directions"
joke, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

Nicholas Clark

* feed is a word visibly distinct from input. I'm failing to find a suitable
  antonym for this meaning of feed.
 

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @ikegami

On Fri, Sep 20, 2013 at 8​:58 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I fully agree, but I don't see how that would help here. Wouldn't that
prevent :crlf (a text processor) from being placed on a binary handle as
Perl does?

:crlf is a special case. It would therefore make sense for :encoding to
handle it specially and "burrow under" it, as you called it. This is
independent of the existence of text/binary string semantic system.

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @nwc10

On Fri, Sep 20, 2013 at 10​:18​:33AM -0400, Eric Brine wrote​:

On Fri, Sep 20, 2013 at 8​:58 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I fully agree, but I don't see how that would help here. Wouldn't that
prevent :crlf (a text processor) from being placed on a binary handle as
Perl does?

Yes, it would, if handles default to binary. But I think that then that's
part of the mess. In that in a Unicode world, every platform needs to care
about whether a handle is binary or text. And the old convenience of "just"
opening a file, without (at that point) caring whether it's a CSV or a JPEG
goes out of the window.

:crlf is a special case. It would therefore make sense for :encoding to
handle it specially and "burrow under" it, as you called it. This is
independent of the existence of text/binary string semantic system.

Problem is that it's no more of a special case than a layer that converts
all Unicode line endings to LF. Or a layer that does NFD. It's not unique.
That's what's bugging me.

You sort of need some sort of "apply layer" logic, which assumes

FILE -> [binary -> binary] {0,*} -> [binary -> text] -> [text -> text] {0,*}

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

(And I might be missing one in that diagram - maybe FILE should be
[source -> binary], which would solve the :stdio vs :unix mess)

And if you want to apply a [text -> binary] or build more funky than the
above model permits, or for (some reason) remove a layer, you use a second
API which does "build the entire stack", or "splice".

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @Tux

On Fri, 20 Sep 2013 15​:29​:16 +0100, Nicholas Clark <nick@​ccl4.org>
wrote​:

On Fri, Sep 20, 2013 at 10​:18​:33AM -0400, Eric Brine wrote​:

On Fri, Sep 20, 2013 at 8​:58 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I fully agree, but I don't see how that would help here. Wouldn't that
prevent :crlf (a text processor) from being placed on a binary handle as
Perl does?

Yes, it would, if handles default to binary. But I think that then that's
part of the mess. In that in a Unicode world, every platform needs to care
about whether a handle is binary or text. And the old convenience of "just"
opening a file, without (at that point) caring whether it's a CSV or a JPEG
goes out of the window.

Not that is happens a lot, but in CSV there is no overall encoding. The
CSV format allows you to pass every line/record in a different encoding
or even every field within a line/record. Not that that would be a sane
thing to do, but the definition allows that :(

What *does* happen (quite too often) is that the lines are exported in
CSV with iso-8895-1 when every character falls in that range and in
UTF-8 when the record contains a field that contains a character
outside of the iso range. The decoder now has to check twice for
validity. These generators suck, but I have to deal with their output
on a daily basis.

:crlf is a special case. It would therefore make sense for :encoding to
handle it specially and "burrow under" it, as you called it. This is
independent of the existence of text/binary string semantic system.

Problem is that it's no more of a special case than a layer that converts
all Unicode line endings to LF. Or a layer that does NFD. It's not unique.
That's what's bugging me.

You sort of need some sort of "apply layer" logic, which assumes

FILE -> [binary -> binary] {0,*} -> [binary -> text] -> [text -> text] {0,*}

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

(And I might be missing one in that diagram - maybe FILE should be
[source -> binary], which would solve the :stdio vs :unix mess)

And if you want to apply a [text -> binary] or build more funky than the
above model permits, or for (some reason) remove a layer, you use a second
API which does "build the entire stack", or "splice".

Nicholas Clark

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From zefram@fysh.org

Nicholas Clark wrote​:

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

It sounds like the concept of "applying" a layer has been overloaded
beyond usefulness. Inserting layers at different positions are different
operations, and replacing a layer (or group of layers) is different again.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @nwc10

On Fri, Sep 20, 2013 at 04​:00​:25PM +0100, Zefram wrote​:

Nicholas Clark wrote​:

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

It sounds like the concept of "applying" a layer has been overloaded
beyond usefulness. Inserting layers at different positions are different
operations, and replacing a layer (or group of layers) is different again.

Yes, half the time I agree with you here. It's too complex to be useful.

But it's bugging me that the only 2 frequent operations a programmer does are

1) State that the file is binary
2) State that the file is text in a particular encoding

with the bothersome problem that the default is text, with platform specific
line ending post-processing, which should be retained on a text file even if
the encoding is changed from the default.

And that (1) and (2) above ought to be easy to do, without needing to resort
to a more flexible syntax.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @cpansprout

On Fri Sep 20 05​:59​:34 2013, nicholas wrote​:

I believe that the second half of Jarkko's quote applies​:

Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.

Failing that, we wrestle with the punchline to the "Irishman giving
directions"
joke, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly, you
need a
type system. Or at least enough of a type system to distinguish "text"
from
"binary".

I usually just work around the whole issue with explicit encode/decode.
Also, where possible, I avoid UTF-16 and Windows. Life is so much
simpler that way!

Nicholas Clark

* feed is a word visibly distinct from input. I'm failing to find a
suitable
antonym for this meaning of feed.

The opposite of feed is usually starve. But that doesn�t work here.
Maybe spew? Extort?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @Leont

On Fri, Sep 20, 2013 at 7​:27 AM, Tony Cook via RT <perlbug-followup@​perl.org

wrote​:

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian) or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

This means you get other broken behaviour, such as inserting a 0d byte
before characters in the U+AXX range​:

C​:\Users\tony>perl -e "open my $fh, '>​:encoding(utf16be)', 'foo.txt' or
die $!;
print $fh qq(\x{a90}hello\n)"

C​:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print
unpack('H*',
$_), qq'\n' }" <foo.txt
0d0a9000680065006c006c006f000d0a

The workaround (or perhaps the only real fix) is to clear the :crlf
layer and add it back on above your unicode layer​:

C​:\Users\tony>perl -e "open my $fh, '>​:raw​:encoding(utf16be)​:crlf',
'foo.txt' or
die $!; print $fh qq(\x{a90}hello\n)"

All correct. I wrote the "​:text" pseudo layer that shortens that to
'​:text(utf-16be)', except that it will not to that whole dance on unix
systems.

Also note that before 5.14, binmode $fh, '​:raw​:encoding(utf-16be)​:crlf' did
not work correctly.

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

(/me avoids going wild with speculation)

The way to correct this is to make open be sensible. That is not a trivial
problem.

Leon

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From robertmay@cpan.org

On 20 September 2013 16​:24, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

* feed is a word visibly distinct from input. I'm failing to find a
suitable
antonym for this meaning of feed.

The opposite of feed is usually starve. But that doesn�t work here.
Maybe spew? Extort?

drain?

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @ap

* Nicholas Clark <nick@​ccl4.org> [2013-09-20 15​:00]​:

I guess it's a kind of (emergent) flaw with the whole design of
layers. In that layers can meaningfully be anything of

text \-> binary    \(eg Unicode \-> UTF\-8\)
binary \-> binary  \(eg gzip\)
binary \-> text    \(eg uuencode\, or these days Base64\)
text \-> text      \(pedantically rot13\)

(in terms of output)
(I think I've read that Python 3 went too far the other way on this by
banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary, as
it corrupts binary data.)

So the design of layers ought to categorise their feed and not-feed*
sides as text or binary, and forbid plugging the wrong sorts together.

If we had that, then attempting to push UTF-16 atop CRLF would be an
error immediately. But, of course, the DWIM approach is that pushing
UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF, given that
issuing an error of the form of "I can see what you're trying to do,
but I'm not going to help you" isn't very nice.

I believe that the second half of Jarkko's quote applies​:

Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.

Failing that, we wrestle with the punchline to the "Irishman giving
directions" joke, and wonder how to retrofit sanity.

Do we want to provide first-class support for layer cakes like this?

  (text â�� binary) (binary â�� text) (text â�� binary)

Because if we don�t, then the solution would seem to be very easy, at
least at the conceptual level​: make each handle have two stacks, one
for (text � text) layers and another one for (binary � binary), plus
a single slot for (text � binary). And then you *set* this single slot
(no pushing/popping there), plus you push/pop the other types of layers
on their respective stacks.

In that design, we also have a (text � binary) layer (named �derp�? :-))
whose output direction implements the current behaviour of `print` and
friends, wherein they try to downgrade a string for output but warn and
output the utf8 buffer as bytes if they can�t.

(As far as I can see there is no reason to have (binary � text) layers
if layers can go in both directions depending on whether they�re applied
to input or output (as is currently the case � you push :encoding(UTF-8)
no matter whether it�s an input or output handle).)

�Derp� is then the default (text � binary) layer for handles on which
nothing else has been set. This solves the question of �how do I set
:crlf on an otherwise unconfigured handle if layers are typed?�

Note that the solves the problem with pushing UTF-16 onto CRLF, because
in this design, you don�t do that � you *set* UTF-16 for the conversion
slot, and in any case the CRLF layer is in a stack by itself so if you
push any (binary � binary) layers, they will push �under� the CRLF layer
automatically. So by this design PerlIO will DTRT automatically.

I�d suggest a migration in which (text � text), (binary � binary) and
(text � binary) layers all move into different namespaces, so that once
completed it becomes impossible to even *say* the wrong thing.

Note that even layer cakes can be supported as a second-class construct,
by providing a reverse-direction pseudo-layer that implements a nested
layer stack, which you can then push onto an layer stack as a unit.

(This even neatly solves the question of how code is supposed to keep
track of the relative ordering of layers in really complex situations.
If you have build layer cakes out of (possibly recursively) nested
stacks then each participating bit of code only needs to care about the
nested stack it is managing itself, and by virtue of the fixed order of
layers within a handle or a pseudo-layer, the overall resulting pipeline
is guaranteed to assemble into something that makes sense.)

Now � how we turn what we have now into the system I outlined is quite
another matter�

� or maybe it ain�t? How hard do the PerlIO people here think this might
be? (Leon?)

* feed is a word visibly distinct from input. I'm failing to find
a suitable antonym for this meaning of feed.

Spew? :-)

--
No trees were harmed in the transmission of this email
but trillions of electrons were excited to participate.

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @ap

* Robert May <robertmay@​cpan.org> [2013-09-20 17​:45]​:

On 20 September 2013 16​:24, Father Chrysostomos wrote​:

The opposite of feed is usually starve. But that doesn�t work here.
Maybe spew? Extort?

drain?

Ah, y�all jogged my memory. Nicholas is looking for �source� and �sink�
I think � cf. <https://en.wikipedia.org/wiki/Sink_%28computing%29>.

--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @Leont

On Fri, Sep 20, 2013 at 2​:58 PM, Nicholas Clark <nick@​ccl4.org> wrote​:

On Thu, Sep 19, 2013 at 10​:27​:32PM -0700, Tony Cook via RT wrote​:

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

I guess it's a kind of (emergent) flaw with the whole design of layers.
In that layers can meaningfully be anything of

text \-> binary    \(eg Unicode \-> UTF\-8\)
binary \-> binary  \(eg gzip\)
binary \-> text    \(eg uuencode\, or these days Base64\)
text \-> text      \(pedantically rot13\)

(in terms of output)
(I think I've read that Python 3 went too far the other way on this by
banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary, as it
corrupts binary data.)

Except that PerlIO internally doesn't work in terms of text or binary, but
in terms of latin-1/binary octets versus utf8 octets :-/.

So for example if you'd open a "​:encoding(utf-16be)​:bytes". You'd read
UTF-16 converted to UTF-8 and then interpreted as Latin-1. Obviously not
something someone would deliberately do, but the fact that you can do it
accidentally is bad enough.

Another dimension of this is that some layers only make sense at the bottom
(e.g. :unix), and others only above such a bottom layer (e.g. most layers).

So the design of layers ought to categorise their feed and not-feed* sides
as text or binary, and forbid plugging the wrong sorts together.

If we had that, then attempting to push UTF-16 atop CRLF would be an error
immediately. But, of course, the DWIM approach is that pushing UTF-16 onto
CRLF would cause UTF-16 to burrow under CRLF, given that issuing an error
of the form of "I can see what you're trying to do, but I'm not going to
help you" isn't very nice.

Yes, absolutely!

I believe that the second half of Jarkko's quote applies​:

Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.

Failing that, we wrestle with the punchline to the "Irishman giving
directions"
joke, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I'm wondering if it's really too late. Given how much brokenness there is
in this area

* feed is a word visibly distinct from input. I'm failing to find a
suitable
antonym for this meaning of feed.

I prefer to just call them top and bottom.

Leon

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2013

From @Leont

On Fri, Sep 20, 2013 at 6​:26 PM, Aristotle Pagaltzis <pagaltzis@​gmx.de>wrote​:

Ah, y�all jogged my memory. Nicholas is looking for �source� and �sink�
I think � cf. <https://en.wikipedia.org/wiki/Sink_%28computing%29>.

IO goes in both directions; so one side will be the source for input but
the sink for output and vice versa.

Source and sink would be *terribly* confusing terms.

Leon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants