[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

p5pRT · 2010-12-01T12:38:25Z

Migrated from rt.perl.org#80058 (status was 'open')

Searchable as RT80058$

p5pRT · 2010-12-01T12:38:26Z

From mezmerik@gmail.com

Created by mezmerik@gmail.com

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian) Â or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

Here's the test program:

open FH_IN, "<:encoding(utf16be)", "src.txt" or die;
open FH_OUT, ">:encoding(utf16be)", "output.txt" or die;

while (<FH_IN>) {
Â Â print FH_OUT $_;
}

I think "src.txt" and "output.txt" should be identical. But not.

1) if "src.txt" is only two CRLFs, its bytecodes are "FE FF 00 0D 00
0A 00 0D 00 0A"; the "output.txt" becomes "FE FF 00 0D 00 0D 0A 00 0D
00 0D 0A", each "0A" gets a unnecessary and wrong preceding "0D".

2) if "src.txt" is only one chinese charater "ä¸�", whose unicode and
UTF-16BE bytecode is "4E 0A", with BOM, the file's whole bytes are "FE
FF 4E 0A"; the "output.txt" becomes "FE FF 4E 0D 0A".

modify the program code:

while (<FH_IN>) {
Â Â chomp;
Â Â print FH_OUT $_;
}

1) "src.txt" which is only two CRLFs, "FE FF 00 0D 00 0A 00 0D 00 0A"
becomes "FE FF 00 0D 00 0D". So, chomp only get rid of LF(00 0A). it
should erase 4 bytes "00 0D 00 0A".

That's what I found when operating UTF-16 files. I'll appreciate your
efforts to improve Unicode support. Many thanks!

Â Joey

Perl Info


Flags:
Â  Â category=core
Â  Â severity=low

Site configuration information for perl 5.12.2:

Configured by SYSTEM at Mon Sep Â 6 23:12:49 2010.

Summary of my perl5 (revision 5 version 12 subversion 2) configuration:

Â Platform:
Â  Â osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
Â  Â uname=''
Â  Â config_args='undef'
Â  Â hint=recommended, useposix=true, d_sigaction=undef
Â  Â useithreads=define, usemultiplicity=define
Â  Â useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
Â  Â use64bitint=undef, use64bitall=undef, uselongdouble=undef
Â  Â usemymalloc=n, bincompat5005=undef
Â Compiler:
Â  Â cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-D_USE_32BIT_TIME_T -DPERL_MSVCRT_READFIX',
Â  Â optimize='-MD -Zi -DNDEBUG -O1',
Â  Â cppflags='-DWIN32'
Â  Â ccversion='12.00.8804', gccversion='', gccosandvers=''
Â  Â intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
Â  Â d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
Â  Â ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='__int64', lseeksize=8
Â  Â alignbytes=8, prototype=define
Â Linker and Libraries:
Â  Â ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf
-libpath:"C:\Perl\lib\CORE" Â -machine:x86'
Â  Â libpth=\lib
Â  Â libs= Â oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib Â version.lib
odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
Â  Â perllibs= Â oldnames.lib kernel32.lib user32.lib gdi32.lib
winspool.lib Â comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib Â netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib
version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
Â  Â libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib
Â  Â gnulibc_version=''
Â Dynamic Linking:
Â  Â dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
Â  Â cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-opt:ref,icf Â -libpath:"C:\Perl\lib\CORE" Â -machine:x86'

Locally applied patches:
Â  Â ACTIVEPERL_LOCAL_PATCHES_ENTRY
Â  Â 1fd8fa4 Add Wolfram Humann to AUTHORS
Â  Â f120055 make string-append on win32 100 times faster
Â  Â a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7
Â  Â 007cfe1 Don't pretend to support really old VC++ compilers
Â  Â 6d8f7c9 Get rid of obsolete PerlCRT.dll support
Â  Â d956618 Make Term::ReadLine::findConsole fall back to STDIN if
/dev/tty can't be opened
Â  Â 321e50c Escape patch strings before embedding them in patchlevel.h


@INC for perl 5.12.2:
Â  Â C:/Perl/site/lib
Â  Â C:/Perl/lib
Â  Â .


Environment for perl 5.12.2:
Â  Â HOME (unset)
Â  Â LANG (unset)
Â  Â LANGUAGE (unset)
Â  Â LD_LIBRARY_PATH (unset)
Â  Â LOGDIR (unset)
Â  Â PATH=C:\Program Files\ActiveState Komodo IDE
6\;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files\NVIDIA
Corporation\PhysX\Common;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\
Â  Â PERL_BADLANG (unset)
Â  Â SHELL (unset)

p5pRT · 2010-12-01T12:38:26Z

From mezmerik@gmail.com

þÿ�
�
�
�

p5pRT · 2010-12-01T12:38:26Z

From mezmerik@gmail.com

þÿN

p5pRT · 2013-09-20T05:27:31Z

From @tonycoz

On Wed Dec 01 04:38:26 2010, mezmerik@gmail.com wrote:

This is a bug report for perl from mezmerik@gmail.com,
generated with the help of perlbug 1.39 running under perl 5.12.2.

-----------------------------------------------------------------
[Please describe your issue here]

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian) Â or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

This means you get other broken behaviour, such as inserting a 0d byte
before characters in the U+AXX range:

C:\Users\tony>perl -e "open my $fh, '>:encoding(utf16be)', 'foo.txt' or
die $!;
print $fh qq(\x{a90}hello\n)"

C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print
unpack('H*',
$_), qq'\n' }" <foo.txt
0d0a9000680065006c006c006f000d0a

The workaround (or perhaps the only real fix) is to clear the :crlf
layer and add it back on above your unicode layer:

C:\Users\tony>perl -e "open my $fh, '>:raw:encoding(utf16be):crlf',
'foo.txt' or
die $!; print $fh qq(\x{a90}hello\n)"

C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print
unpack('H*',
$_), qq'\n' }" <foo.txt
0a9000680065006c006c006f000d000a

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

(/me avoids going wild with speculation)

Tony

p5pRT · 2013-09-20T05:27:32Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2013-09-20T12:59:34Z

From @nwc10

On Thu, Sep 19, 2013 at 10:27:32PM -0700, Tony Cook via RT wrote:

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

I guess it's a kind of (emergent) flaw with the whole design of layers.
In that layers can meaningfully be anything of

text -> binary (eg Unicode -> UTF-8)
binary -> binary (eg gzip)
binary -> text (eg uuencode, or these days Base64)
text -> text (pedantically rot13)

(in terms of output)
(I think I've read that Python 3 went too far the other way on this by
banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary, as it
corrupts binary data.)

(/me avoids going wild with speculation)

So the design of layers ought to categorise their feed and not-feed* sides
as text or binary, and forbid plugging the wrong sorts together.

If we had that, then attempting to push UTF-16 atop CRLF would be an error
immediately. But, of course, the DWIM approach is that pushing UTF-16 onto
CRLF would cause UTF-16 to burrow under CRLF, given that issuing an error
of the form of "I can see what you're trying to do, but I'm not going to
help you" isn't very nice.

I believe that the second half of Jarkko's quote applies:

Documenting bugs before they're found is kinda hard.
Can I borrow your time machine? Mine won't start.

Failing that, we wrestle with the punchline to the "Irishman giving directions"
joke, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

Nicholas Clark

* feed is a word visibly distinct from input. I'm failing to find a suitable
antonym for this meaning of feed.

p5pRT · 2013-09-20T14:19:11Z

From @ikegami

On Fri, Sep 20, 2013 at 8:58 AM, Nicholas Clark <nick@ccl4.org> wrote:

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I fully agree, but I don't see how that would help here. Wouldn't that
prevent :crlf (a text processor) from being placed on a binary handle as
Perl does?

:crlf is a special case. It would therefore make sense for :encoding to
handle it specially and "burrow under" it, as you called it. This is
independent of the existence of text/binary string semantic system.

p5pRT · 2013-09-20T14:29:57Z

From @nwc10

On Fri, Sep 20, 2013 at 10:18:33AM -0400, Eric Brine wrote:

On Fri, Sep 20, 2013 at 8:58 AM, Nicholas Clark <nick@ccl4.org> wrote:

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I fully agree, but I don't see how that would help here. Wouldn't that
prevent :crlf (a text processor) from being placed on a binary handle as
Perl does?

Yes, it would, if handles default to binary. But I think that then that's
part of the mess. In that in a Unicode world, every platform needs to care
about whether a handle is binary or text. And the old convenience of "just"
opening a file, without (at that point) caring whether it's a CSV or a JPEG
goes out of the window.

:crlf is a special case. It would therefore make sense for :encoding to
handle it specially and "burrow under" it, as you called it. This is
independent of the existence of text/binary string semantic system.

Problem is that it's no more of a special case than a layer that converts
all Unicode line endings to LF. Or a layer that does NFD. It's not unique.
That's what's bugging me.

You sort of need some sort of "apply layer" logic, which assumes

FILE -> [binary -> binary] {0,*} -> [binary -> text] -> [text -> text] {0,*}

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

(And I might be missing one in that diagram - maybe FILE should be
[source -> binary], which would solve the :stdio vs :unix mess)

And if you want to apply a [text -> binary] or build more funky than the
above model permits, or for (some reason) remove a layer, you use a second
API which does "build the entire stack", or "splice".

Nicholas Clark

p5pRT · 2013-09-20T14:45:37Z

From @Tux

On Fri, 20 Sep 2013 15:29:16 +0100, Nicholas Clark <nick@ccl4.org>
wrote:

On Fri, Sep 20, 2013 at 10:18:33AM -0400, Eric Brine wrote:

On Fri, Sep 20, 2013 at 8:58 AM, Nicholas Clark <nick@ccl4.org> wrote:

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I fully agree, but I don't see how that would help here. Wouldn't that
prevent :crlf (a text processor) from being placed on a binary handle as
Perl does?

Yes, it would, if handles default to binary. But I think that then that's
part of the mess. In that in a Unicode world, every platform needs to care
about whether a handle is binary or text. And the old convenience of "just"
opening a file, without (at that point) caring whether it's a CSV or a JPEG
goes out of the window.

Not that is happens a lot, but in CSV there is no overall encoding. The
CSV format allows you to pass every line/record in a different encoding
or even every field within a line/record. Not that that would be a sane
thing to do, but the definition allows that :(

What *does* happen (quite too often) is that the lines are exported in
CSV with iso-8895-1 when every character falls in that range and in
UTF-8 when the record contains a field that contains a character
outside of the iso range. The decoder now has to check twice for
validity. These generators suck, but I have to deal with their output
on a daily basis.

:crlf is a special case. It would therefore make sense for :encoding to
handle it specially and "burrow under" it, as you called it. This is
independent of the existence of text/binary string semantic system.

Problem is that it's no more of a special case than a layer that converts
all Unicode line endings to LF. Or a layer that does NFD. It's not unique.
That's what's bugging me.

You sort of need some sort of "apply layer" logic, which assumes

FILE -> [binary -> binary] {0,*} -> [binary -> text] -> [text -> text] {0,*}

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

(And I might be missing one in that diagram - maybe FILE should be
[source -> binary], which would solve the :stdio vs :unix mess)

And if you want to apply a [text -> binary] or build more funky than the
above model permits, or for (some reason) remove a layer, you use a second
API which does "build the entire stack", or "splice".

Nicholas Clark

--
H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT · 2013-09-20T15:01:29Z

From zefram@fysh.org

Nicholas Clark wrote:

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

It sounds like the concept of "applying" a layer has been overloaded
beyond usefulness. Inserting layers at different positions are different
operations, and replacing a layer (or group of layers) is different again.

-zefram

p5pRT · 2013-09-20T15:07:19Z

From @nwc10

On Fri, Sep 20, 2013 at 04:00:25PM +0100, Zefram wrote:

Nicholas Clark wrote:

at which point, applying any binary->binary or text->text layer *stacks* it
at the right point, and applying any binary->text layer swaps out the
previous.

It sounds like the concept of "applying" a layer has been overloaded
beyond usefulness. Inserting layers at different positions are different
operations, and replacing a layer (or group of layers) is different again.

Yes, half the time I agree with you here. It's too complex to be useful.

But it's bugging me that the only 2 frequent operations a programmer does are

1) State that the file is binary
2) State that the file is text in a particular encoding

with the bothersome problem that the default is text, with platform specific
line ending post-processing, which should be retained on a text file even if
the encoding is changed from the default.

And that (1) and (2) above ought to be easy to do, without needing to resort
to a more flexible syntax.

Nicholas Clark

p5pRT · 2013-09-20T15:24:55Z

From @cpansprout

On Fri Sep 20 05:59:34 2013, nicholas wrote:

I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.
Failing that, we wrestle with the punchline to the "Irishman giving
directions"
joke, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly, you
need a
type system. Or at least enough of a type system to distinguish "text"
from
"binary".

I usually just work around the whole issue with explicit encode/decode.
Also, where possible, I avoid UTF-16 and Windows. Life is so much
simpler that way!

Nicholas Clark

* feed is a word visibly distinct from input. I'm failing to find a
suitable
antonym for this meaning of feed.

The opposite of feed is usually starve. But that doesnâ��t work here.
Maybe spew? Extort?

--

Father Chrysostomos

p5pRT · 2013-09-20T15:25:03Z

From @Leont

On Fri, Sep 20, 2013 at 7:27 AM, Tony Cook via RT <perlbug-followup@perl.org

wrote:

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian) or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

This means you get other broken behaviour, such as inserting a 0d byte
before characters in the U+AXX range:

C:\Users\tony>perl -e "open my $fh, '>:encoding(utf16be)', 'foo.txt' or
die $!;
print $fh qq(\x{a90}hello\n)"

C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print
unpack('H*',
$_), qq'\n' }" <foo.txt
0d0a9000680065006c006c006f000d0a

The workaround (or perhaps the only real fix) is to clear the :crlf
layer and add it back on above your unicode layer:

C:\Users\tony>perl -e "open my $fh, '>:raw:encoding(utf16be):crlf',
'foo.txt' or
die $!; print $fh qq(\x{a90}hello\n)"

All correct. I wrote the ":text" pseudo layer that shortens that to
':text(utf-16be)', except that it will not to that whole dance on unix
systems.

Also note that before 5.14, binmode $fh, ':raw:encoding(utf-16be):crlf' did
not work correctly.

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

(/me avoids going wild with speculation)

The way to correct this is to make open be sensible. That is not a trivial
problem.

Leon

p5pRT · 2013-09-20T15:43:42Z

From robertmay@cpan.org

On 20 September 2013 16:24, Father Chrysostomos via RT
<perlbug-followup@perl.org> wrote:

* feed is a word visibly distinct from input. I'm failing to find a
suitable
antonym for this meaning of feed.

The opposite of feed is usually starve. But that doesnâ��t work here.
Maybe spew? Extort?

drain?

p5pRT · 2013-09-20T16:22:22Z

From @ap

* Nicholas Clark <nick@ccl4.org> [2013-09-20 15:00]:

I guess it's a kind of (emergent) flaw with the whole design of
layers. In that layers can meaningfully be anything of
text \-> binary    $eg Unicode \-> UTF\-8$
binary \-> binary  $eg gzip$
binary \-> text    $eg uuencode\, or these days Base64$
text \-> text      $pedantically rot13$
(in terms of output)
(I think I've read that Python 3 went too far the other way on this by
banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary, as
it corrupts binary data.)

So the design of layers ought to categorise their feed and not-feed*
sides as text or binary, and forbid plugging the wrong sorts together.

If we had that, then attempting to push UTF-16 atop CRLF would be an
error immediately. But, of course, the DWIM approach is that pushing
UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF, given that
issuing an error of the form of "I can see what you're trying to do,
but I'm not going to help you" isn't very nice.

I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.
Failing that, we wrestle with the punchline to the "Irishman giving
directions" joke, and wonder how to retrofit sanity.

Do we want to provide first-class support for layer cakes like this?

(text â�� binary) (binary â�� text) (text â�� binary)

Because if we donâ��t, then the solution would seem to be very easy, at
least at the conceptual level: make each handle have two stacks, one
for (text â�� text) layers and another one for (binary â�� binary), plus
a single slot for (text â�� binary). And then you *set* this single slot
(no pushing/popping there), plus you push/pop the other types of layers
on their respective stacks.

In that design, we also have a (text â�� binary) layer (named â��derpâ��? :-))
whose output direction implements the current behaviour of `print` and
friends, wherein they try to downgrade a string for output but warn and
output the utf8 buffer as bytes if they canâ��t.

(As far as I can see there is no reason to have (binary â�� text) layers
if layers can go in both directions depending on whether theyâ��re applied
to input or output (as is currently the case â�� you push :encoding(UTF-8)
no matter whether itâ��s an input or output handle).)

â��Derpâ�� is then the default (text â�� binary) layer for handles on which
nothing else has been set. This solves the question of â��how do I set
:crlf on an otherwise unconfigured handle if layers are typed?â��

Note that the solves the problem with pushing UTF-16 onto CRLF, because
in this design, you donâ��t do that â�� you *set* UTF-16 for the conversion
slot, and in any case the CRLF layer is in a stack by itself so if you
push any (binary â�� binary) layers, they will push â��underâ�� the CRLF layer
automatically. So by this design PerlIO will DTRT automatically.

Iâ��d suggest a migration in which (text â�� text), (binary â�� binary) and
(text â�� binary) layers all move into different namespaces, so that once
completed it becomes impossible to even *say* the wrong thing.

Note that even layer cakes can be supported as a second-class construct,
by providing a reverse-direction pseudo-layer that implements a nested
layer stack, which you can then push onto an layer stack as a unit.

(This even neatly solves the question of how code is supposed to keep
track of the relative ordering of layers in really complex situations.
If you have build layer cakes out of (possibly recursively) nested
stacks then each participating bit of code only needs to care about the
nested stack it is managing itself, and by virtue of the fixed order of
layers within a handle or a pseudo-layer, the overall resulting pipeline
is guaranteed to assemble into something that makes sense.)

Now â�� how we turn what we have now into the system I outlined is quite
another matterâ�¦

â�¦ or maybe it ainâ��t? How hard do the PerlIO people here think this might
be? (Leon?)

* feed is a word visibly distinct from input. I'm failing to find
a suitable antonym for this meaning of feed.

Spew? :-)

--
No trees were harmed in the transmission of this email
but trillions of electrons were excited to participate.

p5pRT · 2013-09-20T16:27:04Z

From @ap

* Robert May <robertmay@cpan.org> [2013-09-20 17:45]:

On 20 September 2013 16:24, Father Chrysostomos wrote:

The opposite of feed is usually starve. But that doesnâ��t work here.
Maybe spew? Extort?

drain?

Ah, yâ��all jogged my memory. Nicholas is looking for â��sourceâ�� and â��sinkâ��
I think â�� cf. <https://en.wikipedia.org/wiki/Sink_%28computing%29>.

--
Aristotle Pagaltzis // <http://plasmasturm.org/>

p5pRT · 2013-09-20T16:32:18Z

From @Leont

On Fri, Sep 20, 2013 at 2:58 PM, Nicholas Clark <nick@ccl4.org> wrote:

On Thu, Sep 19, 2013 at 10:27:32PM -0700, Tony Cook via RT wrote:

I believe this is a known problem with the way the default :crlf layer
works on Win32.

Since the layer is immediately on top of the :unix layer, it's working
at a byte level, adding CRs to the bytes *after* translation from
characters.

The only way I can see to fix this would be to make :crlf special, so it
always remains on top, but I suspect that's going to be fairly ugly from
an implementation point of view - do we make other layers special too?

I guess it's a kind of (emergent) flaw with the whole design of layers.
In that layers can meaningfully be anything of
text \-> binary    $eg Unicode \-> UTF\-8$
binary \-> binary  $eg gzip$
binary \-> text    $eg uuencode\, or these days Base64$
text \-> text      $pedantically rot13$
(in terms of output)
(I think I've read that Python 3 went too far the other way on this by
banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary, as it
corrupts binary data.)

Except that PerlIO internally doesn't work in terms of text or binary, but
in terms of latin-1/binary octets versus utf8 octets :-/.

So for example if you'd open a ":encoding(utf-16be):bytes". You'd read
UTF-16 converted to UTF-8 and then interpreted as Latin-1. Obviously not
something someone would deliberately do, but the fact that you can do it
accidentally is bad enough.

Another dimension of this is that some layers only make sense at the bottom
(e.g. :unix), and others only above such a bottom layer (e.g. most layers).

So the design of layers ought to categorise their feed and not-feed* sides
as text or binary, and forbid plugging the wrong sorts together.

If we had that, then attempting to push UTF-16 atop CRLF would be an error
immediately. But, of course, the DWIM approach is that pushing UTF-16 onto
CRLF would cause UTF-16 to burrow under CRLF, given that issuing an error
of the form of "I can see what you're trying to do, but I'm not going to
help you" isn't very nice.

Yes, absolutely!

I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.
Failing that, we wrestle with the punchline to the "Irishman giving
directions"
joke, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly, you need a
type system. Or at least enough of a type system to distinguish "text" from
"binary".

I'm wondering if it's really too late. Given how much brokenness there is
in this area

* feed is a word visibly distinct from input. I'm failing to find a
suitable
antonym for this meaning of feed.

I prefer to just call them top and bottom.

Leon

p5pRT · 2013-09-20T16:35:10Z

From @Leont

On Fri, Sep 20, 2013 at 6:26 PM, Aristotle Pagaltzis <pagaltzis@gmx.de>wrote:

Ah, yâ��all jogged my memory. Nicholas is looking for â��sourceâ�� and â��sinkâ��
I think â�� cf. <https://en.wikipedia.org/wiki/Sink_%28computing%29>.

IO goes in both directions; so one side will be the source for input but
the sink for output and vice versa.

Source and sink would be *terribly* confusing terms.

Leon

p5pRT added Severity Low labels Oct 18, 2019

toddr removed the khw label Oct 25, 2019

xenu removed the Severity Low label Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

p5pRT commented Dec 1, 2010

p5pRT commented Dec 1, 2010

p5pRT commented Dec 1, 2010

p5pRT commented Dec 1, 2010

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

Comments

p5pRT commented Dec 1, 2010

p5pRT commented Dec 1, 2010

From mezmerik@gmail.com

Created by mezmerik@gmail.com

p5pRT commented Dec 1, 2010

From mezmerik@gmail.com

p5pRT commented Dec 1, 2010

From mezmerik@gmail.com

p5pRT commented Sep 20, 2013

From @tonycoz

p5pRT commented Sep 20, 2013

p5pRT commented Sep 20, 2013

From @nwc10

p5pRT commented Sep 20, 2013

From @ikegami

p5pRT commented Sep 20, 2013

From @nwc10

p5pRT commented Sep 20, 2013

From @Tux

p5pRT commented Sep 20, 2013

From zefram@fysh.org

p5pRT commented Sep 20, 2013

From @nwc10

p5pRT commented Sep 20, 2013

From @cpansprout

p5pRT commented Sep 20, 2013

From @Leont

p5pRT commented Sep 20, 2013

From robertmay@cpan.org

p5pRT commented Sep 20, 2013

From @ap

p5pRT commented Sep 20, 2013

From @ap

p5pRT commented Sep 20, 2013

From @Leont

p5pRT commented Sep 20, 2013

From @Leont