New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869
Comments
From mezmerik@gmail.comCreated by mezmerik@gmail.comHello, I'm using ActivePerl 5.12.2 on Windows 7. Perl for Win32 has a feature to convert a single "LF" (without Here's the test program: open FH_IN, "<:encoding(utf16be)", "src.txt" or die; while (<FH_IN>) { I think "src.txt" and "output.txt" should be identical. But not. 1) if "src.txt" is only two CRLFs, its bytecodes are "FE FF 00 0D 00 2) if "src.txt" is only one chinese charater "�", whose unicode and modify the program code: while (<FH_IN>) { 1) "src.txt" which is only two CRLFs, "FE FF 00 0D 00 0A 00 0D 00 0A" That's what I found when operating UTF-16 files. I'll appreciate your  Joey Perl Info
|
From mezmerik@gmail.comþÿ� |
From mezmerik@gmail.comþÿN |
From @tonycozOn Wed Dec 01 04:38:26 2010, mezmerik@gmail.com wrote:
I believe this is a known problem with the way the default :crlf layer Since the layer is immediately on top of the :unix layer, it's working This means you get other broken behaviour, such as inserting a 0d byte C:\Users\tony>perl -e "open my $fh, '>:encoding(utf16be)', 'foo.txt' or C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print The workaround (or perhaps the only real fix) is to clear the :crlf C:\Users\tony>perl -e "open my $fh, '>:raw:encoding(utf16be):crlf', C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (<>) { print The only way I can see to fix this would be to make :crlf special, so it (/me avoids going wild with speculation) Tony |
The RT System itself - Status changed from 'new' to 'open' |
From @nwc10On Thu, Sep 19, 2013 at 10:27:32PM -0700, Tony Cook via RT wrote:
I guess it's a kind of (emergent) flaw with the whole design of layers. text -> binary (eg Unicode -> UTF-8) (in terms of output) I think on that categorisation LF -> CRLF would be text -> text. (Certainly for output it's not binary -> text or binary -> binary, as it
So the design of layers ought to categorise their feed and not-feed* sides If we had that, then attempting to push UTF-16 atop CRLF would be an error I believe that the second half of Jarkko's quote applies: Documenting bugs before they're found is kinda hard. Failing that, we wrestle with the punchline to the "Irishman giving directions" Gah. It keeps coming back to this - to handle Unicode properly, you need a Nicholas Clark * feed is a word visibly distinct from input. I'm failing to find a suitable |
From @ikegamiOn Fri, Sep 20, 2013 at 8:58 AM, Nicholas Clark <nick@ccl4.org> wrote:
I fully agree, but I don't see how that would help here. Wouldn't that :crlf is a special case. It would therefore make sense for :encoding to |
From @nwc10On Fri, Sep 20, 2013 at 10:18:33AM -0400, Eric Brine wrote:
Yes, it would, if handles default to binary. But I think that then that's
Problem is that it's no more of a special case than a layer that converts You sort of need some sort of "apply layer" logic, which assumes FILE -> [binary -> binary] {0,*} -> [binary -> text] -> [text -> text] {0,*} at which point, applying any binary->binary or text->text layer *stacks* it (And I might be missing one in that diagram - maybe FILE should be And if you want to apply a [text -> binary] or build more funky than the Nicholas Clark |
From @TuxOn Fri, 20 Sep 2013 15:29:16 +0100, Nicholas Clark <nick@ccl4.org>
Not that is happens a lot, but in CSV there is no overall encoding. The What *does* happen (quite too often) is that the lines are exported in
-- |
From zefram@fysh.orgNicholas Clark wrote:
It sounds like the concept of "applying" a layer has been overloaded -zefram |
From @nwc10On Fri, Sep 20, 2013 at 04:00:25PM +0100, Zefram wrote:
Yes, half the time I agree with you here. It's too complex to be useful. But it's bugging me that the only 2 frequent operations a programmer does are 1) State that the file is binary with the bothersome problem that the default is text, with platform specific And that (1) and (2) above ought to be easy to do, without needing to resort Nicholas Clark |
From @cpansproutOn Fri Sep 20 05:59:34 2013, nicholas wrote:
I usually just work around the whole issue with explicit encode/decode.
The opposite of feed is usually starve. But that doesn�t work here. -- Father Chrysostomos |
From @LeontOn Fri, Sep 20, 2013 at 7:27 AM, Tony Cook via RT <perlbug-followup@perl.org
All correct. I wrote the ":text" pseudo layer that shortens that to Also note that before 5.14, binmode $fh, ':raw:encoding(utf-16be):crlf' did
The way to correct this is to make open be sensible. That is not a trivial Leon |
From robertmay@cpan.orgOn 20 September 2013 16:24, Father Chrysostomos via RT
drain? |
From @ap* Nicholas Clark <nick@ccl4.org> [2013-09-20 15:00]:
Do we want to provide first-class support for layer cakes like this? (text � binary) (binary � text) (text � binary) Because if we don�t, then the solution would seem to be very easy, at In that design, we also have a (text � binary) layer (named �derp�? :-)) (As far as I can see there is no reason to have (binary � text) layers �Derp� is then the default (text � binary) layer for handles on which Note that the solves the problem with pushing UTF-16 onto CRLF, because I�d suggest a migration in which (text � text), (binary � binary) and Note that even layer cakes can be supported as a second-class construct, (This even neatly solves the question of how code is supposed to keep Now � how we turn what we have now into the system I outlined is quite � or maybe it ain�t? How hard do the PerlIO people here think this might
Spew? :-) -- |
From @ap* Robert May <robertmay@cpan.org> [2013-09-20 17:45]:
Ah, y�all jogged my memory. Nicholas is looking for �source� and �sink� -- |
From @LeontOn Fri, Sep 20, 2013 at 2:58 PM, Nicholas Clark <nick@ccl4.org> wrote:
Except that PerlIO internally doesn't work in terms of text or binary, but So for example if you'd open a ":encoding(utf-16be):bytes". You'd read Another dimension of this is that some layers only make sense at the bottom
I'm wondering if it's really too late. Given how much brokenness there is
I prefer to just call them top and bottom. Leon |
From @LeontOn Fri, Sep 20, 2013 at 6:26 PM, Aristotle Pagaltzis <pagaltzis@gmx.de>wrote:
IO goes in both directions; so one side will be the source for input but Source and sink would be *terribly* confusing terms. Leon |
Migrated from rt.perl.org#80058 (status was 'open')
Searchable as RT80058$
The text was updated successfully, but these errors were encountered: