Encoding I/O Layer Difference on Windows #15218

p5pRT · 2016-03-07T17:21:02Z

Migrated from rt.perl.org#127668 (status was 'open')

Searchable as RT127668$

p5pRT · 2016-03-07T17:21:02Z

From dwheeler@cpan.org

I have a file to which I’ve written:

"\xc3\xa5\xc3\xa5\xc3\xa5\x0a"

These bytes correspond to the UTF-8 string:

"ååå\n"

I want to read this file into a scalar, so I wrote this Perl:

sub slurp {
my ($file) = @_;
open my $fh, "<:encoding(UTF-8)", $file or die $!;
return '' if eof $fh;
local $/;
return <$fh>;
}

my $fn = shift or die "Usage: $0 [file]\n";
slurp $fn;

Works great on my Mac and on *nix machines, but not on Windows. There it emits a warning:

utf8 "\xA5" does not map to Unicode at try.pl line 6

I can fix it by changing the I/O layer to :raw:encoding(UTF-8). My guess is that it reads it in as raw bytes, first, then does the conversion. I can also break it on my Mac by changing the layer to :crlf:encoding(UTF-8).

But I don’t understand why the encoding layer’s parsing of a file with \x0a line endings should vary by platform. Sure, line ending on Windows might typically be \x0d\x0a, but why would the I/O layer care when I’ve told it that the file encoding is UTF-8?

Note that this does not occur with data in memory. This works fine on both platforms:

use Encode qw(decode_utf8);
my $data = "\xc3\xa5\xc3\xa5\xc3\xa5\x0a";
decode_utf8 $data;

Is the I/O layer assuming that, because we’re on Windows, line endings need to be converted to \r\n before decoding? Is it implicitly using :crlf on Windows? Doesn’t seem like it’d be necessary if I’ve already told it what encoding to use, and shouldn't bork that encoding in any event.

p5pRT · 2016-03-18T13:19:04Z

From @arc

As requested, I've attached a program and test inputs demonstrating that the problem shows up iff all the following are true:

- The :crlf layer is used
- The :encoding(UTF-8) layer is used (not :utf8)
- The input ends in LF rather than CRLF
- The program tests eof($fh) before reading from the filehandle
- $/ is set to undef

I get the following results when running the program in various ways:

$ for mode in ':encoding(UTF-8)' :crlf:utf8 ':crlf:encoding(UTF-8)'; do

for file in crlf.txt lf.txt; do
for slurp in 0 1; do
for use_eof in 0 1; do
echo "mode=$mode file=$file slurp=$slurp use_eof=$use_eof"
perl script.pl $mode $file $slurp $use_eof
done
done
done
done
mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=0
mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=1
mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=0
mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=1
mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=0
mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=1
mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=0
mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=1
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
utf8 "\xA5" does not map to Unicode at foo.pl line 5.
$

The original p5p thread on this may have some additional information: http://nntp.perl.org/group/perl.perl5.porters/234856

--
Aaron Crane ** http://aaroncrane.co.uk/

p5pRT · 2016-03-18T13:19:04Z

From @arc

å

p5pRT · 2016-03-18T13:19:04Z

From @arc

å

p5pRT · 2016-03-18T13:19:04Z

From @arc

script.pl

p5pRT · 2017-02-28T13:03:53Z

@jkeenan - Status changed from 'new' to 'open'

p5pRT · 2017-02-28T13:25:56Z

From @jkeenan

On Fri, 18 Mar 2016 13:19:04 GMT, arc wrote:

As requested, I've attached a program and test inputs demonstrating
that the problem shows up iff all the following are true:

- The :crlf layer is used
- The :encoding(UTF-8) layer is used (not :utf8)
- The input ends in LF rather than CRLF
- The program tests eof($fh) before reading from the filehandle
- $/ is set to undef

I get the following results when running the program in various ways:

$ for mode in ':encoding(UTF-8)' :crlf:utf8 ':crlf:encoding(UTF-8)';
do

for file in crlf.txt lf.txt; do
for slurp in 0 1; do
for use_eof in 0 1; do
echo "mode=$mode file=$file slurp=$slurp use_eof=$use_eof"
perl script.pl $mode $file $slurp $use_eof
done
done
done
done
mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=0
mode=:crlf:utf8 file=crlf.txt slurp=0 use_eof=1
mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=0
mode=:crlf:utf8 file=crlf.txt slurp=1 use_eof=1
mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=0
mode=:crlf:utf8 file=lf.txt slurp=0 use_eof=1
mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=0
mode=:crlf:utf8 file=lf.txt slurp=1 use_eof=1
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=:crlf:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=:crlf:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
utf8 "\xA5" does not map to Unicode at foo.pl line 5.
$

The original p5p thread on this may have some additional information:
http://nntp.perl.org/group/perl.perl5.porters/234856

After the above post from Aaron, Leon Timmermans added the following, which I'm quoting here to get all the current state of discussion into RT:

#####
On Sun, Mar 6, 2016 at 2:54 AM, Aaron Crane <arc@cpan.org> wrote:

I'm far from being an expert in the workings of PerlIO, but my guess
is that the combination of :crlf and :encoding(UTF-8) layers isn't
handling the C<< eof $fh >> test correctly: it looks from the outside
like a whole character gets read from the filehandle (to determine
whether it has reached its end), but then only the last byte of that
character is returned to the buffer.

Shocking, an issue in :crlf or :encoding…

What happens is that a byte gets read and then unread. For a :perlio layer
that byte would just go back to the existent buffer (which has space
because it just came out of there), but :crlf is uniquely special
(PerlIOCrlf_unread in perlio.c, if you're curious). Uncommenting said
unique snowflake code and using slower but more obvious path
(PerlIOBase_unread) seems to solve this, so that may be one half of the
solution.

The other half would be not to do this read/unread silliness in the first
place. We can check for eof without removing anything from the buffer. -T
and -B probably have a similar issue, but I have a hard time imagining how
someone triggers that accidentally.

- using a file that ends in \r\n rather than \n

That actually was the crucial hint to where the problem is located :-)

Leon
#####

Is anyone able to analyze this further?

Thank you very much.

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2017-02-28T14:04:16Z

From @Leont

On Tue, Feb 28, 2017 at 2:25 PM, James E Keenan via RT
<perlbug-followup@perl.org> wrote:

Is anyone able to analyze this further?

Thank you very much.

I don't think this ticket is in need of analysis, I think it's in need of a fix.

Leon

p5pRT · 2017-02-28T15:53:28Z

From cm.perl@abtela.com

Le 28/02/2017 à 14:25, James E Keenan via RT a écrit :

Is anyone able to analyze this further?

Thank you very much.

To me it looks very much like #120797 in which Leon Timmermans suggested
to just get rid of PerlIOCrlf_unread in perlio.c

Regards,

--Christian

p5pRT added the Severity Low label Oct 19, 2019

xenu added the distro-mswin32 label Oct 20, 2021

xenu removed the Severity Low label Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding I/O Layer Difference on Windows #15218

Encoding I/O Layer Difference on Windows #15218

p5pRT commented Mar 7, 2016

p5pRT commented Mar 7, 2016

p5pRT commented Mar 18, 2016

p5pRT commented Mar 18, 2016

p5pRT commented Mar 18, 2016

p5pRT commented Mar 18, 2016

p5pRT commented Feb 28, 2017

p5pRT commented Feb 28, 2017

p5pRT commented Feb 28, 2017

p5pRT commented Feb 28, 2017

Encoding I/O Layer Difference on Windows #15218

Encoding I/O Layer Difference on Windows #15218

Comments

p5pRT commented Mar 7, 2016

p5pRT commented Mar 7, 2016

From dwheeler@cpan.org

p5pRT commented Mar 18, 2016

From @arc

p5pRT commented Mar 18, 2016

From @arc

p5pRT commented Mar 18, 2016

From @arc

p5pRT commented Mar 18, 2016

From @arc

p5pRT commented Feb 28, 2017

p5pRT commented Feb 28, 2017

From @jkeenan

p5pRT commented Feb 28, 2017

From @Leont

p5pRT commented Feb 28, 2017

From cm.perl@abtela.com