Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding I/O Layer Difference on Windows #15218

Open
p5pRT opened this issue Mar 7, 2016 · 9 comments
Open

Encoding I/O Layer Difference on Windows #15218

p5pRT opened this issue Mar 7, 2016 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 7, 2016

Migrated from rt.perl.org#127668 (status was 'open')

Searchable as RT127668$

@p5pRT
Copy link
Author

p5pRT commented Mar 7, 2016

From dwheeler@cpan.org

I have a file to which I’ve written​:

  "\xc3\xa5\xc3\xa5\xc3\xa5\x0a"

These bytes correspond to the UTF-8 string​:

  "ååå\n"

I want to read this file into a scalar, so I wrote this Perl​:

  sub slurp {
  my ($file) = @​_;
  open my $fh, "<​:encoding(UTF-8)", $file or die $!;
  return '' if eof $fh;
  local $/;
  return <$fh>;
  }

  my $fn = shift or die "Usage​: $0 [file]\n";
  slurp $fn;

Works great on my Mac and on *nix machines, but not on Windows. There it emits a warning​:

  utf8 "\xA5" does not map to Unicode at try.pl line 6

I can fix it by changing the I/O layer to :raw​:encoding(UTF-8). My guess is that it reads it in as raw bytes, first, then does the conversion. I can also break it on my Mac by changing the layer to :crlf​:encoding(UTF-8).

But I don’t understand why the encoding layer’s parsing of a file with \x0a line endings should vary by platform. Sure, line ending on Windows might typically be \x0d\x0a, but why would the I/O layer care when I’ve told it that the file encoding is UTF-8?

Note that this does not occur with data in memory. This works fine on both platforms​:

  use Encode qw(decode_utf8);
  my $data = "\xc3\xa5\xc3\xa5\xc3\xa5\x0a";
  decode_utf8 $data;

Is the I/O layer assuming that, because we’re on Windows, line endings need to be converted to \r\n before decoding? Is it implicitly using :crlf on Windows? Doesn’t seem like it’d be necessary if I’ve already told it what encoding to use, and shouldn't bork that encoding in any event.

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2016

From @arc

As requested, I've attached a program and test inputs demonstrating that the problem shows up iff all the following are true​:

- The :crlf layer is used
- The :encoding(UTF-8) layer is used (not :utf8)
- The input ends in LF rather than CRLF
- The program tests eof($fh) before reading from the filehandle
- $/ is set to undef

I get the following results when running the program in various ways​:

$ for mode in '​:encoding(UTF-8)' :crlf​:utf8 '​:crlf​:encoding(UTF-8)'; do

for file in crlf.txt lf.txt; do
for slurp in 0 1; do
for use_eof in 0 1; do
echo "mode=$mode file=$file slurp=$slurp use_eof=$use_eof"
perl script.pl $mode $file $slurp $use_eof
done
done
done
done
mode=​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
mode=​:crlf​:utf8 file=crlf.txt slurp=0 use_eof=0
mode=​:crlf​:utf8 file=crlf.txt slurp=0 use_eof=1
mode=​:crlf​:utf8 file=crlf.txt slurp=1 use_eof=0
mode=​:crlf​:utf8 file=crlf.txt slurp=1 use_eof=1
mode=​:crlf​:utf8 file=lf.txt slurp=0 use_eof=0
mode=​:crlf​:utf8 file=lf.txt slurp=0 use_eof=1
mode=​:crlf​:utf8 file=lf.txt slurp=1 use_eof=0
mode=​:crlf​:utf8 file=lf.txt slurp=1 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
utf8 "\xA5" does not map to Unicode at foo.pl line 5.
$

The original p5p thread on this may have some additional information​: http​://nntp.perl.org/group/perl.perl5.porters/234856

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2016

From @arc

å

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2016

From @arc

å

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2016

From @arc

script.pl

@p5pRT
Copy link
Author

p5pRT commented Feb 28, 2017

@jkeenan - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 28, 2017

From @jkeenan

On Fri, 18 Mar 2016 13​:19​:04 GMT, arc wrote​:

As requested, I've attached a program and test inputs demonstrating
that the problem shows up iff all the following are true​:

- The :crlf layer is used
- The :encoding(UTF-8) layer is used (not :utf8)
- The input ends in LF rather than CRLF
- The program tests eof($fh) before reading from the filehandle
- $/ is set to undef

I get the following results when running the program in various ways​:

$ for mode in '​:encoding(UTF-8)' :crlf​:utf8 '​:crlf​:encoding(UTF-8)';
do

for file in crlf.txt lf.txt; do
for slurp in 0 1; do
for use_eof in 0 1; do
echo "mode=$mode file=$file slurp=$slurp use_eof=$use_eof"
perl script.pl $mode $file $slurp $use_eof
done
done
done
done
mode=​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
mode=​:crlf​:utf8 file=crlf.txt slurp=0 use_eof=0
mode=​:crlf​:utf8 file=crlf.txt slurp=0 use_eof=1
mode=​:crlf​:utf8 file=crlf.txt slurp=1 use_eof=0
mode=​:crlf​:utf8 file=crlf.txt slurp=1 use_eof=1
mode=​:crlf​:utf8 file=lf.txt slurp=0 use_eof=0
mode=​:crlf​:utf8 file=lf.txt slurp=0 use_eof=1
mode=​:crlf​:utf8 file=lf.txt slurp=1 use_eof=0
mode=​:crlf​:utf8 file=lf.txt slurp=1 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=0 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=crlf.txt slurp=1 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=0 use_eof=1
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=0
mode=​:crlf​:encoding(UTF-8) file=lf.txt slurp=1 use_eof=1
utf8 "\xA5" does not map to Unicode at foo.pl line 5.
$

The original p5p thread on this may have some additional information​:
http​://nntp.perl.org/group/perl.perl5.porters/234856

After the above post from Aaron, Leon Timmermans added the following, which I'm quoting here to get all the current state of discussion into RT​:

#####
On Sun, Mar 6, 2016 at 2​:54 AM, Aaron Crane <arc@​cpan.org> wrote​:

I'm far from being an expert in the workings of PerlIO, but my guess
is that the combination of :crlf and :encoding(UTF-8) layers isn't
handling the C<< eof $fh >> test correctly​: it looks from the outside
like a whole character gets read from the filehandle (to determine
whether it has reached its end), but then only the last byte of that
character is returned to the buffer.

Shocking, an issue in :crlf or :encoding…

What happens is that a byte gets read and then unread. For a :perlio layer
that byte would just go back to the existent buffer (which has space
because it just came out of there), but :crlf is uniquely special
(PerlIOCrlf_unread in perlio.c, if you're curious). Uncommenting said
unique snowflake code and using slower but more obvious path
(PerlIOBase_unread) seems to solve this, so that may be one half of the
solution.

The other half would be not to do this read/unread silliness in the first
place. We can check for eof without removing anything from the buffer. -T
and -B probably have a similar issue, but I have a hard time imagining how
someone triggers that accidentally.

- using a file that ends in \r\n rather than \n

That actually was the crucial hint to where the problem is located :-)

Leon
#####

Is anyone able to analyze this further?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Feb 28, 2017

From @Leont

On Tue, Feb 28, 2017 at 2​:25 PM, James E Keenan via RT
<perlbug-followup@​perl.org> wrote​:

Is anyone able to analyze this further?

Thank you very much.

I don't think this ticket is in need of analysis, I think it's in need of a fix.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 28, 2017

From cm.perl@abtela.com

Le 28/02/2017 à 14​:25, James E Keenan via RT a écrit :

Is anyone able to analyze this further?

Thank you very much.

To me it looks very much like #120797 in which Leon Timmermans suggested
to just get rid of PerlIOCrlf_unread in perlio.c

Regards,

--Christian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants