Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PATCH] Replace File::Slurp with Path::Tiny in perlfaq5 #13836

Closed
p5pRT opened this issue May 14, 2014 · 54 comments
Closed

[PATCH] Replace File::Slurp with Path::Tiny in perlfaq5 #13836

p5pRT opened this issue May 14, 2014 · 54 comments

Comments

@p5pRT
Copy link

p5pRT commented May 14, 2014

Migrated from rt.perl.org#121870 (status was 'resolved')

Searchable as RT121870$

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From mcgrath.martin@gmail.com

Apologies if this has already been discussed. I don't subscribe to p5p
but have searched the list before posting, and didn't find anything
relating to this issue.

This patch replaces the example/recommendation of using File​::Slurp
with Path​::Tiny in perlfaq5.

The reasoning behind this is an outstanding critical bug with File​::Slurp​:

https://rt.cpan.org/Public/Bug/Display.html?id=83126

This is a bug report for perl from mcgrath.martin@​gmail.com,
generated with the help of perlbug 1.40 running under perl 5.19.12.

Inline Patch
diff --git a/cpan/perlfaq/lib/perlfaq5.pod b/cpan/perlfaq/lib/perlfaq5.pod
index a2baf16..25d494f 100644
--- a/cpan/perlfaq/lib/perlfaq5.pod
+++ b/cpan/perlfaq/lib/perlfaq5.pod
@@ -200,7 +200,7 @@ you can fit the whole thing in memory!):

     print $out $content;

-Modules such as L<File::Slurp> and L<Tie::File> can help with that
+Modules such as L<Path::Tiny> and L<Tie::File> can help with that
 too. If you can, however, avoid reading the entire file at once. Perl
 won't give that memory back to the operating system until the process
 finishes.
@@ -1133,13 +1133,13 @@ C<$DB_RECNO> bindings, which allow you to tie
an array to a file so that accessing an element of the array actually accesses the corresponding line in the file\.

-If you want to load the entire file, you can use the L<File​::Slurp>
+If you want to load the entire file, you can use the L<Path​::Tiny>
module to do it in one simple and efficient step​:

- use File​::Slurp;
+ use Path​::Tiny;

- my $all_of_it = read_file($filename); # entire file in scalar
- my @​all_lines = read_file($filename); # one line per element
+ my $all_of_it = path($filename)->slurp; # entire file in scalar
+ my @​all_lines = path($filename)->lines; # one line per element

Or you can read the entire file contents into a scalar like this​:

--------------1.8.3.2--


Flags​:
  category=docs
  severity=wishlist


Site configuration information for perl 5.19.12​:

Configured by marto at Sat May 10 12​:15​:37 BST 2014.

Summary of my perl5 (revision 5 version 19 subversion 12) configuration​:
  Derived from​: f78f6d1
  Platform​:
  osname=linux, osvers=3.11.0-20-generic, archname=x86_64-linux
  uname='linux marto-virtualbox 3.11.0-20-generic #34-ubuntu smp tue
apr 1 20​:40​:25 utc 2014 x86_64 x86_64 x86_64 gnulinux '
  config_args='-des -Dusedevel -Dprefix=/home/marto/bleedperl'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
  optimize='-O2',
  cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
  ccversion='', gccversion='4.8.1', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib
/usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu
/lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=libc-2.17.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.17'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib
-fstack-protector'

Locally applied patches​:
  uncommitted-changes


@​INC for perl 5.19.12​:
  /home/marto/bleedperl/lib/site_perl/5.19.12/x86_64-linux
  /home/marto/bleedperl/lib/site_perl/5.19.12
  /home/marto/bleedperl/lib/5.19.12/x86_64-linux
  /home/marto/bleedperl/lib/5.19.12
  .


Environment for perl 5.19.12​:
  HOME=/home/marto
  LANG=en_GB.UTF-8
  LANGUAGE=en_GB​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/lib/lightdm/lightdm​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/bin​:/usr/games​:/usr/local/games
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From mcgrath.martin@gmail.com

0001-Replace-File-Slurp-with-Path-Tiny-in-perlfaq5.patch
From: Martin McGrath <mcgrath.martin@gmail.com>
Date: Sat, 10 May 2014 12:47:51 +0100
Subject: [PATCH] Replace File::Slurp with Path::Tiny in perlfaq5
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------1.8.3.2"

This is a multi-part message in MIME format.
--------------1.8.3.2
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit

---
 cpan/perlfaq/lib/perlfaq5.pod | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)


--------------1.8.3.2
Content-Type: text/x-patch; name="0001-Replace-File-Slurp-with-Path-Tiny-in-perlfaq5.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-Replace-File-Slurp-with-Path-Tiny-in-perlfaq5.patch"

diff --git a/cpan/perlfaq/lib/perlfaq5.pod b/cpan/perlfaq/lib/perlfaq5.pod
index a2baf16..25d494f 100644
--- a/cpan/perlfaq/lib/perlfaq5.pod
+++ b/cpan/perlfaq/lib/perlfaq5.pod
@@ -200,7 +200,7 @@ you can fit the whole thing in memory!):
 
     print $out $content;
 
-Modules such as L<File::Slurp> and L<Tie::File> can help with that
+Modules such as L<Path::Tiny> and L<Tie::File> can help with that
 too. If you can, however, avoid reading the entire file at once. Perl
 won't give that memory back to the operating system until the process
 finishes.
@@ -1133,13 +1133,13 @@ C<$DB_RECNO> bindings, which allow you to tie an array to a file so that
 accessing an element of the array actually accesses the corresponding
 line in the file.
 
-If you want to load the entire file, you can use the L<File::Slurp>
+If you want to load the entire file, you can use the L<Path::Tiny>
 module to do it in one simple and efficient step:
 
-    use File::Slurp;
+    use Path::Tiny;
 
-    my $all_of_it = read_file($filename); # entire file in scalar
-    my @all_lines = read_file($filename); # one line per element
+    my $all_of_it = path($filename)->slurp; # entire file in scalar
+    my @all_lines = path($filename)->lines; # one line per element
 
 Or you can read the entire file contents into a scalar like this:
 

--------------1.8.3.2--


@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @jhi

On Wednesday-201405-14, 5​:11, Martin McGrath (via RT) wrote​:

# New Ticket Created by Martin McGrath
# Please include the string​: [perl #121870]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121870 >

Apologies if this has already been discussed. I don't subscribe to p5p
but have searched the list before posting, and didn't find anything
relating to this issue.

This patch replaces the example/recommendation of using File​::Slurp
with Path​::Tiny in perlfaq5.

The reasoning behind this is an outstanding critical bug with File​::Slurp​:

https://rt.cpan.org/Public/Bug/Display.html?id=83126

In addition to File​::Slurp / perlfaq problem, shouldn't this​:

sysread treats any :encoding(...) as effectively :utf8.
Thus, requesting { binmode => "​:encoding(UTF-8)" } (e.g. strict UTF-8
compliance) actually results in Perl's lax, insecure utf8 decoding being
used instead.
...
More importantly, it will interpret any encoding as :utf8, even for
example :encoding(UTF-16).

be classified as a serious issue in sysread?

Though, I think, one could well argue that trying to read anything else
than bytes (be it utf8 of any kind, or any other encoding) with sysread
is a dubious proposition to begin with.

This is a bug report for perl from mcgrath.martin@​gmail.com,
generated with the help of perlbug 1.40 running under perl 5.19.12.

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @Leont

On Wed, May 14, 2014 at 4​:14 PM, Jarkko Hietaniemi <jhi@​iki.fi> wrote​:

In addition to File​::Slurp / perlfaq problem, shouldn't this​:

sysread treats any :encoding(...) as effectively :utf8.
Thus, requesting { binmode => "​:encoding(UTF-8)" } (e.g. strict UTF-8
compliance) actually results in Perl's lax, insecure utf8 decoding being
used instead.
...
More importantly, it will interpret any encoding as :utf8, even for
example :encoding(UTF-16).

be classified as a serious issue in sysread?

Though, I think, one could well argue that trying to read anything else
than bytes (be it utf8 of any kind, or any other encoding) with sysread is
a dubious proposition to begin with.

Yes. This is essentially a core bug that's leaking out because File​::Slurp
is stupid. sysread might as well die in this particular case IMO, but I'm
not sure if we can retroactively make it do that.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @epa

Interesting. Have you suggested this as a Perl​::Critic policy too?

--
Ed Avis <eda@​waniasset.com>

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @jhi

On Wednesday-201405-14, 10​:23, Leon Timmermans wrote​:

Yes. This is essentially a core bug that's leaking out because
File​::Slurp is stupid. sysread might as well die in this particular case
IMO, but I'm not sure if we can retroactively make it do that.

Probably not for 5.20, harhar... but certainly we could try purposefully
breaking this for 5.21.

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @jhi

Yes. This is essentially a core bug that's leaking out because
File​::Slurp is stupid. sysread might as well die in this particular case
IMO, but I'm not sure if we can retroactively make it do that.

I'll file a bug on this.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @perhunter

On 05/14/2014 10​:23 AM, Leon Timmermans wrote​:

On Wed, May 14, 2014 at 4​:14 PM, Jarkko Hietaniemi <jhi@​iki.fi> wrote​:

In addition to File​::Slurp / perlfaq problem, shouldn't this​:

sysread treats any :encoding(...) as effectively :utf8.
Thus, requesting { binmode => "​:encoding(UTF-8)" } (e.g. strict UTF-8
compliance) actually results in Perl's lax, insecure utf8 decoding being
used instead.
...
More importantly, it will interpret any encoding as :utf8, even for
example :encoding(UTF-16).

be classified as a serious issue in sysread?

Though, I think, one could well argue that trying to read anything else
than bytes (be it utf8 of any kind, or any other encoding) with sysread is
a dubious proposition to begin with.

Yes. This is essentially a core bug that's leaking out because File​::Slurp
is stupid. sysread might as well die in this particular case IMO, but I'm
not sure if we can retroactively make it do that.

and how would you go about fixing it in file​::slurp? i am not a
unicode/utf guru by any stretch (i say i am an ascii bigot! :). would
just doing a raw sysread all the time and then converting the buffer as
directed by binmode work? any other suggestions for how to read utf
files with ease and speed?

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @Leont

On Wed, May 14, 2014 at 5​:14 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

and how would you go about fixing it in file​::slurp? i am not a
unicode/utf guru by any stretch (i say i am an ascii bigot! :). would just
doing a raw sysread all the time and then converting the buffer as directed
by binmode work? any other suggestions for how to read utf files with ease
and speed?

File​::Slurp is nothing more than a workaround for a performance bug in
perl. There is no reason «my $foo = do { local $/; <$fh> };» would need to
be any slower than what File​::Slurp is doing right now. In fact I'm
planning to fix just that in 5.21.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @jhi

On Wednesday-201405-14, 11​:30, Leon Timmermans wrote​:

On Wed, May 14, 2014 at 5​:14 PM, Uri Guttman <uri@​stemsystems.com
<mailto​:uri@​stemsystems.com>> wrote​:

and how would you go about fixing it in file&#8203;::slurp? i am not a
unicode/utf guru by any stretch \(i say i am an ascii bigot\! :\)\.
would just doing a raw sysread all the time and then converting the
buffer as directed by binmode work? any other suggestions for how to
read utf files with ease and speed?

File​::Slurp is nothing more than a workaround for a performance bug in
perl. There is no reason «my $foo = do { local $/; <$fh> };» would need
to be any slower than what File​::Slurp is doing right now. In fact I'm
planning to fix just that in 5.21.

Leon

FWIW re the UTF-8 validity/strictness​: I think we're well within our
rights to break things, quoting Standards (via UTF-8 Wikipedia page)​:

RFC 3629 states "Implementations of the decoding algorithm MUST protect
against decoding invalid sequences." ... The Unicode Standard requires
decoders to "...treat any ill-formed code unit sequence as an error
condition. This guarantees that it will neither interpret nor emit an
ill-formed code unit sequence."

@p5pRT
Copy link
Author

p5pRT commented May 14, 2014

From @xdg

On Wed, May 14, 2014 at 12​:34 PM, Jarkko Hietaniemi <jhi@​iki.fi> wrote​:

RFC 3629 states "Implementations of the decoding algorithm MUST protect
against decoding invalid sequences." ... The Unicode Standard requires
decoders to "...treat any ill-formed code unit sequence as an error
condition. This guarantees that it will neither interpret nor emit an
ill-formed code unit sequence."

It's worth considering two separate issues​:

* invalid UTF-8 encoding
* private/nonchar codepoints

As I understand it, :utf8 is lax in both respects. I'd be very
supportive of fixing the first, even if it breaks backwards
compatibility, but don't think we should address the second.

--
David Golden <xdg@​xdg.me> Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @druud62

On 2014-05-14 17​:30, Leon Timmermans wrote​:

On Wed, May 14, 2014 at 5​:14 PM, Uri Guttman <uri@​stemsystems.com
<mailto​:uri@​stemsystems.com>> wrote​:

and how would you go about fixing it in file&#8203;::slurp? i am not a
unicode/utf guru by any stretch \(i say i am an ascii bigot\! :\)\.
would just doing a raw sysread all the time and then converting the
buffer as directed by binmode work? any other suggestions for how to
read utf files with ease and speed?

File​::Slurp is nothing more than a workaround for a performance bug in
perl. There is no reason «my $foo = do { local $/; <$fh> };» would need
to be any slower than what File​::Slurp is doing right now. In fact I'm
planning to fix just that in 5.21.

Not sure if this is still true for the latest Perls, but you better
write it as​:

  my $foo; { local $/; $foo = <$fh> }

or it will use twice the memory.

--
Ruud

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/14/2014 11​:30 AM, Leon Timmermans wrote​:

On Wed, May 14, 2014 at 5​:14 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

and how would you go about fixing it in file​::slurp? i am not a
unicode/utf guru by any stretch (i say i am an ascii bigot! :). would just
doing a raw sysread all the time and then converting the buffer as directed
by binmode work? any other suggestions for how to read utf files with ease
and speed?

File​::Slurp is nothing more than a workaround for a performance bug in
perl. There is no reason «my $foo = do { local $/; <$fh> };» would need to
be any slower than what File​::Slurp is doing right now. In fact I'm
planning to fix just that in 5.21.

if you think that is all there is to it, then do the workaround. most
coders think read_file( $filename) is a much better piece of code than
the infamous do/local hack. it isn't just a speed up (and check out the
benchmark script to see how much it does blows away do/local) but also
has a better api, better error handling options, more flexible ways to
get the data back. of course you can do atomic writing on your own.
edit_file and edit_file_lines would be trivial for you to write with
open/read/write and all the other options. and where is the equivilent
of write_file with do local? gonna need at least an open call and a
print along with error handling and more.

over 500 modules and distros use file​::slurp. tell them all to use the
do/local thing. i am sure you will be rewarded with gold and jewels.

in the mean time, instead of saying it uses sysread and can't handle utf
correctly, how about saying how to properly handle it? maybe those 500
modules will like that instead.

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From mcgrath.martin@gmail.com

On 14 May 2014 15​:45, Ed Avis via RT <perlbug-followup@​perl.org> wrote​:

Interesting. Have you suggested this as a Perl​::Critic policy too?

Not as yet, I've forked the repo and will work on this tonight.

Thanks

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @jhi

On Wednesday-201405-14, 22​:57, Uri Guttman wrote​:

in the mean time, instead of saying it uses sysread and can't handle utf
correctly, how about saying how to properly handle it? maybe those 500
modules will like that instead.

Based on some torture testing I just did, sysread() isn't the only thing
broken regarding :encoding(blah)... working on it, but it will be 5.21
before fixes happen.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @tonycoz

On Wed, May 14, 2014 at 10​:57​:41PM -0400, Uri Guttman wrote​:

in the mean time, instead of saying it uses sysread and can't handle
utf correctly, how about saying how to properly handle it? maybe
those 500 modules will like that instead.

Don't use sysread() if there are any layers - since it ignores
everything but the :utf8 flag[1]

I'd tend to fallback to read(), whether that meets your performance
requirements I don't know.

Tony

[1] in a scary way, eg. if your file is UCS-2BE and you open with
:encoding(UCS-2BE), sysread() will (try to) treat that UCS-2BE data as
UTF-8 data.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @Leont

On Thu, May 15, 2014 at 4​:57 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

if you think that is all there is to it, then do the workaround. most
coders think read_file( $filename) is a much better piece of code than the
infamous do/local hack. it isn't just a speed up (and check out the
benchmark script to see how much it does blows away do/local) but also has
a better api, better error handling options, more flexible ways to get the
data back. of course you can do atomic writing on your own.

All of that is true, but you don't need hundreds of lines of conceptually
buggy code to achieve that.

edit_file and edit_file_lines would be trivial for you to write with
open/read/write and all the other options. and where is the equivilent of
write_file with do local? gonna need at least an open call and a print
along with error handling and more.

edit_file and edit_file_lines are horribly broken by ignoring encodings
altogether.

over 500 modules and distros use file​::slurp. tell them all to use the
do/local thing. i am sure you will be rewarded with gold and jewels.

in the mean time, instead of saying it uses sysread and can't handle utf
correctly, how about saying how to properly handle it? maybe those 500
modules will like that instead.

Don't use sysread, at least not unless you're absolutely sure it gives the
right answer. It's really that simple.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 08​:43 AM, Tony Cook wrote​:

On Wed, May 14, 2014 at 10​:57​:41PM -0400, Uri Guttman wrote​:

in the mean time, instead of saying it uses sysread and can't handle
utf correctly, how about saying how to properly handle it? maybe
those 500 modules will like that instead.

Don't use sysread() if there are any layers - since it ignores
everything but the :utf8 flag[1]

I'd tend to fallback to read(), whether that meets your performance
requirements I don't know.

so what about sysread in raw mode and then converting the buffer to the
requested format? that should bypass any sysread issues and then rely on
the utf conversion subs directly.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 08​:54 AM, Leon Timmermans wrote​:

On Thu, May 15, 2014 at 4​:57 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

if you think that is all there is to it, then do the workaround. most
coders think read_file( $filename) is a much better piece of code than the
infamous do/local hack. it isn't just a speed up (and check out the
benchmark script to see how much it does blows away do/local) but also has
a better api, better error handling options, more flexible ways to get the
data back. of course you can do atomic writing on your own.

All of that is true, but you don't need hundreds of lines of conceptually
buggy code to achieve that.

and there aren't many bug reports on those hundreds of lines. in fact
most are utf related. the code is very solid and that isn't just my opinion.

edit_file and edit_file_lines would be trivial for you to write with
open/read/write and all the other options. and where is the equivilent of
write_file with do local? gonna need at least an open call and a print
along with error handling and more.

edit_file and edit_file_lines are horribly broken by ignoring encodings
altogether.

they don't ignore them. they pass on the encoding options to read_file
and write_file.

over 500 modules and distros use file​::slurp. tell them all to use the
do/local thing. i am sure you will be rewarded with gold and jewels.

in the mean time, instead of saying it uses sysread and can't handle utf
correctly, how about saying how to properly handle it? maybe those 500
modules will like that instead.

Don't use sysread, at least not unless you're absolutely sure it gives the
right answer. It's really that simple.

and you didn't answer the question about calling sysread in raw mode and
encoding afterwards. try answering that and being useful here. it isn't
my fault that sysread encoding is broken and i wrote file​::slurp 11
years ago when sysread was fine and dandy.

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @craigberry

On Thu, May 15, 2014 at 9​:26 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

so what about sysread in raw mode and then converting the buffer to the
requested format? that should bypass any sysread issues and then rely on the
utf conversion subs directly.

The main problem I'm aware of is that there is no guarantee the buffer
ends on a character boundary of the data are in a multi-byte encoding.
The end of the buffer might have a partial character in it and the
rest of that character hasn't been read in yet. You can't convert a
partial character.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 10​:46 AM, Craig A. Berry wrote​:

On Thu, May 15, 2014 at 9​:26 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

so what about sysread in raw mode and then converting the buffer to the
requested format? that should bypass any sysread issues and then rely on the
utf conversion subs directly.

The main problem I'm aware of is that there is no guarantee the buffer
ends on a character boundary of the data are in a multi-byte encoding.
The end of the buffer might have a partial character in it and the
rest of that character hasn't been read in yet. You can't convert a
partial character.

the whole idea of slurp is to read in the entire file into a string. any
partial chars at the end at the fault of whatever wrote the file. this
conversion wouldn't be done on a block by block basis (and i see your
point there as it finally is clear why sysread would be broken if used
that way).

so again, i will ask the simple question to which i have not gotten a
proper answer from all you unicode experts. if file​::slurp reads in an
ENTIRE file with raw mode sysread into a single string, and THEN
converts its to the requested format, will that work? given the actual
description of the bug above, i can't see why it won't work. and it will
be an EASY fix (patches welcome as slurp is now on github under perlhunter).

so i will be blunt here, put up or shut up. i am asking for help on this
unicode/utf issue where i am very weak. i know many of you are experts
on perl/utf so what is the best way to do this? what subs do i call to
do the conversion of a full buffer? i can make sysread do raw mode and
delay the binmode option to a later conversion.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @demerphq

On 15 May 2014 17​:09, Uri Guttman <uri@​stemsystems.com> wrote​:

On 05/15/2014 10​:46 AM, Craig A. Berry wrote​:

On Thu, May 15, 2014 at 9​:26 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

so what about sysread in raw mode and then converting the buffer to the

requested format? that should bypass any sysread issues and then rely on
the
utf conversion subs directly.

The main problem I'm aware of is that there is no guarantee the buffer
ends on a character boundary of the data are in a multi-byte encoding.
The end of the buffer might have a partial character in it and the
rest of that character hasn't been read in yet. You can't convert a
partial character.

the whole idea of slurp is to read in the entire file into a string. any
partial chars at the end at the fault of whatever wrote the file. this
conversion wouldn't be done on a block by block basis (and i see your point
there as it finally is clear why sysread would be broken if used that way).

so again, i will ask the simple question to which i have not gotten a
proper answer from all you unicode experts. if file​::slurp reads in an
ENTIRE file with raw mode sysread into a single string, and THEN converts
its to the requested format, will that work? given the actual description
of the bug above, i can't see why it won't work. and it will be an EASY fix
(patches welcome as slurp is now on github under perlhunter).

so i will be blunt here, put up or shut up. i am asking for help on this
unicode/utf issue where i am very weak. i know many of you are experts on
perl/utf so what is the best way to do this? what subs do i call to do the
conversion of a full buffer? i can make sysread do raw mode and delay the
binmode option to a later conversion.

Reading the full file and then calling Encode​::decode() on it with the
right flags would work fine.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 11​:31 AM, demerphq wrote​:

On 15 May 2014 17​:09, Uri Guttman <uri@​stemsystems.com> wrote​:

On 05/15/2014 10​:46 AM, Craig A. Berry wrote​:

On Thu, May 15, 2014 at 9​:26 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

so what about sysread in raw mode and then converting the buffer to the

requested format? that should bypass any sysread issues and then rely on
the
utf conversion subs directly.

The main problem I'm aware of is that there is no guarantee the buffer
ends on a character boundary of the data are in a multi-byte encoding.
The end of the buffer might have a partial character in it and the
rest of that character hasn't been read in yet. You can't convert a
partial character.

the whole idea of slurp is to read in the entire file into a string. any
partial chars at the end at the fault of whatever wrote the file. this
conversion wouldn't be done on a block by block basis (and i see your point
there as it finally is clear why sysread would be broken if used that way).

so again, i will ask the simple question to which i have not gotten a
proper answer from all you unicode experts. if file​::slurp reads in an
ENTIRE file with raw mode sysread into a single string, and THEN converts
its to the requested format, will that work? given the actual description
of the bug above, i can't see why it won't work. and it will be an EASY fix
(patches welcome as slurp is now on github under perlhunter).

so i will be blunt here, put up or shut up. i am asking for help on this
unicode/utf issue where i am very weak. i know many of you are experts on
perl/utf so what is the best way to do this? what subs do i call to do the
conversion of a full buffer? i can make sysread do raw mode and delay the
binmode option to a later conversion.

Reading the full file and then calling Encode​::decode() on it with the
right flags would work fine.

yves,

thanks for the direct answer. now the question is should i just pass in
what i get in the bin_mode option to decode or something else? i have
used decode before but i don't recall if it uses the same args as binmode.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @autarch

On Thu, 15 May 2014, Uri Guttman wrote​:

thanks for the direct answer. now the question is should i just pass in what
i get in the bin_mode option to decode or something else? i have used decode
before but i don't recall if it uses the same args as binmode.

This will only sort of work. With binmode you can set multiple layers and
also do things unrelated to character sets (like gzip encode/decode, line
ending translation, etc.).

Encode can accept anything that would be passed to the "​:encoding(...)"
layer, as well as "​:utf8", but the two don't really correspond.

I suspect a better fix would be to simply _not_ use sysread if binmode is
anything besides "​:utf8" or "​:raw".

-dave

/*============================================================
http​://VegGuide.org http​://blog.urth.org
Your guide to all that's veg House Absolute(ly Pointless)
============================================================*/

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 12​:09 PM, Dave Rolsky wrote​:

On Thu, 15 May 2014, Uri Guttman wrote​:

thanks for the direct answer. now the question is should i just pass
in what i get in the bin_mode option to decode or something else? i
have used decode before but i don't recall if it uses the same args as
binmode.

This will only sort of work. With binmode you can set multiple layers
and also do things unrelated to character sets (like gzip encode/decode,
line ending translation, etc.).

Encode can accept anything that would be passed to the "​:encoding(...)"
layer, as well as "​:utf8", but the two don't really correspond.

I suspect a better fix would be to simply _not_ use sysread if binmode
is anything besides "​:utf8" or "​:raw".

what about adding another option that is passed directly to decode?
read() would still fail as files are read in blocks and that would also
trigger the same error if a char was partially read. what i can do is
always do :raw on sysread and call encode when the binmode is something
allowed by decode like utf8 or something. it would ignore it when the
mode is requested to be :raw (could be a binary file). i don't see why i
can't keep sysread if somehow the user can specify how to decode. an
alternative is for callers to always use :raw and then do decode on
their own. that could be added to the pod saying if the new decode
feature doesn't work for them, roll your own.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @ikegami

On Thu, May 15, 2014 at 11​:09 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

so again, i will ask the simple question to which i have not gotten a
proper answer from all you unicode experts. if file​::slurp reads in an
ENTIRE file with raw mode sysread into a single string, and THEN converts
its to the requested format, will that work?

Yes, but it won't handle layers other than :encoding, and it won't handle
already-opened handles with layers.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 01​:16 PM, Eric Brine wrote​:

On Thu, May 15, 2014 at 11​:09 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

so again, i will ask the simple question to which i have not gotten a
proper answer from all you unicode experts. if file​::slurp reads in an
ENTIRE file with raw mode sysread into a single string, and THEN converts
its to the requested format, will that work?

Yes, but it won't handle layers other than :encoding, and it won't handle
already-opened handles with layers.

that can be documented. most callers of read_file won't use it with
already open files. maybe for sockets and other handles (e.g. <DATA>)
which may not have layers. also with slurping the only common layer i
know is already done which is line splitting (works on all platforms).
so the decoding is the only issue. if you need more complex stuff, then
do it yourself. this is for common basic read in a whole file with utf
decoding as needed. and that is by far the common case and what the
module is intended for. it isn't meant to replace complex i/o, just make
common i/o simpler and faster.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @Leont

On Thu, May 15, 2014 at 4​:31 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

and there aren't many bug reports on those hundreds of lines. in fact most
are utf related. the code is very solid and that isn't just my opinion.

It's just *completely* broken with encodings, which is a fairly critical
bug. It just happens to get explicit crlf wrong. Its dealing with layers is
dubious at best and flat out wrong more often than it should.

they don't ignore them. they pass on the encoding options to read_file and

write_file.

Sorry, it seems I had misread the docs there.

Don't use sysread, at least not unless you're absolutely sure it gives the

right answer. It's really that simple.

and you didn't answer the question about calling sysread in raw mode and
encoding afterwards. try answering that and being useful here. it isn't my
fault that sysread encoding is broken and i wrote file​::slurp 11 years ago
when sysread was fine and dandy.

Yes, you can, but why would you go through these complications? What makes
you think you'd be faster/better/whatever than

sub read_file($filename, $encoding = 'utf8') {
  open my $fh, '<​:encoding($encoding)', $filename or die "Can't open
$filename​: $!";
  return do { local $/; <$fh> }
}

It's not that this can't be made to work, it's that a 100+ lines of code
doing tricky things for optimizations that can be achieved using
significantly less code, and you're proposing to add more kludges to this.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 02​:10 PM, Leon Timmermans wrote​:

On Thu, May 15, 2014 at 4​:31 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

and there aren't many bug reports on those hundreds of lines. in fact most
are utf related. the code is very solid and that isn't just my opinion.

It's just *completely* broken with encodings, which is a fairly critical
bug. It just happens to get explicit crlf wrong. Its dealing with layers is
dubious at best and flat out wrong more often than it should.

and why would you need to explicitly pass in cdlf when it can do line
splitting already? and as i said, layers aren't a critical part of the
concept. if you want complex i/o do it yourself. slurping is mostly for
common text files (think config, source, etc.) which don't need anything
but a possible decode. this isn't line by line or some other layered thing.

they don't ignore them. they pass on the encoding options to read_file and

write_file.

Sorry, it seems I had misread the docs there.

i am not that dumb as to not pass in those options. the idea of those is
to use read_file/write_file as they are working. eating my own dog food.

Don't use sysread, at least not unless you're absolutely sure it gives the

right answer. It's really that simple.

and you didn't answer the question about calling sysread in raw mode and
encoding afterwards. try answering that and being useful here. it isn't my
fault that sysread encoding is broken and i wrote file​::slurp 11 years ago
when sysread was fine and dandy.

Yes, you can, but why would you go through these complications? What makes
you think you'd be faster/better/whatever than

sub read_file($filename, $encoding = 'utf8') {
open my $fh, '<​:encoding($encoding)', $filename or die "Can't open
$filename​: $!";
return do { local $/; <$fh> }
}

so do that your self. and what about all the users out there who don't
need that and want a simple call? you can say the same thing for so many
modules which may have a simple workaround in 6 lines of code but do a
whole lot more. so stop your rant on this because slurp is used all over
and it isn't going away. don't use it if you don't want to. helping to
fix the decode problem is useful as other here have done.

It's not that this can't be made to work, it's that a 100+ lines of code
doing tricky things for optimizations that can be achieved using
significantly less code, and you're proposing to add more kludges to this.

tricky things for optimizations? you still haven't really read the code.
most of the 'tricky' code is handling options, line splitting on
winblows (that is a bad special case), and such. very little other than
looping over a LARGE sysread call of 1mb is done for optimization. and
run the benchmark already. see the difference in MANY styles of slurping
in files. choose a file size, etc. the do/local way is almost the
slowest possible way and FUGLY as all hell. i would never allow someone
to write that code when i review it. even a sub wrapper on that is
better but then you might as well use file​::slurp. and yes, i review
code for all my work, both consulting and placement so i have some
experience on what good code looks like.

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @ikegami

On Thu, May 15, 2014 at 2​:04 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

also with slurping the only common layer i know is already done which is
line splitting (works on all platforms). so the decoding is the only issue

:crlf is also very common.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 04​:00 PM, Eric Brine wrote​:

On Thu, May 15, 2014 at 2​:04 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

also with slurping the only common layer i know is already done which is
line splitting (works on all platforms). so the decoding is the only issue

:crlf is also very common.

from what i see in perldoc perlio it just converts crlf to \n on
platforms that need it. read_file has already done that for a long long
time. this is true for both files slurped into a scalar and into an
array of lines. so that isn't anything i need to be concerned about with
layers and decoding.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @Leont

On Thu, May 15, 2014 at 10​:00 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Thu, May 15, 2014 at 2​:04 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

also with slurping the only common layer i know is already done which is
line splitting (works on all platforms). so the decoding is the only issue

:crlf is also very common.

«binmode => '​:crlf'» completely broken on File​::Slurp. On Unix it's a
no-op, on Windows it will turn crlf translation *off*. Because​:

${$buf_ref} =~ s/\015\012/\n/g if $is_win32 && !$opts->{'binmode'} ;

Emulating PerlIO layers really is a deep rabbit hole.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @perhunter

On 05/15/2014 04​:25 PM, Leon Timmermans wrote​:

On Thu, May 15, 2014 at 10​:00 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Thu, May 15, 2014 at 2​:04 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

also with slurping the only common layer i know is already done which is
line splitting (works on all platforms). so the decoding is the only issue

:crlf is also very common.

«binmode => '​:crlf'» completely broken on File​::Slurp. On Unix it's a
no-op, on Windows it will turn crlf translation *off*. Because​:

${$buf_ref} =~ s/\015\012/\n/g if $is_win32 && !$opts->{'binmode'} ;

Emulating PerlIO layers really is a deep rabbit hole.

so you don't pass in :crlf to slurp. is that so hard to understand?? it
isn't NEEDED. and it is emulated just fine and it works. no reports on
that feature in a long time. try to find something constructive to say
or just keep quiet on this. i get it. you don't like the module. good
for you. doesn't mean you have to keep ripping it with unhelpful
comments and such. crlf is handled - period. no need to deal with the
:crlf layer. next?

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 15, 2014

From @davidnicol

Just to be difficult, has anyone benchmarked

  sub read_file($) { `/bin/cat \Q$_[0]\E` } # or "type" on windows

as an alternative, recently?

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @ikegami

On Thu, May 15, 2014 at 4​:43 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

crlf is handled - period.

Yes, but you were talking of breaking it by providing an implementation
that only supported :encoding. I simply reminded you that you also needed
to support :crlf too.

i get it. you don't like the module. good for you. doesn't mean you have to

keep ripping it with unhelpful comments and such.

Yeah, we hate it so much we just had to try to help you fix it???

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @ikegami

On Thu, May 15, 2014 at 4​:50 PM, David Nicol <davidnicol@​gmail.com> wrote​:

Just to be difficult, has anyone benchmarked

sub read_file($) { `/bin/cat \Q$_[0]\E` } # or "type" on windows

as an alternative, recently?

Backticks read the same way C<read> does (8K at a time), so it's slower
than using C<read> to read from a file. (There's quite a number of other
issues too, in general and as a File​::Slurp implementation).

use Benchmark qw( cmpthese );

sub _read {
  my $buf;
  open(my $fh, '<', $ARGV[0]) or die $!;
  my $sz = -s $fh;
  read($fh, $buf, $sz);
}

sub _sysread {
  my $buf;
  open(my $fh, '<', $ARGV[0]) or die $!;
  my $sz = -s $fh;
  sysread($fh, $buf, $sz);
}

cmpthese (-5, {
  qx => sub { my $buf = `/bin/cat \Q$ARGV[0]\E`; 1 },
  read => \&_read,
  sysread => \&_read,
});

  Rate qx sysread read
qx 59.7/s -- -87% -87%
sysread 451/s 655% -- -1%
read 455/s 662% 1% --

1.5MB file

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @perhunter

On 05/16/2014 12​:33 AM, Eric Brine wrote​:

On Thu, May 15, 2014 at 4​:43 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

crlf is handled - period.

Yes, but you were talking of breaking it by providing an implementation
that only supported :encoding. I simply reminded you that you also needed
to support :crlf too.

how does it need to be supported more than it is now? it can be ignored
if passed in as crlf conversion always happens now on winblows.

i get it. you don't like the module. good for you. doesn't mean you have to

keep ripping it with unhelpful comments and such.

Yeah, we hate it so much we just had to try to help you fix it???

i was referring to leon, not you.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @perhunter

On 05/16/2014 12​:46 AM, Eric Brine wrote​:

On Thu, May 15, 2014 at 4​:50 PM, David Nicol <davidnicol@​gmail.com> wrote​:

Just to be difficult, has anyone benchmarked

sub read\_file\($\) \{  \`/bin/cat \\Q$\_\[0\]\\E\` \}  \# or "type" on windows

as an alternative, recently?

Backticks read the same way C<read> does (8K at a time), so it's slower
than using C<read> to read from a file. (There's quite a number of other
issues too, in general and as a File​::Slurp implementation).

use Benchmark qw( cmpthese );

sub _read {
my $buf;
open(my $fh, '<', $ARGV[0]) or die $!;
my $sz = -s $fh;
read($fh, $buf, $sz);
}

sub _sysread {
my $buf;
open(my $fh, '<', $ARGV[0]) or die $!;
my $sz = -s $fh;
sysread($fh, $buf, $sz);
}

cmpthese (-5, {
qx => sub { my $buf = `/bin/cat \Q$ARGV[0]\E`; 1 },
read => \&_read,
sysread => \&_read,
});

is that a mistake with both read and sysread calling _read?

       Rate      qx sysread    read

qx 59.7/s -- -87% -87%
sysread 451/s 655% -- -1%
read 455/s 662% 1% --

those last two are basically the same result as they call the same code.
sysread should be much faster than read. file​::slurp has a good
benchmark script in extras which can easily be added to. it has many
variations of slurping already in there and you can set the file size
and other things.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @Smylers

Leon Timmermans writes​:

There is no reason «my $foo = do { local $/; <$fh> };» would need to
be any slower than what File​::Slurp is doing right now. In fact I'm
planning to fix just that in 5.21.

That's great to hear, Leon. As well as speeding up people using that
idiom directly, it'll also help users of Path​::Class and Path​::Tiny,
both of which have -> slurp methods which are implemented using it.

Uri Guttman writes​:

On 05/15/2014 02​:10 PM, Leon Timmermans wrote​:

sub read_file($filename, $encoding = 'utf8') {
open my $fh, '<​:encoding($encoding)', $filename or die "Can't open
$filename​: $!";
return do { local $/; <$fh> }
}

so do that your self. and what about all the users out there who don't
need that and want a simple call?

Indeed, File​::Slurp provides a convenient interface for many users, who
will wish to use it for that reason regardless of any efficiency
concerns.

I interpreted Leon's suggestion above not as something users should do
_instead_ of using read_file, but as a potential _implementation_ of
read_file in File​::Slurp (or at least the nub of such).

Once using <$fh> has been sped up sufficiently, one possibility would be
forFile​::Slurp to switch to using that, thereby simplifying its
implementation and avoiding any of the issues with using sysread.

Best wishes

Smylers
--
Why drug companies shouldn't be allowed to hide research results that they
don't like​: http​://gu.com/p/3zat8 — UK gov wasted millions on Tamiflu
Sign the AllTrials petition​: http​://www.alltrials.net/

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @Smylers

I just wrote​:

As well as speeding up people using that idiom directly, ...

Erm, I meant speeding up their code, not actually speeding up the
people.

Smylers
--
Why drug companies shouldn't be allowed to hide research results that they
don't like​: http​://gu.com/p/3zat8 — UK gov wasted millions on Tamiflu
Sign the AllTrials petition​: http​://www.alltrials.net/

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @Leont

On Thu, May 15, 2014 at 10​:43 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

so you don't pass in :crlf to slurp. is that so hard to understand?? it
isn't NEEDED. and it is emulated just fine and it works. no reports on that
feature in a long time.

You need it to (portably) open text files created on Windows. That is not
an obscure use-case.

try to find something constructive to say or just keep quiet on this. i
get it. you don't like the module. good for you. doesn't mean you have to
keep ripping it with unhelpful comments and such. crlf is handled - period.
no need to deal with the :crlf layer. next?

How is pointing out bugs "unhelpful comments"? How is offering solutions
(like not using sysread when that's not appropriate) not "constructive"?

Leon

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @ikegami

On Fri, May 16, 2014 at 1​:38 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

is that a mistake with both read and sysread calling _read?

Of course. Was super duper tired. Fixed​:

  Rate qx read sysread
qx 62.3/s -- -86% -91%
read 430/s 591% -- -40%
sysread 718/s 1052% 67% --

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @perhunter

On 05/16/2014 07​:38 AM, Leon Timmermans wrote​:

On Thu, May 15, 2014 at 10​:43 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

so you don't pass in :crlf to slurp. is that so hard to understand?? it
isn't NEEDED. and it is emulated just fine and it works. no reports on that
feature in a long time.

You need it to (portably) open text files created on Windows. That is not
an obscure use-case.

so it is a trivial fix to check for :crlf, do the same line conversion
as now and not set the actual layer. this isn't rocket surgery.

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @perhunter

On 05/16/2014 08​:12 AM, Eric Brine wrote​:

On Fri, May 16, 2014 at 1​:38 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

is that a mistake with both read and sysread calling _read?

Of course. Was super duper tired. Fixed​:

       Rate      qx    read sysread

qx 62.3/s -- -86% -91%
read 430/s 591% -- -40%
sysread 718/s 1052% 67% --

more like what i have seen. that is a lot of speed to make up. i don't
see how do/local or read can get to sysread speed given the i/o layers
overhead. but as i said, even with a speedup, the slurp api is cleaner.
and replacing the implementation isn't an issue as i said most of the
code is handling options, errors and such. little code other than
sysread and large read sizes are optimizations. i even found that just
the overhead of option and error checking and handling is noticeable in
benchmarks. a plain sysread in the benchmark script beats out read_file
for just that reason. i keep telling you all to look at and run the
benchmark.

also the most common homebrew code i have seen is more like this​:

  $text = join( '', <$fh> ) ;

the use of $/ is not well known so advocating it is not a good solution
IMNSHO.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @Leont

On Fri, May 16, 2014 at 4​:29 PM, Uri Guttman <uri@​stemsystems.com> wrote​:

more like what i have seen. that is a lot of speed to make up. i don't see
how do/local or read can get to sysread speed given the i/o layers overhead.

It's not so much layer overhead per se, but buffering overhead in a
specific case that doesn't need buffering. Turning it off will remarkably
increase the performance. Something along the lines of​:

sub read_file ($filename, $layer = '​:unix') {
  open my $fh, "<​:$layer", $filename or die "Couldn't open $filename​: $!";
  read $fh, my $buffer, -s $fh or die "Couldn't read $filename​: $!";
  return $buffer;
}

is just as fast as the code in File​::Slurp (the read itself is the
bottleneck, there isn't much left to optimize).

As soon as you do anything that uses a buffer this performance benefits
goes away. This includes crlf translation, line based IO and decoding.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @jkeenan

On Wed May 14 02​:11​:21 2014, mcgrath.martin@​gmail.com wrote​:

Apologies if this has already been discussed. I don't subscribe to p5p
but have searched the list before posting, and didn't find anything
relating to this issue.

This patch replaces the example/recommendation of using File​::Slurp
with Path​::Tiny in perlfaq5.

The reasoning behind this is an outstanding critical bug with
File​::Slurp​:

https://rt.cpan.org/Public/Bug/Display.html?id=83126

Discussion in this RT appears to have shifted to how best to fix File​::Slurp. File​::Slurp is not a core module, so discussion of how to improve it would normally have been conducted in its own bug queue (presumably, https://rt.cpan.org//Dist/Display.html?Queue=File-Slurp).  I realize that shutting off discussion in p5p at this point would be Canutian, but can I at least pose these questions​:

Do we want to modify perlfaq5 in the manner suggested by the OP two days ago?

Do we want to modify perlfaq5 at all?

If we do, does this go into 5.20?

My two cents​:

File​::Slurp is a well-known library. I can recall lightning talks Uri gave about it at YAPCs a decade ago. Whatever problems it has can be hashed out, preferably in its own bug queue and don't warrant its deletion from perlfaq5. "Mend it, don't end it." Perhaps some discussion of Path​::Tiny's slurping capabilities ought to be added to perlfaq5. And, no, this does not go into 5.20.

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented May 16, 2014

From @rjbs

* James E Keenan via RT <perlbug-followup@​perl.org> [2014-05-16T17​:56​:32]

If we do, does this go into 5.20?

No.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented May 17, 2014

From @xdg

On Fri, May 16, 2014 at 10​:29 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

more like what i have seen. that is a lot of speed to make up. i don't see
how do/local or read can get to sysread speed given the i/o layers overhead.

FWIW, if you open the file with :unix, then you don't get :perlio and
then you don't have the buffering. At that point, read() on the file
size (for ordinary files), does sysread looping in C -- i.e. adding up
bytes read in chunks in C and not doing byte arithmetic and logic
using perl ops.

In my benchmarking, that was as fast or faster in some cases. I've
tried to quickly reconstruct that here​:

Benchmark​: https://gist.github.com/dagolden/c4dad291cd0b7156f268

Source​: https://gist.github.com/dagolden/973bcfa11816e2d96c65

(Small was 50k, medium was 500k, large was 5000k)

I'm not claiming it's an entirely fair benchmark, but it does show a
way to get even faster performance in some circumstances.

Path​::Tiny has a bunch of method call overhead and abstraction (plus
locking) that the others don't have, which is why it's a bit slower.

David

@p5pRT
Copy link
Author

p5pRT commented May 17, 2014

From @perhunter

On 05/16/2014 10​:53 PM, David Golden wrote​:

On Fri, May 16, 2014 at 10​:29 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

more like what i have seen. that is a lot of speed to make up. i don't see
how do/local or read can get to sysread speed given the i/o layers overhead.

FWIW, if you open the file with :unix, then you don't get :perlio and
then you don't have the buffering. At that point, read() on the file
size (for ordinary files), does sysread looping in C -- i.e. adding up
bytes read in chunks in C and not doing byte arithmetic and logic
using perl ops.

and why would that be better than calling sysread directly? given the
large read size of 1mb (most text files are way below that) you only get
one loop and little overhead. i benchmarked a variation which does only
one read on smaller files but the overhead of checking that was seen
(compared to a bare sysread with no extra code).

In my benchmarking, that was as fast or faster in some cases. I've
tried to quickly reconstruct that here​:

Benchmark​: https://gist.github.com/dagolden/c4dad291cd0b7156f268

Source​: https://gist.github.com/dagolden/973bcfa11816e2d96c65

(Small was 50k, medium was 500k, large was 5000k)

I'm not claiming it's an entirely fair benchmark, but it does show a
way to get even faster performance in some circumstances.

Path​::Tiny has a bunch of method call overhead and abstraction (plus
locking) that the others don't have, which is why it's a bit slower.

you can easily add an entry into the benchmark that come with
file​::slurp. i browsed the code of path​::tiny today (not a tiny module!
:). if you want to talk about some ideas back and forth, i am open to it.

thanx,

uri

--
Uri Guttman - The Perl Hunter
The Best Perl Jobs, The Best Perl Hackers
http​://PerlHunter.com

@p5pRT
Copy link
Author

p5pRT commented May 18, 2014

From @xdg

On Sat, May 17, 2014 at 12​:14 AM, Uri Guttman <uri@​stemsystems.com> wrote​:

FWIW, if you open the file with :unix, then you don't get :perlio and
then you don't have the buffering. At that point, read() on the file
size (for ordinary files), does sysread looping in C -- i.e. adding up
bytes read in chunks in C and not doing byte arithmetic and logic
using perl ops.

and why would that be better than calling sysread directly?

In the real world, probably not much. In a tight slurp loop benchmark
case, perl's read() function (in the :unix layer at least) does almost
exactly what File​::Slurp is doing with sysread(), except in C instead
of in Perl, so avoids overhead carrying out things like loops,
comparison, and addition via Perl's op-tree, pads and stack.

you can easily add an entry into the benchmark that come with file​::slurp. i
browsed the code of path​::tiny today (not a tiny module! :). if you want to
talk about some ideas back and forth, i am open to it.

I'll put that on my round tuit list. :-)

David

--
David Golden <xdg@​xdg.me> Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented May 20, 2014

From @druud62

On 2014-05-15 02​:47, Dr.Ruud wrote​:

On 2014-05-14 17​:30, Leon Timmermans wrote​:

On Wed, May 14, 2014 at 5​:14 PM, Uri Guttman <uri@​stemsystems.com
<mailto​:uri@​stemsystems.com>> wrote​:

and how would you go about fixing it in file&#8203;::slurp? i am not a
unicode/utf guru by any stretch \(i say i am an ascii bigot\! :\)\.
would just doing a raw sysread all the time and then converting the
buffer as directed by binmode work? any other suggestions for how to
read utf files with ease and speed?

File​::Slurp is nothing more than a workaround for a performance bug in
perl. There is no reason «my $foo = do { local $/; <$fh> };» would need
to be any slower than what File​::Slurp is doing right now. In fact I'm
planning to fix just that in 5.21.

Not sure if this is still true for the latest Perls, but you better
write it as​:

my $foo; { local $/; $foo = <$fh> }

or it will use twice the memory.

Apparently fixed in 5.19.

--
Ruud

@p5pRT
Copy link
Author

p5pRT commented Oct 8, 2014

From @tonycoz

On Wed May 14 02​:11​:21 2014, mcgrath.martin@​gmail.com wrote​:

This patch replaces the example/recommendation of using File​::Slurp
with Path​::Tiny in perlfaq5.

The reasoning behind this is an outstanding critical bug with
File​::Slurp​:

https://rt.cpan.org/Public/Bug/Display.html?id=83126

perlfaq is upstream CPAN, issues in it should be reported at​:

https://github.com/perl-doc-cats/perlfaq/issues

though I suggest you point them at this discussion.

Discussion of the bug in File​::Slurp belongs in the File​::Slurp ticket​:

https://rt.cpan.org/Ticket/Display.html?id=83126

Closing this ticket.

Tony

@p5pRT
Copy link
Author

p5pRT commented Oct 8, 2014

@tonycoz - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant