Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting $/ to read fixed records can corrupt valid UTF-8 input #10865

Closed
p5pRT opened this issue Nov 29, 2010 · 49 comments
Closed

Setting $/ to read fixed records can corrupt valid UTF-8 input #10865

p5pRT opened this issue Nov 29, 2010 · 49 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 29, 2010

Migrated from rt.perl.org#79960 (status was 'resolved')

Searchable as RT79960$

@p5pRT
Copy link
Author

p5pRT commented Nov 29, 2010

From @nwc10

Created by @nwc10

It's possible to get the perl interpreter to have corrupt internal state on
a valid UTF-8 input stream, by setting $/ to case fixed-length reads.

[Command-line -C7 sets UTF-8 on STD{IN,OUT,ERR}, and $/ = \4096 sets reads to
a fixed size of 4096]

$ ./perl -C7 -e 'print "\x{20AC}" x 1366' | ./perl -C7 -e '$/ = \4096; $_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e line 1, <> chunk 1.
1365

Note that unlike other concerns with the utf8 layer not trapping *in*valid
input, this bug is for *valid* input.

Clearer to see is​:

$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -e '$/ = \2; $_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e line 1, <> chunk 1.
0

The input is truncated at 2 octets​:

$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -Ilib -MDevel​::Peek -e '$/ = \2; $_ = <>; Dump $_'
SV = PV(0xa1e090) at 0xa40f50
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xa3b3e0 "\342\202"\0 [UTF8 "\x{2080}"]
  CUR = 2
  LEN = 80

The dump should look like this​:

$ ./perl -C7 -Ilib -MDevel​::Peek -e 'Dump "\x{20AC}"'
SV = PV(0xa1e2a0) at 0xa33098
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK,UTF8)
  PV = 0xa3aca0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
  CUR = 3
  LEN = 16

Curiously there also seems to be range checking error in the dump code, as a
truncated pound sign causes a lot more grief​:

$ ./perl -C7 -e 'print "\x{A3}"' | ./perl -Ilib -MDevel​::Peek -C7 -we '$/ = \1; $_ = <>; Dump $_'
utf8 "\xC2" does not map to Unicode at -e line 1, <> chunk 1.
SV = PV(0xa1e090) at 0xa40f50
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xa3b3e0 "\302"\0Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc2) in subroutine entry at -e line 1, <> chunk 1.
[UTF8 "\x{0}"]
  CUR = 1
  LEN = 80

The relevant code for this problem is in S_sv_gets_read_record().
[I refactored it out of Perl_sv_gets() earlier today]

It's not immediately obvious to me what the correct solution is.

On the one hand, the user asked for a fixed record length, and on VMS we use
a record based file API, so we could try to honour that either by

a​: refusing to read on UTF-8 file handles. (make it croak)
b​: throwing an error if the read results in a truncated UTF-8 sequence
  (make it croak *some* of the time)

Or we could try to do what read and sysread do, and treat the length parameter
as characters, so that on a UTF-8 flagged handle we loop until we read in
sufficient characters. But that blows the idea of "record based" completely
on a UTF-8 handle.

Nicholas Clark

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.13.7:

Configured by nick at Mon Nov 29 15:00:15 GMT 2010.

Summary of my perl5 (revision 5 version 13 subversion 7) configuration:
  Derived from: 0f93bb20132f1d122993dac5d6e249240a28646e
  Platform:
    osname=linux, osvers=2.6.35.4, archname=x86_64-linux
    uname='linux eris 2.6.35.4 #4 smp tue sep 21 09:54:22 bst 2010 x86_64 gnulinux '
    config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-g -Uusethreads -Uuselongdouble -Uuse64bitall -Uusemymalloc -Duseperlio -Dprefix=~/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20 -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20 -Dinstallman1dir=none -Dinstallman3dir=none -Uuserelocatableinc -Umad -Accccflags-DPURIFY -de'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='ccache gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -g -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.13.7:
    lib
    /home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/site_perl/5.13.7/x86_64-linux
    /home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/site_perl/5.13.7
    /home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/5.13.7/x86_64-linux
    /home/nick/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20/lib/perl5/5.13.7
    .


Environment for perl 5.13.7:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2011

From @Hugmeir

On Mon Nov 29 08​:05​:05 2010, nicholas wrote​:

This is a bug report for perl from nick@​ccl4.org,
generated with the help of perlbug 1.39 running under perl 5.13.7.

-----------------------------------------------------------------
[Please describe your issue here]

It's possible to get the perl interpreter to have corrupt internal
state on
a valid UTF-8 input stream, by setting $/ to case fixed-length reads.

[Command-line -C7 sets UTF-8 on STD{IN,OUT,ERR}, and $/ = \4096 sets
reads to
a fixed size of 4096]

$ ./perl -C7 -e 'print "\x{20AC}" x 1366' | ./perl -C7 -e '$/ = \4096;
$_ = <>; printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e
line 1, <> chunk 1.
1365

Note that unlike other concerns with the utf8 layer not trapping
*in*valid
input, this bug is for *valid* input.

Clearer to see is​:

$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -e '$/ = \2; $_ = <>;
printf "%s\n", length $_'
Malformed UTF-8 character (unexpected end of string) in length at -e
line 1, <> chunk 1.
0

The input is truncated at 2 octets​:

$ ./perl -C7 -e 'print "\x{20AC}"' | ./perl -C7 -Ilib -MDevel​::Peek -e
'$/ = \2; $_ = <>; Dump $_'
SV = PV(0xa1e090) at 0xa40f50
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa3b3e0 "\342\202"\0 [UTF8 "\x{2080}"]
CUR = 2
LEN = 80

The dump should look like this​:

$ ./perl -C7 -Ilib -MDevel​::Peek -e 'Dump "\x{20AC}"'
SV = PV(0xa1e2a0) at 0xa33098
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0xa3aca0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
CUR = 3
LEN = 16

Curiously there also seems to be range checking error in the dump
code, as a
truncated pound sign causes a lot more grief​:

$ ./perl -C7 -e 'print "\x{A3}"' | ./perl -Ilib -MDevel​::Peek -C7 -we
'$/ = \1; $_ = <>; Dump $_'
utf8 "\xC2" does not map to Unicode at -e line 1, <> chunk 1.
SV = PV(0xa1e090) at 0xa40f50
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa3b3e0 "\302"\0Malformed UTF-8 character (unexpected non-
continuation byte 0x00, immediately after start byte 0xc2) in
subroutine entry at -e line 1, <> chunk 1.
[UTF8 "\x{0}"]
CUR = 1
LEN = 80

The relevant code for this problem is in S_sv_gets_read_record().
[I refactored it out of Perl_sv_gets() earlier today]

It's not immediately obvious to me what the correct solution is.

On the one hand, the user asked for a fixed record length, and on VMS
we use
a record based file API, so we could try to honour that either by

a​: refusing to read on UTF-8 file handles. (make it croak)
b​: throwing an error if the read results in a truncated UTF-8 sequence
(make it croak *some* of the time)

Or we could try to do what read and sysread do, and treat the length
parameter
as characters, so that on a UTF-8 flagged handle we loop until we read
in
sufficient characters. But that blows the idea of "record based"
completely
on a UTF-8 handle.

Nicholas Clark

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=low
---
Site configuration information for perl 5.13.7​:

Configured by nick at Mon Nov 29 15​:00​:15 GMT 2010.

Summary of my perl5 (revision 5 version 13 subversion 7)
configuration​:
Derived from​: 0f93bb2
Platform​:
osname=linux, osvers=2.6.35.4, archname=x86_64-linux
uname='linux eris 2.6.35.4 #4 smp tue sep 21 09​:54​:22 bst 2010
x86_64 gnulinux '
config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005
-Uinstallusrbinperl -Dcf_email=nick@​ccl4.org
-Dperladmin=nick@​ccl4.org -Dinc_version_list=
-Dinc_version_list_init=0 -Doptimize=-g -Uusethreads
-Uuselongdouble -Uuse64bitall -Uusemymalloc -Duseperlio
-Dprefix=/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20
-Uusevendorprefix
-Uvendorprefix=
/Sandpit/snap5.9.x-v5.13.7-190-g0f93bb20
-Dinstallman1dir=none -Dinstallman3dir=none -Uuserelocatableinc
-Umad -Accccflags-DPURIFY -de'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define,
usesocks=undef
use64bitint=define, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='ccache gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
optimize='-g',
cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
ccversion='', gccversion='4.3.2', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='gcc', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.7'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -g -L/usr/local/lib
-fstack-protector'

Locally applied patches​:

---
@​INC for perl 5.13.7​:
lib
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-
g0f93bb20/lib/perl5/site_perl/5.13.7/x86_64-linux
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-
g0f93bb20/lib/perl5/site_perl/5.13.7
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-
g0f93bb20/lib/perl5/5.13.7/x86_64-linux
/home/nick/Sandpit/snap5.9.x-v5.13.7-190-
g0f93bb20/lib/perl5/5.13.7
.

---
Environment for perl 5.13.7​:
HOME=/home/nick
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/nick/bin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/games​:/usr/
local/sbin​:/sbin​:/usr/sbin
PERL_BADLANG (unset)
SHELL=/bin/bash

I'd say make it croak, maybe add a "consider using sysread() or binmode
() instead"-like entry in perldiag. I guess it could lead to a bit of
an action-at-a-distance, but getting broken UTF-8 is basically never
right.

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 29, 2011

From PeterCMartini@GMail.com

On Sep 28, 2011, at 12​:23, "Brian Fraser via RT" <perlbug-followup@​perl.org> wrote​:

I'd say make it croak, maybe add a "consider using sysread() or binmode
() instead"-like entry in perldiag. I guess it could lead to a bit of
an action-at-a-distance, but getting broken UTF-8 is basically never
right.

It's certainly never right, but it happens (not every writing process knows how to truncate UTF-8 properly). I like the idea of croaking immediately and pointing to binmode, hopefully with an additional note about how to recover from byte-wise rather than character-wise truncations.

I've always thought of $/ as a tool for reading fixed-length packed records anyway, potentially with mixed binary and text, and utf8 mode would never be appropriate for that.

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2011

From @cpansprout

On Mon Nov 29 08​:05​:05 2010, nicholas wrote​:

It's not immediately obvious to me what the correct solution is.

On the one hand, the user asked for a fixed record length, and on VMS
we use
a record based file API, so we could try to honour that either by

a​: refusing to read on UTF-8 file handles. (make it croak)
b​: throwing an error if the read results in a truncated UTF-8 sequence
(make it croak *some* of the time)

I choose b​: Someone may be using this feature on space-padded (or
null-padded) UTF8 fields of fixed byte lengths. We shouldn’t break that.

@p5pRT
Copy link
Author

p5pRT commented Oct 3, 2011

From @ikegami

On Mon, Nov 29, 2010 at 11​:05 AM, Nicholas Clark
<perlbug-followup@​perl.org>wrote​:

Or we could try to do what read and sysread do, and treat the length
parameter
as characters, so that on a UTF-8 flagged handle we loop until we read in
sufficient characters. But that blows the idea of "record based" completely
on a UTF-8 handle.

It seems that the implication is that (a) or (b) would somehow allow records
over UTF-8 handles. Let's name working like read "(c)".

--

Scenario 1​: Let's say a record consists of two 5 byte fields of UTF-8 text,
and that :utf8 is being used on the handle. Let's say one of the records is
C3 A9 20 20 20 41 42 43 20 20.

a) Croaks.
b) Produces "é ABC " which cannot be parsed into two fields
c) One byte too many is returned.

(b) and (c) aren't useful.

--

Scenario 2​: Let's say a record consists of two 5 character fields, and that
:utf8 is being used on the handle. Let's say one of the records is C3 A9 20
20 20 20 41 42 43 20 20.

a) Croaks.
b) One byte too few is returned.
c) Produces the fields when passed through C<< decode "UTF-8", unpack "A5A5"
$_ >>.

(c) is useful, but (b) isn't.

--

So​:
(b) is never useful with :utf8 handles
(c) is sometimes useful with :utf8 handles

Since (b) and (c) behave the same on binary handles, it seems to me that (c)
is clearly superior to (b).

Between (a) and (c), I prefer (c) since I don't see enough justification to
artificially restrict what the user can do by giving them (a). I have
expected $/ to behave like read in the past.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Oct 3, 2011

From @ikegami

c) Produces the fields when passed through C<< decode "UTF-8", unpack
"A5A5" $_ >>.

That should be C<< map decode "UTF-8", unpack "A5A5", $_ >>.

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2011

From @cpansprout

On Sun Oct 02 22​:24​:46 2011, ikegami@​adaelis.com wrote​:

On Mon, Nov 29, 2010 at 11​:05 AM, Nicholas Clark
<perlbug-followup@​perl.org>wrote​:

Or we could try to do what read and sysread do, and treat the length
parameter
as characters, so that on a UTF-8 flagged handle we loop until we
read in
sufficient characters. But that blows the idea of "record based"
completely
on a UTF-8 handle.

It seems that the implication is that (a) or (b) would somehow allow
records
over UTF-8 handles. Let's name working like read "(c)".

--

Scenario 1​: Let's say a record consists of two 5 byte fields of UTF-8
text,
and that :utf8 is being used on the handle. Let's say one of the
records is
C3 A9 20 20 20 41 42 43 20 20.

a) Croaks.
b) Produces "� ABC " which cannot be parsed into two fields
c) One byte too many is returned.

(b) and (c) aren't useful.

--

Scenario 2​: Let's say a record consists of two 5 character fields, and
that
:utf8 is being used on the handle. Let's say one of the records is C3
A9 20
20 20 20 41 42 43 20 20.

a) Croaks.
b) One byte too few is returned.
c) Produces the fields when passed through C<< decode "UTF-8", unpack
"A5A5"
$_ >>.

(c) is useful, but (b) isn't.

--

So​:
(b) is never useful with :utf8 handles

It can be useful if the records are all single fields.

(c) is sometimes useful with :utf8 handles

Since (b) and (c) behave the same on binary handles, it seems to me
that (c)
is clearly superior to (b).

Between (a) and (c), I prefer (c) since I don't see enough
justification to
artificially restrict what the user can do by giving them (a). I have
expected $/ to behave like read in the past.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @ikegami

On Sun, Oct 23, 2011 at 4​:56 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Oct 02 22​:24​:46 2011, ikegami@​adaelis.com wrote​:

So​:
(b) is never useful with :utf8 handles

It can be useful if the records are all single fields.

(c) is sometimes useful with :utf8 handles

ok, so we have​:

a​: Refusing to read on UTF-8 file handles by croaking.
b​: Treat the record length as bytes. Croak on truncated UTF-8 results.
c​: Treat the record length as characters like read and sysread.

handle aaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccccc
------------ --------------------- ---------------------


bytes ok​: Returns record ok​: Returns record ok​: Returns
record
N fields/rec measured in bytes measured in bytes measured
in bytes

utf8 ok​: Croaks ok​: Returns "record" ok​: Returns
record
1 field/rec measured in bytes measured
in chars

utf8 ok​: Croaks FAIL​: Record can't be ok​: Returns
record
N fields/rec split into fields measured
in chars

- Only (c) can handle both records measured in bytes and records measured in
chars.
- Only (c) is consistent with read and sysread.
- (a) and (c) are more self-consistent than (b)​: One either deals with bytes
or chars, not both at the same time.

But​:

- Only (b) is backwards compatible with existing behaviour (although the
behaviour isn't exactly documented).

- Eric

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2012

From @davidnicol

On Mon, Oct 24, 2011 at 1​:54 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

ok, so we have​:

a​: Refusing to read on UTF-8 file handles by croaking.
b​: Treat the record length as bytes. Croak on truncated UTF-8 results.
c​: Treat the record length as characters like read and sysread.

- Only (c) can handle both records measured in bytes and records measured in
chars.
- Only (c) is consistent with read and sysread.

Why would anyone possibly want fixed-length records in chars? Because
they're packing a Twitter archive? if this use case really exists,
there should be a way to turn it on instead -- maybe by setting $/ to
a reference to an array of integers which will be interpreted as field
lengths in characters and cycled through.

- (a) and (c) are more self-consistent than (b)​: One either deals with bytes
or chars, not both at the same time.

But​:

- Only (b) is backwards compatible with existing behaviour (although the
behaviour isn't exactly documented).

- Eric

d​: b, but there is a way to turn off the croaking, and when it has
been turned off, the invalid segments get downgraded to bytes. ("no
strict utf8" perhaps?) The mechanism for making that adjustment is
named in the croak. Would (d) support the forward-looking case of
migrating a working legacy system handling packed records to a new
environment where the input streams are all chars instead, or a future
perl where -C7 is the default, with a minimum of maintenance? Would it
be better in that situation to require byte mode on the file handle
in question? If so that's

e​: a, plus also croak earlier, at compile time if possible, by doing
flow analysis​: whenever the parser notices that $/ is going to get set
to a reference, hmm, that isn't practical.

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2012

From @ikegami

On Tue, Feb 21, 2012 at 11​:33 PM, David Nicol <davidnicol@​gmail.com> wrote​:

On Mon, Oct 24, 2011 at 1​:54 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

ok, so we have​:

a​: Refusing to read on UTF-8 file handles by croaking.
b​: Treat the record length as bytes. Croak on truncated UTF-8 results.
c​: Treat the record length as characters like read and sysread.

- Only (c) can handle both records measured in bytes and records
measured in
chars.
- Only (c) is consistent with read and sysread.

Why would anyone possibly want fixed-length records in chars? Because
they're packing a Twitter archive? if this use case really exists,
there should be a way to turn it on instead -- maybe by setting $/ to
a reference to an array of integers which will be interpreted as field
lengths in characters and cycled through.

- (a) and (c) are more self-consistent than (b)​: One either deals with
bytes
or chars, not both at the same time.

But​:

- Only (b) is backwards compatible with existing behaviour (although the
behaviour isn't exactly documented).

- Eric

d​: b, but there is a way to turn off the croaking, and when it has
been turned off, the invalid segments get downgraded to bytes. ("no
strict utf8" perhaps?)

If I read that correctly, (d) would mean that

no strict "utf8"; # or whatever
open(my $fh, '<​:encoding(UTF-16be)', \"\x26\x60\x26\x60\x26\x60");
sysread($fh, my $buf, $i);

does

when ($i==1) { $buf = "\xE2"; }
when ($i==2) { $buf = "\xE2\x99"; }
when ($i==3) { $buf = "\x{2660}"; }
when ($i==4) { $buf = "\x{2660}\xE2"; }
when ($i==5) { $buf = "\xE2\x99\A0\xE2\x99"; }
when ($i==6) { $buf = "\x{2660}\x{2660}"; }

I don't see how returning bytes that don't even exist in the file is of any
use.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2012

From @khwilliamson

I can't find my proposal in the record of this ticket, nor anyone
responding to it. The documentation says that $/ gives the *maximum*
record size. So why not return as many whole characters as will fit in
$/ bytes?

I think we need to do something on this for 5.16. At the minimum, we
could emit a warning when a variable length encoded file is opened under
a fixed-length $/.

If even that isn't acceptable, we could add this to the
intend-to-deprecate section in perldelta.

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2012

From @nwc10

On Thu, Mar 01, 2012 at 10​:13​:15AM -0800, Karl Williamson via RT wrote​:

I can't find my proposal in the record of this ticket, nor anyone
responding to it. The documentation says that $/ gives the *maximum*
record size. So why not return as many whole characters as will fit in
$/ bytes?

Specifically, the code is emulated on "everything else", but intended to
do something real and useful on VMS​:

#ifdef VMS
  /* VMS wants read instead of fread, because fread doesn't respect */
  /* RMS record boundaries. This is not necessarily a good thing to be */
  /* doing, but we've got no other real choice - except avoid stdio
  as implementation - perhaps write a :vms layer ?
  */
  fd = PerlIO_fileno(fp);
  if (fd != -1) {
  bytesread = PerlLIO_read(fd, buffer, recsize);
  }
  else /* in-memory file from PerlIO​::Scalar */
#endif

perlvar.pod says​:

  On VMS, record reads are done with the equivalent of C<sysread>,
  so it's best not to mix record and non-record reads on the same
  file. (This is unlikely to be a problem, because any file you'd
  want to read in record mode is probably unusable in line mode.)
  Non-VMS systems do normal I/O, so it's safe to mix record and
  non-record reads of a file.

I think we need to do something on this for 5.16. At the minimum, we
could emit a warning when a variable length encoded file is opened under
a fixed-length $/.

If even that isn't acceptable, we could add this to the
intend-to-deprecate section in perldelta.

So I'd like to know, if a programmer on VMS sets $/ to read records, but on
a file handle marked with :utf8, what do they want?

(and if the answer is "their head examining", that's actually useful, as it
means that the least insane thing to implement is what we get)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2012

From @ikegami

On Thu, Mar 1, 2012 at 1​:13 PM, Karl Williamson via RT <
perlbug-followup@​perl.org> wrote​:

I can't find my proposal in the record of this ticket, nor anyone
responding to it. The documentation says that $/ gives the *maximum*
record size. So why not return as many whole characters as will fit in
$/ bytes?

Sounds like "can't make everyone happy, so make everyone unhappy".

It doesn't help those who want $/ to control the number of elements read by
the leftmost layer.

It doesn't help those who want $/ to control the number of elements read by
the rightmost layer.

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2012

From @ikegami

On Thu, Mar 1, 2012 at 2​:22 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Thu, Mar 1, 2012 at 1​:13 PM, Karl Williamson via RT <
perlbug-followup@​perl.org> wrote​:

I can't find my proposal in the record of this ticket, nor anyone
responding to it. The documentation says that $/ gives the *maximum*
record size. So why not return as many whole characters as will fit in
$/ bytes?

Sounds like "can't make everyone happy, so make everyone unhappy".

It doesn't help those who want $/ to control the number of elements read
by the leftmost layer.

It doesn't help those who want $/ to control the number of elements read
by the rightmost layer.

That last "read" should say "returned".

- Eric

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2012

From @craigberry

On Mar 1, 2012, at 12​:30 PM, Nicholas Clark wrote​:

On Thu, Mar 01, 2012 at 10​:13​:15AM -0800, Karl Williamson via RT wrote​:

I can't find my proposal in the record of this ticket, nor anyone
responding to it. The documentation says that $/ gives the *maximum*
record size. So why not return as many whole characters as will fit in
$/ bytes?

I think that would require making :utf8 into its own layer with its own buffer, which has been discussed over in [perl #100058].

Specifically, the code is emulated on "everything else", but intended to
do something real and useful on VMS​:

#ifdef VMS
/* VMS wants read instead of fread, because fread doesn't respect */
/* RMS record boundaries. This is not necessarily a good thing to be */
/* doing, but we've got no other real choice - except avoid stdio
as implementation - perhaps write a :vms layer ?
*/
fd = PerlIO_fileno(fp);
if (fd != -1) {
bytesread = PerlLIO_read(fd, buffer, recsize);
}
else /* in-memory file from PerlIO​::Scalar */
#endif

I don't think this code is as meaningful as it used to be since unix I/O is the bottom layer for PerlIO now. Which means that PerlLIO_read and PerlIO_read (differing only by the "L") are really the same thing, i.e., both boil down to read(). I guess we can't simplify this code until and unless using stdio as the bottom layer is truly deprecated and expunged.

perlvar.pod says​:

On VMS, record reads are done with the equivalent of C<sysread>,
so it's best not to mix record and non-record reads on the same
file. (This is unlikely to be a problem, because any file you'd
want to read in record mode is probably unusable in line mode.)
Non-VMS systems do normal I/O, so it's safe to mix record and
non-record reads of a file.

I think we need to do something on this for 5.16. At the minimum, we
could emit a warning when a variable length encoded file is opened under
a fixed-length $/.

If even that isn't acceptable, we could add this to the
intend-to-deprecate section in perldelta.

So I'd like to know, if a programmer on VMS sets $/ to read records, but on
a file handle marked with :utf8, what do they want?

(and if the answer is "their head examining", that's actually useful, as it
means that the least insane thing to implement is what we get)

Yes, it's pretty daft to expect whole, varying-width characters to stay whole when you can only get a fixed-width chunk at a time and the chunks are measured in bytes. So far the only difference for VMS that I've thought of derives from this note in the CRTL help entry on read()​:

  The read function does not span record boundaries in a
  record file and, therefore, reads at most one record. A
  separate read must be done for each record.

So that means that if you set $/ to N on a record-oriented file and N is larger than the record size, you won't get as much as you asked for and you may chop varying-width characters in pieces around the record boundaries. Trying to overload the meaning of $/ so that N means number of characters rather than number of bytes obviously could not make it give you more bytes than the record holds.

While it might be less of a corner case and more of a mainstream thing to do on VMS, I can't think of any way that this is substantively different from what would happen on any OS when reading through a pipe or a socket or a PerlIO layer or /dev/mumble that has a fixed-sized buffer measured in bytes. What happens on Unix when you have a pipe buffer that is 8192 bytes and you set $/ to 8193 and read a record containing UTF-8 data through the pipe?

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @sciurius

Nicholas Clark <nick@​ccl4.org> writes​:

So I'd like to know, if a programmer on VMS sets $/ to read records, but on
a file handle marked with :utf8, what do they want?

(and if the answer is "their head examining", that's actually useful, as it
means that the least insane thing to implement is what we get)

In VMS, and its predecessor RSX, the purpose of a file with fixed width
records has always been that you can easily and efficiently retrieve a
specific record at all times. These files were often referred to as
'random access files'. Random access files typically didn't have record
terminators.

In flat files systems, you can find record NNN by seeking to position
NNN*width, and seeking loses its efficiency when done on characters.

-- Johan

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @ikegami

On Thu, Mar 1, 2012 at 6​:17 PM, Craig A. Berry <craigberry@​mac.com> wrote​:

What happens on Unix when you have a pipe buffer that is 8192 bytes and
you set $/ to 8193 and read a record containing UTF-8 data through the pipe?

Perl requests 8K (formerly 4K) chunks until it has received enough. It
requests 8K even if it only needs 1 byte.

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @craigberry

On Mar 2, 2012, at 3​:07 AM, Eric Brine wrote​:

On Thu, Mar 1, 2012 at 6​:17 PM, Craig A. Berry <craigberry@​mac.com> wrote​:
What happens on Unix when you have a pipe buffer that is 8192 bytes and you set $/ to 8193 and read a record containing UTF-8 data through the pipe?

Perl requests 8K (formerly 4K) chunks until it has received enough. It requests 8K even if it only needs 1 byte.

I think you're thinking of the PerlIO buffer that I increased from 4K to the larger of 8K and BUFSIZ in 5.14, and which only applies to the perlio layer. But S_sv_gets_read_record calls PerlIO_read, which just retrieves the base layer (formerly stdio, currently unix) and calls its Read method, which is just read(). So there is no buffering under Perl's control.

I was thinking of a situation where something external to Perl limits how much data you can get in one read and thus gives you less than the full amount requested by $/. I'm pretty sure you'll get mangled UTF-8 if you happen to be mid-character when you hit the end of the device buffer. To test this, you'd need to know something about the internals of your system's pipe implementation (or other device with a fixed buffer).

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @nwc10

On Thu, Mar 01, 2012 at 05​:17​:27PM -0600, Craig A. Berry wrote​:

On Mar 1, 2012, at 12​:30 PM, Nicholas Clark wrote​:

On Thu, Mar 01, 2012 at 10​:13​:15AM -0800, Karl Williamson via RT wrote​:

I can't find my proposal in the record of this ticket, nor anyone
responding to it. The documentation says that $/ gives the *maximum*
record size. So why not return as many whole characters as will fit in
$/ bytes?

I think that would require making :utf8 into its own layer with its own buffer, which has been discussed over in [perl #100058].

Specifically, the code is emulated on "everything else", but intended to
do something real and useful on VMS​:

#ifdef VMS
/* VMS wants read instead of fread, because fread doesn't respect */
/* RMS record boundaries. This is not necessarily a good thing to be */
/* doing, but we've got no other real choice - except avoid stdio
as implementation - perhaps write a :vms layer ?
*/
fd = PerlIO_fileno(fp);
if (fd != -1) {
bytesread = PerlLIO_read(fd, buffer, recsize);
}
else /* in-memory file from PerlIO​::Scalar */
#endif

I don't think this code is as meaningful as it used to be since unix I/O is the bottom layer for PerlIO now. Which means that PerlLIO_read and PerlIO_read (differing only by the "L") are really the same thing, i.e., both boil down to read(). I guess we can't simplify this code until and unless using stdio as the bottom layer is truly deprecated and expunged.

I don't think you're correct on that one. read() is not stdio. It's (at least
on Unix) a syscall. fread() is stdio, and loops on read() until it gets enough
octets. So the code for VMS (if I'm following it correctly) is still grabbing
a single record.

On Fri, Mar 02, 2012 at 08​:11​:10AM -0600, Craig A. Berry wrote​:

On Mar 2, 2012, at 3​:07 AM, Eric Brine wrote​:

On Thu, Mar 1, 2012 at 6​:17 PM, Craig A. Berry <craigberry@​mac.com> wrote​:
What happens on Unix when you have a pipe buffer that is 8192 bytes and you set $/ to 8193 and read a record containing UTF-8 data through the pipe?

You mean set $/ to \8193

Perl requests 8K (formerly 4K) chunks until it has received enough. It requests 8K even if it only needs 1 byte.

I think you're thinking of the PerlIO buffer that I increased from 4K to the larger of 8K and BUFSIZ in 5.14, and which only applies to the perlio layer. But S_sv_gets_read_record calls PerlIO_read, which just retrieves the base layer (formerly stdio, currently unix) and calls its Read method, which is just read(). So there is no buffering under Perl's control.

I was thinking of a situation where something external to Perl limits how much data you can get in one read and thus gives you less than the full amount requested by $/. I'm pretty sure you'll get mangled UTF-8 if you happen to be mid-character when you hit the end of the device buffer. To test this, you'd need to know something about the internals of your system's pipe implementation (or other device with a fixed buffer).

I don't think that discussing this in terms of what non-VMS does with $/ set
to a reference to an integer is necessarily that useful. I think it's really
only been added as "this feature can't be VMS only" (all added in commit

http​://perl5.git.perl.org/perl.git/commitdiff/5b2b9c687790241e85aa7b76

)

In that, the whole bug report is about "what *should* this do?" because what
it currently does is badly broken.

The reason I'm specifically asking "what does a VMS programmer *want*?" is
because the fixed size records feature was put in for VMS, with non-VMS an
afterthought. So

1) is there a sane VMS native interpretation of "UTF-8 coming from a fixed
  record file" ?

and only when that's answered is there

2) what do we fake on other platforms?

[and I think it's also premature to consider whether this needs :utf8 as a
real layer to implement. I'd like to get a feeling for what the Perl space
behaviour, if any, should be]

The possibly useful analogy is "what happens with a :utf8 layer on sysread?"
which is, well, summed up with​:

  goto more_bytes;

ie - it's actually a different behaviour. It makes multiple syscalls. Blech.

[and, thinking about it now, about 14 years later, possibly that non-VMS
code in sv_gets() should have been using read(), not fread(), so that it
would be useful on a datagram socket. But that's a bit late to fix]

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @craigberry

On Mar 2, 2012, at 1​:51 AM, Johan Vromans wrote​:

Nicholas Clark <nick@​ccl4.org> writes​:

So I'd like to know, if a programmer on VMS sets $/ to read records, but on
a file handle marked with :utf8, what do they want?

(and if the answer is "their head examining", that's actually useful, as it
means that the least insane thing to implement is what we get)

In VMS, and its predecessor RSX, the purpose of a file with fixed width
records has always been that you can easily and efficiently retrieve a
specific record at all times. These files were often referred to as
'random access files'. Random access files typically didn't have record
terminators.

In flat files systems, you can find record NNN by seeking to position
NNN*width, and seeking loses its efficiency when done on characters.

Not really relevant to the discussion of $/ and UTF-8, but to be pedantically correct, relative access to files with fixed-length records is only one of several random access methods available on VMS. The others don't require the records to be fixed length.

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @craigberry

On Mar 2, 2012, at 9​:15 AM, Nicholas Clark wrote​:

On Thu, Mar 01, 2012 at 05​:17​:27PM -0600, Craig A. Berry wrote​:

On Mar 1, 2012, at 12​:30 PM, Nicholas Clark wrote​:

Specifically, the code is emulated on "everything else", but intended to
do something real and useful on VMS​:

#ifdef VMS
/* VMS wants read instead of fread, because fread doesn't respect */
/* RMS record boundaries. This is not necessarily a good thing to be */
/* doing, but we've got no other real choice - except avoid stdio
as implementation - perhaps write a :vms layer ?
*/
fd = PerlIO_fileno(fp);
if (fd != -1) {
bytesread = PerlLIO_read(fd, buffer, recsize);
}
else /* in-memory file from PerlIO​::Scalar */
#endif

I don't think this code is as meaningful as it used to be since unix I/O is the bottom layer for PerlIO now. Which means that PerlLIO_read and PerlIO_read (differing only by the "L") are really the same thing, i.e., both boil down to read(). I guess we can't simplify this code until and unless using stdio as the bottom layer is truly deprecated and expunged.

I don't think you're correct on that one. read() is not stdio. It's (at least
on Unix) a syscall. fread() is stdio, and loops on read() until it gets enough
octets.

Yes, I know read() is not stdio and that fread() is. That was my point. Unless I'm really missing something, there is no fread() involved anymore since unix is the bottom PerlIO layer. The comment in the code about avoiding fread() on VMS was only relevant when stdio was the bottom layer, and I believe it may have been the only layer at all when that comment was written.

PerlIO_read (from the branch of the else that you snipped)​:

  {
  bytesread = PerlIO_read(fp, buffer, recsize);
  }

used to be a wrapper around fread() when stdio was the bottom layer, but is now a wrapper around read(). Which means that both branches of the if do exactly the same thing​: call read(). Which means we could get rid of the VMS-specific code in S_sv_gets_read_record but *only* if we were willing to say that stdio can't ever be the bottom layer (as opposed to no longer being the default bottom layer).

So the code for VMS (if I'm following it correctly) is still grabbing
a single record.

Yes, calling read() grabs a single record. My point was just that unless we configure with -UUSEPERLIO, we'll currently get read() from both branches of that if.

The reason I'm specifically asking "what does a VMS programmer *want*?" is
because the fixed size records feature was put in for VMS, with non-VMS an
afterthought. So

1) is there a sane VMS native interpretation of "UTF-8 coming from a fixed
record file" ?

No. Over in [perl #100058] I started but never sent a response to David Nicol's question that may be relevant here​:

On Fri, Oct 14, 2011 at 3​:11 PM, David Nicol <davidnicol@​gmail.com> wrote​:

Is anyone here actually shoehorning UTF8 into fixed-length records, using
any system besides Perl to do it?

I work with record-oriented files, fixed-length and variable-length, almost every day and I have done so off and on for many years. My experience is certainly not comprehensive and my memory may be faulty, but there is no scenario I can remember or imagine where any character interpretation at all (even ASCII) would be imposed on a fixed-length record. The record may very well contain structured data, and some of the fields in it may contain character data. Interpreting that data has nothing to do with processing the records and vice versa.

and only when that's answered is there

2) what do we fake on other platforms?

Yeah, now for the hard part. I'm not much of a language designer and not much of a Unicode wonk, but my feeling is that reading a specific number of characters in one go when the characters are of variable size is a problem that is utterly different from and unrelated to dealing with fixed-length records. It is a third way of defining a record that is as different from defining it by length or delimiter as those two ways are from each other.

I understand that something must be done to fix the mayhem that results when imposing :utf8 on a byte stream that may get truncated mid-character. I don't know the best way to do that, but I don't think pretending that the byte stream is not a byte stream makes sense.

[and I think it's also premature to consider whether this needs :utf8 as a
real layer to implement. I'd like to get a feeling for what the Perl space
behaviour, if any, should be]

The possibly useful analogy is "what happens with a :utf8 layer on sysread?"
which is, well, summed up with​:

    goto more\_bytes;

ie - it's actually a different behaviour. It makes multiple syscalls. Blech.

[and, thinking about it now, about 14 years later, possibly that non-VMS
code in sv_gets() should have been using read(), not fread(), so that it
would be useful on a datagram socket. But that's a bit late to fix]

Again, I'm almost positive we switched that fread() to read() when we switched the default bottom layer from stdio to unix.

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @ikegami

On Fri, Mar 2, 2012 at 9​:11 AM, Craig A. Berry <craigberry@​mac.com> wrote​:

On Mar 2, 2012, at 3​:07 AM, Eric Brine wrote​:

On Thu, Mar 1, 2012 at 6​:17 PM, Craig A. Berry <craigberry@​mac.com>
wrote​:
What happens on Unix when you have a pipe buffer that is 8192 bytes and
you set $/ to 8193 and read a record containing UTF-8 data through the pipe?

Perl requests 8K (formerly 4K) chunks until it has received enough. It
requests 8K even if it only needs 1 byte.

I think you're thinking of the PerlIO buffer that I increased from 4K to
the larger of 8K and BUFSIZ in 5.14,

I'm not "thinking" anything. I'm reporting what Perl does as seen by
C<strace>.

and which only applies to the perlio layer.

Yes, I'm only reporting what happens on a standard unix build. But isn't
that what you asked?

I was thinking of a situation where something external to Perl limits how

much data you can get in one read and thus gives you less than the full
amount requested by $/.

That's exactly the situation I described. Here, let me provide the strace
output.

$ strace perl -e'$/=\40; <>;' < /dev/random
...
read(0, "\5|\200\"\360T0*\325\223\276\322\20S\244\16\341", 8192) = 17
read(0, "\370\356 \2652\236\27>", 8192) = 8
read(0, "\0\270\ve\332\223\225\312", 8192) = 8
read(0, "\316\366\272\311\215.\204\361", 8192) = 8
...

I'm pretty sure you'll get mangled UTF-8 if you happen to be
mid-character when you hit the end of the device buffer.

No, because Perl will just ask for more. You'll get mangled UTF-8 if you
happen to request a number of bytes that ends you mid-character (which is
what this ticket is about).

(If we were talking about sysread instead of readline or read, then yes, it
could happen then. Unlike read and readline, sysread returns as soon as
bytes are available.)

- Eric

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2012

From @craigberry

On Mar 2, 2012, at 11​:10 AM, Craig A. Berry wrote​:

On Mar 2, 2012, at 9​:15 AM, Nicholas Clark wrote​:

On Thu, Mar 01, 2012 at 05​:17​:27PM -0600, Craig A. Berry wrote​:

On Mar 1, 2012, at 12​:30 PM, Nicholas Clark wrote​:

Specifically, the code is emulated on "everything else", but intended to
do something real and useful on VMS​:

#ifdef VMS
/* VMS wants read instead of fread, because fread doesn't respect */
/* RMS record boundaries. This is not necessarily a good thing to be */
/* doing, but we've got no other real choice - except avoid stdio
as implementation - perhaps write a :vms layer ?
*/
fd = PerlIO_fileno(fp);
if (fd != -1) {
bytesread = PerlLIO_read(fd, buffer, recsize);
}
else /* in-memory file from PerlIO​::Scalar */
#endif

I don't think this code is as meaningful as it used to be since unix I/O is the bottom layer for PerlIO now. Which means that PerlLIO_read and PerlIO_read (differing only by the "L") are really the same thing, i.e., both boil down to read(). I guess we can't simplify this code until and unless using stdio as the bottom layer is truly deprecated and expunged.

I don't think you're correct on that one. read() is not stdio. It's (at least
on Unix) a syscall. fread() is stdio, and loops on read() until it gets enough
octets.

Yes, I know read() is not stdio and that fread() is. That was my point. Unless I'm really missing something, there is no fread() involved anymore since unix is the bottom PerlIO layer.

Sigh. This was only about half right. It's true that there is no fread() involved anymore, and read() sits at the bottom in all cases, but PerlIO_read, as we currently do on non-VMS, does buffering, whereas PerlLIO_read (which in the non-threads case is actually just a macro defined as read) does not.

On non-VMS the call stack when using $/ = \N looks like​:

read syscall
PerlIOUnix_read
Perl_PerlIO_read (unix layer)
PerlIOBuf_fill
Perl_PerlIO_fill
PerlIOBase_read
PerlIOBuf_read
Perl_PerlIO_read (perlio layer)
S_sv_gets_read_record
Perl_sv_gets
Perl_do_readline
Perl_pp_readline
Perl_runops_debug
S_run_body
perl_run

whereas on VMS, the call stack looks like​:

read syscall, aka PerlLIO_read
S_sv_gets_read_record
Perl_sv_gets
Perl_do_readline
Perl_pp_readline
Perl_runops_debug
S_run_body
perl_run

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2012

From @ikegami

On Fri, Mar 2, 2012 at 2​:03 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Fri, Mar 2, 2012 at 9​:11 AM, Craig A. Berry <craigberry@​mac.com> wrote​:

I was thinking of a situation where something external to Perl limits how

much data you can get in one read and thus gives you less than the full
amount requested by $/.

That's exactly the situation I described. Here, let me provide the strace
output.

$ strace perl -e'$/=\40; <>;' < /dev/random
...
read(0, "\5|\200\"\360T0*\325\223\276\322\20S\244\16\341", 8192) = 17
read(0, "\370\356 \2652\236\27>", 8192) = 8
read(0, "\0\270\ve\332\223\225\312", 8192) = 8
read(0, "\316\366\272\311\215.\204\361", 8192) = 8
...

I'm pretty sure you'll get mangled UTF-8 if you happen to be
mid-character when you hit the end of the device buffer.

No, because Perl will just ask for more. You'll get mangled UTF-8 if you
happen to request a number of bytes that ends you mid-character (which is
what this ticket is about).

(If we were talking about sysread instead of readline or read, then yes,
it could happen then. Unlike read and readline, sysread returns as soon as
bytes are available.)

And here's an example where one character is read using two reads​:

$ perl -C -e'print "a"x8191, chr(0x2660)' > x

$ ls -l x
-rw------- 1 ikegami group 8194 Mar 2 23​:26 x

$ perl -le'use open "​:std", "​:utf8"; $/=\8194; $_=<>; print $_ eq
("a"x8191).chr(0x2660) ?1​:0;' < x
1

strace​:

read(0, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 8192) = 8192
read(0, "\231\240", 8192) = 2

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2012

From @demerphq

On 1 March 2012 20​:22, Eric Brine <ikegami@​adaelis.com> wrote​:

On Thu, Mar 1, 2012 at 1​:13 PM, Karl Williamson via RT
<perlbug-followup@​perl.org> wrote​:

I can't find my proposal in the record of this ticket, nor anyone
responding to it.  The documentation says that $/ gives the *maximum*
record size.  So why not return as many whole characters as will fit in
$/ bytes?

Sounds like "can't make everyone happy, so make everyone unhappy".

It doesn't help those who want $/ to control the number of elements read by
the leftmost layer.

It doesn't help those who want $/ to control the number of elements read by
the rightmost layer.

I think that controlling the leftmost layer, iow, bytes read from
source, is the most important point and the reason that someone would
use fixed length records. I think that we should read the requested
bytes, and if the last bytes form an unterminated utf8 sequence we
should drop those bytes and replace them with zeros and warn that we
have done so, possibly under strict we should die if we do so.

I mean, maybe the person doing the reading has a supposedly null
padded utf8 sequence, and something was too long and got truncated. I
could see that with telecoms flat files.

yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2012

From @craigberry

On Mar 3, 2012, at 1​:29 AM, Eric Brine wrote​:

On Fri, Mar 2, 2012 at 2​:03 PM, Eric Brine <ikegami@​adaelis.com> wrote​:
On Fri, Mar 2, 2012 at 9​:11 AM, Craig A. Berry <craigberry@​mac.com> wrote​:
I was thinking of a situation where something external to Perl limits how much data you can get in one read and thus gives you less than the full amount requested by $/.

That's exactly the situation I described. Here, let me provide the strace output.

$ strace perl -e'$/=\40; <>;' < /dev/random
...
read(0, "\5|\200\"\360T0*\325\223\276\322\20S\244\16\341", 8192) = 17
read(0, "\370\356 \2652\236\27>", 8192) = 8
read(0, "\0\270\ve\332\223\225\312", 8192) = 8
read(0, "\316\366\272\311\215.\204\361", 8192) = 8
...

I'm pretty sure you'll get mangled UTF-8 if you happen to be mid-character when you hit the end of the device buffer.

No, because Perl will just ask for more. You'll get mangled UTF-8 if you happen to request a number of bytes that ends you mid-character (which is what this ticket is about).

(If we were talking about sysread instead of readline or read, then yes, it could happen then. Unlike read and readline, sysread returns as soon as bytes are available.)

And here's an example where one character is read using two reads​:

$ perl -C -e'print "a"x8191, chr(0x2660)' > x

$ ls -l x
-rw------- 1 ikegami group 8194 Mar 2 23​:26 x

$ perl -le'use open "​:std", "​:utf8"; $/=\8194; $_=<>; print $_ eq ("a"x8191).chr(0x2660) ?1​:0;' < x
1

strace​:

read(0, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 8192) = 8192
read(0, "\231\240", 8192) = 2

Thanks for clarifying my muddy thinking, Eric. I was neglecting the effects of the buffering layer because it's not used for record mode on VMS and I had erroneously convinced myself that it's not used elsewhere either, but it is. As long as the perlio buffer is larger than the requested record size, it looks like it will insulate you from anything external to Perl giving you less than the requested size.

So does your second example demonstrate that if you request something larger than the perlio buffer, then you can get caught mid-character on buffer boundaries as well as record boundaries? And does that first 8192-byte chunk get loaded into an SV that is then invalid if its UTF-8 flag is on?

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 5, 2012

From @ikegami

On Sat, Mar 3, 2012 at 6​:56 PM, Craig A. Berry <craigberry@​mac.com> wrote​:

So does your second example demonstrate that if you request something
larger than the perlio buffer,

then you can get caught mid-character on buffer boundaries as well as

record boundaries?

I'm not sure what "caught" means here.

It demonstrates that if a character is split across a chunk boundary (which
is inevitable), that there is no problem.

And does that first 8192-byte chunk get loaded into an SV that is then

invalid if its UTF-8 flag is on?

I don't know the implementation, so I don't know if it's possible for the
SV to have its UTF-8 flag on (or if the buffer is an SV at all).

I think the real question is​: What happens if the second read fails?

- Eric

@p5pRT
Copy link
Author

p5pRT commented Mar 6, 2012

From @tonycoz

On Sun, Oct 02, 2011 at 01​:40​:51PM -0700, Father Chrysostomos via RT wrote​:

On Mon Nov 29 08​:05​:05 2010, nicholas wrote​:

It's not immediately obvious to me what the correct solution is.

On the one hand, the user asked for a fixed record length, and on VMS
we use
a record based file API, so we could try to honour that either by

a​: refusing to read on UTF-8 file handles. (make it croak)
b​: throwing an error if the read results in a truncated UTF-8 sequence
(make it croak *some* of the time)

I choose b​: Someone may be using this feature on space-padded (or
null-padded) UTF8 fields of fixed byte lengths. We shouldn’t break that.

Except it only vaguely accidentally works now.*

It's not counting bytes in the file, but bytes read from the top most
layer​:

tony@​mars​:~$ echo 'ABCDEDGH' | iconv -t UTF16LE | ~/perl-v5.15.8-76-g1cd16d8/bin/perl -MDevel​::Peek -e 'binmode STDIN, "​:encoding(utf16le)"; $/ = \4; $x = <STDIN>; Dump($x)'
SV = PV(0x19b2010) at 0x19c7a48
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x19d8120 "ABCD"\0 [UTF8 "ABCD"]
  CUR = 4
  LEN = 80

If that C< $/ = \4; > were returning that number of bytes from the
file $x would only have two characters, not 4.

tony@​mars​:~$ echo '€ABCDEFGH' | iconv -t UTF16LE | ~/perl-v5.15.8-76-g1cd16d8/bin/perl -MDevel​::Peek -e 'binmode STDIN, "​:encoding(utf16le)"; $/ = \4; $x = <STDIN>; Dump($x)'
SV = PV(0xf44010) at 0xf59a48
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xf6a120 "\342\202\254A"\0 [UTF8 "\x{20ac}A"]
  CUR = 4
  LEN = 80

In this case I got lucky and managed to consume 4 bytes from the file.

Expecting byte semantics out of a stream we've deliberately chosen to
work in a non-byte fashion doesn't seem realistic to me.

I think it should behave like read().

Tony

* it isn't specifically this message, but several seem to assume that
  you can expect byte behaviour from a character stream.

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2012

From @rjbs

I think the list came up with three likely ways forward. I don't /think/ either was entirely shot
down, after some re-reading.

(1) $/=\10 makes <$fh> behave like read $fh, $str, 10 -- meaning that $str is either (a) ten
units longer or (b) fewer than ten units longer and $fh is at eof or (c) there is some error. If
you're reading a raw handle, you get bytes, and there is no complexity. If you're reading a
handle that's decoding a variable length encoding, the read may have to double-dip. If the
double-dip doesn't help, or has an error, that error should propagate and we still end up with
(c) an error condition. i.e., the getline diamond (<>) returns undef and $! is populated, or
(more likely) the encoding layer throws its "malformed input" exception.

(2) $/=\10 being applied to a non-single-width encoded filehandle is fatal on read.

(3) $/=\10 is left working as it does now, but there is a warning when it's in effect for a n-s-
w encoded filehandle.

Option (1) seemed to get the least discussion, at least that I saw in my re-reading, but I think
it's the best. It also seems the least likely to get implemented promptly, and I wonder
whether we're best off doing (3) for now until (1) is available.

I'm not sure how feasible these are for implementation, as I imagine there will be some grotty
layer interaction issues.

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2012

From @ikegami

On Mon, Mar 12, 2012 at 10​:23 AM, Ricardo SIGNES via RT <
perlbug-followup@​perl.org> wrote​:

$/=\10 makes <$fh> behave like read $fh, $str, 10

[...]

It also seems the least likely to get implemented promptly, and I wonder
whether we're best off doing (3) for now until (1) is available.

I'm not sure how feasible these are for implementation, as I imagine there
will be some grotty
layer interaction issues.

If it's the same behaviour as C<read($fh, $buf, ${ $/ })> as you say, it
should be easy to implement. Does the detailed behaviour actually deviate
from C<read>'s, and are those deviations intentional? (I have no idea how
to cause read errors to test it.)

- Eric

meaning that $str is either (a) ten units longer

Nit​: "Long", not "longer". C<readline> doesn't have anything to which to
append. (Even C<read> doesn't append.)

If you're reading a handle that's decoding a variable length encoding, the

read may have to double-dip.

Nit​: Any of zero, one or more dips can be needed for either "kind" of
handles.

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @Leont

On Mon Mar 12 07​:23​:03 2012, rjbs wrote​:

(3) $/=\10 is left working as it does now, but there is a warning when
it's in effect for a n-s-w encoded filehandle.

It doesn't really work like that. You can't know if a handle is
non-single-width encoded, you can only know if it outputs an ordinary
bytestream or a utf8 encoded bytestream.

Also, how does this work wrt crlf-translation? That is a similar issue
in the end.

Leon

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @tonycoz

On Mon, Mar 12, 2012 at 07​:23​:04AM -0700, Ricardo SIGNES via RT wrote​:

I think the list came up with three likely ways forward. I don't /think/ either was entirely shot
down, after some re-reading.

(1) $/=\10 makes <$fh> behave like read $fh, $str, 10 -- meaning that $str is either (a) ten
units longer or (b) fewer than ten units longer and $fh is at eof or (c) there is some error. If
you're reading a raw handle, you get bytes, and there is no complexity. If you're reading a
handle that's decoding a variable length encoding, the read may have to double-dip. If the
double-dip doesn't help, or has an error, that error should propagate and we still end up with
(c) an error condition. i.e., the getline diamond (<>) returns undef and $! is populated, or
(more likely) the encoding layer throws its "malformed input" exception.

Sort of implemented in tonyc/readline-fixed.

The biggest issue is VMS, since that falls back to directly calling
PerlLIO_read() for a real file.

That make sense for a VMS record-based file, but if it's not record
based we have the same problem blead currently has with $/ = \10 on
utf8 handles.

The other issue is that the :utf8 handle can return invalid utf-8 -
which is an issue outside the new code too, but how to handle it?

Tony
(a quick email before bed)

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @rjbs

* Tony Cook <tony@​develop-help.com> [2012-03-16T09​:27​:55]

Sort of implemented in tonyc/readline-fixed.

Keen!

The biggest issue is VMS, since that falls back to directly calling
PerlLIO_read() for a real file.

That make sense for a VMS record-based file, but if it's not record
based we have the same problem blead currently has with $/ = \10 on
utf8 handles.

Okay. I will think about that, but hope to hear from Craig [or any other
VMS-using folk].

The other issue is that the :utf8 handle can return invalid utf-8 -
which is an issue outside the new code too, but how to handle it?

This is getting worked on elsewhere and I assume that when it is addressed,
it won't be relevant here in a special way. In other words​: I think we can
stick to solving these problems independently. Agreed?

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @craigberry

On Fri, Mar 16, 2012 at 9​:05 AM, Ricardo Signes
<perl.p5p@​rjbs.manxome.org> wrote​:

* Tony Cook <tony@​develop-help.com> [2012-03-16T09​:27​:55]

Sort of implemented in tonyc/readline-fixed.

Keen!

The biggest issue is VMS, since that falls back to directly calling
PerlLIO_read() for a real file.

That make sense for a VMS record-based file, but if it's not record
based we have the same problem blead currently has with $/ = \10 on
utf8 handles.

Okay.  I will think about that, but hope to hear from Craig [or any other
VMS-using folk].

I've browsed Tony's branch and his commit message correctly notes​:

- VMS is the elephant in the room - the conditional means that the new
  code is completely ignored for reading from files. If we can detect
  record-oriented files in some way this could change.

It is possible to determine whether a file we have open is
record-oriented. I did something similar here​:

<http​://perl5.git.perl.org/perl.git/commit/8c8488cd4fce90cb5c03fb3f89e89c05e5275498>

I think what Tony is suggesting is that if we are reading from a
fixed-length, record-oriented file, then and only then would we employ
the traditional behavior and go directly to the unix layer and read a
specific number of bytes. Otherwise we would do what other platforms
do and let the perlio layer intervene. And the new behavior on top of
that would be to make $/ = \number mean number of characters not
number of bytes for handles with utf-8 set. Is that a fair summary?

I think that could work for some definition of work. Adding a third
meaning to $/ that is utterly different from its other two meanings is
a, um, interesting design decision. I'm willing to accept that it
might be the right one given the emergent behaviors surrounding using
it on non-binary data. I've never had occasion to do that myself, but
I gather that people use it as a "sip don't slurp" feature where they
don't actually care how much data they get back as long as it's not
too much at once. If that's the most common use case, then maybe this
approach is the best way forward.

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @khwilliamson

On 03/16/2012 10​:59 AM, Craig A. Berry wrote​:

On Fri, Mar 16, 2012 at 9​:05 AM, Ricardo Signes
<perl.p5p@​rjbs.manxome.org> wrote​:

* Tony Cook<tony@​develop-help.com> [2012-03-16T09​:27​:55]

Sort of implemented in tonyc/readline-fixed.

Keen!

The biggest issue is VMS, since that falls back to directly calling
PerlLIO_read() for a real file.

That make sense for a VMS record-based file, but if it's not record
based we have the same problem blead currently has with $/ = \10 on
utf8 handles.

Okay. I will think about that, but hope to hear from Craig [or any other
VMS-using folk].

I've browsed Tony's branch and his commit message correctly notes​:

- VMS is the elephant in the room - the conditional means that the new
code is completely ignored for reading from files. If we can detect
record-oriented files in some way this could change.

It is possible to determine whether a file we have open is
record-oriented. I did something similar here​:

<http​://perl5.git.perl.org/perl.git/commit/8c8488cd4fce90cb5c03fb3f89e89c05e5275498>

I think what Tony is suggesting is that if we are reading from a
fixed-length, record-oriented file, then and only then would we employ
the traditional behavior and go directly to the unix layer and read a
specific number of bytes. Otherwise we would do what other platforms
do and let the perlio layer intervene. And the new behavior on top of
that would be to make $/ = \number mean number of characters not
number of bytes for handles with utf-8 set. Is that a fair summary?

That is my understanding.

I think that could work for some definition of work. Adding a third
meaning to $/ that is utterly different from its other two meanings is
a, um, interesting design decision. I'm willing to accept that it
might be the right one given the emergent behaviors surrounding using
it on non-binary data. I've never had occasion to do that myself, but
I gather that people use it as a "sip don't slurp" feature where they
don't actually care how much data they get back as long as it's not
too much at once. If that's the most common use case, then maybe this
approach is the best way forward.

I am not convinced that it is the right thing to do to add this new
interpretation. The only use case I've heard, as someone suggested
earlier in this thread, is for twitter's arbitrary 140 character limit.
  Maybe that's a good enough case to do it.

I believe that it's possible to be "too clever by half" in the sense of
being half-witted about it. Sometimes we think we know what DWIM should
be, but we're wrong, and we are actually doing a disservice by making
the wrong guess. I'm not so sure that this isn't in that category.

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @tonycoz

On Sat, Mar 17, 2012 at 12​:27​:55AM +1100, Tony Cook wrote​:

The biggest issue is VMS, since that falls back to directly calling
PerlLIO_read() for a real file.

That make sense for a VMS record-based file, but if it's not record
based we have the same problem blead currently has with $/ = \10 on
utf8 handles.

Actually, thinking about it further - it doesn't make sense.

Currently reading from a file with $/ = \N under VMS ignores
everything about PerlIO except the UTF-8 flag.

So any encoding layer is ignored.

But I suspect fixing that for record oriented files would require a
"PerlIO_read_record()" or something similar that promises to fulfil
the results from a single PerlLIO_read() or from buffered data without
a further PerlLIO_read().

Tony

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2012

From @tonycoz

On Fri, Mar 16, 2012 at 01​:32​:54PM -0600, Karl Williamson wrote​:

I am not convinced that it is the right thing to do to add this new
interpretation. The only use case I've heard, as someone suggested
earlier in this thread, is for twitter's arbitrary 140 character
limit. Maybe that's a good enough case to do it.

This doesn't add a new interpretation, it replaces the existing
behaviour.

From "stream is utf-8 ? read N bytes of the stream as encoded in UTF-8
after it's passed through the I/O layers : read N bytes" to "read N
characters from the stream".

I believe that it's possible to be "too clever by half" in the sense
of being half-witted about it. Sometimes we think we know what DWIM
should be, but we're wrong, and we are actually doing a disservice
by making the wrong guess. I'm not so sure that this isn't in that
category.

The problem is the current behaviour of getline with C<$/ = \N;> is
broken in at least four ways​:

1) it ignores the abstraction - the user has put the stream in a mode
  where otherwise a byte doesn't map exactly to a character, but the
  current behaviour treats N as a byte count.

2) the byte count only ever matches the underlying stream for utf-8
  encoding​:

# read a latin1 stream, see the ticket for other examples
$ perl -e 'print "foo\xE4"' | perl -MDevel​::Peek -e 'binmode STDIN, "​:encoding(latin1)"; $/ = \4; $line = <STDIN>; Dump $line'
SV = PV(0x2476af8) at 0x2494e98
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x2512610 "foo\303"\0 [UTF8 "foo\x{c0}"]
  CUR = 4
  LEN = 80

3) getline() with $/ = \N can leave the stream in an inconsistent
  state (ie. in the middle of a character).

4) it can produce SVs marked as UTF-8 with invalid UTF-8 data
  (breaking our SV abstraction)***

My change is intended to produce the same behaviour as read() on the
same stream, if this altered behaviour is broken, then maybe read()
needs to change too.

I can think of some other reasonable behaviours​:

a) croak() if the stream is in utf-8 mode - but in general (I think)
  perl tends to try a little bit harder to make things work rather
  than throw up it's hands and show "CAN'T!".

b) always call PerlLIO_read() (aka sysread()), but in this case it
  should never set the UTF-8 flag on the result, since setting UTF-8
  would cause the same problem as 4) above*. This means that we're
  ignoring the encoding of the underlying file, which is inconsistent
  with the $/ = "sep" behaviour of getline().

c) extend the PerlIO_read_record() mentioned in my note on VMS
  behaviour to be a "read N bytes from the underlying file and fulfil
  the request from that" on stream files. Any incomplete characters
  should be treated the same as an incomplete character at EOF. I
  suspect this is the best solution, but implementing it isn't going
  to happen before 5.16.

Now I can't think of a case where I'd use $/ = \N or read($fh,
$scalar, N) on an encoded stream, but making it behave sanely would be
nice, even if that sanity is "CAN'T!".

Tony

* this is probably a bug in sysread()**​:

$ perl -e 'print "foo\xE4"' | perl -MDevel​::Peek -e 'binmode STDIN, "​:encoding(latin1)"; sysread(STDIN, $x, 4); Dump $x'
SV = PVMG(0x1466970) at 0x142fe30
  REFCNT = 1
  FLAGS = (SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x143e5e0 "foo\344"\0 [UTF8 "foo\x{4000}"]
  CUR = 4
  LEN = 8
  MAGIC = 0x14381f0
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = -1

** even if it's documented to work that way, ugh

*** I think badly encoded UTF-8 SVs should cause a panic when detected

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2012

From @rjbs

* Karl Williamson <public@​khwilliamson.com> [2012-03-16T15​:32​:54]

On 03/16/2012 10​:59 AM, Craig A. Berry wrote​:

I think that could work for some definition of work. Adding a third
meaning to $/ that is utterly different from its other two meanings is
a, um, interesting design decision. I'm willing to accept that it
might be the right one given the emergent behaviors surrounding using
it on non-binary data. I've never had occasion to do that myself, but
I gather that people use it as a "sip don't slurp" feature where they
don't actually care how much data they get back as long as it's not
too much at once. If that's the most common use case, then maybe this
approach is the best way forward.

I am not convinced that it is the right thing to do to add this new
interpretation. The only use case I've heard, as someone suggested
earlier in this thread, is for twitter's arbitrary 140 character
limit. Maybe that's a good enough case to do it.

I would not say that we are adding a new meaning. I think we are making
$/ more consistent. $/=\10 now means "read ten bytes" even when the
handle is not in the business of providing bytes and when even read and
sysread would give you ten *characters* if you passed them a length of
10.

I believe that it's possible to be "too clever by half" in the sense
of being half-witted about it. Sometimes we think we know what DWIM
should be, but we're wrong, and we are actually doing a disservice
by making the wrong guess. I'm not so sure that this isn't in that
category.

I agree​: I am not a fan of too much DWIM. I don't like the "figure out
something halfway sane to do, rather than tell the user to make some
darn sense."

I don't see this as a weird made-up behavior to avoid an exception. I
see it as a clarification of what $/=\10 means​: it means to read ten
at a time, just like other reads would read ten.

So, rather than "who would use this?" I wonder "which design is more
coherent overall?" To me, it seems like this one is easier to explain.
"It means that your <> reads X at a time, like read(), but with a
notation some will find less surprising."

The alternative is something like, "It means to read X bytes at a time,
like read, except it is fatal if you have any layer in place other than
raw." And then we're also back to Leon's question about the crlf layer.
We can find the answer to that, but then it needs to go into the
explanation, too.

So​: I think this is easier to explain, which is usually a good sign for
"not too clever."

But​: I am open to being told that I'm wrong and nuts, especially if it
comes with a good explanation of how or why.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2012

From @craigberry

On Fri, Mar 16, 2012 at 5​:10 PM, Tony Cook <tony@​develop-help.com> wrote​:

On Sat, Mar 17, 2012 at 12​:27​:55AM +1100, Tony Cook wrote​:

The biggest issue is VMS, since that falls back to directly calling
PerlLIO_read() for a real file.

That make sense for a VMS record-based file, but if it's not record
based we have the same problem blead currently has with $/ = \10 on
utf8 handles.

Actually, thinking about it further - it doesn't make sense.

Currently reading from a file with $/ = \N under VMS ignores
everything about PerlIO except the UTF-8 flag.

So any encoding layer is ignored.

But I suspect fixing that for record oriented files would require a
"PerlIO_read_record()" or something similar that promises to fulfil
the results from a single PerlLIO_read() or from buffered data without
a further PerlLIO_read().

That's what it already does on VMS, but it currently does it
regardless of file type and it doesn't fix anything related to
encoding. The attached patch against the readline-fixed branch makes
it call PerlLIO_read only for record-oriented files. leaving
stream-oriented files to do exactly what's done on other platforms.

The patch introduces a couple of new test failures (t/io/errno.t and
dist/IO/t/io_dir.t) that appear at first glance to have something to
do with eol or eof expectations, but I haven't investigated them. The
test failures almost certainly have something to do with behavior on
stream-oriented files (the "else" branch expanded by the patch) since
as far as I know there is no actual testing of $/ = \N on
record-oriented files.

The patch doesn't do anything for the problem with reading
varying-width characters from record-oriented files. I don't really
think there is anything that can be done about that, except telling
people not to. Records are measured in bytes. Period. This is the
model upon which $/ = \N was based. It's no sin to extend the model
as in the suggestion to read characters rather than bytes, but the
extended model can't sensibly be superimposed on the real thing. It
just won't work.

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2012

From @craigberry

0001-Only-handle-PL_rs-differently-on-VMS-for-record-orie.patch
From 4bce77cf02b9db2e4ca0e81ba33d9c3500faa1ff Mon Sep 17 00:00:00 2001
From: "Craig A. Berry" <craigberry@mac.com>
Date: Fri, 16 Mar 2012 14:20:29 -0500
Subject: [PATCH] Only handle PL_rs differently on VMS for record-oriented
 files.

For stream-oriented files, the effects of buffering and other
layers should be exactly as they are on other platforms. For true,
record-oriented files, though, setting $/ = \number must provide
exactly one low-level read per record.  If a read were ever to
return less than a full record (due to, for example, filling up
the perlio buffer), a subsequent read would get the *next* record,
losing whatever data remained in the partially-read record.
---
 sv.c |   29 +++++++++++++++++++----------
 1 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/sv.c b/sv.c
index 6a303cc..6a784d7 100644
--- a/sv.c
+++ b/sv.c
@@ -7561,22 +7561,31 @@ S_sv_gets_read_record(pTHX_ SV *const sv, PerlIO *const fp, I32 append)
     const STRLEN recsize = SvUV(SvRV(PL_rs)); /* RsRECORD() guarantees > 0. */
       /* Grab the size of the record we're getting */
     char *buffer = SvGROW(sv, (STRLEN)(recsize + append + 1)) + append;
+    
+    /* Go yank in */
 #ifdef VMS
+#include <rms.h>
     int fd;
-#endif
+    Stat_t st;
 
-    /* Go yank in */
-#ifdef VMS
-    /* VMS wants read instead of fread, because fread doesn't respect */
-    /* RMS record boundaries. This is not necessarily a good thing to be */
-    /* doing, but we've got no other real choice - except avoid stdio
-       as implementation - perhaps write a :vms layer ?
-    */
+    /* With a true, record-oriented file on VMS, we need to use read directly
+     * to ensure that we respect RMS record boundaries.  The user is responsible
+     * for providing a PL_rs value that corresponds to the FAB$W_MRS (maximum
+     * record size) field.  N.B. This is likely to produce invalid results on
+     * varying-width character data when a record ends mid-character.
+     */
     fd = PerlIO_fileno(fp);
-    if (fd != -1) {
+    if (fd != -1
+	&& PerlLIO_fstat(fd, &st) == 0
+	&& (st.st_fab_rfm == FAB$C_VAR
+	    || st.st_fab_rfm == FAB$C_VFC
+	    || st.st_fab_rfm == FAB$C_FIX)) {
+
 	bytesread = PerlLIO_read(fd, buffer, recsize);
     }
-    else /* in-memory file from PerlIO::Scalar */
+    else /* in-memory file from PerlIO::Scalar
+          * or not a record-oriented file
+          */
 #endif
     {
 	bytesread = PerlIO_read(fp, buffer, recsize);
-- 
1.7.7.GIT

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2012

From @ikegami

For those advocating $/ to control the number of bytes read at the lowest
level rather than the number of units returned.

If you're going to create fixed-with record binary file whose records
consist entirely of text

my $rec = encode($enc, $text);
die if length($rec) > $REC_SIZE;
$rec .= $pad x (($REC_SIZE-length($rec))/length($pad));
print($bin_fh $rec);

Why do you find it necessary to then go and treat it as a text file

# :encoding($enc)
local $/ = \$REC_SIZE;
my $rec = <$fh> =~ s/(?​:\Q$decoded_pad\E)+\z//r;

when it's just as simple as treating it as a binary file?

# :raw
local $/ = \$REC_SIZE;
my $rec = decode($enc, <$fh> =~ s/(?​:\Q$pad\E)+\z//r);

There is no need for $/ to indicate the number of bytes read at the lowest
level, a "feature" that has far too many problem. If you have a binary file
such as a fixed-width record file, don't use :encoding!

- Eric

PS - The problems seems to boil down to the inability to tell Perl how to
handle errors it might receive, such as a partial character at the end of
the record.

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2012

From @tonycoz

On Mon, Mar 12, 2012 at 07​:23​:04AM -0700, Ricardo SIGNES via RT wrote​:

I think the list came up with three likely ways forward. I don't /think/ either was entirely shot
down, after some re-reading.

An alternative approach for 5.16 - document the badness and warn the
user.

Also visible at​:

http​://perl5.git.perl.org/perl.git/shortlog/refs/heads/tonyc/readline-fixed-doc

Tony

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2012

From @tonycoz

0001-rt-79960-document-how-broken-N-is-for-unicode-stream.patch
From 1d8aef309f72b162395b7beed4462bc5964d973f Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Sun, 18 Mar 2012 10:26:22 +1100
Subject: [PATCH] [rt #79960] document how broken $/ = \N is for unicode streams

It's kind of late in the release process to change how $/ = \N works
for unicode streams, briefly document how broken it is and let the
user know it may change.
---
 pod/perlvar.pod |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/pod/perlvar.pod b/pod/perlvar.pod
index 72968f1..ea1f601 100644
--- a/pod/perlvar.pod
+++ b/pod/perlvar.pod
@@ -1348,6 +1348,13 @@ want to read in record mode is probably unusable in line mode.)
 Non-VMS systems do normal I/O, so it's safe to mix record and
 non-record reads of a file.
 
+If you perform a record read on a FILE with an encoding layer such as
+C<:encoding(latin1)> or C<:utf8>, you may get an invalid string as a
+result, may leave the FILE positioned between characters in the stream
+and may not be reading the number of bytes from the underlying file
+that you specified.  This behaviour may change without warning in a
+future version of perl.
+
 See also L<perlport/"Newlines">. Also see L</$.>.
 
 Mnemonic: / delimits line boundaries when quoting poetry.
-- 
1.7.2.5

@p5pRT
Copy link
Author

p5pRT commented Mar 19, 2012

From @khwilliamson

On 03/17/2012 04​:53 AM, Eric Brine wrote​:

For those advocating $/ to control the number of bytes read at the
lowest level rather than the number of units returned.

If you're going to create fixed-with record binary file whose records
consist entirely of text

my $rec = encode($enc, $text);
die if length($rec) > $REC_SIZE;
$rec .= $pad x (($REC_SIZE-length($rec))/length($pad));
print($bin_fh $rec);

Why do you find it necessary to then go and treat it as a text file

# :encoding($enc)
local $/ = \$REC_SIZE;
my $rec = <$fh> =~ s/(?​:\Q$decoded_pad\E)+\z//r;

when it's just as simple as treating it as a binary file?

# :raw
local $/ = \$REC_SIZE;
my $rec = decode($enc, <$fh> =~ s/(?​:\Q$pad\E)+\z//r);

There is no need for $/ to indicate the number of bytes read at the
lowest level, a "feature" that has far too many problem. If you have a
binary file such as a fixed-width record file, don't use :encoding!

I agree, and doing otherwise has been called crazy and insane by others
on this list. This ticket was filed by Nicholas, and it is his
recollection that it came about not because some programmer was crazy
enough to do this, but that reading the code made him wonder what would
happen if....

I have come to the opinion that we don't have to accommodate this,
except to croak or warn upon its occurrence. I think that is sufficient
to do in 5.16, and perhaps forever.

@p5pRT
Copy link
Author

p5pRT commented Mar 19, 2012

From @khwilliamson

On 03/16/2012 06​:25 PM, Ricardo Signes wrote​:

* Karl Williamson<public@​khwilliamson.com> [2012-03-16T15​:32​:54]

On 03/16/2012 10​:59 AM, Craig A. Berry wrote​:

I think that could work for some definition of work. Adding a third
meaning to $/ that is utterly different from its other two meanings is
a, um, interesting design decision. I'm willing to accept that it
might be the right one given the emergent behaviors surrounding using
it on non-binary data. I've never had occasion to do that myself, but
I gather that people use it as a "sip don't slurp" feature where they
don't actually care how much data they get back as long as it's not
too much at once. If that's the most common use case, then maybe this
approach is the best way forward.

I am not convinced that it is the right thing to do to add this new
interpretation. The only use case I've heard, as someone suggested
earlier in this thread, is for twitter's arbitrary 140 character
limit. Maybe that's a good enough case to do it.

I would not say that we are adding a new meaning. I think we are making
$/ more consistent. $/=\10 now means "read ten bytes" even when the
handle is not in the business of providing bytes and when even read and
sysread would give you ten *characters* if you passed them a length of
10.

I believe that it's possible to be "too clever by half" in the sense
of being half-witted about it. Sometimes we think we know what DWIM
should be, but we're wrong, and we are actually doing a disservice
by making the wrong guess. I'm not so sure that this isn't in that
category.

I agree​: I am not a fan of too much DWIM. I don't like the "figure out
something halfway sane to do, rather than tell the user to make some
darn sense."

I don't see this as a weird made-up behavior to avoid an exception. I
see it as a clarification of what $/=\10 means​: it means to read ten
at a time, just like other reads would read ten.

So, rather than "who would use this?" I wonder "which design is more
coherent overall?" To me, it seems like this one is easier to explain.
"It means that your<> reads X at a time, like read(), but with a
notation some will find less surprising."

The alternative is something like, "It means to read X bytes at a time,
like read, except it is fatal if you have any layer in place other than
raw." And then we're also back to Leon's question about the crlf layer.
We can find the answer to that, but then it needs to go into the
explanation, too.

So​: I think this is easier to explain, which is usually a good sign for
"not too clever."

But​: I am open to being told that I'm wrong and nuts, especially if it
comes with a good explanation of how or why.

I am persuaded by your argument that this is a reasonable thing to do
and isn't extending DWIM too far. But I'm not persuaded it is something
that need be done. But since Tony has gone to the trouble of making a
patch for it (which I haven't had time to review yet, and I have no
expertise in this area of Perl anyway), I no longer have an objection to
doing it.

I do think something should be done for 5.16, and that a warning or
croaking is sufficient. I don't have an opinion on whether a more
full-blown change should go into 5.16 or not.

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2012

From @tonycoz

On Mon Nov 29 08​:05​:05 2010, nicholas wrote​:

It's possible to get the perl interpreter to have corrupt internal
state on
a valid UTF-8 input stream, by setting $/ to case fixed-length reads.

fixed in 3421318

Tony

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2012

@tonycoz - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Dec 10, 2012
@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2012

From @rjbs

* Tony Cook via RT <perlbug-followup@​perl.org> [2012-12-10T18​:34​:04]

On Mon Nov 29 08​:05​:05 2010, nicholas wrote​:

It's possible to get the perl interpreter to have corrupt internal
state on
a valid UTF-8 input stream, by setting $/ to case fixed-length reads.

fixed in 3421318

Thanks very much! I'm glad this got applied in the end!

--
rjbs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant