Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl leaves broken UTF-8 in SVs whose UTF8 is set #11671

Open
p5pRT opened this issue Sep 26, 2011 · 25 comments
Open

Perl leaves broken UTF-8 in SVs whose UTF8 is set #11671

p5pRT opened this issue Sep 26, 2011 · 25 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 26, 2011

Migrated from rt.perl.org#100058 (status was 'open')

Searchable as RT100058$

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

From tchrist@perl.com

Remebering how setting $/ to an int ref can cause Perl to erroneously leave
broken Perl strings (malformed UTF-8, etc), I've noticed that you can get
this to happen even more easily than that.

  % perl -C0 -le 'print "\xC0\x81"' | perl -CS -nle 'printf "U+%v04X\n", $_'
  Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc0) in printf at -e line 1, <> line 1.
  U+0000

  % perl -C0 -le 'print "\xC1\x81"' | perl -CS -nle 'print for length, defined, ord'
  Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc1) in ord at -e line 1, <> line 1.
  1
  1
  0

Surely this is an error?? We are actually storing invalid UTF-8
and yet we are marking it valid​:

  % perl -C0 -le 'print "\xC1\x81"' | perl -MDevel​::Peek -CS -nle 'Dump($_)'
  SV = PV(0x3c0250e4) at 0x3c04b084
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x3c031920 "\301\201"\0Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc1) in subroutine entry at -e line 1, <> line 1.
  [UTF8 "\x{0}"]
  CUR = 2
  LEN = 80

  % perl -C0 -le 'print "bad\xC1\x81stuff"' | perl -MDevel​::Peek -CS -nle 'Dump($_)'
  SV = PV(0x3c0250e4) at 0x3c04b084
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x3c031920 "bad\301\201stuff"\0Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc1) in subroutine entry at -e line 1, <> line 1.
  [UTF8 "bad\x{0}stuff"]
  CUR = 10
  LEN = 80

  % perl -C0 -le 'print "bad\xC1\x88stuff"' | perl -MDevel​::Peek -CS -nle 'Dump($_)'
  SV = PV(0x3c0250e4) at 0x3c04b084
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x3c031920 "bad\301\210stuff"\0Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc1) in subroutine entry at -e line 1, <> line 1.
  [UTF8 "bad\x{0}stuff"]
  CUR = 10
  LEN = 80

The UTF8 flag is on, but that is not UTF8.

I can't see how this isn't a bug, but am willing to be enlightened.

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration​:
 
  Platform​:
  osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
  uname='openbsd chthon 4.4 generic#0 i386 '
  config_args='-des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=y, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-lgdbm -lm -lutil -lc
  perllibs=-lm -lutil -lc
  libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
  cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:
  Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
  PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
  USE_PERL_ATOF
  Built under openbsd
  Compiled at Jun 11 2011 11​:48​:28
  %ENV​:
  PERL_UNICODE="SA"
  @​INC​:
  /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/site_perl/5.14.0
  /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/5.14.0
  /usr/local/lib/perl5/site_perl/5.12.3
  /usr/local/lib/perl5/site_perl/5.11.3
  /usr/local/lib/perl5/site_perl/5.10.1
  /usr/local/lib/perl5/site_perl/5.10.0
  /usr/local/lib/perl5/site_perl/5.8.7
  /usr/local/lib/perl5/site_perl/5.8.0
  /usr/local/lib/perl5/site_perl/5.6.0
  /usr/local/lib/perl5/site_perl/5.005
  /usr/local/lib/perl5/site_perl
  .

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

From @cpansprout

On Mon Sep 26 13​:19​:50 2011, tom christiansen wrote​:

Remebering how setting $/ to an int ref can cause Perl to erroneously
leave
broken Perl strings (malformed UTF-8, etc), I've noticed that you can
get
this to happen even more easily than that.

% perl \-C0 \-le 'print "\\xC0\\x81"' | perl \-CS \-nle 'printf

"U+%v04X\n", $_'
Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc0)
in printf at -e line 1, <> line 1.
U+0000

% perl \-C0 \-le 'print "\\xC1\\x81"' | perl \-CS \-nle 'print for

length, defined, ord'
Malformed UTF-8 character (2 bytes, need 1, after start byte 0xc1)
in ord at -e line 1, <> line 1.
1
1
0

Surely this is an error?? We are actually storing invalid UTF-8
and yet we are marking it valid​:

% perl \-C0 \-le 'print "\\xC1\\x81"' | perl \-MDevel&#8203;::Peek \-CS \-nle

'Dump($_)'
SV = PV(0x3c0250e4) at 0x3c04b084
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x3c031920 "\301\201"\0Malformed UTF-8 character (2 bytes,
need 1, after start byte 0xc1) in subroutine entry at -e line 1, <>
line 1.
[UTF8 "\x{0}"]
CUR = 2
LEN = 80

% perl \-C0 \-le 'print "bad\\xC1\\x81stuff"' | perl \-MDevel&#8203;::Peek \-CS

-nle 'Dump($_)'
SV = PV(0x3c0250e4) at 0x3c04b084
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x3c031920 "bad\301\201stuff"\0Malformed UTF-8 character (2
bytes, need 1, after start byte 0xc1) in subroutine entry at -e line
1, <> line 1.
[UTF8 "bad\x{0}stuff"]
CUR = 10
LEN = 80

% perl \-C0 \-le 'print "bad\\xC1\\x88stuff"' | perl \-MDevel&#8203;::Peek \-CS

-nle 'Dump($_)'
SV = PV(0x3c0250e4) at 0x3c04b084
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x3c031920 "bad\301\210stuff"\0Malformed UTF-8 character (2
bytes, need 1, after start byte 0xc1) in subroutine entry at -e line
1, <> line 1.
[UTF8 "bad\x{0}stuff"]
CUR = 10
LEN = 80

The UTF8 flag is on, but that is not UTF8.

I can't see how this isn't a bug, but am willing to be enlightened.

I think it was agreed some time ago that that is a bug. The utf8 layer
should at least check for well-formedness (meaning that it produces a
valid perl scalar), even if it does not check for strict UTF-8 (disallow
certain codepoin(the latter being a matter of controversy).

--tom

Summary of my perl5 (revision 5 version 14 subversion 0)
configuration​:

Platform​:
osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
uname='openbsd chthon 4.4 generic#0 i386 '
config_args='-des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define,
usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=y, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include',
optimize='-O2',
cppflags='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
ccversion='', gccversion='3.3.5 (propolice)',
gccosandvers='openbsd4.4'
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries​:
ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-lgdbm -lm -lutil -lc
perllibs=-lm -lutil -lc
libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false,
libperl=libperl.a
gnulibc_version=''
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC
-L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:
Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV
PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
USE_PERL_ATOF
Built under openbsd
Compiled at Jun 11 2011 11​:48​:28
%ENV​:
PERL_UNICODE="SA"
@​INC​:
/usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
/usr/local/lib/perl5/site_perl/5.14.0
/usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
/usr/local/lib/perl5/5.14.0
/usr/local/lib/perl5/site_perl/5.12.3
/usr/local/lib/perl5/site_perl/5.11.3
/usr/local/lib/perl5/site_perl/5.10.1
/usr/local/lib/perl5/site_perl/5.10.0
/usr/local/lib/perl5/site_perl/5.8.7
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl/5.005
/usr/local/lib/perl5/site_perl
.

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

From tchrist@perl.com

I think it was agreed some time ago that that is a bug. The utf8 layer
should at least check for well-formedness (meaning that it produces a
valid perl scalar), even if it does not check for strict UTF-8 (disallow
certain codepoin(the latter being a matter of controversy).

I do have some mail from Mark Davis explaining why a UTF-8 decoder must
allow everything in the range U+0000 through U+1FFFF *except* for
surrogates. Our "nonchar" warnings apparently shouldn't be there.

--tom

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

From @Leont

On Mon, Sep 26, 2011 at 10​:25 PM, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

I think it was agreed some time ago that that is a bug.  The utf8 layer
should at least check for well-formedness (meaning that it produces a
valid perl scalar), even if it does not check for strict UTF-8 (disallow
certain codepoin(the latter being a matter of controversy).

The complication of course is that there is no such thing as a utf8
layer. It's just a flag on top of another layer. For that matter
making it a real layer may actually be a rather sensible thing, and
since we're emulating doing that right now it shouldn't be very user
visible either.

Leon

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2011

From tchrist@perl.com

"Leon Timmermans via RT" <perlbug-followup@​perl.org> wrote
  on Mon, 26 Sep 2011 13​:31​:08 PDT​:

The complication of course is that there is no such thing as a utf8
layer. It's just a flag on top of another layer. For that matter
making it a real layer may actually be a rather sensible thing, and
since we're emulating doing that right now it shouldn't be very user
visible either.

The description in perlrun of what happens when you use :utf8 in the PERLIO
envariable confuses me, because it says that that layer does not do validity
checks, but as far as I can discern, it does.

--tom

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @ikegami

On Mon, Sep 26, 2011 at 4​:55 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

The description in perlrun of what happens when you use :utf8 in the PERLIO
envariable confuses me, because it says that that layer does not do
validity
checks, but as far as I can discern, it does.

The warning is coming from sprintf, not the layer.

$ perl -C0 -le 'print "\xC0\x81"' | perl -CS -nle1

$ perl -C0 -le 'print "\xC0\x81"' | perl -CS -MDevel​::Peek -nle'Dump($_)'
SV = PV(0x9c72788) at 0x9c9ed98
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x9ca1070 "\300\201"\0Malformed UTF-8 character (2 bytes, need 1,
after start byte 0xc0) in subroutine entry at -e line 1, <> line 1.
[UTF8 "\x{0}"]
  CUR = 2
  LEN = 80

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @khwilliamson

On 09/26/2011 02​:27 PM, Tom Christiansen wrote​:

I think it was agreed some time ago that that is a bug. The utf8 layer
should at least check for well-formedness (meaning that it produces a
valid perl scalar), even if it does not check for strict UTF-8 (disallow
certain codepoin(the latter being a matter of controversy).

I do have some mail from Mark Davis explaining why a UTF-8 decoder must
allow everything in the range U+0000 through U+1FFFF *except* for
surrogates. Our "nonchar" warnings apparently shouldn't be there.

--tom

This issue keeps coming back up, when I think we have long ago resolved
how to fix it. Here is my view of how the API should work, and I
thought that it followed the consensus view. This follows what I think
Zefram and David Golden proposed more than a year ago.

The default utf8 layer should prohibit malformed utf8, surrogates,
non-character code points and above-Unicode code points.

There should be an alternate layer, called something like utf8-lax,
which allows all three, but not malformed utf8. There should be three
other layers, with names like no-surrogates, no-nonchars, and
only-unicode which disallow exactly one class, as indicated by their
names. It should be then possible to combine these to orthogonally
allow any combination of the three problematic input types.

My understanding is that the the original reason for not doing the input
checks was performance. Security is a far more important issue now, and
Nicholas has demonstrated code that does the parsing with a minimal
performance hit.

I have been waiting for that code to be complete, and then planned to
implement the other layers, unless someone else wanted to.

Having now read Mark's email, I don't think that contradicts anything
said above. It should be possible for a utf8 decoder to allow
non-characters, but it should be possible for such a decoder to disallow
them as well, and that should be what you get by default. Only by
taking extra action should you be able to specify that you want atypical
code points allowed.

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2011

From @Leont

On Wed, Sep 28, 2011 at 1​:09 AM, Karl Williamson
<public@​khwilliamson.com> wrote​:

This issue keeps coming back up, when I think we have long ago resolved how
to fix it.  Here is my view of how the API should work, and I thought that
it followed the consensus view.  This follows what I think Zefram and David
Golden proposed more than a year ago.

The default utf8 layer should prohibit malformed utf8, surrogates,
non-character code points and above-Unicode code points.

There should be an alternate layer, called something like utf8-lax, which
allows all three, but not malformed utf8.  There should be three other
layers, with names like no-surrogates, no-nonchars, and only-unicode which
disallow exactly one class, as indicated by their names.  It should be then
possible to combine these to orthogonally allow any combination of the three
problematic input types.

I would personally prefer it to be one layer with multiple options. I
suspect that would be conceptually cleaner when you want to combine
them. E.g​: «open my $fh, '<'​:utf8(surrogates-ok,nonchars-ok),
$filename» or some such.

Leon

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2011

From @nwc10

On Tue, Sep 27, 2011 at 05​:09​:33PM -0600, Karl Williamson wrote​:

My understanding is that the the original reason for not doing the input
checks was performance. Security is a far more important issue now, and
Nicholas has demonstrated code that does the parsing with a minimal
performance hit.

I had hoped to work on it over last Christmas, but everyone got ill and
my laptop power supply failed. So it didn't happen.

Whilst I have a feel for how to do it for UTF-8, I have no idea how do to
it for UTF-8 and UTF-EBCDIC, or at least "not break EBCDIC platforms" or
"make something hard to port to EBCDIC" as a side effect.

I also wasn't sure how to benchmark it properly, to be confident about the
magnitude of the performance change. I had thought that my test code should
be *more* efficient that the current code in utf8.c [it did less work], but
all the numbers I could collect showed it to be slightly slower. Hence why
I'm not trusting my intuition about what's happening.

It's also blocking on lack of feedback to bug #79960

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2011

From @khwilliamson

On 09/28/2011 05​:50 AM, Nicholas Clark wrote​:

On Tue, Sep 27, 2011 at 05​:09​:33PM -0600, Karl Williamson wrote​:

My understanding is that the the original reason for not doing the input
checks was performance. Security is a far more important issue now, and
Nicholas has demonstrated code that does the parsing with a minimal
performance hit.

I had hoped to work on it over last Christmas, but everyone got ill and
my laptop power supply failed. So it didn't happen.

Whilst I have a feel for how to do it for UTF-8, I have no idea how do to
it for UTF-8 and UTF-EBCDIC, or at least "not break EBCDIC platforms" or
"make something hard to port to EBCDIC" as a side effect.

I believe I have the expertise to take what you do for UTF-8 and extend
it to work on UTF-EBCDIC. I have in the past ginned up a test platform
to test some EBCDIC things on Linux; and this looks like a feasible
candidate for the same treatment.

I also wasn't sure how to benchmark it properly, to be confident about the
magnitude of the performance change. I had thought that my test code should
be *more* efficient that the current code in utf8.c [it did less work], but
all the numbers I could collect showed it to be slightly slower. Hence why
I'm not trusting my intuition about what's happening.

I remember seeing the code somewhere, and thinking that it could be
faster than what we have already. I believe that the security concerns
of not doing anything out-weigh any performance impacts. I suspect
there are performance experts on this list that Jesse could lean on to
evaluate this extremely important work, which should help keep us from
getting more CVEs.

It's also blocking on lack of feedback to bug #79960

So, here's my comments on that bug. FWIW, here is a link to what
Unicode says should happen for input validation
http​://www.unicode.org/versions/Unicode6.0.0/ch03.pdf#page=42

I have never used $/ set to a fixed length, but reading the pod, it
appears to me that the crux of the matter is this, "[it] will attempt to
read records instead of lines, with the maximum record size being the
referenced integer." It also says, "any file you'd want to read in
record mode is probably unusable in line mode." That tells me it is ok
to croak in this situation.

But why not just return only as many complete characters as will fit in
the fixed length, leaving the pointer at the beginning of the next
partial character? The documentation already says that you can't always
expect a full-length record; and it doesn't say this occurs just at EOF.
  It would croak if that partial character is too long to ever fit ($/
being very small, as in some of your examples).

I do think that the buffer length should only be construed as bytes and
not characters.

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2011

From tchrist@perl.com

Karl Williamson <public@​khwilliamson.com> wrote
  on Wed, 28 Sep 2011 17​:35​:01 MDT​:

I do think that the buffer length should only be construed as bytes and
not characters.

Could you please explain why you think that?

Why not have

  binmode(FH, "​:utf8");
  $/ = \1000;
  $_ = <FH>;

mean

  binmode(FH, "​:utf8");
  read(FH, $_, 1000);

I vagule feel like you should never have byte operation
on an encoded stream.

But maybe I'm wrong.

--tom

@p5pRT
Copy link
Author

p5pRT commented Sep 29, 2011

From @khwilliamson

On 09/28/2011 05​:56 PM, Tom Christiansen wrote​:

Karl Williamson<public@​khwilliamson.com> wrote
on Wed, 28 Sep 2011 17​:35​:01 MDT​:

I do think that the buffer length should only be construed as bytes and
not characters.

Could you please explain why you think that?

Why not have

 binmode\(FH\, "&#8203;:utf8"\);
 $/ = \\1000;
 $\_ =\<FH>;

mean

 binmode\(FH\, "&#8203;:utf8"\);
 read\(FH\, $\_\, 1000\);

I vagule feel like you should never have byte operation
on an encoded stream.

But maybe I'm wrong.

--tom

I found this persuasive (from the original ticket) "Or we could try to
do what read and sysread do, and treat the length parameter as
characters, so that on a UTF-8 flagged handle we loop until we read in
sufficient characters. But that blows the idea of "record based"
completely on a UTF-8 handle."

I would also be ok with just croaking when attempting a byte-type
operation on an encoded string.

@p5pRT
Copy link
Author

p5pRT commented Sep 29, 2011

From @nwc10

On Wed, Sep 28, 2011 at 01​:43​:18AM +0200, Leon Timmermans wrote​:

I would personally prefer it to be one layer with multiple options. I
suspect that would be conceptually cleaner when you want to combine
them. E.g​: «open my $fh, '<'​:utf8(surrogates-ok,nonchars-ok),
$filename» or some such.

I *think* so, but somewhere I have notes on what made sense, and some
combinations don't.

Whilst it would be nice for :utf8 to be the flexible layer, I think it would
lead to various problems, including problems with security expectations.
Code you wrote on a perl with it would work just fine, nicely locked down.

Then you run that code on an older perl​:

$ echo Works already | perl -we 'open my $fh, "<​:utf8(surrogates-ok,nonchars-ok)", "/dev/fd/0"; print <$fh>'
Works already
$ echo Works already | perl -we 'open my $fh, "<​:utf8(maximally-paranoid)", "/dev/fd/0"; print <$fh>'
Works already

a) No error. No warning that your input isn't subject to paranoia
b) No way to write a compatibility version that works on older perls.

I guess that one can solve (a) by having :utf8 fault the new arguments
unless in the scope of a suitable :feature, but it's starting to feel
clunky.

Also, in terms of Jesse's 5.16+ plan, I can't see how layers are anything
but interpreter-global. In that, if we change the default for :utf8, it
has to be for everyone. The code doing validation can't do it on the basis
of lexical scope, because validation probably has to happen when a buffer is
filled, not as data are read out.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Oct 13, 2011

From @Leont

On Thu, Sep 29, 2011 at 10​:09 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

I would personally prefer it to be one layer with multiple options. I
suspect that would be conceptually cleaner when you want to combine
them. E.g​: «open my $fh, '<'​:utf8(surrogates-ok,nonchars-ok),
$filename» or some such.

I *think* so, but somewhere I have notes on what made sense, and some
combinations don't.

Whilst it would be nice for :utf8 to be the flexible layer, I think it would
lead to various problems, including problems with security expectations.
Code you wrote on a perl with it would work just fine, nicely locked down.

Then you run that code on an older perl​:

$ echo Works already | perl -we 'open my $fh, "<​:utf8(surrogates-ok,nonchars-ok)", "/dev/fd/0"; print <$fh>'
Works already
$ echo Works already | perl -we 'open my $fh, "<​:utf8(maximally-paranoid)", "/dev/fd/0"; print <$fh>'
Works already

a) No error. No warning that your input isn't subject to paranoia
b) No way to write a compatibility version that works on older perls.

I guess that one can solve (a) by having :utf8 fault the new arguments
unless in the scope of a suitable :feature, but it's starting to feel
clunky.

Also, in terms of Jesse's 5.16+ plan, I can't see how layers are anything
but interpreter-global. In that, if we change the default for :utf8, it
has to be for everyone. The code doing validation can't do it on the basis
of lexical scope, because validation probably has to happen when a buffer is
filled, not as data are read out.

If we had aliases in PerlIO, all of this could be handled much more
cleanly. :utf8 would mean :utf8-new or :utf8-old depending on scope.

Leon

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2011

From @davidnicol

On Wed, Sep 28, 2011 at 8​:00 PM, Karl Williamson <public@​khwilliamson.com>wrote​:

I found this persuasive (from the original ticket) "Or we could try to do
what read and sysread do, and treat the length parameter as characters, so
that on a UTF-8 flagged handle we loop until we read in
sufficient characters. But that blows the idea of "record based" completely
on a UTF-8 handle."

I would also be ok with just croaking when attempting a byte-type operation
on an encoded string.

it may torpedo and sink the original fixed length records for mainframe IO
optimization idea, but that driving need may have been eclipsed by the need
to port systems that now use fixed-character-count fields out of habit into
todays brave new world where a toothbrush can have more computing power in
it than ...

Is anyone here actually shoehorning UTF8 into fixed-length records, using
any system besides Perl to do it?

How do major commercial databases handle unicode and "CHAR 20" fields?

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2011

From tchrist@perl.com

Is anyone here actually shoehorning UTF8 into fixed-length records,

Oh sure.

using any system besides Perl to do it?

Java.

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 15, 2011

From @Hugmeir

On Fri, Oct 14, 2011 at 5​:13 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

Is anyone here actually shoehorning UTF8 into fixed-length records,

Oh sure.

using any system besides Perl to do it?

Java.

Is their model worth borrowing?

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2011

From @cpansprout

On Tue Sep 27 16​:11​:29 2011, public@​khwilliamson.com wrote​:

On 09/26/2011 02​:27 PM, Tom Christiansen wrote​:

I think it was agreed some time ago that that is a bug. The utf8 layer
should at least check for well-formedness (meaning that it produces a
valid perl scalar), even if it does not check for strict UTF-8
(disallow
certain codepoin(the latter being a matter of controversy).

I do have some mail from Mark Davis explaining why a UTF-8 decoder must
allow everything in the range U+0000 through U+1FFFF *except* for
surrogates. Our "nonchar" warnings apparently shouldn't be there.

--tom

This issue keeps coming back up, when I think we have long ago resolved
how to fix it. Here is my view of how the API should work, and I
thought that it followed the consensus view. This follows what I think
Zefram and David Golden proposed more than a year ago.

The default utf8 layer should prohibit malformed utf8,

Yes, of course.

surrogates,
non-character code points and above-Unicode code points.

That might be going to far.

There should be an alternate layer, called something like utf8-lax,
which allows all three, but not malformed utf8. There should be three
other layers, with names like no-surrogates, no-nonchars, and
only-unicode which disallow exactly one class, as indicated by their
names. It should be then possible to combine these to orthogonally
allow any combination of the three problematic input types.

My understanding is that the the original reason for not doing the input
checks was performance. Security is a far more important issue now,

Indeed, but the only example given where non-characters were a security
issue involved three pieces of buggy software interacting, including a
‘security’ layer that wasn’t.

(Have I already said this? I have a backlog of messages I wanted to
reply to, so I may be repeating myself.)

@p5pRT
Copy link
Author

p5pRT commented Mar 22, 2012

From @rjbs

I believe that the correct behavior is for $/=\10 to cause getlines behave as reads with length
10, as discussed. However, this will be introduced in the 5.17 series. In the meantime, the
horrors of this will be documented.

The security concern, for now, can be addressed at a different level.

@p5pRT
Copy link
Author

p5pRT commented Mar 22, 2012

From @rjbs

On Thu Mar 22 14​:53​:40 2012, rjbs wrote​:

I believe that the correct behavior is for $/=\10 to cause getlines
behave as reads with length
10, as discussed. However, this will be introduced in the 5.17
series. In the meantime, the
horrors of this will be documented.

The security concern, for now, can be addressed at a different level.

...clearly I'm talking about the $/ bug specifically, which I've marked as not blocking 5.16.

@p5pRT
Copy link
Author

p5pRT commented Mar 22, 2012

From [Unknown Contact. See original ticket]

On Thu Mar 22 14​:53​:40 2012, rjbs wrote​:

I believe that the correct behavior is for $/=\10 to cause getlines
behave as reads with length
10, as discussed. However, this will be introduced in the 5.17
series. In the meantime, the
horrors of this will be documented.

The security concern, for now, can be addressed at a different level.

...clearly I'm talking about the $/ bug specifically, which I've marked as not blocking 5.16.

@p5pRT
Copy link
Author

p5pRT commented Mar 22, 2012

From tchrist@perl.com

Ok.

--tom

@p5pRT
Copy link
Author

p5pRT commented Apr 26, 2012

From @rjbs

As of cd7e6c8 the lone remaining change to be made is the default-strictness of the :utf8 layer,
which is not going to happen in 5.16.0. It will be made most likely in early 5.17.

The last few commits, however, make utf-8 handling much stricter, including fixing a number of
bugs reported here and elsewhere. I invite interested parties to have a look!

This ticket will now be removed from blocking.

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2013

From @rjbs

On Thu Apr 26 11​:11​:43 2012, rjbs wrote​:

As of cd7e6c8 the lone remaining change to be made is the default-
strictness of the :utf8 layer,
which is not going to happen in 5.16.0. It will be made most likely
in early 5.17.

The last few commits, however, make utf-8 handling much stricter,
including fixing a number of
bugs reported here and elsewhere. I invite interested parties to have
a look!

We hoped to see strict utf8 available in 5.17, but it didn't happen for various reasons. I got an
update not too long ago that this should be underway again. Is that still the case? Can we get a
status report? That'd be keen.

Thanks!

--
rjbs

@toddr toddr removed the khw label Oct 25, 2019
@xenu xenu removed the Severity Low label Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants