Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing regex test that dies with "panic: pp_match start/end pointers" #10194

Open
p5pRT opened this issue Feb 22, 2010 · 18 comments
Open

Failing regex test that dies with "panic: pp_match start/end pointers" #10194

p5pRT opened this issue Feb 22, 2010 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 22, 2010

Migrated from rt.perl.org#73018 (status was 'open')

Searchable as RT73018$

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2010

From @avar

Created by @avar

It's possible to make Perl die in pp_match by manually setting the
utf8 flag on a specially crafted string. Initially this happened in a
program that wasn't manually setting the flag but I've distilled it
down to this simple report.

See this message on perl5-porters for the full report​:
http​://www.nntp.perl.org/group/perl.perl5.porters/2010/02/msg156829.html

This is just an RT copy, as requested.

Perl Info

Flags:
    category=core
    severity=medium

This perlbug was built using Perl 5.11.4 - Tue Feb 16 12:45:44 GMT 2010
It is being executed now by  Perl 5.10.1 - Sun Feb  7 13:06:31 UTC 2010.

Site configuration information for perl 5.10.1:

Configured by Debian Project at Sun Feb  7 13:06:31 UTC 2010.

Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=2.6.32-trunk-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux madeleine 2.6.32-trunk-amd64 #1 smp sun jan 10 22:40:40 utc 2010 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.3', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.10.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
    gnulibc_version='2.10.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.10.1:
    /etc/perl
    /usr/local/lib/perl/5.10.1
    /usr/local/share/perl/5.10.1
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.1:
    HOME=/home/avar
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/usr/bin:/bin:/usr/games:/home/avar/g/misc-scripts:/home/avar/g/misc-scripts
    PERLDOC=-MPod::Text::Ansi
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2010

From marvin@rectangular.com

What's the proposed remedy? The behavior we have now is the behavior I
would expect.

Munging the internal data of SVf_UTF8 strings isn't an easy mistake to
make -- if I recall correctly, the bitwise operators pose the only
silent threat. Any tool that lets you set the SVf_UTF8 flag directly --
such as Encode​::_utf8_on(), used in the test case -- has big fat red
warnings all over its docs​: "Do not use unless you know that the STRING
is well-formed UTF-8."

I don't know about this specific case, but in general, always being
prepared to recover gracefully from bad UTF-8 data at any moment is
expensive. I don't think we should sacrifice performance or code
clarity for the sake of, e.g. perpetually preparing to swap in the
Unicode replacement character. But I don't think we ought to silently
propagate bad data via bad match pointers, either. In my opinion, dying
with an informative error message is the best behavior.

The time to validate UTF-8 data is on input.

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2010

From marvin@rectangular.com

What's the proposed remedy? The behavior we have now is the behavior I
would expect.

Munging the internal data of SVf_UTF8 strings isn't an easy mistake to
make -- if I recall correctly, the bitwise operators pose the only
silent threat. Any tool that lets you set the SVf_UTF8 flag directly --
such as Encode​::_utf8_on(), used in the test case -- has big fat red
warnings all over its docs​: "Do not use unless you know that the STRING
is well-formed UTF-8."

I don't know about this specific case, but in general, always being
prepared to recover gracefully from bad UTF-8 data at any moment is
expensive. I don't think we should sacrifice performance or code
clarity for the sake of, e.g. perpetually preparing to swap in the
Unicode replacement character. But I don't think we ought to silently
propagate bad data via bad match pointers, either. In my opinion, dying
with an informative error message is the best behavior.

The time to validate UTF-8 data is on input.

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2010

From @avar

On Tue, Feb 23, 2010 at 05​:19, Marvin Humphrey via RT
<perlbug-followup@​perl.org> wrote​:

What's the proposed remedy?  The behavior we have now is the behavior I
would expect.

Munging the internal data of SVf_UTF8 strings isn't an easy mistake to
make -- if I recall correctly, the bitwise operators pose the only
silent threat.  Any tool that lets you set the SVf_UTF8 flag directly --
such as Encode​::_utf8_on(), used in the test case -- has big fat red
warnings all over its docs​:  "Do not use unless you know that the STRING
is well-formed UTF-8."

I don't know about this specific case, but in general, always being
prepared to recover gracefully from bad UTF-8 data at any moment is
expensive.  I don't think we should sacrifice performance or code
clarity for the sake of, e.g. perpetually preparing to swap in the
Unicode replacement character.  But I don't think we ought to silently
propagate bad data via bad match pointers, either.  In my opinion, dying
with an informative error message is the best behavior.

The core issue here seems to be that when this data gets passed into
core the regex engine ends up thinking it has a different length than
what core's idea of its string length is. This can be seen on the test
case by breaking in pp_match where it dies.

As noted in the bug this initially happened on a program that wasn't
doing anything particularly evil. I spotted this while parsing a
malformed line of the #perl6 IRC log with POE​::Component​::IRC​::Common
which does this​:

sub irc_to_utf8 {
  my ($line) = @​_;
  my $utf8 = guess_encoding($line, 'utf8');
  return ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);
}

Unfortunately I haven't been able to reproduce that again..

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2010

From marvin@rectangular.com

On Tue, Feb 23, 2010 at 05​:31​:47AM +0000, Ævar Arnfjörð Bjarmason wrote​:

The core issue here seems to be that when this data gets passed into
core the regex engine ends up thinking it has a different length than
what core's idea of its string length is.

But that's because the scalar is in an invalid state. It's not something
either the core or the regex engine is equipped to recover from, and my
contention is that equipping either the core or the regex engine so that they
could successfully patch up that scalar and return it to a valid state under
all circumstances would be too expensive. We can't afford to promote
"SVf_UTF8 flag set, but invalid byte sequence" to an allowed state.

As noted in the bug this initially happened on a program that wasn't
doing anything particularly evil. I spotted this while parsing a
malformed line of the #perl6 IRC log with POE​::Component​::IRC​::Common
which does this​:

sub irc_to_utf8 {
my ($line) = @​_;
my $utf8 = guess_encoding($line, 'utf8');
return ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);
}

Unfortunately I haven't been able to reproduce that again..

I submit that that's where the fixable bug lies, and that efforts should go
into isolating the problem there. A start might be to submit random bytes
sequences to decode() with both of those encoding specifications to see if you
can get it to choke.

There's room for one possible improvement within Perl itself​: the "panic"
error message doesn't tell you what the scalar contained, which would be
useful in tracking down what went wrong. A Devel​::Peek-style escaped string
dump would be nice to have.

Otherwise, I vote to close this bug as a "won't fix".

Marvin Humphrey

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From @avar

On Tue, Feb 23, 2010 at 20​:53, Marvin Humphrey <marvin@​rectangular.com> wrote​:

On Tue, Feb 23, 2010 at 05​:31​:47AM +0000, Ęvar Arnfjörš Bjarmason wrote​:

The core issue here seems to be that when this data gets passed into
core the regex engine ends up thinking it has a different length than
what core's idea of its string length is.

But that's because the scalar is in an invalid state.  It's not something
either the core or the regex engine is equipped to recover from, and my
contention is that equipping either the core or the regex engine so that they
could successfully patch up that scalar and return it to a valid state under
all circumstances would be too expensive.  We can't afford to promote
"SVf_UTF8 flag set, but invalid byte sequence" to an allowed state.

I'm not familiar with the UTF-8 bits of the regex engine but why is
its idea about the length of an invalid UTF-8 byte sequence marked as
UTF-8 different from that of core's scalar functions? That suggest
something odd to me.

As noted in the bug this initially happened on a program that wasn't
doing anything particularly evil. I spotted this while parsing a
malformed line of the #perl6 IRC log with POE​::Component​::IRC​::Common
which does this​:

sub irc_to_utf8 {
    my ($line) = @​_;
    my $utf8 = guess_encoding($line, 'utf8');
    return ref $utf8 ? decode('utf8', $line) : decode('cp1252', $line);
}

Unfortunately I haven't been able to reproduce that again..

I submit that that's where the fixable bug lies, and that efforts should go
into isolating the problem there.  A start might be to submit random bytes
sequences to decode() with both of those encoding specifications to see if you
can get it to choke.

There's room for one possible improvement within Perl itself​: the "panic"
error message doesn't tell you what the scalar contained, which would be
useful in tracking down what went wrong.  A Devel​::Peek-style escaped string
dump would be nice to have.

Otherwise, I vote to close this bug as a "won't fix".

Here's the short testcase that doesn't do anything too evil that I was
looking for​:

  $ perl -e 'print "\xe2\x90"' | perl -MDevel​::Peek -CI -E 'my $in =
join q[], <>; Dump($in); () = $in =~ /(.*)/;'
  SV = PV(0x9672810) at 0x9694668
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x9698918 "\342\220"\0 [UTF8 "\x{2400}"]
  CUR = 2
  LEN = 4
  panic​: pp_match start/end pointers at -e line 1, <> line 1.

The problem with saying that you shouldn't turn on the UTF-8 flag on
invalid data is that people are doing that all the time now on data
that may very well be garbage. A lot of modules / programs are using
the -C switch or the equivalent $PERL_UNICODE or open.pm options. As
it is you can't parse /dev/urandom or IRC log files that may be 99.9%
valid UTF-8 data without that 0.01% eventually killing perl if you use
these widely used and recommended modules.

I don't think it's acceptable that perl die this way.

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From ben@morrow.me.uk

Quoth avarab@​gmail.com (=?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?=)​:

Here's the short testcase that doesn't do anything too evil that I was
looking for​:

$ perl \-e 'print "\\xe2\\x90"' | perl \-MDevel&#8203;::Peek \-CI \-E 'my $in =

join q[], <>; Dump($in); () = $in =~ /(.*)/;'
SV = PV(0x9672810) at 0x9694668
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x9698918 "\342\220"\0 [UTF8 "\x{2400}"]
CUR = 2
LEN = 4
panic​: pp_match start/end pointers at -e line 1, <> line 1.

The problem with saying that you shouldn't turn on the UTF-8 flag on
invalid data is that people are doing that all the time now on data
that may very well be garbage. A lot of modules / programs are using
the -C switch or the equivalent $PERL_UNICODE or open.pm options. As
it is you can't parse /dev/urandom or IRC log files that may be 99.9%
valid UTF-8 data without that 0.01% eventually killing perl if you use
these widely used and recommended modules.

As currently implemented pushing a :utf8 layer (as opposed to an
:encoding(utf8) layer) onto an input filehandle is *always* a bug, for
exactly this reason. IMHO :utf8 should do at least minimal correctness
checking, and throw a catchable exception on failure (and no, I don't
care if it's slower). :encoding(utf8) is still there for those who need
more control.

Ben

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From marvin@rectangular.com

On Wed, Feb 24, 2010 at 12​:27​:38AM +0000, Ævar Arnfjörð Bjarmason wrote​:

I'm not familiar with the UTF-8 bits of the regex engine but why is
its idea about the length of an invalid UTF-8 byte sequence marked as
UTF-8 different from that of core's scalar functions? That suggest
something odd to me.

I don't know the regex engine well myself, but I have written a certain amount
of low-level UTF-8 manipulation C code for other projects. The typical way to
repair invalid UTF-8 byte sequences is to swap in the replacement character --
e.g. that's what Encode​::decode() does by default. That's not going to happen
on the fly in the middle of a compiled regex operation, or a substring, or
what have you, because it might require reallocation.

However, if you don't do that and you blithely keep on going, it's easy to get
stuff like memory read or write overruns. Or rather, you could write the code
to be safe, but it would involve one of two fixes. 1) Prescan and if necessary
repair the scalar before any dangerous operation. 2) Rewrite your low-level
algorithms to perform constant bounds checking, effectively pushing down
validation to the byte level. Either of those approaches is unacceptably
costly.

Here's the code in pp_hot.c that dies​:

  if (RX_OFFS(rx)[i].end < 0 || RX_OFFS(rx)[i].start < 0 ||
  len < 0 || len > strend - s)
  DIE(aTHX_ "panic​: pp_match start/end pointers");

That's a post-operation sanity check to ensure that the match points are
actually within bounds. It's hard to know exactly how they got out of place,
but my guess is that the match position was incremented by a UTF8_SKIP based on
a UTF-8 header byte which happened to be followed by an inadequate number of
trailing bytes. You're not going to prevent that without checking every
UTF8_SKIP addition, which will slow things waaaay down.

Here's the short testcase that doesn't do anything too evil that I was
looking for​:

$ perl \-e 'print "\\xe2\\x90"' | perl \-MDevel&#8203;::Peek \-CI \-E 'my $in =

join q[], <>; Dump($in); () = $in =~ /(.*)/;'
SV = PV(0x9672810) at 0x9694668
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x9698918 "\342\220"\0 [UTF8 "\x{2400}"]
CUR = 2
LEN = 4
panic​: pp_match start/end pointers at -e line 1, <> line 1.

The problem with saying that you shouldn't turn on the UTF-8 flag on
invalid data is that people are doing that all the time now on data
that may very well be garbage. A lot of modules / programs are using
the -C switch or the equivalent $PERL_UNICODE or open.pm options. As
it is you can't parse /dev/urandom or IRC log files that may be 99.9%
valid UTF-8 data without that 0.01% eventually killing perl if you use
these widely used and recommended modules.

It's not safe to use -CI with data that is not known to be valid UTF-8.
From the docs for binmode()​:

  To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just
  marks the data as UTF-8 without further checking, while :encoding(utf8)
  checks the data for actually being valid UTF-8.

I don't think it's acceptable that perl die this way.

I don't think what you want can be achieved without slowing Perl unicode string
manipulation operations to a crawl, and that only after theoretical herculean
efforts.

Instead, validate on input.

Marvin Humphrey

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From @ikegami

On Tue, Feb 23, 2010 at 8​:49 PM, Marvin Humphrey <marvin@​rectangular.com>wrote​:

It's not safe to use -CI with data that is not known to be valid UTF-8.
From the docs for binmode()​:

To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8 just
marks the data as UTF-8 without further checking, while :encoding(utf8)
checks the data for actually being valid UTF-8.

That sounds like a bug. How much of a penalty is it to do a basic check on
an IO stream?

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From @ikegami

On Wed, Feb 24, 2010 at 1​:37 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Tue, Feb 23, 2010 at 8​:49 PM, Marvin Humphrey <marvin@​rectangular.com>wrote​:

It's not safe to use -CI with data that is not known to be valid UTF-8.
From the docs for binmode()​:

To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8
just
marks the data as UTF-8 without further checking, while :encoding(utf8)
checks the data for actually being valid UTF-8.

That sounds like a bug. How much of a penalty is it to do a basic check on
an IO stream?

Note that IO stream checks could take advantage of algorithms that work with
UTF-8 (32-bit encoded form max) that we can't use internally for utf8.

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From ben@morrow.me.uk

Quoth ikegami@​adaelis.com (Eric Brine)​:

On Wed, Feb 24, 2010 at 1​:37 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Tue, Feb 23, 2010 at 8​:49 PM, Marvin Humphrey
<marvin@​rectangular.com>wrote​:

It's not safe to use -CI with data that is not known to be valid UTF-8.
From the docs for binmode()​:

To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(utf8) . :utf8
just
marks the data as UTF-8 without further checking, while :encoding(utf8)
checks the data for actually being valid UTF-8.

That sounds like a bug. How much of a penalty is it to do a basic check on
an IO stream?

Note that IO stream checks could take advantage of algorithms that work with
UTF-8 (32-bit encoded form max) that we can't use internally for utf8.

Not if you're expecting to be able to read the output of perl -CO you
can't, and that seems a fairly obvious assumption.

Ben

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From @avar

On Wed, Feb 24, 2010 at 22​:57, Ben Morrow <ben@​morrow.me.uk> wrote​:

Quoth ikegami@​adaelis.com (Eric Brine)​:

On Wed, Feb 24, 2010 at 1​:37 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Tue, Feb 23, 2010 at 8​:49 PM, Marvin Humphrey
<marvin@​rectangular.com>wrote​:

It's not safe to use -CI with data that is not known to be valid UTF-8.
From the docs for binmode()​:

   To mark FILEHANDLE as UTF-8, use :utf8  or :encoding(utf8) . :utf8
 just
   marks the data as UTF-8 without further checking, while :encoding(utf8)
   checks the data for actually being valid UTF-8.

That sounds like a bug. How much of a penalty is it to do a basic check on
an IO stream?

Note that IO stream checks could take advantage of algorithms that work with
UTF-8 (32-bit encoded form max) that we can't use internally for utf8.

Not if you're expecting to be able to read the output of perl -CO you
can't, and that seems a fairly obvious assumption.

Internally processing the stream as UTF-32 doesn't mean you'd output as UTF-32.

@p5pRT
Copy link
Author

p5pRT commented Feb 24, 2010

From ben@morrow.me.uk

At 11PM +0000 on 24/02/10 Ævar Arnfjörð Bjarmason wrote​:

On Wed, Feb 24, 2010 at 22​:57, Ben Morrow <ben@​morrow.me.uk> wrote​:

Quoth ikegami@​adaelis.com (Eric Brine)​:

On Wed, Feb 24, 2010 at 1​:37 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

That sounds like a bug. How much of a penalty is it to do a basic
check on an IO stream?

Note that IO stream checks could take advantage of algorithms that
work with UTF-8 (32-bit encoded form max) that we can't use
internally for utf8.

Not if you're expecting to be able to read the output of perl -CO you
can't, and that seems a fairly obvious assumption.

Internally processing the stream as UTF-32 doesn't mean you'd output
as UTF-32.

I wasn't clear. Since perl can represent codepoints greater than 2**32
internally, perl -CO (or any other use of :utf8 on an output handle)
can produce such technically-invalid over-long utf8 sequences. If :utf8
on input were to assume that all characters fit into 32 bits, this would
mean that (sometimes) you wouldn't be able to feed perl's output back to
perl without it complaining. This seems unreasonable.

Ben

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2010

From @ikegami

On Wed, Feb 24, 2010 at 6​:21 PM, Ben Morrow <ben@​morrow.me.uk> wrote​:

If :utf8 on input were to assume that all characters fit into 32 bits,

First, I didn't mention :utf8. :utf8 is an interface between a PerlIO layer
and other Perl code, and I was talking about an external interface. Maybe
that's the same thing, maybe not.

this would mean that (sometimes) you wouldn't be able to feed perl's output
back to
perl without it complaining. This seems unreasonable.

The purpose of IO is to interact, and outputting illegal UTF-8 doesn't
accomplish that. Outputting utf8 instead of UTF-8 should be a conscious
decision, not the default.

In fact, we already have warnings for some illegal charaters, so we're not
currently outputing utf8​:

$ perl -C -we'print chr 0xFFFF' | od -c
Unicode character 0xffff is illegal at -e line 1.
0000000 357 277 277
0000003

But we're not outputting UTF-8 either​:

$ perl -C -we'print chr 0x200_000' | od -c
0000000 370 210 200 200 200
0000005

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2010

From @khwilliamson

Eric Brine wrote​:

On Wed, Feb 24, 2010 at 6​:21 PM, Ben Morrow <ben@​morrow.me.uk> wrote​:

If :utf8 on input were to assume that all characters fit into 32 bits,

First, I didn't mention :utf8. :utf8 is an interface between a PerlIO layer
and other Perl code, and I was talking about an external interface. Maybe
that's the same thing, maybe not.

this would mean that (sometimes) you wouldn't be able to feed perl's output
back to
perl without it complaining. This seems unreasonable.

The purpose of IO is to interact, and outputting illegal UTF-8 doesn't
accomplish that. Outputting utf8 instead of UTF-8 should be a conscious
decision, not the default.

In fact, we already have warnings for some illegal charaters, so we're not
currently outputing utf8​:

$ perl -C -we'print chr 0xFFFF' | od -c
Unicode character 0xffff is illegal at -e line 1.
0000000 357 277 277
0000003

I feel the need to point out that this character and similar are no
longer considered illegal by Unicode, but are "noncharacters." To that
end, 5.11.x's message now says​:
  ./perl -Ilib -C -we'print chr 0xFFFF' | od -x
Unicode non-character 0xffff is illegal for interchange at -e line 1.
0000000 bfef 00bf
0000003

A set of cooperating applications can accept these characters in I/O,
but they should not be sent to an unsuspecting application. And the od
gives the legal utf8 for that code point (I checked)

But we're not outputting UTF-8 either​:

$ perl -C -we'print chr 0x200_000' | od -c
0000000 370 210 200 200 200
0000005

When I do an od -x instead, I get
0000000 88f8 8080 0080
0000005

I think this is the correct utf8-like byte sequence for this code point.
  (I can never remember which is the correct term for the utf8 that
stops at 0x10FFFF, like the standard says, and which incorporates the
same algorithm to get values beyond that which Unicode says it will
never go.)

@p5pRT
Copy link
Author

p5pRT commented Feb 25, 2010

From ben@morrow.me.uk

Quoth public@​khwilliamson.com (karl williamson)​:

I think this is the correct utf8-like byte sequence for this code point.
(I can never remember which is the correct term for the utf8 that
stops at 0x10FFFF, like the standard says, and which incorporates the
same algorithm to get values beyond that which Unicode says it will
never go.)

In perl Encode terms, 'UTF-8' means the encoding as defined by the
standard (limited to 0x10FFFF, surrogates are illegal, &c.) whereas
'utf8' means perl's extended version of that. I don't know if there are
terms in more general use for these concepts.

Ben

@khwilliamson
Copy link
Contributor

This should be automatically fixed when https://github.com/Perl/perl5/pull/19121 is finally merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants