Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please implement Unicode Corrigendum #9 (noncharacters) #13594

Closed
p5pRT opened this issue Feb 11, 2014 · 33 comments
Closed

Please implement Unicode Corrigendum #9 (noncharacters) #13594

p5pRT opened this issue Feb 11, 2014 · 33 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 11, 2014

Migrated from rt.perl.org#121226 (status was 'rejected')

Searchable as RT121226$

@p5pRT
Copy link
Author

p5pRT commented Feb 11, 2014

From gpiero@rm-rf.it

Created by gpiero@rm-rf.it

Currently perl issues a serious warning when trying to output (or input)
Unicode Noncharacters [0]​:

$ perl -CS -le 'print "noncharacter​: \x{FDEF}"'
Unicode non-character U+FDEF is illegal for open interchange at -e line 1.
noncharacter​: �

This is due to a common interpretation of the Unicode standard. Anyway the
Unicode Technical Committee has issued Corrigendum #9 [1] on 2013-Jan-30.
Quoting from it​:

`` Noncharacters in the Unicode Standard are intended for internal use and have
no standard interpretation when exchanged outside the context of internal use.
However, they are not illegal in interchange nor do they cause ill-formed
Unicode text. This has always been the intent of the standard, as expressed by
the Unicode Technical Committee. This is necessary for the effective use of
noncharacters, because anytime a Unicode string crosses an API boundary, it is
in effect being "interchanged". ``

As this is labeled as a clarification, I don't think we have to wait for the
next Unicode version for adhering to it (I mean​: adhering to Corrigendum #9
does not break compliance with previous versions of Unicode, IMO).

I admit that at this point it isn't clear to me the distinction between
private-use[2] and noncharacters, but, as for what perl is concerned, I think
they should be managed in the same way. I.e.​:

$ perl -CS -le 'print "private-use character​: \x{F8FF}"'
private-use character​: 

(no warning issued).

At the very least, the severity of the 'nonchar' warning should be lowered.

Thanks,
Gian Piero.

[0] http​://www.unicode.org/faq/private_use.html#noncharacters
[1] http​://www.unicode.org/versions/corrigendum9.html
[2] http​://www.unicode.org/faq/private_use.html#private_use

Perl Info

Flags:
     category=core
     severity=wishlist

Site configuration information for perl 5.19.8:

Configured by gpiero at Mon Feb 10 20:17:47 CET 2014.

Summary of my perl5 (revision 5 version 19 subversion 8) configuration:
    
   Platform:
     osname=linux, osvers=3.12-1-amd64, archname=x86_64-linux
     uname='linux caimano 3.12-1-amd64 #1 smp debian 3.12.8-1 (2014-01-19) x86_64 gnulinux '
     config_args='-de -Dprefix=/home/gpiero/perl5/perlbrew/perls/perl-5.19.8 -Dusedevel -Aeval:scriptdir=/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/bin'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=undef, usemultiplicity=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.8.2', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
     libs=-lnsl -ldl -lm -lcrypt -lutil -lc
     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
     libc=, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.17'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'



@INC for perl 5.19.8:
     /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/site_perl/5.19.8/x86_64-linux
     /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/site_perl/5.19.8
     /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/5.19.8/x86_64-linux
     /home/gpiero/perl5/perlbrew/perls/perl-5.19.8/lib/5.19.8
     .


Environment for perl 5.19.8:
     HOME=/home/gpiero
     LANG=en_US.UTF-8
     LANGUAGE=en_US:en
     LC_COLLATE=C
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/home/gpiero/perl5/perlbrew/bin:/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/bin:/home/gpiero/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
     PERLBREW=command perlbrew
     PERLBREW_BASHRC_VERSION=0.67
     PERLBREW_HOME=/home/gpiero/.perlbrew
     PERLBREW_MANPATH=/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/man
     PERLBREW_PATH=/home/gpiero/perl5/perlbrew/bin:/home/gpiero/perl5/perlbrew/perls/perl-5.19.8/bin
     PERLBREW_PERL=perl-5.19.8
     PERLBREW_ROOT=/home/gpiero/perl5/perlbrew
     PERLBREW_VERSION=0.67
     PERL_BADLANG (unset)
     SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Feb 11, 2014

From @ap

* Gian Piero <gpiero@​rm-rf.it> [2014-02-11 17​:10]​:

The Unicode Technical Committee has issued Corrigendum #9 [1] on
2013-Jan-30.

I found most helpful the sections from
<http​://www.unicode.org/faq/private_use.html#nonchar7> on down.

They clearly describe the intent that these noncharacters should be
usable without further ado from any infrastructure.

I admit that at this point it isn't clear to me the distinction
between private-use[2] and noncharacters, but, as for what perl is
concerned, I think they should be managed in the same way. I.e.​:

$ perl -CS -le 'print "private-use character​: \x{F8FF}"'
private-use character​: 

(no warning issued).

That is very much my interpretation of the FAQ.

Quoting from <http​://www.unicode.org/faq/private_use.html#nonchar2>​:

  Noncharacters are in a sense a kind of private-use character,
  because they are reserved for internal (private) use. However, that
  internal use is intended as a “super” private use, not normally
  interchanged with other users.

Prior to that, the FAQ expends some verbiage to convey that private-use
characters were intended specifically for interchange among parties who
have agreed on some interpretation for those characters amongst each
other.

So in answer to your question, it appears that the UTC conceives the
difference between non- and private-use characters to be that the
meaning of noncharacters should always be considered unknown whenever
they cross the boundaries of a particular system, while private-use
characters may meaningfully pass the boundaries between systems that
share an agreed-upon interpretation for them.

It inescapably follows that if even the use of noncharacters must not
cause warnings, then much more so neither must the use of private-use
characters.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Feb 11, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2014

From gpiero@rm-rf.it

Hi Aristotle,

thank you for your reply.

* [Tue, Feb 11, 2014 at 11​:47​:16AM -0800] Aristotle Pagaltzis via RT​:

I admit that at this point it isn't clear to me the distinction
between private-use[2] and noncharacters, but, as for what perl is
concerned, I think they should be managed in the same way. I.e.​:

So in answer to your question, it appears that the UTC conceives the
difference between non- and private-use characters to be that the
meaning of noncharacters should always be considered unknown whenever
they cross the boundaries of a particular system, while private-use
characters may meaningfully pass the boundaries between systems that
share an agreed-upon interpretation for them.

Yes, but the definition of 'system' in this context is not so clear to
me and probably isn't commonly agreed upon in general (as also the UTC
expresses some concerns when talking about distributed software).
Personally I don't see much differences between the two sets in
practical cases, but probably I'm wrong.
Anyway this is just a personal consideration. The point here is that
perl should not warn when using noncharacters, and I think we agree on
this.

It inescapably follows that if even the use of noncharacters must not
cause warnings, then much more so neither must the use of private-use
characters.

I'm afraid I wasn't clear in the initial report​: AFAIK perl already
correctly manages private-use characters and does not seem to issue
warnings when using them. It does however cause a (serious) warning when
using noncharacters, and this is the only warning I was asking to
dismiss.

Ciao,
Gian Piero.

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2014

From @khwilliamson

On 02/11/2014 02​:16 PM, Gian Piero Carrubba wrote​:

Hi Aristotle,

thank you for your reply.

* [Tue, Feb 11, 2014 at 11​:47​:16AM -0800] Aristotle Pagaltzis via RT​:

I admit that at this point it isn't clear to me the distinction
between private-use[2] and noncharacters, but, as for what perl is
concerned, I think they should be managed in the same way. I.e.​:

So in answer to your question, it appears that the UTC conceives the
difference between non- and private-use characters to be that the
meaning of noncharacters should always be considered unknown whenever
they cross the boundaries of a particular system, while private-use
characters may meaningfully pass the boundaries between systems that
share an agreed-upon interpretation for them.

Yes, but the definition of 'system' in this context is not so clear to
me and probably isn't commonly agreed upon in general (as also the UTC
expresses some concerns when talking about distributed software).
Personally I don't see much differences between the two sets in
practical cases, but probably I'm wrong.
Anyway this is just a personal consideration. The point here is that
perl should not warn when using noncharacters, and I think we agree on
this.

It inescapably follows that if even the use of noncharacters must not
cause warnings, then much more so neither must the use of private-use
characters.

I'm afraid I wasn't clear in the initial report​: AFAIK perl already
correctly manages private-use characters and does not seem to issue
warnings when using them. It does however cause a (serious) warning when
using noncharacters, and this is the only warning I was asking to dismiss.

Ciao,
Gian Piero.

I have mixed feelings about this request.

First, some clarifications. Private-use characters have always been
intended to be freely interchangeable, but the meanings are not
specified by the Standard. What they typically are used for is
corporations or other groups decide they want to use certain ones for
certain purposes and their code is written knowing this. But there is
nothing preventing another group from using the same code points for
something else. As long as the two groups don't ever exchange files
which use these code points, there is no problem. As an example, the
Apple Corporation has chosen a particular code point to represent their
logo. All their software recognizes this code point and treats it
accordingly. If you are writing software that might run on one of their
devices, and you need private-use code points, it would be best if you
avoided using that particular one. Another example is there is a
registry of private-use code points run by an individual, IIRC. He
publishes the list so that people can avoid conflicts. It includes
characters from the Klingon script and similar ones, that Unicode
refuses to encode, but which have communities who want them. Some
scripts started out in this registry, but Unicode was eventually
persuaded to encode them, and code that used the old values could be
changed to simply add a constant number to any code point to get the
Unicode value.

Non-character code points have a different genesis altogether.
Originally Unicode was conceived as having just 2**16 code points. If
one wants to loop over all of them using a 16-bit word size, you can't
use the typical "while (x < MAX) {}" loop without overflowing. They
solved this by just saying U+FFFF isn't ever going to be a real code
point, so you could say "while (x < 0xFFFF)" and cover everything of
interest. They also wanted to reserve FFFE, since the Byte-Order Mark
(FEFF) looks like that value when the byte ordering is wrong. You don't
want a legal character to be confused with the BOM.

When Unicode was expanded beyond 16-bits, they created the plane
concept, with Plane0 being 0-65535, Plane1 being the next set, etc. It
was envisioned that software would work on a given plane, switching at
times, so they reserved FFFF and FFFE on each of the 16 planes.

Eventually it became clear that there is a need for text-processing
software to be able to have sentinel code points that it knows won't be
in the middle of a stream of text that it is processing. Thus, they
added the other non-character code points. (There may be a reason for
these particular ones to be not-desirable to use for other purposes, but
if so, I'm unaware of what it is.)

Non-character code points should not be foisted off on an unsuspecting
application, unlike private-use code points. Software has been written
expecting that it can use these code points for its own purposes and not
have to worry about them being in an input stream. One should have a
gate keeper that rejects these by default. An example is a text editor
that is intended to edit any Unicode-conformant text. It doesn't know
what any private-use character is intended for, nor does it need to
know. What it knows is that such a character should be treated like any
other. But a text editor may use an algorithm that intersperses
characters that have special meaning to it with the ones that are being
edited. That's what non-characters are for. A conformant text editor
does not have to accept text with non-characters.
A conformant text editor does have to accept text with private-use
characters. The Corrigendum says "Noncharacter​: A code point that is
permanently reserved for internal use"

Now to the request. I agree that the warning is not severe; however we
wanted it to be on by default, and the only way to do that currently in
Perl is to make it "severe". The question is should you be warned if
you are outputting a code point that is "permanently reserved for
internal use". It sure sounds like it to me, but I can see the other
side too. But that's why we made a new and separate warning category
for just the input and output of these code points. If your application
does this, it is a simple matter to say

"no warnings 'nonchar'"

to silence just them.

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2014

From @xdg

perldiag says this​:

  Unicode non-character U+%X is illegal for open interchange
  (S utf8, nonchar) Certain codepoints, such as U+FFFE and U+FFFF,
  are defined by the Unicode standard to be non-characters. Those
  are legal codepoints, but are reserved for internal use; so,
  applications shouldn't attempt to exchange them. If you know what
  you are doing you can turn off this warning by "no warnings
  'nonchar';".

That seems to explain it solely as "non-characters" not "private characters".

From Karl's explanation and corrigendum #9, I think it's clear that
"interchange" is allowed, even if it's an odd case. Certainly, they
are not "illegal".

A new 'privatechar' warning category should be added to cover those
distinct from 'nonchar', and I think the wording needs to be softer​:

E.g.

  Unicode private character U+%x in %x, may not be portable

and

  Unicode non-character U+%x in %x, may not be portable

In those, the second "%x" would be the op that triggered the warning,
akin to the "wide character" warnings.

Unlike the wide character warning, though, where the IO handle is
wholly unprepared for character data, I'm not convinced that nonchar
and privatechar need to be on by default, however. They should be
enabled by "use warnings".

Of course, an IO layer should be able to decide if those are acceptable. E.g.

  binmode(STDOUT, "​:utf8_private_strict");

Should something like that be created, it should allow private
characters but warn on non characters.

David

On Wed, Feb 12, 2014 at 2​:22 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 02/11/2014 02​:16 PM, Gian Piero Carrubba wrote​:

Hi Aristotle,

thank you for your reply.

* [Tue, Feb 11, 2014 at 11​:47​:16AM -0800] Aristotle Pagaltzis via RT​:

I admit that at this point it isn't clear to me the distinction
between private-use[2] and noncharacters, but, as for what perl is
concerned, I think they should be managed in the same way. I.e.​:

So in answer to your question, it appears that the UTC conceives the
difference between non- and private-use characters to be that the
meaning of noncharacters should always be considered unknown whenever
they cross the boundaries of a particular system, while private-use
characters may meaningfully pass the boundaries between systems that
share an agreed-upon interpretation for them.

Yes, but the definition of 'system' in this context is not so clear to
me and probably isn't commonly agreed upon in general (as also the UTC
expresses some concerns when talking about distributed software).
Personally I don't see much differences between the two sets in
practical cases, but probably I'm wrong.
Anyway this is just a personal consideration. The point here is that
perl should not warn when using noncharacters, and I think we agree on
this.

It inescapably follows that if even the use of noncharacters must not
cause warnings, then much more so neither must the use of private-use
characters.

I'm afraid I wasn't clear in the initial report​: AFAIK perl already
correctly manages private-use characters and does not seem to issue
warnings when using them. It does however cause a (serious) warning when
using noncharacters, and this is the only warning I was asking to dismiss.

Ciao,
Gian Piero.

I have mixed feelings about this request.

First, some clarifications. Private-use characters have always been
intended to be freely interchangeable, but the meanings are not specified by
the Standard. What they typically are used for is corporations or other
groups decide they want to use certain ones for certain purposes and their
code is written knowing this. But there is nothing preventing another group
from using the same code points for something else. As long as the two
groups don't ever exchange files which use these code points, there is no
problem. As an example, the Apple Corporation has chosen a particular code
point to represent their logo. All their software recognizes this code
point and treats it accordingly. If you are writing software that might run
on one of their devices, and you need private-use code points, it would be
best if you avoided using that particular one. Another example is there is
a registry of private-use code points run by an individual, IIRC. He
publishes the list so that people can avoid conflicts. It includes
characters from the Klingon script and similar ones, that Unicode refuses to
encode, but which have communities who want them. Some scripts started out
in this registry, but Unicode was eventually persuaded to encode them, and
code that used the old values could be changed to simply add a constant
number to any code point to get the Unicode value.

Non-character code points have a different genesis altogether. Originally
Unicode was conceived as having just 2**16 code points. If one wants to
loop over all of them using a 16-bit word size, you can't use the typical
"while (x < MAX) {}" loop without overflowing. They solved this by just
saying U+FFFF isn't ever going to be a real code point, so you could say
"while (x < 0xFFFF)" and cover everything of interest. They also wanted to
reserve FFFE, since the Byte-Order Mark (FEFF) looks like that value when
the byte ordering is wrong. You don't want a legal character to be confused
with the BOM.

When Unicode was expanded beyond 16-bits, they created the plane concept,
with Plane0 being 0-65535, Plane1 being the next set, etc. It was
envisioned that software would work on a given plane, switching at times, so
they reserved FFFF and FFFE on each of the 16 planes.

Eventually it became clear that there is a need for text-processing software
to be able to have sentinel code points that it knows won't be in the middle
of a stream of text that it is processing. Thus, they added the other
non-character code points. (There may be a reason for these particular ones
to be not-desirable to use for other purposes, but if so, I'm unaware of
what it is.)

Non-character code points should not be foisted off on an unsuspecting
application, unlike private-use code points. Software has been written
expecting that it can use these code points for its own purposes and not
have to worry about them being in an input stream. One should have a gate
keeper that rejects these by default. An example is a text editor that is
intended to edit any Unicode-conformant text. It doesn't know what any
private-use character is intended for, nor does it need to know. What it
knows is that such a character should be treated like any other. But a text
editor may use an algorithm that intersperses characters that have special
meaning to it with the ones that are being edited. That's what
non-characters are for. A conformant text editor does not have to accept
text with non-characters.
A conformant text editor does have to accept text with private-use
characters. The Corrigendum says "Noncharacter​: A code point that is
permanently reserved for internal use"

Now to the request. I agree that the warning is not severe; however we
wanted it to be on by default, and the only way to do that currently in Perl
is to make it "severe". The question is should you be warned if you are
outputting a code point that is "permanently reserved for internal use". It
sure sounds like it to me, but I can see the other side too. But that's why
we made a new and separate warning category for just the input and output of
these code points. If your application does this, it is a simple matter to
say

"no warnings 'nonchar'"

to silence just them.

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2014

From tchrist@perl.com

I don't understand why you would ever want to issue a warning
for emitting a PUA code point.

  use charnames "​:alias" => {
  "APPLE CORPORATE LOGO" => 0xF8FF,
  };

  print "\N{APPLE CORPORATE LOGO}\n";

Let alone all the fun I have with my Tengwar module.

  ### This one matches the assignments of the Free Tengwar Font Project
  ### @​ http​://freetengwar.sourceforge.net/
  use constant TENGWAR_BASE => _CONSCRIPT_UNICODE_REGISTRY;

  ### Whereas This one matches the official roadmap​:
  # use constant TENGWAR_BASE => _UNICODE_CONSORTIIUM;

  ## if In file, can do this​:
  ## use charnames "​:full", "​:alias" => "tengwar";

  use charnames "​:full", "​:alias" => { reverse (

  (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO",
  (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA",
  (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA",
  (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE",
  (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO",
  (TENGWAR_BASE + 0x05) => "TENGWAR LETTER UMBAR",

  ....

--tom

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2014

From @Tux

On Wed, 12 Feb 2014 21​:56​:33 -0700, Tom Christiansen <tchrist@​perl.com>
wrote​:

I don't understand why you would ever want to issue a warning
for emitting a PUA code point.

use charnames "&#8203;:alias" => \{
  "APPLE CORPORATE LOGO" => 0xF8FF\,
\};

print "\\N\{APPLE CORPORATE LOGO\}\\n";

Let alone all the fun I have with my Tengwar module.

\#\#\# This one matches the assignments of the Free Tengwar Font Project  
\#\#\#       @&#8203; http&#8203;://freetengwar\.sourceforge\.net/
use constant TENGWAR\_BASE => \_CONSCRIPT\_UNICODE\_REGISTRY;

\#\#\# Whereas This one matches the official roadmap&#8203;:
\# use constant TENGWAR\_BASE => \_UNICODE\_CONSORTIIUM;

\#\# if In file\, can do this&#8203;:
\#\# use charnames "&#8203;:full"\, "&#8203;:alias" => "tengwar";

use charnames "&#8203;:full"\, "&#8203;:alias" => \{ reverse \(

\(TENGWAR\_BASE \+ 0x00\) => "TENGWAR LETTER TINCO"\,
\(TENGWAR\_BASE \+ 0x01\) => "TENGWAR LETTER PARMA"\,
\(TENGWAR\_BASE \+ 0x02\) => "TENGWAR LETTER CALMA"\,
\(TENGWAR\_BASE \+ 0x03\) => "TENGWAR LETTER QUESSE"\,
\(TENGWAR\_BASE \+ 0x04\) => "TENGWAR LETTER ANDO"\,
\(TENGWAR\_BASE \+ 0x05\) => "TENGWAR LETTER UMBAR"\,

I am really happy to see :alias being used like this. It clearly proves
I am not the only one who uses it on a daily basis since 2002 :)

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2014

From gpiero@rm-rf.it

Hello Karl,

thank you, very interesting explanation.

* [Wed, Feb 12, 2014 at 11​:23​:08AM -0800] karl williamson via RT​:

But a text editor may use an algorithm that intersperses
characters that have special meaning to it with the ones that are being
edited. That's what non-characters are for.

Oh well, I thought UTC recommended use of out-of-range sentinels in
those cases (forward compatibility as the Unicode range expands is left
as an open exercise). So now I guess I haven't understood use cases for
sentinels (but maybe the end-of-... cases).

On the other hand, both noncharacters and sentinels could not be the
best choice here. I assume you're saving those "formatting codes" in
your data files as they have a meaning you don't want to lose. Now you
(or someone other for what it matters) want to build a full office suite
and start coding an email client or a presentation software. You
probably also want to be able to include or import the data files
created with your text editor (hopefully preserving the formatting).
Does this mean that your noncharacters suddenly become illegal and you
have to replace them all with private-use chars for being able to
exchange them between the involved softwares ? Or do you consider the
office suite to be a single 'system' and so you're allowed to use them ?
In the latter case obviously the two softwares should agree upon their
meaning, so... well, isn't it so much similar to the definition of
private-use characters ?

I guess I would use private-use chars in most cases for avoiding
problems in the future...
... mmh, and will probably end up with some private-use character
clashing... damn it...

A conformant text editor does not have to accept text with
non-characters.
A conformant text editor does have to accept text with private-use
characters.

Ok, I'm following you here. Nevertheless I'm wondering how much is it
practical...
Hopefully I'm not the only one that in 2014 still doesn't have a clear
understanding of Unicode. Hmm, on second thought, I do hope I am...

Now to the request. I agree that the warning is not severe; however we
wanted it to be on by default, and the only way to do that currently in
Perl is to make it "severe". The question is should you be warned if
you are outputting a code point that is "permanently reserved for
internal use". It sure sounds like it to me, but I can see the other
side too. But that's why we made a new and separate warning category
for just the input and output of these code points. If your application
does this, it is a simple matter to say

"no warnings 'nonchar'"

to silence just them.

Ok, so I mis-interpreted the reason for the warning. I thought it was
Perl telling me​: "ehi, you cannot in/out-put that char", and having to
explicitly tell​: "no, no, I assure you I can... cfr. Corrigendum #9"
seemed somewhat wrong. But I see your point now.

Ciao,
Gian Piero.

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2014

From @xdg

On Wed, Feb 12, 2014 at 11​:56 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

I don't understand why you would ever want to issue a warning
for emitting a PUA code point.

Because they require prior agreement between parties, I think it
sensible to (optionally) warn when private characters go into a IO
stream.

This is particularly important for characters read from one source and
output to another destination. Without knowing that the source and
destination agree to the same semantics, unexpected results could
occur and those are exactly the sort of things that Perl tends to warn
users about.

Absolutely, the warning should not be on by default. When warnings
are enabled, turning off the private character warning or bypassing
the warning by putting on an IO layer that knows that private
characters are OK would be the way to explicitly indicate "prior
agreement".

I don't think "​:utf8" should warn about private or non-characters. I
do think :encoding(UTF-8) should (as it is currently defined as
"strict"). I would love to have someone implement a middle ground
that is strict about ill-formed data but allows
private/non-characters. E.g. :encoding(UTF-8-any).

Then one could say C<< binmode($fh, "​:encoding(UTF-8-any)") >> and
merrily use private characters without issue, plus the code would be
self-documenting that private/non character use is allowed.

If at some point in the future, we move to make "​:utf8" itself
"strict", then I would favor it having "UTF-8-any' semantics.

David

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Feb 17, 2014

From gpiero@rm-rf.it

* [Wed, Feb 12, 2014 at 01​:37​:08PM -0800] David Golden via RT​:

A new 'privatechar' warning category should be added to cover those
distinct from 'nonchar', and I think the wording needs to be softer​:

E.g.

Unicode private character U+%x in %x, may not be portable

and

Unicode non-character U+%x in %x, may not be portable

In those, the second "%x" would be the op that triggered the warning,
akin to the "wide character" warnings.

* [Thu, Feb 13, 2014 at 06​:47​:07AM -0800] David Golden via RT​:

On Wed, Feb 12, 2014 at 11​:56 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

I don't understand why you would ever want to issue a warning
for emitting a PUA code point.

Because they require prior agreement between parties, I think it
sensible to (optionally) warn when private characters go into a IO
stream.

This is particularly important for characters read from one source and
output to another destination. Without knowing that the source and
destination agree to the same semantics, unexpected results could
occur and those are exactly the sort of things that Perl tends to warn
users about.

Not yet sure, but I think I agree with you about a warning related to
private-use chars because of the reasons you exposed. Put it this way​:
if you were on the receiving end of a streaming containing PU chars you
didn't expect, you probably wished that the developer of the offending
code had been warned about the problem.
Still, I wonder if there should be a way to enable (extra-)optional
warnings. Something like​:

use warnings;
use warnings 'private_chars'; # not enabled by previous statement.

Unlike the wide character warning, though, where the IO handle is
wholly unprepared for character data, I'm not convinced that nonchar
and privatechar need to be on by default, however. They should be
enabled by "use warnings".

Strongly uncertain about this matter. My first reaction was​: "absolutely
they should be explicitly enabled via a 'use warnings'".
But after thinking a bit about Karl's explanation, I can see the point
for having them always enabled. Quoting myself​: "if you were on the
receiving end of a streaming containing... etc." (and you want the
developer on the other side to _always_ see the warning).
But then again, setting it to be severe is unfriendly to one-liners,
when you have to type extra chars only to tell the interpreter that you
don't care about those warnings.
Well, in the end I probably agree with you about this point too, but I'm
reserving the right to change idea in almost no time if I feel so.

Of course, an IO layer should be able to decide if those are
acceptable. E.g.

binmode(STDOUT, "​:utf8_private_strict");

Should something like that be created, it should allow private
characters but warn on non characters.

Interesting idea. I also see a use for a layer that would accept Unicode
non-characters but would continue to warn about non-Unicode characters.
Please note that currently the 'nonchar' warning tag is confusing and
probably its scope is too wide.

$ perl-5.19.8 -CS -le 'print "\x{FFFF_1234}"' >/dev/null
Code point 0xFFFF1234 is not Unicode, may not be portable at -e line 1.
$ perl-5.19.8 -CS -le 'no warnings "nonchar"; print "\x{FFFF_1234}"' >/dev/null

So, not only it turns off warnings about non-characters, but it also
shuts up entirely about non-Unicode codepoints. I think it make sense to
split it into two separate tags, 'nonchar' and 'nonunicode', probably
lowering the severity of the former.

Back to the subject of layers, I probably would also love a couple of
layers that would silently strip off private-use and/or nonchars, so
that you could 'sanitize' your input without worrying about characters
that you would discard anyway (but probably in a less efficient way).

Ciao,
Gian Piero.

@p5pRT
Copy link
Author

p5pRT commented Feb 17, 2014

From gpiero@rm-rf.it

* [Mon, Feb 17, 2014 at 08​:23​:14PM +0100] Gian Piero Carrubba​:

Please note that currently the 'nonchar' warning tag is confusing and
probably its scope is too wide.

$ perl-5.19.8 -CS -le 'print "\x{FFFF_1234}"' >/dev/null
Code point 0xFFFF1234 is not Unicode, may not be portable at -e line 1.
$ perl-5.19.8 -CS -le 'no warnings "nonchar"; print "\x{FFFF_1234}"' >/dev/null

So, not only it turns off warnings about non-characters, but it also
shuts up entirely about non-Unicode codepoints. I think it make sense
to split it into two separate tags, 'nonchar' and 'nonunicode',
probably lowering the severity of the former.

Well, not exactly. The 'non_unicode' tag already exists and is severe
too, but the 'no warnings "tag"' syntax acts ...mmh.. strangely.

$ perlbrew exec --with perl-5.19.8 perl -CS -l >/dev/null
  print "\x{FDEF}";
  binmode(STDOUT, '​:encoding(notexist)');
Unicode non-character U+FDEF is illegal for open interchange at - line 1.
Cannot find encoding "notexist" at - line 2.

$ perlbrew exec --with perl-5.19.8 perl -CS -l >/dev/null
  no warnings 'io';
  print "\x{FDEF}";
  binmode(STDOUT, '​:encoding(notexist)');
(no warnings)

wtf? Disabling a 'severe' warning results in _all_ warnings being
silenced if you don't also explicitly tell 'use warnings'. Definitively
not what I would expect.

I cannot find it reported nor this behaviour seems documented. Should it
be reported separately ?

Ciao,
Gian Piero.

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2014

From @xdg

On Mon, Feb 17, 2014 at 2​:23 PM, Gian Piero Carrubba <gpiero@​rm-rf.it> wrote​:

Unlike the wide character warning, though, where the IO handle is
wholly unprepared for character data, I'm not convinced that nonchar
and privatechar need to be on by default, however. They should be
enabled by "use warnings".

Strongly uncertain about this matter. My first reaction was​: "absolutely
they should be explicitly enabled via a 'use warnings'".

That's actually what I meant. I think nonchar and privatechar should
not be "severe" warnings (that fire regardless of "use warnings"). I
think they should be regular warnings that are enabled with "use
warnings" like any other warning. They should not be "optional"
warnings that need to be explicitly turned on -- because no one will
bother to do so and thus there's little point.

David

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2014

From @rjbs

949cf49 introduced "a new set of flags to disallow those code points." For
example, UNICODE_WARN_NONCHAR. Encode​::Unicode seems to always pass
UNICODE_WARN_ILLEGAL_INTERCHANGE as its flags. Would exposing a means to
tweak this be plausible?

In general, I think the problem is that there are cases for wanting these
warnings or not, not only on a program or scope basis, but per-handle. If I've
opened a handle to some generic input file, I may want to be alerted to any
non-characters, while a connection to another part of my internal
infrastructure may be quite prepared to truck in them.

Making the warning non-severe seems reasonable, although I'm not worked up
about it. To me, severe warnings are for things that are going to change in
the future or that are almost certainly a mistake or ambiguity. Although
non-Unicode (characters U+11000 and up) warnings seem like candidates for
severe warnings, I'm not sure Unicode non-character warnings fit.

I also don't know how commonly code is being run with no warnings enabled, so
I'm not sure how significant this distinction is in practice.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2014

From @xdg

On Tue, Feb 18, 2014 at 8​:27 AM, Ricardo Signes
<perl.p5p@​rjbs.manxome.org> wrote​:

In general, I think the problem is that there are cases for wanting these
warnings or not, not only on a program or scope basis, but per-handle. If I've
opened a handle to some generic input file, I may want to be alerted to any
non-characters, while a connection to another part of my internal
infrastructure may be quite prepared to truck in them.

That's my rationale for having strict UTF-8 layers do warnif() to a
category and letting users enable or disable those warnings as usual.

  use warnings;

  say $nonchar; # warns about wide char

  binmode(STDOUT, '​:utf8');
  say $nonchar; # lax​: doesn't warn

  binmode(STDOUT, '​:encoding(UTF-8)');
  say $nonchar; # strict​: warns about nonchar

  {
  no warnings 'nonchar';
  say $nonchar; # doesn't warn
  }

  binmode(STDOUT, '​:encoding(UTF-8-any)');
  say $nonchar; # permissive​: doesn't warn

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2014

From @ikegami

On Mon, Feb 17, 2014 at 2​:23 PM, Gian Piero Carrubba <gpiero@​rm-rf.it>wrote​:

Still, I wonder if there should be a way to enable (extra-)optional
warnings. Something like​:

use warnings;
use warnings 'private_chars'; # not enabled by previous statement.

C<< use warnings; >> is documented to enable all warnings. Don't break this
promise.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @pjcj

On Tue, Feb 18, 2014 at 12​:49​:26PM -0500, Eric Brine wrote​:

On Mon, Feb 17, 2014 at 2​:23 PM, Gian Piero Carrubba <gpiero@​rm-rf.it>wrote​:

Still, I wonder if there should be a way to enable (extra-)optional
warnings. Something like​:

use warnings;
use warnings 'private_chars'; # not enabled by previous statement.

C<< use warnings; >> is documented to enable all warnings. Don't break this
promise.

I have no comments on this specific proposal, but I do have thoughts
about the idea of warnings which are not enabled by default.
warnings.pm says​:

  If no import list is supplied, all possible warnings are either enabled
  or disabled.

First, I'm not even sure whether this is totally accurate, but even if
it is, I do not see it as a promise, but rather as documenting the
current situation. strict.pm says something very similar. In neither
case do I see a problem with new categories being added which are not
enabled by default.

However, I do see a problem with adding new categories which are enabled
by default and start complaining about constructs which may not be
particularly problematic. And it would be a shame if there were no way
to add such categories, in both warnings and strict.

--
Paul Johnson - paul@​pjcj.net
http​://www.pjcj.net

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @druud62

On 2014-02-18 20​:28, Paul Johnson wrote​:

On Tue, Feb 18, 2014 at 12​:49​:26PM -0500, Eric Brine wrote​:

On Mon, Feb 17, 2014 at 2​:23 PM, Gian Piero Carrubba <gpiero@​rm-rf.it>wrote​:

Still, I wonder if there should be a way to enable (extra-)optional
warnings. Something like​:

use warnings;
use warnings 'private_chars'; # not enabled by previous statement.

C<< use warnings; >> is documented to enable all warnings. Don't break this
promise.

I have no comments on this specific proposal, but I do have thoughts
about the idea of warnings which are not enabled by default.
warnings.pm says​:

If no import list is supplied, all possible warnings are either enabled
or disabled.

First, I'm not even sure whether this is totally accurate, but even if
it is, I do not see it as a promise, but rather as documenting the
current situation. strict.pm says something very similar. In neither
case do I see a problem with new categories being added which are not
enabled by default.

However, I do see a problem with adding new categories which are enabled
by default and start complaining about constructs which may not be
particularly problematic. And it would be a shame if there were no way
to add such categories, in both warnings and strict.

use warnings '​:most';
;)

--
Ruud

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @ikegami

On Tue, Feb 18, 2014 at 2​:28 PM, Paul Johnson <paul@​pjcj.net> wrote​:

On Tue, Feb 18, 2014 at 12​:49​:26PM -0500, Eric Brine wrote​:

On Mon, Feb 17, 2014 at 2​:23 PM, Gian Piero Carrubba <gpiero@​rm-rf.it
wrote​:

Still, I wonder if there should be a way to enable (extra-)optional
warnings. Something like​:

use warnings;
use warnings 'private_chars'; # not enabled by previous statement.

C<< use warnings; >> is documented to enable all warnings. Don't break
this
promise.

I have no comments on this specific proposal, but I do have thoughts
about the idea of warnings which are not enabled by default.
warnings.pm says​:

If no import list is supplied, all possible warnings are either enabled
or disabled.

First, I'm not even sure whether this is totally accurate, but even if
it is, I do not see it as a promise, but rather as documenting the
current situation.

If this is the case, I haven't been doing and recommending when I think I
have been. What should I use to enable all warnings if not C<< use
warnings; >> (which is documented to be C<< use warnings '​:all'; >>)?

However, I do see a problem with adding new categories which are enabled

by default and start complaining about constructs which may not be
particularly problematic. And it would be a shame if there were no way
to add such categories, in both warnings and strict.

I didn't say they should be enabled by default. I said they should be
enabled by C<< use warnings '​:all'; >> aka C<< use warnings; >>.

@p5pRT
Copy link
Author

p5pRT commented Apr 21, 2014

From @khwilliamson

Regardless of what the ultimate disposition of this is, I have attached a patch that would clarify the current situation for at least 5.20. Any objections to it?

@p5pRT
Copy link
Author

p5pRT commented Apr 21, 2014

From @khwilliamson

0002-Proposed-5.20-wording-for-non-char-code-point-usage.patch
From 6b1134ef7e53209fcf4f197707a95e4b5b330f86 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Mon, 21 Apr 2014 08:49:00 -0600
Subject: [PATCH 2/2] Proposed 5.20 wording for non-char code point usage

This clarifies how things work in 5.20.
---
 pod/perldiag.pod | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 5482684..64a1bff 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -5739,9 +5739,15 @@ with the characters in the Lao and Thai scripts.
 (S nonchar) Certain codepoints, such as U+FFFE and U+FFFF, are
 defined by the Unicode standard to be non-characters.  Those are
 legal codepoints, but are reserved for internal use; so, applications
-shouldn't attempt to exchange them.  If you know what you are doing
+shouldn't attempt to exchange them.  An application may not be
+expecting any of these characters at all, and receiving them
+may lead to bugs.  If you know what you are doing
 you can turn off this warning by C<no warnings 'nonchar';>.
 
+This is not really a "serious" error, but it is supposed to be raised
+by default even if warnings are not enabled, and currently the only
+way to do that in Perl is to mark it as serious.
+
 =item Unicode surrogate U+%X is illegal in UTF-8
 
 (S surrogate) You had a UTF-16 surrogate in a context where they are
-- 
1.8.3.2

@p5pRT
Copy link
Author

p5pRT commented Apr 22, 2014

From @rjbs

* Karl Williamson via RT <perlbug-followup@​perl.org> [2014-04-21T10​:57​:26]

Regardless of what the ultimate disposition of this is, I have attached a
patch that would clarify the current situation for at least 5.20. Any
objections to it?

None.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented May 21, 2014

From @jhi

It seems that Perl is lagging on the handling for Unicode
"non-characters" [1]​: they are these days valid for interchange​:

http​://www.unicode.org/versions/corrigendum9.html

In other words, they should be handled much like PUA (private use area)
characters [2]​: passed through as-is.

How we are currently doing it wrong​:

(a) ./perl -CO -we 'print chr(0xFFFF)'

Unicode non-character U+FFFF is illegal for open interchange at -e line 1.
�%

(Somewhat strangely, the -CO is required for the warning to appear.)

We shouldn't warn.

It is possible we still could warn somehow, to alert users about the
special nature of "non-characters" (a very unfortunate name), but they
are definitely legal characters, and they can be interchanged. (They
are not *intended* for interchange, but that is quite different from
"forbidden".)

(b) In Encode, the "utf8" lets the non-chars through, but the strict
"UTF-8" mangles them to the Unicode REPLACEMENT CHARACTER U+FFFD​:

./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("utf8",
"\xEF\xBF\xBF"))'
SV = PV(0x7ffba18041f0) at 0x7ffba1803438
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x7ffba143e6c0 "\357\277\277"\0 [UTF8 "\x{ffff}"]
  CUR = 3
  LEN = 16
./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("UTF-8",
"\xEF\xBF\xBF"))' {git​: nonchar
SV = PV(0x7ff34104aa50) at 0x7ff341031f28
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x7ff340d022e0 "\357\277\275"\0 [UTF8 "\x{fffd}"]
  CUR = 3
  LEN = 16

We shouldn't mangle.


[1] http​://www.unicode.org/faq/private_use.html#nonchar1
[2] http​://www.unicode.org/faq/private_use.html

@p5pRT
Copy link
Author

p5pRT commented May 21, 2014

From @jhi

On Wednesday-201405-21, 9​:55, Jarkko Hietaniemi (via RT) wrote​:

How we are currently doing it wrong​:

Should've said​:

"Currently known wrongnesses include, but are probably not limited to"

@p5pRT
Copy link
Author

p5pRT commented May 22, 2014

From @tonycoz

On Wed May 21 06​:55​:23 2014, jhi wrote​:

It seems that Perl is lagging on the handling for Unicode
"non-characters" [1]​: they are these days valid for interchange​:

http​://www.unicode.org/versions/corrigendum9.html

In other words, they should be handled much like PUA (private use area)
characters [2]​: passed through as-is.

How we are currently doing it wrong​:

(a) ./perl -CO -we 'print chr(0xFFFF)'

Unicode non-character U+FFFF is illegal for open interchange at -e line 1.
�%

(Somewhat strangely, the -CO is required for the warning to appear.)

We shouldn't warn.

It is possible we still could warn somehow, to alert users about the
special nature of "non-characters" (a very unfortunate name), but they
are definitely legal characters, and they can be interchanged. (They
are not *intended* for interchange, but that is quite different from
"forbidden".)

(b) In Encode, the "utf8" lets the non-chars through, but the strict
"UTF-8" mangles them to the Unicode REPLACEMENT CHARACTER U+FFFD​:

./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("utf8",
"\xEF\xBF\xBF"))'
SV = PV(0x7ffba18041f0) at 0x7ffba1803438
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x7ffba143e6c0 "\357\277\277"\0 [UTF8 "\x{ffff}"]
CUR = 3
LEN = 16
./perl -Ilib -MEncode=decode -MDevel​::Peek -we 'Dump(decode("UTF-8",
"\xEF\xBF\xBF"))' {git​: nonchar
SV = PV(0x7ff34104aa50) at 0x7ff341031f28
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x7ff340d022e0 "\357\277\275"\0 [UTF8 "\x{fffd}"]
CUR = 3
LEN = 16

We shouldn't mangle.

This looks like a duplicate of

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226

Tony

@p5pRT
Copy link
Author

p5pRT commented May 22, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 22, 2014

From @jhi

On Wednesday-201405-21, 23​:42, Tony Cook via RT wrote​:

This looks like a duplicate of

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226

Yup, the same issue.

FWIW, I started poking at this.

@p5pRT
Copy link
Author

p5pRT commented May 29, 2014

From @khwilliamson

On Thu May 22 05​:32​:49 2014, jhi wrote​:

On Wednesday-201405-21, 23​:42, Tony Cook via RT wrote​:

This looks like a duplicate of

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226

Yup, the same issue.

FWIW, I started poking at this.

I have now merged these two tickets. I've been thinking about and doing some research in the Unicode standard about this, and am having trouble with the idea that we should now just change to accept non-characters without warning.

Non-characters are still "permanently reserved for internal use", quoting from Corrigendum #9. I want to emphasize that word "internal". An application should be able to presume that data it receives from an external source does not contain non-characters, so it is free to use them in any way it wishes. This is the whole point of non-characters, to have some code points available for you that you are assured won't be coming from somewhere else.

And how do things come from somewhere else? through I/O. Hence, the presumption by Perl should be that I/O is related to an external interface. It may be that an application is composed of cooperating processes that communicate via I/O, but Perl's presumption must be, unless indicated otherwise, that I/O is for external interfaces.

An application that uses non-characters will want its inputs to not have any of them coming in to it. It wants them filtered out; the best choice is to have them turned into REPLACEMENT CHARACTERS. My claim is that Perl should do this by default. Corrigendum #9 doesn't change this. And there should be a way to change the default. That is what Corrigendum #9 makes clear, and which Perl already does in (too) many cases. That Corrigendum was not aimed at Perl, but other Unicode implementations. My point is that Perl already implements this Corrigendum, and need not nor should not change because of it.

We have long ago agreed that the default input for Perl should be strict, and that explicit action should be taken to override that. strict input should continue to exclude non-characters. If we were to change that, existing applications would be suddenly and silently exposed to security holes, where an attacker who knows the internal structure of the application inserts non-characters to fool it.

Let me reiterate my main point. We already implement Corrigendum #9. We should not make changes because of it.

Private-use characters are not the same as non-characters. An application has no right to presume that external inputs don't include private-use characters. But it is free to ascribe its own meanings to them. In practice, most applications will just treat them as some generic code points.

I think David Golden's ideas would be a useful addition, but it's not my itch. I would be happy to consult with someone who wishes to scratch it though
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented May 29, 2014

From @jhi

There's input and there's output.

I agree that default input should be strict​: but I think stricter than
what we have now, e.g. not accept U+200000. And not accept non-chars.

(There's also more spectrum than just spewing warnings​: currently we
generate U+FFFD but then *continue* reading. We could e.g. truncate
and stop reading, and/or croak...)

I am not entirely certain about the definition of "internal" here,
though. Internal to what? One "process"? What if Perl is just a
"library" and not an "application"? A set of Perl applications? A set
of mixed applications?

But on output if I output U+FFFF I don't want to output U+FFFD. (This
doesn't happen now, either​: we just warn. But being strict the wrong
way, this could happen.) This is no different from chr(0xFFFF), really,
if I write that I don't want magic making it chr(0xFFFD).

Again, quoting the C9​: "However, they are not illegal in interchange nor
do they cause ill-formed Unicode text. This has always been the intent
of the standard, as expressed by the Unicode Technical Committee." So
us currently warning non-chars being illegal for interchange is wrong.
They are not.

@p5pRT
Copy link
Author

p5pRT commented May 29, 2014

From @jhi

On Thursday-201405-29, 15​:49, Jarkko Hietaniemi wrote​:

This is no different from chr(0xFFFF), really,
if I write that I don't want magic making it chr(0xFFFD).

Or this​:

perl -MEncode=decode -MDevel​::Peek -we 'Dump(decode("UTF-8",
"\xEF\xBF\xBF"))'

giving me the bytes \xEF\xBF\xBD, aka U+FFFD.

@p5pRT
Copy link
Author

p5pRT commented May 30, 2014

From @khwilliamson

On 05/29/2014 01​:49 PM, Jarkko Hietaniemi wrote​:

There's input and there's output.

I agree that default input should be strict​: but I think stricter than
what we have now, e.g. not accept U+200000. And not accept non-chars.

This has been hashed around a lot before, and I think every one now
agrees with you here.

(There's also more spectrum than just spewing warnings​: currently we
generate U+FFFD but then *continue* reading. We could e.g. truncate
and stop reading, and/or croak...)

Perhaps options.

I am not entirely certain about the definition of "internal" here,
though. Internal to what? One "process"? What if Perl is just a
"library" and not an "application"? A set of Perl applications? A set
of mixed applications?

That's why there has to be flexibility. We have to make the default the
sanest and safest, but allow the programmer(s) to override it for their
needs.

But on output if I output U+FFFF I don't want to output U+FFFD. (This
doesn't happen now, either​: we just warn. But being strict the wrong
way, this could happen.) This is no different from chr(0xFFFF), really,
if I write that I don't want magic making it chr(0xFFFD).

Agreed. The reason we warn is so you know you're outputting something
somebody else likely wont be able to handle. The only time something
should be translated into FFFD is on input. I don't know about ENV or ARGV.

Again, quoting the C9​: "However, they are not illegal in interchange nor
do they cause ill-formed Unicode text. This has always been the intent
of the standard, as expressed by the Unicode Technical Committee." So
us currently warning non-chars being illegal for interchange is wrong.
They are not.
The wording should change, but I do believe there should be a warning
nonetheless. I do wish Unicode had phrased the original and the
Corrigendum better. They do seem to me to have an aversion to
straightforward language.

@p5pRT
Copy link
Author

p5pRT commented Mar 4, 2015

From @jhi

There was some discussion but it was all over the place, and this ticket as such is pretty useless. Rejecting.

@p5pRT
Copy link
Author

p5pRT commented Mar 4, 2015

@jhi - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant