New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please implement Unicode Corrigendum #9 (noncharacters) #13594
Comments
From gpiero@rm-rf.itCreated by gpiero@rm-rf.itCurrently perl issues a serious warning when trying to output (or input) $ perl -CS -le 'print "noncharacter: \x{FDEF}"' This is due to a common interpretation of the Unicode standard. Anyway the `` Noncharacters in the Unicode Standard are intended for internal use and have As this is labeled as a clarification, I don't think we have to wait for the I admit that at this point it isn't clear to me the distinction between $ perl -CS -le 'print "private-use character: \x{F8FF}"' (no warning issued). At the very least, the severity of the 'nonchar' warning should be lowered. Thanks, [0] http://www.unicode.org/faq/private_use.html#noncharacters Perl Info
|
From @ap* Gian Piero <gpiero@rm-rf.it> [2014-02-11 17:10]:
I found most helpful the sections from They clearly describe the intent that these noncharacters should be
That is very much my interpretation of the FAQ. Quoting from <http://www.unicode.org/faq/private_use.html#nonchar2>: Noncharacters are in a sense a kind of private-use character, Prior to that, the FAQ expends some verbiage to convey that private-use So in answer to your question, it appears that the UTC conceives the It inescapably follows that if even the use of noncharacters must not Regards, |
The RT System itself - Status changed from 'new' to 'open' |
From gpiero@rm-rf.itHi Aristotle, thank you for your reply. * [Tue, Feb 11, 2014 at 11:47:16AM -0800] Aristotle Pagaltzis via RT:
Yes, but the definition of 'system' in this context is not so clear to
I'm afraid I wasn't clear in the initial report: AFAIK perl already Ciao, |
From @khwilliamsonOn 02/11/2014 02:16 PM, Gian Piero Carrubba wrote:
I have mixed feelings about this request. First, some clarifications. Private-use characters have always been Non-character code points have a different genesis altogether. When Unicode was expanded beyond 16-bits, they created the plane Eventually it became clear that there is a need for text-processing Non-character code points should not be foisted off on an unsuspecting Now to the request. I agree that the warning is not severe; however we "no warnings 'nonchar'" to silence just them. |
From @xdgperldiag says this: Unicode non-character U+%X is illegal for open interchange That seems to explain it solely as "non-characters" not "private characters". From Karl's explanation and corrigendum #9, I think it's clear that A new 'privatechar' warning category should be added to cover those E.g. Unicode private character U+%x in %x, may not be portable and Unicode non-character U+%x in %x, may not be portable In those, the second "%x" would be the op that triggered the warning, Unlike the wide character warning, though, where the IO handle is Of course, an IO layer should be able to decide if those are acceptable. E.g. binmode(STDOUT, ":utf8_private_strict"); Should something like that be created, it should allow private David On Wed, Feb 12, 2014 at 2:22 PM, Karl Williamson
-- |
From tchrist@perl.comI don't understand why you would ever want to issue a warning use charnames ":alias" => { print "\N{APPLE CORPORATE LOGO}\n"; Let alone all the fun I have with my Tengwar module. ### This one matches the assignments of the Free Tengwar Font Project ### Whereas This one matches the official roadmap: ## if In file, can do this: use charnames ":full", ":alias" => { reverse ( (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO", .... --tom |
From @TuxOn Wed, 12 Feb 2014 21:56:33 -0700, Tom Christiansen <tchrist@perl.com>
I am really happy to see :alias being used like this. It clearly proves -- |
From gpiero@rm-rf.itHello Karl, thank you, very interesting explanation. * [Wed, Feb 12, 2014 at 11:23:08AM -0800] karl williamson via RT:
Oh well, I thought UTC recommended use of out-of-range sentinels in On the other hand, both noncharacters and sentinels could not be the I guess I would use private-use chars in most cases for avoiding
Ok, I'm following you here. Nevertheless I'm wondering how much is it
Ok, so I mis-interpreted the reason for the warning. I thought it was Ciao, |
From @xdgOn Wed, Feb 12, 2014 at 11:56 PM, Tom Christiansen <tchrist@perl.com> wrote:
Because they require prior agreement between parties, I think it This is particularly important for characters read from one source and Absolutely, the warning should not be on by default. When warnings I don't think ":utf8" should warn about private or non-characters. I Then one could say C<< binmode($fh, ":encoding(UTF-8-any)") >> and If at some point in the future, we move to make ":utf8" itself David -- |
From gpiero@rm-rf.it* [Wed, Feb 12, 2014 at 01:37:08PM -0800] David Golden via RT:
* [Thu, Feb 13, 2014 at 06:47:07AM -0800] David Golden via RT:
Not yet sure, but I think I agree with you about a warning related to use warnings;
Strongly uncertain about this matter. My first reaction was: "absolutely
Interesting idea. I also see a use for a layer that would accept Unicode $ perl-5.19.8 -CS -le 'print "\x{FFFF_1234}"' >/dev/null So, not only it turns off warnings about non-characters, but it also Back to the subject of layers, I probably would also love a couple of Ciao, |
From gpiero@rm-rf.it* [Mon, Feb 17, 2014 at 08:23:14PM +0100] Gian Piero Carrubba:
Well, not exactly. The 'non_unicode' tag already exists and is severe $ perlbrew exec --with perl-5.19.8 perl -CS -l >/dev/null $ perlbrew exec --with perl-5.19.8 perl -CS -l >/dev/null wtf? Disabling a 'severe' warning results in _all_ warnings being I cannot find it reported nor this behaviour seems documented. Should it Ciao, |
From @xdgOn Mon, Feb 17, 2014 at 2:23 PM, Gian Piero Carrubba <gpiero@rm-rf.it> wrote:
That's actually what I meant. I think nonchar and privatechar should David -- |
From @rjbs949cf49 introduced "a new set of flags to disallow those code points." For In general, I think the problem is that there are cases for wanting these Making the warning non-severe seems reasonable, although I'm not worked up I also don't know how commonly code is being run with no warnings enabled, so -- |
From @xdgOn Tue, Feb 18, 2014 at 8:27 AM, Ricardo Signes
That's my rationale for having strict UTF-8 layers do warnif() to a use warnings; say $nonchar; # warns about wide char binmode(STDOUT, ':utf8'); binmode(STDOUT, ':encoding(UTF-8)'); { binmode(STDOUT, ':encoding(UTF-8-any)'); -- |
From @ikegamiOn Mon, Feb 17, 2014 at 2:23 PM, Gian Piero Carrubba <gpiero@rm-rf.it>wrote:
C<< use warnings; >> is documented to enable all warnings. Don't break this |
From @pjcjOn Tue, Feb 18, 2014 at 12:49:26PM -0500, Eric Brine wrote:
I have no comments on this specific proposal, but I do have thoughts If no import list is supplied, all possible warnings are either enabled First, I'm not even sure whether this is totally accurate, but even if However, I do see a problem with adding new categories which are enabled -- |
From @druud62On 2014-02-18 20:28, Paul Johnson wrote:
use warnings ':most'; -- |
From @ikegamiOn Tue, Feb 18, 2014 at 2:28 PM, Paul Johnson <paul@pjcj.net> wrote:
If this is the case, I haven't been doing and recommending when I think I However, I do see a problem with adding new categories which are enabled
I didn't say they should be enabled by default. I said they should be |
From @khwilliamsonRegardless of what the ultimate disposition of this is, I have attached a patch that would clarify the current situation for at least 5.20. Any objections to it? |
From @khwilliamson0002-Proposed-5.20-wording-for-non-char-code-point-usage.patchFrom 6b1134ef7e53209fcf4f197707a95e4b5b330f86 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Mon, 21 Apr 2014 08:49:00 -0600
Subject: [PATCH 2/2] Proposed 5.20 wording for non-char code point usage
This clarifies how things work in 5.20.
---
pod/perldiag.pod | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 5482684..64a1bff 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -5739,9 +5739,15 @@ with the characters in the Lao and Thai scripts.
(S nonchar) Certain codepoints, such as U+FFFE and U+FFFF, are
defined by the Unicode standard to be non-characters. Those are
legal codepoints, but are reserved for internal use; so, applications
-shouldn't attempt to exchange them. If you know what you are doing
+shouldn't attempt to exchange them. An application may not be
+expecting any of these characters at all, and receiving them
+may lead to bugs. If you know what you are doing
you can turn off this warning by C<no warnings 'nonchar';>.
+This is not really a "serious" error, but it is supposed to be raised
+by default even if warnings are not enabled, and currently the only
+way to do that in Perl is to mark it as serious.
+
=item Unicode surrogate U+%X is illegal in UTF-8
(S surrogate) You had a UTF-16 surrogate in a context where they are
--
1.8.3.2
|
From @rjbs* Karl Williamson via RT <perlbug-followup@perl.org> [2014-04-21T10:57:26]
None. -- |
From @jhiIt seems that Perl is lagging on the handling for Unicode http://www.unicode.org/versions/corrigendum9.html In other words, they should be handled much like PUA (private use area) How we are currently doing it wrong: (a) ./perl -CO -we 'print chr(0xFFFF)' Unicode non-character U+FFFF is illegal for open interchange at -e line 1. (Somewhat strangely, the -CO is required for the warning to appear.) We shouldn't warn. It is possible we still could warn somehow, to alert users about the (b) In Encode, the "utf8" lets the non-chars through, but the strict ./perl -Ilib -MEncode=decode -MDevel::Peek -we 'Dump(decode("utf8", We shouldn't mangle. [1] http://www.unicode.org/faq/private_use.html#nonchar1 |
From @jhiOn Wednesday-201405-21, 9:55, Jarkko Hietaniemi (via RT) wrote:
Should've said: "Currently known wrongnesses include, but are probably not limited to" |
From @tonycozOn Wed May 21 06:55:23 2014, jhi wrote:
This looks like a duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121226 Tony |
The RT System itself - Status changed from 'new' to 'open' |
From @jhiOn Wednesday-201405-21, 23:42, Tony Cook via RT wrote:
Yup, the same issue. FWIW, I started poking at this. |
From @khwilliamsonOn Thu May 22 05:32:49 2014, jhi wrote:
I have now merged these two tickets. I've been thinking about and doing some research in the Unicode standard about this, and am having trouble with the idea that we should now just change to accept non-characters without warning. Non-characters are still "permanently reserved for internal use", quoting from Corrigendum #9. I want to emphasize that word "internal". An application should be able to presume that data it receives from an external source does not contain non-characters, so it is free to use them in any way it wishes. This is the whole point of non-characters, to have some code points available for you that you are assured won't be coming from somewhere else. And how do things come from somewhere else? through I/O. Hence, the presumption by Perl should be that I/O is related to an external interface. It may be that an application is composed of cooperating processes that communicate via I/O, but Perl's presumption must be, unless indicated otherwise, that I/O is for external interfaces. An application that uses non-characters will want its inputs to not have any of them coming in to it. It wants them filtered out; the best choice is to have them turned into REPLACEMENT CHARACTERS. My claim is that Perl should do this by default. Corrigendum #9 doesn't change this. And there should be a way to change the default. That is what Corrigendum #9 makes clear, and which Perl already does in (too) many cases. That Corrigendum was not aimed at Perl, but other Unicode implementations. My point is that Perl already implements this Corrigendum, and need not nor should not change because of it. We have long ago agreed that the default input for Perl should be strict, and that explicit action should be taken to override that. strict input should continue to exclude non-characters. If we were to change that, existing applications would be suddenly and silently exposed to security holes, where an attacker who knows the internal structure of the application inserts non-characters to fool it. Let me reiterate my main point. We already implement Corrigendum #9. We should not make changes because of it. Private-use characters are not the same as non-characters. An application has no right to presume that external inputs don't include private-use characters. But it is free to ascribe its own meanings to them. In practice, most applications will just treat them as some generic code points. I think David Golden's ideas would be a useful addition, but it's not my itch. I would be happy to consult with someone who wishes to scratch it though |
From @jhiThere's input and there's output. I agree that default input should be strict: but I think stricter than (There's also more spectrum than just spewing warnings: currently we I am not entirely certain about the definition of "internal" here, But on output if I output U+FFFF I don't want to output U+FFFD. (This Again, quoting the C9: "However, they are not illegal in interchange nor |
From @jhiOn Thursday-201405-29, 15:49, Jarkko Hietaniemi wrote:
Or this: perl -MEncode=decode -MDevel::Peek -we 'Dump(decode("UTF-8", giving me the bytes \xEF\xBF\xBD, aka U+FFFD. |
From @khwilliamsonOn 05/29/2014 01:49 PM, Jarkko Hietaniemi wrote:
This has been hashed around a lot before, and I think every one now
Perhaps options.
That's why there has to be flexibility. We have to make the default the
Agreed. The reason we warn is so you know you're outputting something
|
From @jhiThere was some discussion but it was all over the place, and this ticket as such is pretty useless. Rejecting. |
@jhi - Status changed from 'open' to 'rejected' |
Migrated from rt.perl.org#121226 (status was 'rejected')
Searchable as RT121226$
The text was updated successfully, but these errors were encountered: