Skip Menu |

CC: zefram [...] fysh.org
Subject: utf8 fatal warning
Date: Tue, 24 Feb 2009 21:26:57 +0000
To: perlbug [...] perl.org
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 4.7k
This is a bug report for perl from zefram@fysh.org, generated with the help of perlbug 1.35 running under perl v5.8.8. ----------------------------------------------------------------- [Please enter your report here] $ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ /\A[\x{123}]/ ? "yes" : "no"' no $ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~ /\A[\x{123}]/ ? "yes" : "no"' Malformed UTF-8 character (fatal) at -e line 1. $ Turning warnings on makes the regexp operation die, whereas with warnings off it produced the correct answer. If 'no warnings "utf8"' is in scope at the regexp op, then the error does not occur. The form of the regexp affects behaviour. If the regexp is /\A\x{123}/ (not a character class) then there is no error or warning. If the regexp is /\A\x{23}/ (ASCII character, no character class) then a *warning* is issued and the right answer is returned. If the regexp is /\A[\x{23}]/ (ASCII character, character class) then the error occurs on 5.8.8 and a warning is issued on 5.10.0 or 5.8.9. Non-ASCII Latin-1 characters behave the same as ASCII characters. The characters in the string that can be complained about this way are U+d800 to U+dfff (the surrogates) and U+ffff (one of many reserved non-characters). The character is in fact encoded correctly in Perl-internal UTF-8. The error message is wrong. Curiously, and probably related, Devel::Peek::Dump() suffers the same kind of problem when dumping the string: it generates a warning and gives the wrong UTF-8 decode iff 'use warnings "utf8"' is in scope at its call site. [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl v5.8.8: Configured by Debian Project at Fri Dec 19 00:43:54 EST 2008. Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.18-6-686, archname=i486-linux-gnu-thread-multi uname='linux etch 2.6.18-6-686 #1 smp fri dec 12 16:48:28 utc 2008 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-21)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8 gnulibc_version='2.3.6' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: --- @INC for perl v5.8.8: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 . --- Environment for perl v5.8.8: HOME=/home/zefram LANG (unset) LANGUAGE (unset) LC_CTYPE=en_GB LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games PERL_BADLANG (unset) SHELL=/usr/bin/zsh
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.3k
On Tue Feb 24 13:27:21 2009, zefram@fysh.org wrote: Show quoted text
> > This is a bug report for perl from zefram@fysh.org, > generated with the help of perlbug 1.35 running under perl v5.8.8. > > > ----------------------------------------------------------------- > [Please enter your report here] > > $ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ > /\A[\x{123}]/ ? "yes" : "no"' > no > $ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~ > /\A[\x{123}]/ ? "yes" : "no"' > Malformed UTF-8 character (fatal) at -e line 1. > $ > > Turning warnings on makes the regexp operation die, whereas with > warnings > off it produced the correct answer. If 'no warnings "utf8"' is in > scope > at the regexp op, then the error does not occur. > > The form of the regexp affects behaviour. If the regexp is > /\A\x{123}/ > (not a character class) then there is no error or warning. If the > regexp > is /\A\x{23}/ (ASCII character, no character class) then a *warning* > is > issued and the right answer is returned. If the regexp is > /\A[\x{23}]/ > (ASCII character, character class) then the error occurs on 5.8.8 and > a warning is issued on 5.10.0 or 5.8.9. Non-ASCII Latin-1 characters > behave the same as ASCII characters. > > The characters in the string that can be complained about this way > are U+d800 to U+dfff (the surrogates) and U+ffff (one of many reserved > non-characters). > > The character is in fact encoded correctly in Perl-internal UTF-8. > The error message is wrong. Curiously, and probably related, > Devel::Peek::Dump() suffers the same kind of problem when dumping the > string: it generates a warning and gives the wrong UTF-8 decode iff > 'use warnings "utf8"' is in scope at its call site.
I'm not sure what to do about this ticket. The basics of it anyway are behaving as designed, which is that non-characters and surrogates generate errors unless warnings are turned off, but then things should work. The message in 5.12 for U+FFFF has been clarified that this character is illegal for interchange. This should be extended in a later release to the other 65 noncharacters. Surrogates, on the other hand, should never appear in well-formed utf8, and there are security considerations for doing so that I don't fully understand but can see why. It seems to me that the current design is sufficient.
Download (untitled) / with headers
text/plain 295b
Zefram, Is it ok if I close this ticket? The Unicode standard says, "Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed." The message for FFFF has been changed to be correct. --Karl Williamson
Subject: Re: [perl #63446] utf8 fatal warning
Date: Wed, 20 Oct 2010 08:06:32 +0100
To: Karl Williamson via RT <perlbug-followup [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 381b
Karl Williamson via RT wrote: Show quoted text
>Is it ok if I close this ticket?
No. It's not OK for a warning to be fatal. The situation should either be a fatal error (regardless of warning flags) or a non-fatal warning (controlled by warning flags). A warning would make a lot more sense, because Perl is generally happy to process codepoints in ways that Unicode does not permit. -zefram
RT-Send-CC: perl5-porters [...] perl.org, public [...] khwilliamson.com
Download (untitled) / with headers
text/plain 2.3k
On Sun Mar 21 19:08:59 2010, khw wrote: Show quoted text
> On Tue Feb 24 13:27:21 2009, zefram@fysh.org wrote:
> > $ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ > > /\A[\x{123}]/ ? "yes" : "no"' > > no > > $ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~ > > /\A[\x{123}]/ ? "yes" : "no"' > > Malformed UTF-8 character (fatal) at -e line 1. > > $ > > > > Turning warnings on makes the regexp operation die, whereas with > > warnings > > off it produced the correct answer. If 'no warnings "utf8"' is in > > scope > > at the regexp op, then the error does not occur.
> > I'm not sure what to do about this ticket. The basics of it anyway are > behaving as designed, which is that non-characters and surrogates > generate errors unless warnings are turned off, but then things should > work.
It may be working as designed, but it was not designed very well. Show quoted text
> The message in 5.12 for U+FFFF has been clarified that this > character is illegal for interchange. This should be extended in a > later release to the other 65 noncharacters. > > Surrogates, on the other hand, should never appear in well-formed utf8, > and there are security considerations for doing so that I don't fully > understand but can see why.
The regular expression engine is not a security layer. It should not pretend to be one. If I want to implement a security layer using regular expressions, then this bug (yes, I do consider it a bug) will get in the way. Furthermore, Perl’s strings are not just Unicode. Unicode strings are merely a subset of the strings that Perl supports. Regular expressions are for looking at strings. So it should not warn or die based on the contents of the string, as long as it is a valid Perl string. perl already warns for "\x{d800}" and chr 0xd800. So if such a string is passed to a regular expression, we get multiple warnings for the same character. I use Perl’s strings for storing 16-bit binary data. The result is that not only the code creating such strings, but any code looking at the strings, has to turn off utf8 warnings. So I can’t use any CPAN modules such as Data::Dump::Streamer. I propose we stop the regular expression engine from rejecting or warning about these characters altogether. The only checking should be for code that creates such characters or for I/O layers. There are three patches attached that fix a few cases. There will be more to come.
diff -up blead-63446.base/MANIFEST blead-63446-utf8-warnings/MANIFEST --- blead-63446.base/MANIFEST 2010-11-26 08:06:10.000000000 -0800 +++ blead-63446-utf8-warnings/MANIFEST 2010-11-27 21:36:33.000000000 -0800 @@ -4804,6 +4804,7 @@ t/porting/podcheck.t Test the POD of sh t/porting/regen.t Check that regen.pl doesn't need running t/porting/test_bootstrap.t Test that the instructions for test bootstrapping aren't accidentally overlooked. t/README Instructions for regression tests +t/re/beyond_unicode.t See if regexps work with all characters t/re/fold_grind.t See if case folding works properly t/re/overload.t Test against string corruption in pattern matches on overloaded objects t/re/pat_advanced.t See if advanced esoteric patterns work diff -up blead-63446.base/regcomp.c blead-63446-utf8-warnings/regcomp.c --- blead-63446.base/regcomp.c 2010-11-24 09:59:12.000000000 -0800 +++ blead-63446-utf8-warnings/regcomp.c 2010-11-28 05:37:38.000000000 -0800 @@ -3038,7 +3038,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_ if (UTF) { const U8 * const s = (U8*)STRING(scan); l = utf8_length(s, s + l); - uc = utf8_to_uvchr(s, NULL); + uc = + utf8n_to_uvchr(s, UTF8_MAXBYTES, NULL, UTF8_ALLOW_ANYUV); } else { uc = *((U8*)STRING(scan)); } @@ -7779,7 +7780,7 @@ tryagain: if (UTF8_IS_START(*p) && UTF) { STRLEN numlen; ender = utf8n_to_uvchr((U8*)p, RExC_end - p, - &numlen, UTF8_ALLOW_DEFAULT); + &numlen, UTF8_ALLOW_ANYUV); p += numlen; } else @@ -9078,7 +9079,10 @@ S_reguni(pTHX_ const RExC_state_t *pRExC PERL_ARGS_ASSERT_REGUNI; - return SIZE_ONLY ? UNISKIP(uv) : (uvchr_to_utf8((U8*)s, uv) - (U8*)s); + return + SIZE_ONLY + ? UNISKIP(uv) + : (uvuni_to_utf8_flags((U8*)s, uv, UNICODE_ALLOW_ANY) - (U8*)s); } /* diff -Nurp blead-63446.base/t/re/beyond_unicode.t blead-63446-utf8-warnings/t/re/beyond_unicode.t --- blead-63446.base/t/re/beyond_unicode.t 1969-12-31 16:00:00.000000000 -0800 +++ blead-63446-utf8-warnings/t/re/beyond_unicode.t 2010-11-28 05:49:47.000000000 -0800 @@ -0,0 +1,30 @@ +#!./perl -w + +# This script tests that the regular expression engine can handle all Perl +# characters, including those that are not Unicode. Unicode characters are +# merely a subset of Perl characters. + +BEGIN { + chdir 't' if -d 't'; + @INC = '../lib'; + require './test.pl'; +} + +plan 1; + +my @bad; + +sub report_bad { + if(@bad) { + diag "Bad ranges: ", join " ", map sprintf("%x00..%x00",$_,$_+1), @bad; + } +} + +@bad = (); +for(0..0x1200) { + next if rand > .25; + my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 }; + push @bad, $_ if $c !~ quotemeta $c; +} +ok !@bad, 'quotemeta $foo matches $foo for every character'; +report_bad;
diff -up blead-63446-utf8-warnings2/regcomp.c blead-63446-utf8-warnings3/regcomp.c --- blead-63446-utf8-warnings2/regcomp.c 2010-11-28 06:27:23.000000000 -0800 +++ blead-63446-utf8-warnings3/regcomp.c 2010-11-28 11:03:40.000000000 -0800 @@ -8313,7 +8313,7 @@ parseit: if (UTF) { value = utf8n_to_uvchr((U8*)RExC_parse, RExC_end - RExC_parse, - &numlen, UTF8_ALLOW_DEFAULT); + &numlen, UTF8_ALLOW_ANYUV); RExC_parse += numlen; } else diff -up blead-63446-utf8-warnings2/regexec.c blead-63446-utf8-warnings3/regexec.c --- blead-63446-utf8-warnings2/regexec.c 2010-11-28 06:32:01.000000000 -0800 +++ blead-63446-utf8-warnings3/regexec.c 2010-11-28 11:08:39.000000000 -0800 @@ -6217,10 +6217,8 @@ S_reginclass(pTHX_ const regexp * const /* If c is not already the code point, get it */ if (utf8_target && !UTF8_IS_INVARIANT(c)) { c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &c_len, - (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV) - | UTF8_ALLOW_FFFF | UTF8_CHECK_ONLY); - /* see [perl #37836] for UTF8_ALLOW_ANYUV; [perl #38293] for - * UTF8_ALLOW_FFFF */ + UTF8_ALLOW_ANYUV | UTF8_CHECK_ONLY); + /* see [perl #37836], [perl #38293] and [perl #63446] */ if (c_len == (STRLEN)-1) Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); } diff -up blead-63446-utf8-warnings2/utf8.c blead-63446-utf8-warnings3/utf8.c --- blead-63446-utf8-warnings2/utf8.c 2010-11-28 06:26:01.000000000 -0800 +++ blead-63446-utf8-warnings3/utf8.c 2010-11-28 12:40:01.000000000 -0800 @@ -2046,8 +2046,7 @@ Perl_swash_fetch(pTHX_ SV *swash, const Unicode tables, not a native character number. */ const UV code_point = utf8n_to_uvuni(ptr, UTF8_MAXBYTES, 0, - ckWARN(WARN_UTF8) ? - 0 : UTF8_ALLOW_ANY); + UTF8_ALLOW_ANYUV); swatch = swash_get(swash, /* On EBCDIC & ~(0xA0-1) isn't a useful thing to do */ (klen) ? (code_point & ~(needents - 1)) : 0,
diff -up blead-63446-utf8-warnings/regcomp.c blead-63446-utf8-warnings2/regcomp.c --- blead-63446-utf8-warnings/regcomp.c 2010-11-28 05:37:38.000000000 -0800 +++ blead-63446-utf8-warnings2/regcomp.c 2010-11-28 06:27:23.000000000 -0800 @@ -1348,7 +1348,7 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_st HV *widecharmap = NULL; AV *revcharmap = newAV(); regnode *cur; - const U32 uniflags = UTF8_ALLOW_DEFAULT; + const U32 uniflags = UTF8_ALLOW_ANYUV; STRLEN len = 0; UV uvc = 0; U16 curword = 0; diff -up blead-63446-utf8-warnings/regexec.c blead-63446-utf8-warnings2/regexec.c --- blead-63446-utf8-warnings/regexec.c 2010-11-24 05:45:11.000000000 -0800 +++ blead-63446-utf8-warnings2/regexec.c 2010-11-28 06:32:01.000000000 -0800 @@ -1752,7 +1752,7 @@ S_find_byclass(pTHX_ regexp * prog, cons */ while (s <= last_start) { - const U32 uniflags = UTF8_ALLOW_DEFAULT; + const U32 uniflags = UTF8_ALLOW_ANYUV; U8 *uc = (U8*)s; U16 charid = 0; U32 base = 1; @@ -2948,7 +2948,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, #endif dVAR; register const bool utf8_target = PL_reg_match_utf8; - const U32 uniflags = UTF8_ALLOW_DEFAULT; + const U32 uniflags = UTF8_ALLOW_ANYUV; REGEXP *rex_sv = reginfo->prog; regexp *rex = (struct regexp *)SvANY(rex_sv); RXi_GET_DECL(rex,rexi); diff -up blead-63446-utf8-warnings/t/re/beyond_unicode.t blead-63446-utf8-warnings2/t/re/beyond_unicode.t --- blead-63446-utf8-warnings/t/re/beyond_unicode.t 2010-11-28 06:08:59.000000000 -0800 +++ blead-63446-utf8-warnings2/t/re/beyond_unicode.t 2010-11-28 06:09:42.000000000 -0800 @@ -10,7 +10,7 @@ BEGIN { require './test.pl'; } -plan 1; +plan 2; my @bad; @@ -26,5 +26,13 @@ for(0..0x1200) { my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 }; push @bad, $_ if $c !~ quotemeta $c; } -ok !@bad, 'quotemeta $foo matches $foo for every character'; +ok !@bad, '$foo =~ quotemeta $foo for every character'; +report_bad; + +for(0..0x1200) { + next if rand > .25; + my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 }; + push @bad, $_ if $c !~ /\Q$c\E|a/; +} +ok !@bad, '$foo =~ /\Q$foo\E|a/ for every character'; report_bad;
CC: perl5-porters [...] perl.org, public [...] khwilliamson.com
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sun, 28 Nov 2010 15:41:08 -0700
To: perlbug-followup [...] perl.org
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 3.6k
In-Reply-To: Message from "Father Chrysostomos via RT" <perlbug-followup@perl.org> of "Sun, 28 Nov 2010 13:16:27 PST." <rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@perl.org> Show quoted text
> Furthermore, Perl’s strings are not just Unicode. Unicode strings > are merely a subset of the strings that Perl supports.
Quite. Show quoted text
> Regular expressions are for looking at strings. So it should not warn > or die based on the contents of the string, as long as it is a valid > Perl string.
Show quoted text
> perl already warns for "\x{d800}" and chr 0xd800. So if such a string > is passed to a regular expression, we get multiple warnings for the > same character.
It’s true. Show quoted text
> I use Perl’s strings for storing 16-bit binary data. The result is > that not only the code creating such strings, but any code looking at > the strings, has to turn off utf8 warnings. So I can’t use any CPAN > modules such as Data::Dump::Streamer.
That seems unfortunate. Show quoted text
> I propose we stop the regular expression engine from rejecting or > warning about these characters altogether. The only checking should be > for code that creates such characters or for I/O layers.
What you’ve written seems perfectly reasonable — even desirable. However, I am rather concerned that this could lead to anomalous behavior. Here’s the kind of thing I don’t believe we want to see happen in Perl. Java’s pattern matching acts completely nutty when presented with the kind of data you’re talking about. It never warns or dies, just gives logically inconsistent results. I hope we do not embark down a road that leads to this kinda of nonsense. To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a 0xDC7E in UTF-16BE. Here are the crazy things the Java pattern matcher does with that. I’ve all‐capped results I find most troubling below. A surrogate pair tests as a single character, but so does half of that pair. Plus if you flip the order of the surrogates, you now utterly illogical results, with the matcher claiming it has neither surrogates nor nonsurrogates in it, nor any characters at all, yet still allowing some things to match nonetheless, but others to senselessly fail. * Correct surrogate order: true "\uD83D\uDC7E" =~ /./ true "\uD83D\uDC7E" =~ /^.$/ false "\uD83D\uDC7E" =~ /^..$/ false "\uD83D\uDC7E" =~ /\p{Cs}/ true "\uD83D\uDC7E" =~ /\P{Cs}/ true: "\uD83D\uDC7E" =~ /\uD83D\uDC7E/ false: "\uD83D\uDC7E" =~ /\uDC7E/ false: "\uD83D\uDC7E" =~ /\uD83D/ * Half a surrogate pair: TRUE "\uD83D" =~ /./ TRUE "\uD83D" =~ /^.$/ false "\uD83D" =~ /^..$/ true "\uD83D" =~ /\p{Cs}/ false "\uD83D" =~ /\P{Cs}/ true "\uD83D" =~ /\uD83D/ * The other half of a surrogate pair: TRUE "\uDC7E" =~ /./ TRUE "\uDC7E" =~ /^.$/ false "\uDC7E" =~ /^..$/ true "\uDC7E" =~ /\p{Cs}/ false "\uDC7E" =~ /\P{Cs}/ true "\uDC7E" =~ /\uDC7E/ * Surrogates in backwards order: FALSE "\uDC7E\uD83D" =~ /./ false "\uDC7E\uD83D" =~ /^.$/ true "\uDC7E\uD83D" =~ /^..$/ FALSE "\uDC7E\uD83D" =~ /\p{Cs}/ FALSE "\uDC7E\uD83D" =~ /\P{Cs}/ true: "\uDC7E\uD83D" =~ /\uDC7E\uD83D/ FALSE: "\uDC7E\uD83D" =~ /\uD83D/ FALSE: "\uDC7E\uD83D" =~ /\uDC7E/ See what I mean? Isn’t that loony? I’m not sure what you would see done with “raw data”, but I sure do hope it’s nothing at all like *that*! ☹ --tom
RT-Send-CC: perl5-porters [...] perl.org, public [...] khwilliamson.com, tchrist [...] perl.com
On Sun Nov 28 14:42:02 2010, tom christiansen wrote: Show quoted text
> In-Reply-To: > > Message from "Father Chrysostomos via RT" <perlbug-followup@perl.org> > of "Sun, 28 Nov 2010 13:16:27 PST." > <rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@perl.org> >
> > Furthermore, Perl’s strings are not just Unicode. Unicode strings > > are merely a subset of the strings that Perl supports.
> > Quite. >
> > Regular expressions are for looking at strings. So it should not warn > > or die based on the contents of the string, as long as it is a valid > > Perl string.
>
> > perl already warns for "\x{d800}" and chr 0xd800. So if such a string > > is passed to a regular expression, we get multiple warnings for the > > same character.
> > It’s true. >
> > I use Perl’s strings for storing 16-bit binary data. The result is > > that not only the code creating such strings, but any code looking at > > the strings, has to turn off utf8 warnings. So I can’t use any CPAN > > modules such as Data::Dump::Streamer.
> > That seems unfortunate. >
> > I propose we stop the regular expression engine from rejecting or > > warning about these characters altogether. The only checking should be > > for code that creates such characters or for I/O layers.
> > What you’ve written seems perfectly reasonable — even desirable. > > However, I am rather concerned that this could lead to anomalous behavior. > Here’s the kind of thing I don’t believe we want to see happen in Perl. > > Java’s pattern matching acts completely nutty when presented with the kind > of data you’re talking about. It never warns or dies, just gives
logically Show quoted text
> inconsistent results. I hope we do not embark down a road that leads to > this kinda of nonsense. > > To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a > 0xDC7E in UTF-16BE. Here are the crazy things the Java pattern matcher > does with that. I’ve all‐capped results I find most troubling below. > A surrogate pair tests as a single character, but so does half of that > pair. Plus if you flip the order of the surrogates, you now utterly > illogical results, with the matcher claiming it has neither surrogates nor > nonsurrogates in it, nor any characters at all, yet still allowing some > things to match nonetheless, but others to senselessly fail.
None of that will happen in perl, because 0xDC7E and U+1F47E are completely unrelated characters, as far as it is concerned. $ perl -le' print "yes" if "\x{1F47E}" =~ /\p{Cs}/' $ perl -le' print "yes" if "\x{DC7E}" =~ /\p{Cs}/' yes $ perl -le' print "yes" if "\x{1F47E}" =~ /^.\z/' yes $ perl -le' print "yes" if "\x{D83D}\x{DC7E}" =~ /^.\z/' Show quoted text
> I’m not sure what you would > see done with “raw data”, but I sure do hope it’s nothing at all > like *that*! ☹
It will be treated the same way as \x{110000}-\x{ffffffff), except that \p{Cs} can match a surrogate and there is no such shorthand for \x{110000}-\x{ffffffff). I’m just making the utf8-warning implementation the same as the non-utf8-warning implementation. BTW, here are two more patches.
Download 5.
text/plain 1.6k
From: Father Chrysostomos <sprout@cpan.org> [perl #63446] "x" =~ /\x/ for all characters This makes "x" =~ /\x/ work for all characters that are not ASCII letters or numbers, regardless of utf8 warnings. diff -up blead-63446-utf8-warnings4/regcomp.c blead-63446-utf8-warnings5/regcomp.c --- blead-63446-utf8-warnings4/regcomp.c 2010-11-28 11:03:40.000000000 -0800 +++ blead-63446-utf8-warnings5/regcomp.c 2010-11-28 14:24:16.000000000 -0800 @@ -8326,7 +8326,7 @@ parseit: if (UTF) { value = utf8n_to_uvchr((U8*)RExC_parse, RExC_end - RExC_parse, - &numlen, UTF8_ALLOW_DEFAULT); + &numlen, UTF8_ALLOW_ANYUV); RExC_parse += numlen; } else diff -up blead-63446-utf8-warnings4/t/re/beyond_unicode.t blead-63446-utf8-warnings5/t/re/beyond_unicode.t --- blead-63446-utf8-warnings4/t/re/beyond_unicode.t 2010-11-28 14:06:55.000000000 -0800 +++ blead-63446-utf8-warnings5/t/re/beyond_unicode.t 2010-11-28 14:44:59.000000000 -0800 @@ -10,7 +10,7 @@ BEGIN { require './test.pl'; } -plan 3; +plan 4; my @bad; @@ -18,7 +18,7 @@ sub test_against_many_chars(&$) { my($test, $name) = @::_; @bad = (); for(0..0x1200) { - next if rand > .25; + next if rand > .125; &$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]); } ok !@bad, $name; @@ -42,3 +42,10 @@ test_against_many_chars { my $c = join "", @{$_[0]}; push @bad, $_ if $c !~ "^[\Q$c\E]+\\z"; } '$foo =~ /[$foo]/ for every character'; + +test_against_many_chars { + # Skip this for the ASCII range, as "a" =~ /\a/ obviously does not match. + return if !$_; + my $c = join "", @{$_[0]}; + push @bad, $_ if $c !~ ("^[" . ($c =~ s/(.)/\\$1/gross) . "]+\\z"); +} '"x" =~ /[\x]/ for every character';
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sun, 28 Nov 2010 15:50:49 -0800
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Father Chrysostomos <sprout [...] cpan.org>
Download (untitled) / with headers
text/plain 132b
On Sun Nov 28 14:54:51 2010, sprout wrote: Show quoted text
> BTW, here are two more patches.
RT did not like those files. Let’s try this again:

Message body is not shown because sender requested not to inline it.

Download 5. \
text/plain 1.6k

Message body is not shown because sender requested not to inline it.

CC: "OtherRecipients of perl Ticket #63446:;" [...] smtp.indra.com, perl5-porters [...] perl.org, tchrist [...] perl.com
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sun, 28 Nov 2010 19:30:19 -0700
To: perlbug-followup [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Father Chrysostomos via RT wrote: Show quoted text
> On Sun Nov 28 14:42:02 2010, tom christiansen wrote:
>> In-Reply-To: >> >> Message from "Father Chrysostomos via RT" <perlbug-followup@perl.org> >> of "Sun, 28 Nov 2010 13:16:27 PST." >> <rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@perl.org> >>
>>> Furthermore, Perl’s strings are not just Unicode. Unicode strings >>> are merely a subset of the strings that Perl supports.
>> Quite. >>
>>> Regular expressions are for looking at strings. So it should not warn >>> or die based on the contents of the string, as long as it is a valid >>> Perl string. >>> perl already warns for "\x{d800}" and chr 0xd800. So if such a string >>> is passed to a regular expression, we get multiple warnings for the >>> same character.
>> It’s true. >>
>>> I use Perl’s strings for storing 16-bit binary data. The result is >>> that not only the code creating such strings, but any code looking at >>> the strings, has to turn off utf8 warnings. So I can’t use any CPAN >>> modules such as Data::Dump::Streamer.
>> That seems unfortunate. >>
>>> I propose we stop the regular expression engine from rejecting or >>> warning about these characters altogether. The only checking should be >>> for code that creates such characters or for I/O layers.
>> What you’ve written seems perfectly reasonable — even desirable. >> >> However, I am rather concerned that this could lead to anomalous behavior. >> Here’s the kind of thing I don’t believe we want to see happen in Perl. >> >> Java’s pattern matching acts completely nutty when presented with the kind >> of data you’re talking about. It never warns or dies, just gives
> logically
>> inconsistent results. I hope we do not embark down a road that leads to >> this kinda of nonsense. >> >> To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a >> 0xDC7E in UTF-16BE. Here are the crazy things the Java pattern matcher >> does with that. I’ve all‐capped results I find most troubling below. >> A surrogate pair tests as a single character, but so does half of that >> pair. Plus if you flip the order of the surrogates, you now utterly >> illogical results, with the matcher claiming it has neither surrogates nor >> nonsurrogates in it, nor any characters at all, yet still allowing some >> things to match nonetheless, but others to senselessly fail.
> > None of that will happen in perl, because 0xDC7E and U+1F47E are > completely unrelated characters, as far as it is concerned. > > $ perl -le' print "yes" if "\x{1F47E}" =~ /\p{Cs}/' > $ perl -le' print "yes" if "\x{DC7E}" =~ /\p{Cs}/' > yes > > $ perl -le' print "yes" if "\x{1F47E}" =~ /^.\z/' > yes > $ perl -le' print "yes" if "\x{D83D}\x{DC7E}" =~ /^.\z/' >
>> I’m not sure what you would >> see done with “raw data”, but I sure do hope it’s nothing at all >> like *that*! ☹
> > It will be treated the same way as \x{110000}-\x{ffffffff), except that > \p{Cs} can match a surrogate and there is no such shorthand for > \x{110000}-\x{ffffffff). > > I’m just making the utf8-warning implementation the same as the > non-utf8-warning implementation. > > BTW, here are two more patches. > >
I have some uneasiness about this. It needs ample vetting here. First, to make sure you know, I am planning to shortly change things so that the non-characters and above-Unicode code points do not by default warn except in I/O. The fixes to do that are more minimal than your patches. I had thought of doing that with surrogates as well, but this met with resistance. This was some months back. So I didn't even propose it with my most recent postings, the last one of which got no response, which I take to mean that I had finally addressed all the concerns expressed earlier. It seems to me that the best solution would be a way to declare a binary string, and it would be illegal to operate on it using things that require semantics beyond the ordinal. So /i would not be valid, nor uc(), nor /\w/, etc, etc. But that might be construed as being against Perl philosophy.
RT-Send-CC: perl5-porters [...] perl.org, tchrist [...] perl.com, public [...] khwilliamson.com
Download (untitled) / with headers
text/plain 1.4k
On Sun Nov 28 18:32:05 2010, public@khwilliamson.com wrote: Show quoted text
> I have some uneasiness about this. It needs ample vetting here. > > First, to make sure you know, I am planning to shortly change things > so > that the non-characters and above-Unicode code points do not by > default > warn except in I/O.
If warnings are on, right? Show quoted text
> The fixes to do that are more minimal than your > patches.
I’d better stop, then. :-) Show quoted text
> > I had thought of doing that with surrogates as well, but this met with > resistance.
Can you give me a reference? Show quoted text
> This was some months back. So I didn't even propose it > with my most recent postings, the last one of which got no response, > which I take to mean that I had finally addressed all the concerns > expressed earlier. > > It seems to me that the best solution would be a way to declare a > binary > string, and it would be illegal to operate on it using things that > require semantics beyond the ordinal. So /i would not be valid, nor > uc(), nor /\w/, etc, etc. But that might be construed as being > against > Perl philosophy.
/i and \x{d800} are orthogonal, so neither one should stop the other from working. Whether I/O, chr and "\x{...}" warn or not, as long as I can turn off the warning with ‘no warnings "utf8"’, does not matter to me. But I reiterate that regular expressions should never warn or die for valid Perl strings. That’s a bit like adding uninitialized warnings to ‘defined’.
CC: perl5-porters [...] perl.org, public [...] khwilliamson.com
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 29 Nov 2010 11:05:50 +0100
To: perlbug-followup [...] perl.org
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.9k
On 28 November 2010 22:16, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Sun Mar 21 19:08:59 2010, khw wrote:
>> On Tue Feb 24 13:27:21 2009, zefram@fysh.org wrote:
>> > $ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ >> >    /\A[\x{123}]/ ? "yes" : "no"' >> > no >> > $ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~ >> >    /\A[\x{123}]/ ? "yes" : "no"' >> > Malformed UTF-8 character (fatal) at -e line 1. >> > $ >> > >> > Turning warnings on makes the regexp operation die, whereas with >> >    warnings >> > off it produced the correct answer.  If 'no warnings "utf8"' is in >> >    scope >> > at the regexp op, then the error does not occur.
>> >> I'm not sure what to do about this ticket.  The basics of it anyway are >> behaving as designed, which is that non-characters and surrogates >> generate errors unless warnings are turned off, but then things should >> work.
> > It may be working as designed, but it was not designed very well. >
>> The message in 5.12 for U+FFFF has been clarified that this >> character is illegal for interchange.  This should be extended in a >> later release to the other 65 noncharacters. >> >> Surrogates, on the other hand, should never appear in well-formed utf8, >> and there are security considerations for doing so that I don't fully >> understand but can see why.
> > The regular expression engine is not a security layer. It should not > pretend to be one. If I want to implement a security layer using regular > expressions, then this bug (yes, I do consider it a bug) will get in the > way. > > Furthermore, Perl’s strings are not just Unicode. Unicode strings are > merely a subset of the strings that Perl supports. > > Regular expressions are for looking at strings. So it should not warn or > die based on the contents of the string, as long as it is a valid Perl > string. > > perl already warns for "\x{d800}" and chr 0xd800. So if such a string is > passed to a regular expression, we get multiple warnings for the same > character. > > I use Perl’s strings for storing 16-bit binary data. The result is that > not only the code creating such strings, but any code looking at the > strings, has to turn off utf8 warnings. So I can’t use any CPAN modules > such as Data::Dump::Streamer. > > I propose we stop the regular expression engine from rejecting or > warning about these characters altogether. The only checking should be > for code that creates such characters or for I/O layers.
I agree, except that I would include /i matches. Using /i on a unicode flagged string implies you want (our brand of) unicode folding semantics. In order to make that work effectively we need to be able to depend on the utf8 data following the rules. So i think its just fine if the case-folding logic warns about something. But I agree that the regex engine should not block case sensitive matches. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: John Gardiner Myers <jgmyers [...] proofpoint.com>, perl5-porters [...] perl.org, tchrist [...] perl.com
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 29 Nov 2010 12:52:09 -0700
To: perlbug-followup [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.7k
Father Chrysostomos via RT wrote: Show quoted text
> On Sun Nov 28 18:32:05 2010, public@khwilliamson.com wrote:
>> I have some uneasiness about this. It needs ample vetting here. >> >> First, to make sure you know, I am planning to shortly change things >> so >> that the non-characters and above-Unicode code points do not by >> default >> warn except in I/O.
> > If warnings are on, right?
Yes, I keep forgetting to say that. Show quoted text
>
>> The fixes to do that are more minimal than your >> patches.
> > I’d better stop, then. :-)
At least until we see how this resolves, anyway. Show quoted text
>> I had thought of doing that with surrogates as well, but this met with >> resistance.
> > Can you give me a reference?
The only relatively recent one I can find is a really mild comment from Yves saying he would need to to think about the ramifications. I thought there were more. There certainly is discussion on the recent thread http://groups.google.com/group/perl.perl5.porters/browse_thread/thread/501f0059709a973b/2599e7219597cec4?lnk=gst&q=non-character But I ran across this very similar discussion from two years ago: http://rt.perl.org/rt3//Public/Bug/Display.html?id=51936 I'm willing to make surrogates internally allowed by default, like non-characters if the consensus is it's ok to do so. They most definitely would continue to be warned about on I/O. Part of the problem with them is that the Unicode standard says they should not be in well-formed utf8. John G. Myers can address this (cc'd) Show quoted text
>
>> This was some months back. So I didn't even propose it >> with my most recent postings, the last one of which got no response, >> which I take to mean that I had finally addressed all the concerns >> expressed earlier. >> >> It seems to me that the best solution would be a way to declare a >> binary >> string, and it would be illegal to operate on it using things that >> require semantics beyond the ordinal. So /i would not be valid, nor >> uc(), nor /\w/, etc, etc. But that might be construed as being >> against >> Perl philosophy.
> > /i and \x{d800} are orthogonal, so neither one should stop the other > from working.
This brings up another question that occurred to me. Didn't you say you were processing binary data? If so, then why is it encoded in utf8? Show quoted text
> Whether I/O, chr and "\x{...}" warn or not, as long as I can turn off > the warning with ‘no warnings "utf8"’, does not matter to me. > > But I reiterate that regular expressions should never warn or die for > valid Perl strings. That’s a bit like adding uninitialized warnings to > ‘defined’. >
I think most of us agree. If Perl stored its strings internally in U32 words instead of U8 utf8 bytes, I don't think there would be this discussion, or the earlier ones. Show quoted text
>
RT-Send-CC: perl5-porters [...] perl.org, public [...] khwilliamson.com
Download (untitled) / with headers
text/plain 385b
On Mon Nov 29 11:53:45 2010, public@khwilliamson.com wrote: Show quoted text
> This brings up another question that occurred to me. Didn't you say > you > were processing binary data? If so, then why is it encoded in utf8?
By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It is perl that happens to use utf8 internally to represent it, but I should not have to worry about that.
RT-Send-CC: perl5-porters [...] perl.org, demerphq [...] gmail.com
Download (untitled) / with headers
text/plain 909b
On Mon Nov 29 02:06:16 2010, demerphq wrote: Show quoted text
> On 28 November 2010 22:16, Father Chrysostomos via RT > <perlbug-followup@perl.org> wrote:
> > I propose we stop the regular expression engine from rejecting or > > warning about these characters altogether. The only checking should be > > for code that creates such characters or for I/O layers.
> > I agree, except that I would include /i matches. > > Using /i on a unicode flagged string implies you want (our brand of) > unicode folding semantics. > > In order to make that work effectively we need to be able to depend on > the utf8 data following the rules. > > So i think its just fine if the case-folding logic warns about something.
That makes case-tolerance conceptually more complex than it needs to be. I thought /σ/i was supposed to be equivalent to /[Σσς]/, but you seem to be saying that the former would warn while the latter would not.
CC: perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 13 Dec 2010 22:42:00 +0100
To: perlbug-followup [...] perl.org
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.3k
On 12 December 2010 22:04, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Mon Nov 29 02:06:16 2010, demerphq wrote:
>> On 28 November 2010 22:16, Father Chrysostomos via RT >> <perlbug-followup@perl.org> wrote:
>> > I propose we stop the regular expression engine from rejecting or >> > warning about these characters altogether. The only checking should be >> > for code that creates such characters or for I/O layers.
>> >> I agree, except that I would include /i matches. >> >> Using /i on a unicode flagged string implies you want (our brand of) >> unicode folding semantics. >> >> In order to make that work effectively we need to be able to depend on >> the utf8 data following the rules. >> >> So i think its just fine if the case-folding logic warns about something.
> > That makes case-tolerance conceptually more complex than it needs to be. > I thought /σ/i was supposed to be equivalent to /[Σσς]/, but you seem to > be saying that the former would warn while the latter would not.
What I was saying was that if we are doing a case insensitive match and you put a \x{D800} in your string or pattern, then we would be entitled to warn, as that codepoint is reserved for representing high value codepoints that cannot be expressed in 16 bits, and does not represent a "character" at all, and thus cannot be folded. Similar story for codepoints > 10FFFF etc. In other words, we should NOT warn if someone wants to match a string against \x{D800} or match a string containing \x{D800} against a case-sensitive pattern, as there is no reason to ascribe semantic meaning to the codepoints when doing case-sensitive matching. However when we case fold we must ascribe semantic meaning to the codepoints, and when we encounter one that is illegal it makes sense to say so. Also, just as a note, in the early days the character class notation was created so that people had a shorthand way to write (a|b|c|d) type constructs. With unicode folding rules where the folded version of a string can be longer than the original, this doesnt make quite as much sense. For instance /[\x{DF}-\x{FF}]/i becomes problematic, as \xDF fold to 'ss' so, which a character class cant match, and even more bizarre, what exactly does it mean to have a range with a multi-codepoint string as the startpoint? cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 13 Dec 2010 16:14:26 -0800
To: perl5-porters [...] perl.org
From: Reverend Chip <rev.chip [...] gmail.com>
On 12/12/2010 1:02 PM, Father Chrysostomos via RT wrote: Show quoted text
> On Mon Nov 29 11:53:45 2010, public@khwilliamson.com wrote:
>> This brings up another question that occurred to me. Didn't you say >> you >> were processing binary data? If so, then why is it encoded in utf8?
> By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It > is perl that happens to use utf8 internally to represent it, but I > should not have to worry about that.
I hate to have to disagree, but: "UTF8" means "UCS Translation Format - 8-bit", and "UCS" means "Universal Character Set", i.e. Unicode. Unicode semantics _are_ part of what Perl supports, so Perl is entitled to give Unicode-specific meaning to the code points it finds therein. What you seem to want is for Perl to support "arbitrary integers encoded as variable-length byte strings using the same encoding tricks as UTF8", and of course it is possible that this could have been done, but that's not what Perl actually promises to do. So complaining when Perl takes seriously the "u" in "utf8" seems ill-founded. Unless, that is, I have been grievously misinformed?
CC: perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 13 Dec 2010 20:24:37 -0500
To: Reverend Chip <rev.chip [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 1.3k
On Mon, Dec 13, 2010 at 7:14 PM, Reverend Chip <rev.chip@gmail.com> wrote: Show quoted text
> On 12/12/2010 1:02 PM, Father Chrysostomos via RT wrote:
> > By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It > > is perl that happens to use utf8 internally to represent it, but I > > should not have to worry about that.
> > I hate to have to disagree, but: "UTF8" means "UCS Translation Format - > 8-bit", and "UCS" means "Universal Character Set", i.e. Unicode. >
The name of some internal flag is of very little importance. Perl currently supports strings of arbitrary 32-bit numbers in 32-bit builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I don't know of any documentation to the contrary (but I'm not familiar with the latest). $ perl -E'say 0xFFFFFFFFFFFFFFFF' 18446744073709551615 $ perl -E'say ord chr 0xFFFFFFFFFFFFFFFF' 18446744073709551615 $ perl -MEncode -E'$x=chr 0xFFFFFFFFFFFFFFFF; Encode::_utf8_off($x); say length($x)' 13 Despite being named "UTF8", the flag clearly does not imply adherence to UTF-8. (Obviously, uc() and the regex engine will assign meaning to those numbers, but that's unrelated.) It may be that Perl should be changed so its strings are confined to strings of Unicode characters, but basing your argument on the name of some internal flag makes the argument unconvincing. - Eric
CC: perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 13 Dec 2010 22:22:28 -0800
To: Eric Brine <ikegami [...] adaelis.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On 12/13/2010 5:24 PM, Eric Brine wrote: Show quoted text
> On Mon, Dec 13, 2010 at 7:14 PM, Reverend Chip <rev.chip@gmail.com > <mailto:rev.chip@gmail.com>> wrote: > > On 12/12/2010 1:02 PM, Father Chrysostomos via RT wrote:
> > By 16-bit binary data, I mean sequences of unsigned 16-bit
> integers. It
> > is perl that happens to use utf8 internally to represent it, but I > > should not have to worry about that.
> > I hate to have to disagree, but: "UTF8" means "UCS Translation > Format - > 8-bit", and "UCS" means "Universal Character Set", i.e. Unicode. > > > The name of some internal flag is of very little importance.
That's would be true, if it were purely an internal flag. But the flag is very external, both in code ("utf8::upgrade()") and in documentation. Please try harder. Show quoted text
> Perl currently supports strings of arbitrary 32-bit numbers in 32-bit > builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I > don't know of any documentation to the contrary...
Well of course Perl is designed to perform as gracefully as possible as the Unicode committee(s) assign new code points; to do otherwise would be downright stupid. But that forward-looking design is irrelevant to the fact that Perl knows the strings _are_ Unicode. As for documentary evidence, from the many possible choices I pick mostly at random this quotation from perlunicode -- which is referenced from utf8's documentation, natch: "Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data." The above-quoted objections are hardly worth knocking down. Please, please try harder.
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Tue, 14 Dec 2010 08:24:57 +0100
To: Reverend Chip <rev.chip [...] gmail.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.3k
On 14 December 2010 07:22, Reverend Chip <rev.chip@gmail.com> wrote: Show quoted text
> On 12/13/2010 5:24 PM, Eric Brine wrote:
>> On Mon, Dec 13, 2010 at 7:14 PM, Reverend Chip <rev.chip@gmail.com >> <mailto:rev.chip@gmail.com>> wrote: >> >>     On 12/12/2010 1:02 PM, Father Chrysostomos via RT wrote: >>     > By 16-bit binary data, I mean sequences of unsigned 16-bit >>     integers. It >>     > is perl that happens to use utf8 internally to represent it, but I >>     > should not have to worry about that. >> >>     I hate to have to disagree, but:  "UTF8" means "UCS Translation >>     Format - >>     8-bit", and "UCS" means "Universal Character Set", i.e. Unicode. >> >> >> The name of some internal flag is of very little importance.
> > That's would be true, if it were purely an internal flag.  But the flag > is very external, both in code ("utf8::upgrade()") and in > documentation.  Please try harder. >
>> Perl currently supports strings of arbitrary 32-bit numbers in 32-bit >> builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I >> don't know of any documentation to the contrary...
> > Well of course Perl is designed to perform as gracefully as possible as > the Unicode committee(s) assign new code points; to do otherwise would > be downright stupid.  But that forward-looking design is irrelevant to > the fact that Perl knows the strings _are_ Unicode.   As for documentary > evidence, from the many possible choices I pick mostly at random this > quotation from perlunicode -- which is referenced from utf8's > documentation, natch: > >    "Unless explicitly stated, Perl operators use character semantics > for Unicode data and byte semantics for non-Unicode data."
You are a bit misinformed. The internals specifically contemplated handling the utf8 encoding as a way to implement packed arrays of 32 bit integers. It is only when we must ascribe meaning to codepoints, such as when we do case change operations, or case insensitive matching that we ascribe semantic meaning to the values. There is no reason not to allow \x{D800} to be stored in a utf8 string, except if someone wants to treat that string as having meaning under unicode. Its not perls nature to say "you cant do that - unicode doesn't agree" except when we have no other choice. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Tue, 14 Dec 2010 10:39:13 -0800
To: demerphq <demerphq [...] gmail.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 2.4k
On 12/13/2010 11:24 PM, demerphq wrote: Show quoted text
> On 14 December 2010 07:22, Reverend Chip <rev.chip@gmail.com> wrote:
>> On 12/13/2010 5:24 PM, Eric Brine wrote:
>>> Perl currently supports strings of arbitrary 32-bit numbers in 32-bit >>> builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I >>> don't know of any documentation to the contrary...
>> Well of course Perl is designed to perform as gracefully as possible as >> the Unicode committee(s) assign new code points; to do otherwise would >> be downright stupid. But that forward-looking design is irrelevant to >> the fact that Perl knows the strings _are_ Unicode. As for documentary >> evidence, from the many possible choices I pick mostly at random this >> quotation from perlunicode -- which is referenced from utf8's >> documentation, natch: >> >> "Unless explicitly stated, Perl operators use character semantics >> for Unicode data and byte semantics for non-Unicode data."
> You are a bit misinformed. The internals specifically contemplated > handling the utf8 encoding as a way to implement packed arrays of 32 > bit integers.
Code cannot contemplate. What are you trying to say? A hypothetical leveraging of the utf8 support code for some other purpose is off topic. Show quoted text
> It is only when we must ascribe meaning to codepoints, such as when we > do case change operations, or case insensitive matching that we > ascribe semantic meaning to the values.
Well, of course. Unnecessary validation work is unnecessary. Still, Perl knows it's Unicode. Show quoted text
> There is no reason not to allow \x{D800} to be stored in a utf8 > string, except if someone wants to treat that string as having meaning > under unicode.
Perl does treat the string as having meaning under Unicode. This is established. Now if a programmer decides to play a game in which he puts illegal code points into Unicode strings because Perl's validation is lazy, well, that's a game that programmer may win and may lose; but in any case, he has no grounds to complain when Perl's validation catches up with him. Show quoted text
> Its not perls nature to say "you cant do that - unicode > doesn't agree" except when we have no other choice.
Perl's nature both includes compliance and integrity. It's established and documented that Perl's "utf8" is a representation of Unicode; that's not a lie, but a truth that some people are in denial about. Perl interprets your commands within that context. So its compliance has limits and conditions. It has always been thus.
CC: perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Tue, 14 Dec 2010 15:31:15 -0500
To: Reverend Chip <rev.chip [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 883b
On Tue, Dec 14, 2010 at 1:46 PM, Reverend Chip <rev.chip@gmail.com> wrote: Show quoted text
> Perl is made of its operators, in part. If operators treat the strings >
as Unicode, then Perl does. But substr, length and index don't treat strings as Unicode. It doesn't assign any meaning to the characters. That's why I can use substr, length and index and strings of arbitrary integers (e.g. iso-latin-15, JFIF image, etc). I hope you're not saying I'm misusing substr by using it on binary data. So the only question that leaves is what's the limit in the size of the integers. Nowhere does it mention that the integers are limited to 8-bits, and it's not limited to 8-bits in practice. (It's limited to UV.) That's an assumption you're carrying over from C or something. I'm not saying we should support more than Unicode or not, just that we currently do support more than Unicode. - Eric
CC: perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Tue, 14 Dec 2010 13:44:17 -0800
To: Eric Brine <ikegami [...] adaelis.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 681b
On 12/14/2010 12:31 PM, Eric Brine wrote: Show quoted text
> On Tue, Dec 14, 2010 at 1:46 PM, Reverend Chip <rev.chip@gmail.com > <mailto:rev.chip@gmail.com>> wrote: > > Perl is made of its operators, in part. If operators treat the > strings > > as Unicode, then Perl does. > > > But substr, length and index don't treat strings as Unicode.
Yes, they do. But their error checking is minimal for performance reasons. So you're getting away with cheating. Show quoted text
> It doesn't assign any meaning to the characters.
For the value of "it" that is Perl as a whole, I've already proven this point wrong; please spare us the repetition. If you mean each individual operator, then see above.
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Wed, 15 Dec 2010 12:55:28 -0700
To: Reverend Chip <rev.chip [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.7k
I'd like to come to some closure on this discussion. Let me start by stepping back and summarizing, first quoting from the Unicode Standard: "2.7 Unicode Strings "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. "Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide. "It is straightforward to design basic string manipulation libraries that handle isolated surrogates in a consistent and straightforward manner. They cannot ever be interpreted as abstract characters, but they can be internally handled the same way as noncharacters where they occur. Typically they occur only ephemerally, such as in dealing with keyboard events. While an ideal protocol would allow keyboard events to contain complete strings, many allow only a single UTF-16 code unit per event. As a sequence of events is transmitted to the application, a string that is being built up by the application in response to those events may contain isolated surrogates at any particular point in time." And the definition of "abstract character": "D7 Abstract character: A unit of information used for the organization, control, or representation of textual data. "When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, aural or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats. "An abstract character has no concrete form and should not be confused with a glyph. "An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme. "The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters. "Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences." What that bureaucratize comes down to is that an abstract character is a code point that has been assigned a meaning, like LINE FEED or LATIN CAPITAL LETTER A. There are 2**21 -1 code points in Unicode, ranging from 0 to 0x10FFFF; somewhat less than a quarter are currently assigned. There are 4 categories of code points that are not abstract characters (private use are separate, and not an issue): 1) Those that may be assigned in the future; they have General Category Cn. 2) Noncharacters, also having Gc=Cn. These are reserved for internal use by an application; and are illegal for interchange between applications. Perl handles these improperly, treating them like it does surrogates, except that for conversion from utf8 to an unsigned value, it knows about only one of the 66 of these. Asymmetrically, going from uv to utf8, it does know about all 66, but splits them into two groups, a distinction not present in the standard. 3) Beyond Unicode code points. These are the code points above 0x10FFFF but fitting into whatever size word is available. Unicode has said that these will never be used by it. 4) Surrogates, having Gc=Cs, which are reserved for use in pairs in UTF-16 to allow > 16 bit code points to be specified. My original proposal this time round was to fix the noncharacters to operate as the standard says, by allowing them by default internally. Only during I/O would they be checked for. It seems like a small extension to give this behavior as well to the beyond-Unicode code points. And it has been suggested that this behavior extend as well to the surrogates, so that any unsigned value can be represented as a "Perl string". Note that no one is proposing that any of these values be legal upon I/o. In this thread, I don't think I've heard what the harm is of allowing surrogates internally. The above text from the standard seems to allow that possibility, as long as they don't represent an abstract character. So why not allow them (as they mostly are now when warnings are off)?
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Thu, 16 Dec 2010 15:53:32 -0800
To: karl williamson <public [...] khwilliamson.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 2.8k
On 12/15/2010 11:55 AM, karl williamson wrote: Show quoted text
> "As a sequence of events is transmitted to the application, a string > that is being built up by the application in response to those events > may contain isolated surrogates at any particular point in time."
That is a much better explanation than I have previously made as to why Perl is so lax with its code point checking while it still knows, certainly, that its strings are Unicode. It makes sense that Perl must allow arbitrary sequences of code points because Perl can't know what the string is being used for; and too much error checking would render Perl less useful. For example, when translating JS/JSON/Java strings to Perl utf8 strings, it shouldn't be surprising or alarming to find at some point surrogates as 'characters' in the Perl string. Any time Perl code copes with UTF-16 might also have this happen. So Perl can't do much error checking on contained code points. Show quoted text
> My original proposal this time round was to fix the noncharacters to > operate as the standard says, by allowing them by default internally. > Only during I/O would they be checked for. It seems like a small > extension to give this behavior as well to the beyond-Unicode code > points. And it has been suggested that this behavior extend as well > to the surrogates, so that any unsigned value can be represented as a > "Perl string". Note that no one is proposing that any of these values > be legal upon I/o. > > In this thread, I don't think I've heard what the harm is of allowing > surrogates internally. The above text from the standard seems to > allow that possibility, as long as they don't represent an abstract > character. So why not allow them (as they mostly are now when > warnings are off)?
I think "allow" is too overloaded; perhaps I'm misunderstanding what you mean by it. Maybe this is violent agreement. :-( That said, "harm" isn't the only relevant standard. There's also "correctness." If you ask Perl to do something with data and the data are not properly formed for what you ask it to do -- if Perl can't do it _correctly_ -- then we expect Perl to tell us. Compare C<"xyz" + 1>. A utf8 string where some of the code points are surrogates, which is being processed in a way that requires knowing the code points' semantics, like \p{whatever} or case operations, cannot be processed _correctly_ because there is no _correct_ answer. Therein may lay the harm you seek, btw: Perl silently acting on invalid data and producing invalid results without the programmer getting a warning about it. So it seems to me, in the end, that the warnings on surrogates in \p{foo}, //i, lc, uc, etc. are important; but that we could document that set of operations that will warn, and guarantee to programmers that if they stay clear of those operators, they can put any pseudo-character in a utf8 string and we will promise to avert our collective gaze.
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Fri, 17 Dec 2010 14:51:42 -0700
To: Reverend Chip <rev.chip [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.4k
Reverend Chip wrote: Show quoted text
> On 12/15/2010 11:55 AM, karl williamson wrote:
>> "As a sequence of events is transmitted to the application, a string >> that is being built up by the application in response to those events >> may contain isolated surrogates at any particular point in time."
> > That is a much better explanation than I have previously made as to why > Perl is so lax with its code point checking while it still knows, > certainly, that its strings are Unicode. It makes sense that Perl must > allow arbitrary sequences of code points because Perl can't know what > the string is being used for; and too much error checking would render > Perl less useful. For example, when translating JS/JSON/Java strings to > Perl utf8 strings, it shouldn't be surprising or alarming to find at > some point surrogates as 'characters' in the Perl string. Any time Perl > code copes with UTF-16 might also have this happen. So Perl can't do > much error checking on contained code points. >
>> My original proposal this time round was to fix the noncharacters to >> operate as the standard says, by allowing them by default internally. >> Only during I/O would they be checked for. It seems like a small >> extension to give this behavior as well to the beyond-Unicode code >> points. And it has been suggested that this behavior extend as well >> to the surrogates, so that any unsigned value can be represented as a >> "Perl string". Note that no one is proposing that any of these values >> be legal upon I/o. >> >> In this thread, I don't think I've heard what the harm is of allowing >> surrogates internally. The above text from the standard seems to >> allow that possibility, as long as they don't represent an abstract >> character. So why not allow them (as they mostly are now when >> warnings are off)?
> > I think "allow" is too overloaded; perhaps I'm misunderstanding what you > mean by it. Maybe this is violent agreement. :-( That said, "harm" > isn't the only relevant standard. There's also "correctness." If you > ask Perl to do something with data and the data are not properly formed > for what you ask it to do -- if Perl can't do it _correctly_ -- then we > expect Perl to tell us. Compare C<"xyz" + 1>. > > A utf8 string where some of the code points are surrogates, which is > being processed in a way that requires knowing the code points' > semantics, like \p{whatever} or case operations, cannot be processed > _correctly_ because there is no _correct_ answer. Therein may lay the > harm you seek, btw: Perl silently acting on invalid data and producing > invalid results without the programmer getting a warning about it.
But surrogates do have semantics. The standard is kind of self-contradictory about these things. It says that surrogate code points are not legal Unicode code points, the same as for those above 10FFFF. But the data files give a property definition for surrogates for every property in Unicode. The upper case of a surrogate is itself. It has general category 'Cs' or 'Surrogate'. It's in one of the surrogate blocks. It case folds to itself. These definitions are not just by-products of having to have place-markers in the data files: In the standard's Section "D.4 Changes from Version 5.0 to Version 5.1", it says: "In UAX #24, “Unicode Script Property,” added surrogates to the list of code points which get the “Unknown” script value ...". So they are actively maintaining the properties for surrogate code points. Show quoted text
> > So it seems to me, in the end, that the warnings on surrogates in > \p{foo}, //i, lc, uc, etc. are important; but that we could document > that set of operations that will warn, and guarantee to programmers that > if they stay clear of those operators, they can put any pseudo-character > in a utf8 string and we will promise to avert our collective gaze. >
A problem is that it doesn't currently work any way like this when converting from utf8 to code point. Unless warnings are off, or called with specific flags, the common function that does this will return 0 instead of the code point for surrogates, one of the non-character code points, and any values that don't fit into 31 bits. All the other non-character code points and above 10FFFF but fitting into 31 bits are unchecked. (It is beyond me as to why the 31-bit limit, unless there was concern that UV's didn't work and IV's would be required, or I suppose it could be that testing for this was the most efficient, requiring only a single-byte compare, to make sure that it would fit into 32 bits.)
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Fri, 17 Dec 2010 14:15:28 -0800
To: karl williamson <public [...] khwilliamson.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 3.3k
On 12/17/2010 1:51 PM, karl williamson wrote: Show quoted text
> Reverend Chip wrote:
>> A utf8 string where some of the code points are surrogates, which is >> being processed in a way that requires knowing the code points' >> semantics, like \p{whatever} or case operations, cannot be processed >> _correctly_ because there is no _correct_ answer. Therein may lay the >> harm you seek, btw: Perl silently acting on invalid data and producing >> invalid results without the programmer getting a warning about it.
> > But surrogates do have semantics. The standard is kind of > self-contradictory about these things. It says that surrogate code > points are not legal Unicode code points, the same as for those above > 10FFFF. But the data files give a property definition for surrogates > for every property in Unicode. The upper case of a surrogate is > itself. It has general category 'Cs' or 'Surrogate'. It's in one of > the surrogate blocks. It case folds to itself. These definitions are > not just by-products of having to have place-markers in the data > files: In the standard's Section "D.4 Changes from Version 5.0 to > Version 5.1", it says: "In UAX #24, “Unicode Script Property,” added > surrogates to the list of code points which get the “Unknown” script > value ...". So they are actively maintaining the properties for > surrogate code points.
Only a committee would declare something illegal, and also specify how it should work. I left that whole paragraph in place because I can hardly believe it, so I want to remind myself that it's true. So: Under the "generous in what you accept" principle, if Unicode has gone so far as to maintain the code point properties, then we really should play along, I guess. \p{foo} for everyone! Show quoted text
>> >> So it seems to me, in the end, that the warnings on surrogates in >> \p{foo}, //i, lc, uc, etc. are important; but that we could document >> that set of operations that will warn, and guarantee to programmers that >> if they stay clear of those operators, they can put any pseudo-character >> in a utf8 string and we will promise to avert our collective gaze.
> > A problem is that it doesn't currently work any way like this when > converting from utf8 to code point. Unless warnings are off, or > called with specific flags, the common function that does this will > return 0 instead of the code point for surrogates, one of the > non-character code points, and any values that don't fit into 31 > bits. All the other non-character code points and above 10FFFF but > fitting into 31 bits are unchecked. > > (It is beyond me as to why the 31-bit limit, unless there was concern > that UV's didn't work and IV's would be required, or I suppose it > could be that testing for this was the most efficient, requiring only > a single-byte compare, to make sure that it would fit into 32 bits.)
Erm, that sounds like something that should change, then. If current Unicode defines properties for surrogate code points (and presumably that one non-character code point?) then we need to be able to decode them. I suspect that the 31-bit limit probably does relate to IVs; but given that UV support is quite good I think allowing 32 bits is a good idea, but I imagine that might have cascading implications throughout the code that processes characters. (Why signed integers are the default in C is something I hope never to truly understand.)
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Fri, 17 Dec 2010 15:34:50 -0700
To: Reverend Chip <rev.chip [...] gmail.com>
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 666b
Show quoted text
> Only a committee would declare something illegal, and also > specify how it should work.
Show quoted text
> I left that whole paragraph in place because I can hardly > believe it, so I want to remind myself that it's true.
I thought the same thing. Remember, Java is the place where both a single surrogate (half a character, as it were) tests true for /^.$/, but so does a surrogate pair: you can't distinguish them! There's a disturbing amount of Java mumblese in the Unicode docs. It's disturbing because they're citing people who clearly don't understand important matters, which shows that they themselves are thinking fuzzily. This does not inspire confidence. --tom
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Fri, 17 Dec 2010 15:35:47 -0800
To: Tom Christiansen <tchrist [...] perl.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 910b
On 12/17/2010 2:34 PM, Tom Christiansen wrote: Show quoted text
>> Only a committee would declare something illegal, and also >> specify how it should work. >> I left that whole paragraph in place because I can hardly >> believe it, so I want to remind myself that it's true.
> I thought the same thing. Remember, Java is the place where both a single > surrogate (half a character, as it were) tests true for /^.$/, but so does > a surrogate pair: you can't distinguish them!
That's remarkable. I'll bet that a reversed surrogate pair matches /^.{2}$/. Show quoted text
> There's a disturbing amount > of Java mumblese in the Unicode docs. It's disturbing because they're > citing people who clearly don't understand important matters, which shows > that they themselves are thinking fuzzily. This does not inspire confidence.
Semantic bleed back from Java makes sense; has the Java world been a political force in Unicode, perhaps?
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Fri, 17 Dec 2010 17:06:31 -0700
To: Reverend Chip <rev.chip [...] gmail.com>
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 2.3k
Show quoted text
> That's remarkable. I'll bet that a reversed surrogate pair matches > /^.{2}$/.
Clever guess! But it's worse than that: U+D83D TRUE: "?" =~ /./ U+D83D TRUE: "?" =~ /^.$/ U+D83D false: "?" =~ /../ U+D83D false: "?" =~ /^..$/ U+D83D TRUE: "?" =~ /\pC/ U+D83D TRUE: "?" =~ /\p{Cs}/ U+D83D TRUE: "?" =~ /\p{InHighSurrogates}/ U+D83D false: "?" =~ /\p{InLowSurrogates}/ U+D83D false: "?" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/ U+1F47E TRUE: "👾" =~ /./ U+1F47E TRUE: "👾" =~ /^.$/ U+1F47E false: "👾" =~ /../ U+1F47E false: "👾" =~ /^..$/ U+1F47E TRUE: "👾" =~ /\pC/ U+1F47E TRUE: "👾" =~ /\p{Cs}/ U+1F47E false: "👾" =~ /\p{InHighSurrogates}/ U+1F47E TRUE: "👾" =~ /\p{InLowSurrogates}/ U+1F47E TRUE: "👾" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/ U+DC7E TRUE: "?" =~ /./ U+DC7E TRUE: "?" =~ /^.$/ U+DC7E false: "?" =~ /../ U+DC7E false: "?" =~ /^..$/ U+DC7E TRUE: "?" =~ /\pC/ U+DC7E TRUE: "?" =~ /\p{Cs}/ U+DC7E false: "?" =~ /\p{InHighSurrogates}/ U+DC7E TRUE: "?" =~ /\p{InLowSurrogates}/ U+DC7E false: "?" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/ U+DC7E.D83D TRUE: "??" =~ /./ U+DC7E.D83D false: "??" =~ /^.$/ U+DC7E.D83D TRUE: "??" =~ /../ U+DC7E.D83D TRUE: "??" =~ /^..$/ U+DC7E.D83D TRUE: "??" =~ /\pC/ U+DC7E.D83D TRUE: "??" =~ /\p{Cs}/ U+DC7E.D83D TRUE: "??" =~ /\p{InHighSurrogates}/ U+DC7E.D83D TRUE: "??" =~ /\p{InLowSurrogates}/ U+DC7E.D83D false: "??" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/ U+FDDD TRUE: "\x{FDDD}" =~ /./ U+FDDD TRUE: "\x{FDDD}" =~ /^.$/ U+FDDD false: "\x{FDDD}" =~ /../ U+FDDD false: "\x{FDDD}" =~ /^..$/ U+FDDD false: "\x{FDDD}" =~ /\pC/ U+FDDD false: "\x{FDDD}" =~ /\p{Cs}/ U+FDDD false: "\x{FDDD}" =~ /\p{InHighSurrogates}/ U+FDDD false: "\x{FDDD}" =~ /\p{InLowSurrogates}/ U+FDDD TRUE: "\x{FDDD}" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/ U+FFFF TRUE: "#" =~ /./ U+FFFF TRUE: "#" =~ /^.$/ U+FFFF false: "#" =~ /../ U+FFFF false: "#" =~ /^..$/ U+FFFF false: "#" =~ /\pC/ U+FFFF false: "#" =~ /\p{Cs}/ U+FFFF false: "#" =~ /\p{InHighSurrogates}/ U+FFFF false: "#" =~ /\p{InLowSurrogates}/ U+FFFF TRUE: "#" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/ That's with a UTF-8 output encoding. Anything smell fishy to you? I mean, more than once ever few lines? :) --tom
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sat, 18 Dec 2010 17:27:21 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
* Reverend Chip <rev.chip@gmail.com> [2010-12-14 01:15]: Show quoted text
> I hate to have to disagree, but:
You are disagreeing with Larry. Show quoted text
> "UTF8" means "UCS Translation Format - 8-bit", and "UCS" means > "Universal Character Set", i.e. Unicode. Unicode semantics > _are_ part of what Perl supports, so Perl is entitled to give > Unicode-specific meaning to the code points it finds therein. > What you seem to want is for Perl to support "arbitrary > integers encoded as variable-length byte strings using the same > encoding tricks as UTF8", and of course it is possible that > this could have been done, but that's not what Perl actually > promises to do.
It actually does, and has documented that it does. http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8 Show quoted text
> So complaining when Perl takes seriously the "u" in "utf8" > seems ill-founded. Unless, that is, I have been grievously > misinformed?
Personally I like that Perl is lax there. In fact I want a particular UTF-8 encoder/decoder for Perl at some point, which and can fully round-trip binary data, and is known as UTF-8b. In <http://bsittler.livejournal.com/10381.html> it is described briefly. It comes from a long and thorough analysis detailed in <http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>. And <http://hyperreal.org/~est/freeware/> has implementations for C and Python. It is part of Perl’s heritage and appeal that it allows your code to deal with the outside world in as messy or tidy a manner as necessary to get the job done – the glue language. I have grown some distaste for technologies that fail on that count, eg. the restriction in XML that you can’t represent control characters in a well-formed document, in any way. I remember the problems this caused at a site I worked on, where we could not provide feeds of user posts without hacky workarounds, since HTTP POSTs can contain anything. When you are forbidden from saying something at all, it will sooner or later become a problem. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sat, 18 Dec 2010 22:18:34 +0100
To: Reverend Chip <rev.chip [...] gmail.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 6.9k
On 14 December 2010 19:39, Reverend Chip <rev.chip@gmail.com> wrote: Show quoted text
> On 12/13/2010 11:24 PM, demerphq wrote:
>> On 14 December 2010 07:22, Reverend Chip <rev.chip@gmail.com> wrote:
>>> On 12/13/2010 5:24 PM, Eric Brine wrote:
>>>> Perl currently supports strings of arbitrary 32-bit numbers in 32-bit >>>> builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I >>>> don't know of any documentation to the contrary...
>>> Well of course Perl is designed to perform as gracefully as possible as >>> the Unicode committee(s) assign new code points; to do otherwise would >>> be downright stupid.  But that forward-looking design is irrelevant to >>> the fact that Perl knows the strings _are_ Unicode.   As for documentary >>> evidence, from the many possible choices I pick mostly at random this >>> quotation from perlunicode -- which is referenced from utf8's >>> documentation, natch: >>> >>>    "Unless explicitly stated, Perl operators use character semantics >>> for Unicode data and byte semantics for non-Unicode data."
>> You are a bit misinformed. The internals specifically contemplated >> handling the utf8 encoding as a way to implement packed arrays of 32 >> bit integers.
> > Code cannot contemplate.  What are you trying to say?  A hypothetical > leveraging of the utf8 support code for some other purpose is off topic.
The coder can contemplate, and the implement, support outside of pure requirements. So in particular (and this is documented) we use the term "utf8" to represent "Perl's private extension of UTF8". utf8 is equivalent to "UTF8" in that all legal (canonical) UTF8 sequences are legal utf8, however not all legal utf8 sequences are in UTF8 as utf8 supports a larger range of codepoints, and codepoints which are illegal in UTF8, like the codepoints reserved for UTF-16 (surrogate pairs). Show quoted text
>> It is only when we must ascribe meaning to codepoints, such as when we >> do case change operations, or case insensitive matching that we >> ascribe semantic meaning to the values.
> > Well, of course.  Unnecessary validation work is unnecessary.  Still, > Perl knows it's Unicode.
"Perl knows it's Unicode" is an insufficiently well defined expression for this discussion. Flipping the utf8 bit on a SV tells perl that the integers stored therin are to be decoded as utf8, and that if certain operations are performed to use special Unicode routines to do so. When the latter are invoked Perl has the right to complain about problems with the contents of the utf8 string. If they are not it has no business doing so. Show quoted text
>> There is no reason not to allow \x{D800} to be stored in a utf8 >> string, except if someone wants to treat that string as having meaning >> under unicode.
> > Perl does treat the string as having meaning under Unicode.
Only when performing an operation that requires lookup into the Unicode database. Which is actually rarely. The rest of the time it knows it is utf8. Which as I explain above is /not/ Unicode. Show quoted text
> This is established.
No, it is not. Show quoted text
> Now if a programmer decides to play a game in which he > puts illegal code points into Unicode strings because Perl's validation > is lazy,
It is not lazy. It is deliberately designed to do this. Read the code and the comments. Show quoted text
> well, that's a game that programmer may win and may lose; but > in any case, he has no grounds to complain when Perl's validation > catches up with him.
No. Again, you havent read the docs. We document that utf8 is not UTF8, and that you can do things with utf8 that are strictly speaking illegal in UTF8. Show quoted text
>>  Its not perls nature to say "you cant do that - unicode >> doesn't agree" except when we have no other choice.
> > Perl's nature both includes compliance and integrity.  It's established > and documented that Perl's "utf8" is a representation of Unicode; that's > not a lie, but a truth that some people are in denial about.  Perl > interprets your commands within that context.  So its compliance has > limits and conditions.  It has always been thus.
You are misinformed. See "perlunifaq" <quote> =head2 What's the difference between C<UTF-8> and C<utf8>? C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in what it accepts. If you have to communicate with things that aren't so liberal, you may want to consider using C<UTF-8>. If you have to communicate with things that are too liberal, you may have to use C<utf8>. The full explanation is in L<Encode>. C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8 consistently, even where utf8 is actually used internally, because theAnd f distinction can be hard to make, and is mostly irrelevant. For example, utf8 can be used for code points that don't exist in Unicode, like 9999999, but if you encode that to UTF-8, you get a substitution character (by default; see L<Encode/"Handling Malformed Data"> for more ways of dealing with this.) Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not some other encoding.) </quote> And further from Encode: <quote> =head1 UTF-8 vs. utf8 vs. UTF8 ....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed. That has been the perl's notion of UTF-8 but official UTF-8 is more strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al). Now that is overruled by Larry Wall himself. From: Larry Wall <larry@wall.org> Date: December 04, 2004 11:51:58 JST To: perl-unicode@perl.org Subject: Re: Make Encode.pm support the real UTF-8 Message-Id: <20041204025158.GA28754@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I've no problem with 'utf8' being perl's unrestricted uft8 encoding, : but "UTF-8" is the name of the standard and should give the : corresponding behaviour. For what it's worth, that's how I've always kept them straight in my head. Also for what it's worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry Do you copy? As of Perl 5.8.7, B<UTF-8> means strict, official UTF-8 while B<utf8> means liberal, lax, version thereof. And Encode version 2.10 or later thus groks the difference between C<UTF-8> and C"utf8". encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks C<UTF-8> in Encode is actually a canonical name for C<utf-8-strict>. Yes, the hyphen between "UTF" and "8" is important. Without it Encode goes "liberal" find_encoding("UTF-8")->name # is 'utf-8-strict' find_encoding("utf-8")->name # ditto. names are case insensitive find_encoding("utf_8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'. The UTF8 flag is internally called UTF8, without a hyphen. It indicates whether a string is internally encoded as utf8, also without a hypen. </quote> Seems to me you have some reading to do. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sat, 18 Dec 2010 22:21:54 +0100
To: Reverend Chip <rev.chip [...] gmail.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 655b
On 17 December 2010 00:53, Reverend Chip <rev.chip@gmail.com> wrote: Show quoted text
> So it seems to me, in the end, that the warnings on surrogates in > \p{foo}, //i, lc, uc, etc. are important; but that we could document > that set of operations that will warn, and guarantee to programmers that > if they stay clear of those operators, they can put any pseudo-character > in a utf8 string and we will promise to avert our collective gaze.
That is what I, and others have been saying all along. And is actually the only way for things to work which complies with existing documentation and code. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sun, 19 Dec 2010 14:40:01 -0800
To: Reverend Chip <rev.chip [...] gmail.com>
From: Father Chrysostomos <sprout [...] cpan.org>
Download (untitled) / with headers
text/plain 757b
On Dec 16, 2010, at 3:53 PM, Reverend Chip wrote: Show quoted text
> So it seems to me, in the end, that the warnings on surrogates in > \p{foo}, //i, lc, uc, etc. are important; but that we could document > that set of operations that will warn, and guarantee to programmers that > if they stay clear of those operators, they can put any pseudo-character > in a utf8 string and we will promise to avert our collective gaze.
Warnings for surrogates may seem logical at first, but they do not solve my problem of not being able to use modules that I didn’t write. So I’ll just have to continue monkey-patching warnings::import. (I’m getting used to that sort of thing.) \p{foo} should definitely be exempt, as we have \p{Cs} specifically for matching surrogates.
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Father Chrysostomos <sprout [...] cpan.org>, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 20 Dec 2010 03:38:13 -0800
To: demerphq <demerphq [...] gmail.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 731b
On 12/18/2010 1:21 PM, demerphq wrote: Show quoted text
> On 17 December 2010 00:53, Reverend Chip <rev.chip@gmail.com> wrote:
>> So it seems to me, in the end, that the warnings on surrogates in >> \p{foo}, //i, lc, uc, etc. are important; but that we could document >> that set of operations that will warn, and guarantee to programmers that >> if they stay clear of those operators, they can put any pseudo-character >> in a utf8 string and we will promise to avert our collective gaze.
> That is what I, and others have been saying all along.
Indeed, you told me so. Show quoted text
> And is actually the only way for things to work which complies with > existing documentation and code.
I think the existing docs are ambiguous. No matter now, of course.
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 20 Dec 2010 03:45:54 -0800
To: Father Chrysostomos <sprout [...] cpan.org>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 884b
On 12/19/2010 2:40 PM, Father Chrysostomos wrote: Show quoted text
> On Dec 16, 2010, at 3:53 PM, Reverend Chip wrote:
>> So it seems to me, in the end, that the warnings on surrogates in >> \p{foo}, //i, lc, uc, etc. are important; but that we could document >> that set of operations that will warn, and guarantee to programmers that >> if they stay clear of those operators, they can put any pseudo-character >> in a utf8 string and we will promise to avert our collective gaze.
> > Warnings for surrogates may seem logical at first, but they do not solve my problem of not being able to use modules that I didn’t write. So I’ll just have to continue monkey-patching warnings::import. (I’m getting used to that sort of thing.)
Sorry for the trouble. This seems like a hack needed whenever warnings are triggered by bad data (or at least data Perl would call bad), e.g math on non-numbers.
CC: Eric Brine <ikegami [...] adaelis.com>, perl5-porters [...] perl.org
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 20 Dec 2010 03:56:24 -0800
To: demerphq <demerphq [...] gmail.com>
From: Reverend Chip <rev.chip [...] gmail.com>
Download (untitled) / with headers
text/plain 518b
On 12/18/2010 1:18 PM, demerphq wrote: Show quoted text
> See "perlunifaq" > > <quote> > > =head2 What's the difference between C<UTF-8> and C<utf8>? > > C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in > what it accepts. [...]
OK, fine, I was mistaken. All code points are welcome. The fact that "utf8" doesn't mean the same as "UTF-8" is remarkably annoying, and I don't blame myself for falling for it. I will read perlunitut (again), though, just in case there are more land mines I might step on.
CC: Father Chrysostomos <sprout [...] cpan.org>, karl williamson <public [...] khwilliamson.com>, perl5-porters [...] perl.org, Zefram <zefram [...] fysh.org>, David Golden <xdaveg [...] gmail.com>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Sun, 19 Dec 2010 22:56:58 -0500
To: Reverend Chip <rev.chip [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
On Mon, Dec 20, 2010 at 6:45 AM, Reverend Chip <rev.chip@gmail.com> wrote: Show quoted text
> On 12/19/2010 2:40 PM, Father Chrysostomos wrote:
> > On Dec 16, 2010, at 3:53 PM, Reverend Chip wrote:
> >> So it seems to me, in the end, that the warnings on surrogates in > >> \p{foo}, //i, lc, uc, etc. are important; but that we could document > >> that set of operations that will warn, and guarantee to programmers that > >> if they stay clear of those operators, they can put any pseudo-character > >> in a utf8 string and we will promise to avert our collective gaze.
> > > > Warnings for surrogates may seem logical at first, but they do not solve
> my problem of not being able to use modules that I didn’t write. So I’ll > just have to continue monkey-patching warnings::import. (I’m getting used to > that sort of thing.) > > Sorry for the trouble. This seems like a hack needed whenever warnings > are triggered by bad data (or at least data Perl would call bad), e.g > math on non-numbers. >
It's going to be like math on NaN. There are billions of different NaN, just like there are billions of non-Unicode characters.
CC: Perl5 Porters Mailing List <perl5-porters [...] perl.org>
Subject: Re: [perl #63446] utf8 fatal warning
Date: Mon, 20 Dec 2010 11:48:51 -0700
To: Reverend Chip <rev.chip [...] gmail.com>
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 3.4k
Show quoted text
>> Anything smell fishy to you? I mean, more than once ever few lines? :)
Show quoted text
> I'm not sure what it means. I thought we were talking Java but then I > see all the \p patterns; does Java implement those?
Yes, *some* of them. The pre-3.1 ones only, mostly. It's a massive flustercluck, and they don't give a damn. * Like they have no way to convert between code points and character names. * Like they ignore the rules about loose interpretation of properties. * Like they have \p{Alpha} for [A-Za-z] *only*, and no \p{Alphabetic}. * Like they have \p{InGreek} but not \p{IsGreek}. * Like they have \p{javaWhiteSpace}, which they *CLAIM* maps to Unicode whitespace, but they lie lie lie, because they exempt the super-common U+A0 plus U+2007 and U+202F from being white space. * Like the fools use such lovely things as \p{javaJavaIdentifierStart}, which besides its evil name, isn't something they actually use. Meaning, I can put control characters *including NULs and ESCs!!!!* in my Java idents, and everything chugs along just fine until I trigger your terminal's answer-back sequence. * They can't even get casing right when they claim to. * They're canonical equivalence is broken because of the damned Java preprocessor. * You can't get back out the pattern that you originally compiled, and which the regex compiler changed without telling you how. This is just the very tip of a huge festering pool of poorly engineered errors that they're loth even to admit to, and next to intransigent about fixing, for that would require acknowledging their own errors. Java's claims of Unicode support are nothing but that: claims. They are unsubtantiable, and there is no will to fix the *SEVERAL DOZEN* bugs I have discovered. The answer is inevitably one of these: (1) Sure, but does anybody really care? (2) It's documented not to work, so we never have to make it work. (3) Yes, that's a shame, but fixing it would break backward compatibility. (4) That's only required for Level N+1 compliance, so we don't have to fix it. There is an arrogance and insularity, and NIH ignorance, amongst the Sun Java people the likes of which I don't think I've seen since the old IBM days before the DEC wars. It is infuriating beyond description. I hope Java dies dies dies. Since they refuse to act on my double-digit worth of Unicode bugs, I'm seriously considering creating a "tribute site" where I air out their dirty laundry. That's how angry I am about this putrid pile of Certifiably Regressive Agonizing Puke. Sure, I may not be able to shame them into fixing anything, but I can certainly shame them. That may be some small solace to me as well as warning to others who've been suckered into this insane piece of mindless bloatware. RESOLVED: DO NOT USE JAVA IF YOU NEED TO DEAL WITH UNICODE F U L L S T O P Show quoted text
> Also, is the UTF-8 output encoding relevant to the match > results, or just their display?
Just to their display. The internals are all converted into UTFMH-16. Here: read 'em and weep. --tom
Download surotest.java
text/plain 1.5k
import java.io.*; import java.util.regex.*; public class surotest { private static PrintStream stdout; private static void testmatch(String s, String re) { boolean found = Pattern.compile(re).matcher(s).find(); stdout.printf("U+"); for (int i = 0; i < s.length(); i++) { stdout.printf("%04X", s.codePointAt(i)); if (s.codePointAt(i) > Character.MAX_VALUE) { i++; } // idiots if (i+1 < s.length()) { stdout.printf("."); } } stdout.printf(" %s: ", found ? " TRUE" : "false", s); stdout.printf(" \"%s\" =~ /%s/\n", s, re); } public static void main(String[ ] args) throws IOException { // yes, this is intentionally outdented /* * note that encoding-mapping failures are suppressed, * as in fact are all errors with this PoS interface * * What fools these morons be! */ stdout = new PrintStream(System.out, true, "UTF-8"); String[] slist = { "\uD83D", // high surrogate half - invalid "\uD83D\uDC7E", // U+1F47E "\uDC7E", // low surrogate half - invalid "\uDC7E\uD83D", // wrong order "\uFDDD", // invalid "\uFFFF", // invalid }; for (String s : slist) { stdout.println(""); testmatch(s, "."); testmatch(s, "^.$"); testmatch(s, ".."); testmatch(s, "^..$"); testmatch(s, "\\pC"); testmatch(s, "\\p{Cs}"); testmatch(s, "\\p{InHighSurrogates}"); testmatch(s, "\\p{InLowSurrogates}"); testmatch(s, "[^\\pL\\pM\\pN\\pP\\pS\\pZ\\pC]"); } } }
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 161b
The acceptance of surrogates no longer is dependent on warnings being enabled or not. Now they are accepted except under strict input rules. --Karl Williamson


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org