Skip Menu |
Report information
Id: 60156
Status: resolved
Priority: 0/
Queue: perl5

Owner: khw <khw [at] cpan.org>
Requestors: public [at] khwilliamson.com
rmbarker <Robin.Barker [at] npl.co.uk>
Cc:
AdminCc:

Operating System: All
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version:
  • 5.6.0
  • 5.6.1
  • 5.6.2
  • 5.7.0
  • 5.7.1
  • 5.7.2
  • 5.8.0
  • 5.8.1
  • 5.8.2
  • 5.8.3
  • 5.8.4
  • 5.8.5
  • 5.8.6
  • 5.8.7
  • 5.8.8
  • 5.8.9
  • 5.9.0
  • 5.9.1
  • 5.9.2
  • 5.9.3
  • 5.9.4
  • 5.9.5
  • 5.10.0
  • 5.10.1
  • 1.0
  • 5.000
  • 5.002
  • 5.003
  • 5.004
  • 5.004_00
  • 5.004_01
  • 5.004_02
  • 5.004_03
  • 5.004_04
  • 5.004_05
  • 5.005
  • 5.005_01
  • 5.005_02
  • 5.005_03
  • 5.005_04
Fixed In: 5.11.0



Subject: [[:print:]] v \p{Print}
Date: Wed, 2 Jan 2008 11:56:08 -0000
To: <perlbug [...] perl.org>
From: "Robin Barker" <Robin.Barker [...] npl.co.uk>
Download (untitled) / with headers
text/plain 4.7k
Subject: [[:print:]] v \p{Print}

This is a bug report for perl from robin.barker@npl.co.uk,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As I read the documentation, the pairs in perlre.pod of
[[:...:]] and \p{Is....} are supposed to match the same.

But
\p{IsPrint} matches \011 \012 \013 \014 \015
which [[:print:]] does not

\p{IsPunct} does not match $ + < = > ^ ` | ~
which [[:punct:]] does match

Various \p{Is...} match characters in the range 128-256
but no [[:...:]] match characters in that range.

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=medium
---
Site configuration information for perl 5.10.0:

Configured by Robin Barker at Tue Dec 18 11:23:28 GMT 2007.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.18-1.2798.fc6, archname=i686-linux-64int
    uname='linux rain.npl.co.uk 2.6.18-1.2798.fc6 #1 smp mon oct 16 14:54:20 edt 2006 i686 i686 i386 gnulinux '
    config_args='-des -Dcc=gcc -Uinstallusrbinperl -Dcf_email=Robin.Barker@npl.co.uk -Dcf_by=Robin Barker -Dman1dir=none -Dman3dir=none -Doptimize=-O2 -g -Duse64bitint -Dotherlibdirs=/usr/lib/perl5:/usr/lib/perl5/site_perl:/usr/lib/perl5/vendor_perl -Dinc_version_list=5.8.8 5.8.7 5.8.6 5.8.5 5.8.4 5.8.3 5.8.2 5.8.1 5.8.0 5.6.1 5.6.0 5.005'

    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',

    optimize='-O2 -g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.2.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:
   

---
@INC for perl 5.10.0:
    /usr/local/lib/perl5/5.10.0/i686-linux-64int
    /usr/local/lib/perl5/5.10.0
    /usr/local/lib/perl5/site_perl/5.10.0/i686-linux-64int
    /usr/local/lib/perl5/site_perl/5.10.0
    /usr/local/lib/perl5/site_perl/5.8.8
    /usr/local/lib/perl5/site_perl/5.005
    /usr/local/lib/perl5/site_perl
    /usr/lib/perl5/5.8.8
    /usr/lib/perl5/5.8.7
    /usr/lib/perl5/5.8.6
    /usr/lib/perl5/58.5
    /usr/lib/perl5
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl/5.8.7
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl/5.8.5
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl/5.8.7
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl/5.8.5
    /usr/lib/perl5/vendor_perl
    .

---
Environment for perl 5.10.0:
    HOME=/home/rmb1
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/lib:/usr/local/lib
    LOGDIR (unset)
    PATH=/home/rmb1/appl/script:/opt/perl/bin:/opt/gcc/bin:/opt/SUNWspci/bin/:/usr/rain.npl.co.uk/bin:/usr/local/bin:/usr/local/Admigration/exec:/usr/local/hotjava/bin:/usr/openwin/bin:/usr/dt/bin:/usr/bin:/bin

    PERL5LIB=
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh

Robin Barker
Mathematics and Scientific Computing group
F1-A8
Ext: 7090


-------------------------------------------------------------------
This e-mail and any attachments may contain confidential and/or
privileged material; it is for the intended addressee(s) only.
If you are not a named addressee, you must not use, retain or
disclose such information.

NPL Management Ltd cannot guarantee that the e-mail or any
attachments are free from viruses.

NPL Management Ltd. Registered in England and Wales. No: 2937881
Registered Office: Serco House, 16 Bartley Wood Business Park,
                   Hook, Hampshire, United Kingdom  RG27 9UY
-------------------------------------------------------------------
Subject: [PATCH] RE: [perl #49302] [[:print:]] v \p{Print}
Date: Mon, 31 Mar 2008 21:42:05 +0100
To: <perl5-porters [...] perl.org>, <bugs-bitbucket [...] rt.perl.org>
From: "Robin Barker" <Robin.Barker [...] npl.co.uk>
Download (untitled) / with headers
text/plain 1.3k
No progress in resolving this in code, so here is documentation patch. Robin --- pod/perlre.pod.orig 2008-01-30 20:41:06.000000000 +0000 +++ pod/perlre.pod @@ -375,8 +375,8 @@ digit IsDigit \d graph IsGraph lower IsLower - print IsPrint - punct IsPunct + print IsPrint (but see 2. below) + punct IsPunct (but see 3. below) space IsSpace IsSpacePerl \s upper IsUpper @@ -385,6 +385,41 @@ For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. +However, the equivalence between C<[[:xxxxx:]]> and C<\p{Xxxxx}> is not exact. + +=over 4 + +=item 1. + +C<[[:xxxxx:]]> only matches characters in the range 0x00-0x7F. + +=item 2. + +C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not. + +=item 3. + +C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not, +because they are classed as symbols in Unicode. + +=over 4 + +=item C<$> + +Currency symbol + +=item C<+> C<< < >> C<=> C<< > >> C<|> C<~> + +Mathematical symbols + +=item C<^> C<`> + +Modifier symbols (accents) + +=back + +=back + If the C<utf8> pragma is not used but the C<locale> pragma is, the classes correlate with the usual isalpha(3) interface (except for "word" and "blank").
Download perlre.patch
text/plain 1.1k

Message body is not shown because sender requested not to inline it.

CC: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [PATCH] RE: [perl #49302] [[:print:]] v \p{Print}
Date: Mon, 07 Apr 2008 22:49:30 +0200
To: Robin Barker <Robin.Barker [...] npl.co.uk>
From: David Landgren <david [...] landgren.net>
Robin Barker wrote: Show quoted text
> No progress in resolving this in code, so here is documentation patch. > > Robin > > --- pod/perlre.pod.orig 2008-01-30 20:41:06.000000000 +0000 > +++ pod/perlre.pod > @@ -375,8 +375,8 @@ > digit IsDigit \d > graph IsGraph > lower IsLower > - print IsPrint > - punct IsPunct > + print IsPrint (but see 2. below) > + punct IsPunct (but see 3. below) > space IsSpace > IsSpacePerl \s > upper IsUpper > @@ -385,6 +385,41 @@ > > For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. > > +However, the equivalence between C<[[:xxxxx:]]> and C<\p{Xxxxx}> is not exact. > + > +=over 4 > + > +=item 1. > + > +C<[[:xxxxx:]]> only matches characters in the range 0x00-0x7F. > + > +=item 2. > + > +C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not. > + > +=item 3. > + > +C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not, > +because they are classed as symbols in Unicode.
s/classed/classified/ considered?
Subject: Re: [PATCH] RE: [perl #49302] [[:print:]] v \p{Print}
Date: Sun, 13 Apr 2008 18:55:21 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 472b
Robin Barker skribis 2008-03-31 21:42 (+0100): Show quoted text
> +C<[[:xxxxx:]]> only matches characters in the range 0x00-0x7F.
Not always true. juerd@lanova:~$ perl -CO -e'printf "[%s]\n", "foo\x{123}" =~ /([[:print:]]+)/' [fooģ] See also Unicode::Semantics on CPAN. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl>
Subject: [PATCH] another go; was RE: [perl #49302] [[:print:]] v \p{Print}
Date: Fri, 25 Apr 2008 14:21:06 +0100
To: <perl5-porters [...] perl.org>, <bugs-bitbucket [...] rt.perl.org>
From: "Robin Barker" <Robin.Barker [...] npl.co.uk>
Download (untitled) / with headers
text/plain 3.2k
I've taken in to account comments about the last patch, and increased my understanding of what is going on, and present a revised documentation patch. With a patch to t/op/pat.t to capture the current behaviour in a test. Robin diff -ur ../perl-current/pod/perlre.pod ./pod/perlre.pod --- ../perl-current/pod/perlre.pod 2008-01-30 20:41:06.000000000 +0000 +++ ./pod/perlre.pod @@ -375,20 +375,60 @@ digit IsDigit \d graph IsGraph lower IsLower - print IsPrint - punct IsPunct + print IsPrint (but see [2] below) + punct IsPunct (but see [3] below) space IsSpace IsSpacePerl \s upper IsUpper - word IsWord + word IsWord \w xdigit IsXDigit For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. +However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}> +is not exact. + +=over 4 + +=item [1] + If the C<utf8> pragma is not used but the C<locale> pragma is, the classes correlate with the usual isalpha(3) interface (except for "word" and "blank"). +But if the C<locale> or C<encoding> pragmas are not used and +the string is not C<utf8>, then C<[[:xxxxx:]]> (and C<\w>, etc.) +will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will +force the string to C<utf8> and can match these characters +(as Unicode). + +=item [2] + +C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not. + +=item [3] + +C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not, +because they are classed as symbols (not punctuation) in Unicode. + +=over 4 + +=item C<$> + +Currency symbol + +=item C<+> C<< < >> C<=> C<< > >> C<|> C<~> + +Mathematical symbols + +=item C<^> C<`> + +Modifier symbols (accents) + +=back + +=back + The other named classes are: =over 4 diff -ur ../perl-current/t/op/pat.t ./t/op/pat.t --- ../perl-current/t/op/pat.t 2008-04-15 13:46:40.000000000 +0100 +++ ./t/op/pat.t @@ -4604,6 +4604,32 @@ iseq($te[0], '../'); } +SKIP: { + if (ordA == 193) { skip("Assumes ASCII", 4) } + + my @notIsPunct = grep {/[[:punct:]]/ and not /\p{IsPunct}/} + map {chr} 0x20..0x7f; + iseq( join('', @notIsPunct), '$+<=>^`|~', + '[:punct:] disagress with IsPunct on Symbols'); + + my @isPrint = grep {not/[[:print:]]/ and /\p{IsPrint}/} + map {chr} 0..0x1f, 0x7f..0x9f; + iseq( join('', @isPrint), "\x09\x0a\x0b\x0c\x0d\x85", + 'IsPrint disagrees with [:print:] on control characters'); + + my @isPunct = grep {/[[:punct:]]/ != /\p{IsPunct}/} + map {chr} 0x80..0xff; + iseq( join('', @isPunct), "\xa1\xab\xb7\xbb\xbf", # ¡ « · » ¿ + 'IsPunct disagrees with [:punct:] outside ASCII'); + + my @isPunctLatin1 = eval q{ + use encoding 'latin1'; + grep {/[[:punct:]]/ != /\p{IsPunct}/} map {chr} 0x80..0xff; + }; + if( $@ ){ skip( $@, 1); } + iseq( join('', @isPunctLatin1), '', + 'IsPunct agrees with [:punct:] with explicit Latin1'); +} # Test counter is at bottom of file. Put new tests above here. @@ -4667,7 +4693,7 @@ # Don't forget to update this! BEGIN { - $::TestCount = 4031; + $::TestCount = 4035; print "1..$::TestCount\n"; }
Download perlre-3.patch
text/plain 2.9k

Message body is not shown because sender requested not to inline it.

CC: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [PATCH] another go; was RE: [perl #49302] [[:print:]] v \p{Print}
Date: Sun, 27 Apr 2008 00:06:39 +0200
To: "Robin Barker" <Robin.Barker [...] npl.co.uk>
From: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
Download (untitled) / with headers
text/plain 311b
2008/4/25 Robin Barker <Robin.Barker@npl.co.uk>: Show quoted text
> I've taken in to account comments about the last patch, > and increased my understanding of what is going on, and > present a revised documentation patch. With a patch to > t/op/pat.t to capture the current behaviour in a test.
Thanks, applied as #33752
Subject: A number of characters match both a posix class and its complement
Date: Sun, 26 Oct 2008 16:32:13 -0600
To: perlbug [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.1k
This is a bug report for perl from public@khwilliamson.com, generated with the help of perlbug 1.39 running under perl 5.11.0. ----------------------------------------------------------------- use utf8; print '©' =~ /[[:graph:]]/, "\n"; print '©' =~ /[[:^graph:]]/, "\n"; both print 1. This happens for various posix classes, and various utf8 characters. Also for perl 5.8 ----------------------------------------------------------------- --- Flags: category=core severity=medium --- This perlbug was built using Perl 5.11.0 - Wed Oct 22 19:16:44 MDT 2008 It is being executed now by Perl 5.11.0 - Fri Oct 24 11:08:58 MDT 2008. Site configuration information for perl 5.11.0: Configured by khw at Fri Oct 24 11:08:58 MDT 2008. Summary of my perl5 (revision 5 version 11 subversion 0 patch 34566) configuration: Platform: osname=linux, osvers=2.6.24-21-generic, archname=i686-linux uname='linux karl 2.6.24-21-generic #1 smp mon aug 25 17:32:09 utc 2008 i686 gnulinux ' config_args='-d -Dmksymlinks -Dprefix=/home/khw/myperl -Dusedevel -DDEBUGGING=both' hint=previous, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' ccversion='', gccversion='4.2.3 (Ubuntu 4.2.3-2ubuntu7)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.7' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector' Locally applied patches: DEVEL --- @INC for perl 5.11.0: /home/khw/myperl/lib/5.11.0/i686-linux /home/khw/myperl/lib/5.11.0 /home/khw/myperl/lib/site_perl/5.11.0/i686-linux /home/khw/myperl/lib/site_perl/5.11.0 . --- Environment for perl 5.11.0: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh
CC: perl5-porters [...] perl.org
Subject: Re: program to look at char class complements [perl #60156]
Date: Thu, 30 Oct 2008 11:02:21 -0600
To: Tom Christiansen <tchrist [...] perl.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.8k
Tom Christiansen wrote: Show quoted text
> In-Reply-To: Message from karl williamson <public@khwilliamson.com> > of "Wed, 29 Oct 2008 10:21:33 MDT." <49088D8D.3050202@khwilliamson.com> >
>> Tom Christiansen wrote:
>
>>> PRESCRIPT: Karl, my program at the end below should be good for >>> sniffing out \p and \P mutual-exclusion failure bugs >>> such as I believe you recently reported. >>>
>
>> Thanks for the program. My bug reports have mostly come from >> running something similar, but not with nearly as many of the >> classes as you did.
>
>> I'm guessing you didn't run it past 127, because it fails immediately >> with a Malformed UTF-8 character (fatal) at (eval 3) line 2.
> > Yes, that's because it intentionally assumed ASCII only. The amended > program and results follow. You should now be to use this program > directly for your testing. >
>> I actually don't think there are any issues with the \p and \P, >> because those go out and use auto-constructed files. The >> problems I have found have been in the posix classes and >> entirely in the 128-255 range.
> > It's not just there. It's a bug in negating of POSIX char classes. > I can elicit errors at codepoints <128, even with Unicode semantics on. > > Specifically, just runniing POSIX charclass tests alone: > > REPORT: ranging from U+00 (0) .. U+00007F (127) [128 codepoints] > failed 28 tests, 3300 out of 3328 tests successful (0.991587%) > > The problems are with the [:print:] and [:punct:] properties: > > Trouble w/U+007E: TILDE: Property "[:punct:]" failed > Trouble w/U+007C: VERTICAL LINE: Property "[:punct:]" failed > Trouble w/U+0060: GRAVE ACCENT: Property "[:punct:]" failed > Trouble w/U+005E: CIRCUMFLEX ACCENT: Property "[:punct:]" failed > Trouble w/U+003E: GREATER-THAN SIGN: Property "[:punct:]" failed > Trouble w/U+003D: EQUALS SIGN: Property "[:punct:]" failed > Trouble w/U+003C: LESS-THAN SIGN: Property "[:punct:]" failed > Trouble w/U+002B: PLUS SIGN: Property "[:punct:]" failed > Trouble w/U+0024: DOLLAR SIGN: Property "[:punct:]" failed > Trouble w/U+000D: CARRIAGE RETURN (CR): Property "[:print:]" failed > Trouble w/U+000C: FORM FEED (FF): Property "[:print:]" failed > Trouble w/U+000B: LINE TABULATION: Property "[:print:]" failed > Trouble w/U+000A: LINE FEED (LF): Property "[:print:]" failed > Trouble w/U+0009: CHARACTER TABULATION: Property "[:print:]" failed > > The positive char class test is like this: > > ( > ( $U_char =~ /\A[[:print:]]\z/ ) > == > ( $U_char !~ /\A[[:^print:]]\z/ ) > ) > && > ( > ( $U_char =~ /\A[[:print:]]\z/ ) > == > ( $U_char !~ /\A[^[:print:]]\z/ ) > ) > > The negative char class test is this: > > ( $U_char =~ /\A[[:^punct:]]\z/ ) > == > ( $U_char =~ /\A[^[:punct:]]\z/ ) > > ...
I wondered why you were getting errors when my own (which I thought were rather extensive) test cases were getting none in the ASCII range. It turns out that it's because I hadn't thought to try the negation ^ in the outer set of braces. Also, these problems don't occur when the characters aren't utf8 (when they are packed as C instead of U). And they fail in exactly the places that the posix classes don't match the unicode ones. It is documented that [:graph:] includes the precise set of symbols that have failures that \p{IsGraph} does not. Similarly for [:print:] and the ones it has failures for. I haven't looked at the code, but what likely what is going on is that the complement of a complement loses these differences from unicode (only when the utf8 flag is on, because otherwise the unicode classes aren't looked at) My goal is to fix all these problems in the 128-255 range. It's turning out to be more work than I thought, essentially because there are a lot of pre-existing errors and inconsistencies.
Subject: Re: [perl #60156] A number of characters match both a posix class and its complement
Date: Mon, 3 Nov 2008 20:06:30 +1100
To: perl5-porters [...] perl.org
From: Tony Cook <tony [...] develop-help.com>
Download (untitled) / with headers
text/plain 1.4k
On Sun, Oct 26, 2008 at 03:32:53PM -0700, karl williamson wrote: Show quoted text
> ----------------------------------------------------------------- > use utf8; > > print '©' =~ /[[:graph:]]/, "\n"; > print '©' =~ /[[:^graph:]]/, "\n"; > > both print 1. This happens for various posix classes, and various > utf8 characters. Also for perl 5.8 > -----------------------------------------------------------------
This problem also occurs for \s vs \S inside a character class: sh-3.1$ ~/perl/blead34613/bin/perl5.11.0 -le '$x = chr(0xa0).chr(0x100); chop $x; print $x =~ /[\s]/; print $x =~ /[\S]/;' 1 1 sh-3.1$ The problem is that the character classes are compiled to a combination of a bitmap and a Unicode property check, eg: [\s] becomes ANYOF[\11\12\14\15 +utf8::IsSpacePerl] [\S] becomes ANYOF[\0-\10\13\16-\37!-\377!utf8::IsSpacePerl] and similarly for [[:foo:]]: [[:graph:]] becomes ANYOF[!-~+utf8::IsGraph] [[:^graph:]] becomes ANYOF[\0- \177-\377!utf8::IsGraph] and for UTF scalars both the bitmap and the unicode property is checked (S_reginclass in regexec.c). The bitmap is generated using isSPACE, isGRAPH, etc, which without a locale, returns false for characters \200-\x377. I suspect a solution is going to involve not generating the bitmap for named classes classes in character classes and using ANYOF_SPACE, ANYOF_GRAPH etc in the character class node. But I don't understand enough of the regular expression engine to want to delve that deeply. Tony
CC: perl5-porters [...] perl.org
Subject: Re: [perl #60156] A number of characters match both a posix class and its complement
Date: Tue, 4 Nov 2008 19:04:54 +0100
To: "Tony Cook" <tony [...] develop-help.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 3.3k
2008/11/3 Tony Cook <tony@develop-help.com>: Show quoted text
> On Sun, Oct 26, 2008 at 03:32:53PM -0700, karl williamson wrote:
>> ----------------------------------------------------------------- >> use utf8; >> >> print '(c)' =~ /[[:graph:]]/, "\n"; >> print '(c)' =~ /[[:^graph:]]/, "\n"; >> >> both print 1. This happens for various posix classes, and various >> utf8 characters. Also for perl 5.8 >> -----------------------------------------------------------------
> > This problem also occurs for \s vs \S inside a character class: > > sh-3.1$ ~/perl/blead34613/bin/perl5.11.0 -le '$x = chr(0xa0).chr(0x100); chop $x; print $x =~ /[\s]/; print $x =~ /[\S]/;' > 1 > 1 > sh-3.1$ > > The problem is that the character classes are compiled to a > combination of a bitmap and a Unicode property check, eg: > > [\s] becomes ANYOF[\11\12\14\15 +utf8::IsSpacePerl] > [\S] becomes ANYOF[\0-\10\13\16-\37!-\377!utf8::IsSpacePerl] > > and similarly for [[:foo:]]: > > [[:graph:]] becomes ANYOF[!-~+utf8::IsGraph] > > [[:^graph:]] becomes ANYOF[\0- \177-\377!utf8::IsGraph] > > and for UTF scalars both the bitmap and the unicode property is > checked (S_reginclass in regexec.c). > > The bitmap is generated using isSPACE, isGRAPH, etc, which without a > locale, returns false for characters \200-\x377. > > I suspect a solution is going to involve not generating the bitmap for > named classes classes in character classes and using ANYOF_SPACE, > ANYOF_GRAPH etc in the character class node. > > But I don't understand enough of the regular expression engine to want > to delve that deeply. > > Tony >
I expected the attached patch to work out. Unfortunately it doesnt. I get the following failures when i do a make test-reonly Test Summary Report ------------------- op/pat (Wstat: 0 Tests: 4035 Failed: 5) Failed tests: 1871, 1901-1904 op/regexp (Wstat: 0 Tests: 1359 Failed: 4) Failed tests: 603, 608, 611, 1310 op/regexp_noamp (Wstat: 0 Tests: 1359 Failed: 4) Failed tests: 603, 608, 611, 1310 op/regexp_notrie (Wstat: 0 Tests: 1359 Failed: 4) Failed tests: 603, 608, 611, 1310 op/regexp_qr (Wstat: 0 Tests: 1359 Failed: 4) Failed tests: 603, 608, 611, 1310 op/regexp_qr_embed (Wstat: 0 Tests: 1359 Failed: 4) Failed tests: 603, 608, 611, 1310 op/regexp_trielist (Wstat: 0 Tests: 1359 Failed: 4) Failed tests: 603, 608, 611, 1310 Files=29, Tests=21383, 59 wallclock secs ( 8.70 usr 0.22 sys + 49.73 cusr 0.76 csys = 59.41 CPU) Result: FAIL Failed 7/29 test programs. 29/21383 subtests failed. Which seems quite odd, like the swash logic is failing somehow, in ways that I personally find surprising. The issue here is that the bitmaps are constructed for non-utf8 and utf8 semantics differ for utf8 and non-utf8 in the codepoint range 128-255. The logic ive added is to only use the bitmaps for codepoints 0..127 when doutf8 and then rely on the unicode charclass "swash" logic to handle the rest, but this fails. I havent had sufficient time, not access to a debugger i know well enough* to figure out why. However im still poking. Yves * Id love to attend a 'gdb for perl core hackers' course should someone put one together for the next YAPC::EU/ or YAPC Europe Perl Workshop. *HINT* HINT. :-) -- perl -Mre=debug -e "/just|another|perl|hacker/"

Message body is not shown because sender requested not to inline it.

CC: "Tom Christiansen" <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Thu, 6 Nov 2008 17:29:27 +0100
To: "karl williamson" <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 6.5k
2008/10/30 karl williamson <public@khwilliamson.com>: Show quoted text
> Tom Christiansen wrote:
>> >> In-Reply-To: Message from karl williamson <public@khwilliamson.com> of >> "Wed, 29 Oct 2008 10:21:33 MDT." <49088D8D.3050202@khwilliamson.com>
>>> >>> Tom Christiansen wrote:
>>
>>>> PRESCRIPT: Karl, my program at the end below should be good for >>>> sniffing out \p and \P mutual-exclusion failure bugs such as I >>>> believe you recently reported. >>>>
>>
>>> Thanks for the program. My bug reports have mostly come from >>> running something similar, but not with nearly as many of the >>> classes as you did.
>>
>>> I'm guessing you didn't run it past 127, because it fails immediately >>> with a Malformed UTF-8 character (fatal) at (eval 3) line 2.
>> >> Yes, that's because it intentionally assumed ASCII only. The amended >> program and results follow. You should now be to use this program >> directly for your testing. >>
>>> I actually don't think there are any issues with the \p and \P, >>> because those go out and use auto-constructed files. The >>> problems I have found have been in the posix classes and >>> entirely in the 128-255 range.
>> >> It's not just there. It's a bug in negating of POSIX char classes. >> I can elicit errors at codepoints <128, even with Unicode semantics on. >> >> Specifically, just runniing POSIX charclass tests alone: >> >> REPORT: ranging from U+00 (0) .. U+00007F (127) [128 codepoints] >> failed 28 tests, 3300 out of 3328 tests successful (0.991587%) >> >> The problems are with the [:print:] and [:punct:] properties: >> >> Trouble w/U+007E: TILDE: Property "[:punct:]" failed >> Trouble w/U+007C: VERTICAL LINE: Property "[:punct:]" failed >> Trouble w/U+0060: GRAVE ACCENT: Property "[:punct:]" failed >> Trouble w/U+005E: CIRCUMFLEX ACCENT: Property "[:punct:]" failed >> Trouble w/U+003E: GREATER-THAN SIGN: Property "[:punct:]" failed >> Trouble w/U+003D: EQUALS SIGN: Property "[:punct:]" failed >> Trouble w/U+003C: LESS-THAN SIGN: Property "[:punct:]" failed >> Trouble w/U+002B: PLUS SIGN: Property "[:punct:]" failed >> Trouble w/U+0024: DOLLAR SIGN: Property "[:punct:]" failed >> Trouble w/U+000D: CARRIAGE RETURN (CR): Property "[:print:]" failed >> Trouble w/U+000C: FORM FEED (FF): Property "[:print:]" failed >> Trouble w/U+000B: LINE TABULATION: Property "[:print:]" failed >> Trouble w/U+000A: LINE FEED (LF): Property "[:print:]" failed >> Trouble w/U+0009: CHARACTER TABULATION: Property "[:print:]" failed >> >> The positive char class test is like this: >> >> ( >> ( $U_char =~ /\A[[:print:]]\z/ ) >> == >> ( $U_char !~ /\A[[:^print:]]\z/ ) >> ) >> && >> ( >> ( $U_char =~ /\A[[:print:]]\z/ ) >> == >> ( $U_char !~ /\A[^[:print:]]\z/ ) >> ) >> >> The negative char class test is this: >> >> ( $U_char =~ /\A[[:^punct:]]\z/ ) >> == >> ( $U_char =~ /\A[^[:punct:]]\z/ ) >> >> ...
> > I wondered why you were getting errors when my own (which I thought were > rather extensive) test cases were getting none in the ASCII range. It turns > out that it's because I hadn't thought to try the negation ^ in the outer > set of braces. Also, these problems don't occur when the characters aren't > utf8 (when they are packed as C instead of U). And they fail in exactly the > places that the posix classes don't match the unicode ones. It is > documented that [:graph:] includes the precise set of symbols that have > failures that \p{IsGraph} does not. Similarly for [:print:] and the ones it > has failures for. > > I haven't looked at the code, but what likely what is going on is that the > complement of a complement loses these differences from unicode (only when > the utf8 flag is on, because otherwise the unicode classes aren't looked at) > > My goal is to fix all these problems in the 128-255 range. It's turning out > to be more work than I thought, essentially because there are a lot of > pre-existing errors and inconsistencies. >
Just for the record in the bug ticket, I used the attached file to determine that the affected patterns and strings are as follows: /[\w][\W]/ matches unicode codepoints: 170 178..179 181 185..186 188..190 192..214 216..246 248 /[\s][\S]/ matches unicode codepoints: 133 160 /[[:alnum:]][[:^alnum:]]/ matches unicode codepoints: 170 181 186 192..214 216..246 248 /[[:alpha:]][[:^alpha:]]/ matches unicode codepoints: 170 181 186 192..214 216..246 248 /[[:cntrl:]][[:^cntrl:]]/ matches unicode codepoints: 128..159 173 /[[:graph:]][[:^graph:]]/ matches unicode codepoints: 161 /[[:lower:]][[:^lower:]]/ matches unicode codepoints: 170 181 186 223..246 248 /[[:print:]][[:^print:]]/ matches unicode codepoints: 133 160 /[[:punct:]][[:^punct:]]/ matches unicode codepoints: 36 43 60..62 94 96 124 126 161 171 183 187 191 /[[:upper:]][[:^upper:]]/ matches unicode codepoints: 192..214 216 /[[:space:]][[:^space:]]/ matches unicode codepoints: 133 160 /[[:blank:]][[:^blank:]]/ matches unicode codepoints: 160 Thats a lot of characters. Sadly. Im trying to figure out a work around that doesnt make non-unicode slower while at the same time not involving a complete rewrite of how character classes are stored. The problem is that as a speed optimisation we expand these to their ascii bitmaps at compile time (when not under use locale), which means we merge them with whatever non special chars are in the charclass. We could not do that, but it would slow down non-uncode a lot. Or i guess we could change the charclass representation (essentially doubling it) and build both a unciode and non-uncode equivalent. That would double the size of a charclass, and be a reasonable amount of work to fix. There are some "out of band" options tho. We chould implement the "match semantics flags" as has been discussed in the past, and leave the current implementation as is when the flags are omitted. And note the problem. Anybody that cares would have to use the apropriate modifier. We could do a more extreme version of this and just say that ascii semantics apply for these items only, and remove the unicode char class property logic when they are used, forcing people to use explicit unicode semantics. (Or vice versa, but I dont think THAT is a good idea.) The bottom line is that regex metapatterns whose semantics change under unicode and otherwise are evil. They have lead to all kinds of trouble at all kinds of levels. I think fixing this properly requires the pumpking / larry to make a policy decision. Do we stick with the weird bifurcated behaviour depending on string representation or not. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Download test_all_cc.pl
text/x-perl 1.6k

Message body is not shown because sender requested not to inline it.

CC: Tom Christiansen <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Thu, 06 Nov 2008 20:40:34 -0700
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 6.8k
demerphq wrote: Show quoted text
> 2008/10/30 karl williamson <public@khwilliamson.com>:
>> Tom Christiansen wrote:
>>> In-Reply-To: Message from karl williamson <public@khwilliamson.com> of >>> "Wed, 29 Oct 2008 10:21:33 MDT." <49088D8D.3050202@khwilliamson.com>
>>>> Tom Christiansen wrote:
>>>>> PRESCRIPT: Karl, my program at the end below should be good for >>>>> sniffing out \p and \P mutual-exclusion failure bugs such as I >>>>> believe you recently reported. >>>>>
>>>> Thanks for the program. My bug reports have mostly come from >>>> running something similar, but not with nearly as many of the >>>> classes as you did. >>>> I'm guessing you didn't run it past 127, because it fails immediately >>>> with a Malformed UTF-8 character (fatal) at (eval 3) line 2.
>>> Yes, that's because it intentionally assumed ASCII only. The amended >>> program and results follow. You should now be to use this program >>> directly for your testing. >>>
>>>> I actually don't think there are any issues with the \p and \P, >>>> because those go out and use auto-constructed files. The >>>> problems I have found have been in the posix classes and >>>> entirely in the 128-255 range.
>>> It's not just there. It's a bug in negating of POSIX char classes. >>> I can elicit errors at codepoints <128, even with Unicode semantics on. >>> >>> Specifically, just runniing POSIX charclass tests alone: >>> >>> REPORT: ranging from U+00 (0) .. U+00007F (127) [128 codepoints] >>> failed 28 tests, 3300 out of 3328 tests successful (0.991587%) >>> >>> The problems are with the [:print:] and [:punct:] properties: >>> >>> Trouble w/U+007E: TILDE: Property "[:punct:]" failed >>> Trouble w/U+007C: VERTICAL LINE: Property "[:punct:]" failed >>> Trouble w/U+0060: GRAVE ACCENT: Property "[:punct:]" failed >>> Trouble w/U+005E: CIRCUMFLEX ACCENT: Property "[:punct:]" failed >>> Trouble w/U+003E: GREATER-THAN SIGN: Property "[:punct:]" failed >>> Trouble w/U+003D: EQUALS SIGN: Property "[:punct:]" failed >>> Trouble w/U+003C: LESS-THAN SIGN: Property "[:punct:]" failed >>> Trouble w/U+002B: PLUS SIGN: Property "[:punct:]" failed >>> Trouble w/U+0024: DOLLAR SIGN: Property "[:punct:]" failed >>> Trouble w/U+000D: CARRIAGE RETURN (CR): Property "[:print:]" failed >>> Trouble w/U+000C: FORM FEED (FF): Property "[:print:]" failed >>> Trouble w/U+000B: LINE TABULATION: Property "[:print:]" failed >>> Trouble w/U+000A: LINE FEED (LF): Property "[:print:]" failed >>> Trouble w/U+0009: CHARACTER TABULATION: Property "[:print:]" failed >>> >>> The positive char class test is like this: >>> >>> ( >>> ( $U_char =~ /\A[[:print:]]\z/ ) >>> == >>> ( $U_char !~ /\A[[:^print:]]\z/ ) >>> ) >>> && >>> ( >>> ( $U_char =~ /\A[[:print:]]\z/ ) >>> == >>> ( $U_char !~ /\A[^[:print:]]\z/ ) >>> ) >>> >>> The negative char class test is this: >>> >>> ( $U_char =~ /\A[[:^punct:]]\z/ ) >>> == >>> ( $U_char =~ /\A[^[:punct:]]\z/ ) >>> >>> ...
>> I wondered why you were getting errors when my own (which I thought were >> rather extensive) test cases were getting none in the ASCII range. It turns >> out that it's because I hadn't thought to try the negation ^ in the outer >> set of braces. Also, these problems don't occur when the characters aren't >> utf8 (when they are packed as C instead of U). And they fail in exactly the >> places that the posix classes don't match the unicode ones. It is >> documented that [:graph:] includes the precise set of symbols that have >> failures that \p{IsGraph} does not. Similarly for [:print:] and the ones it >> has failures for. >> >> I haven't looked at the code, but what likely what is going on is that the >> complement of a complement loses these differences from unicode (only when >> the utf8 flag is on, because otherwise the unicode classes aren't looked at) >> >> My goal is to fix all these problems in the 128-255 range. It's turning out >> to be more work than I thought, essentially because there are a lot of >> pre-existing errors and inconsistencies. >>
> > Just for the record in the bug ticket, I used the attached file to > determine that the affected patterns and strings are as follows: > > /[\w][\W]/ > matches unicode codepoints: 170 178..179 181 185..186 188..190 > 192..214 216..246 248 > > /[\s][\S]/ > matches unicode codepoints: 133 160 > > /[[:alnum:]][[:^alnum:]]/ > matches unicode codepoints: 170 181 186 192..214 216..246 248 > > /[[:alpha:]][[:^alpha:]]/ > matches unicode codepoints: 170 181 186 192..214 216..246 248 > > /[[:cntrl:]][[:^cntrl:]]/ > matches unicode codepoints: 128..159 173 > > /[[:graph:]][[:^graph:]]/ > matches unicode codepoints: 161 > > /[[:lower:]][[:^lower:]]/ > matches unicode codepoints: 170 181 186 223..246 248 > > /[[:print:]][[:^print:]]/ > matches unicode codepoints: 133 160 > > /[[:punct:]][[:^punct:]]/ > matches unicode codepoints: 36 43 60..62 94 96 124 126 161 171 183 187 191 > > /[[:upper:]][[:^upper:]]/ > matches unicode codepoints: 192..214 216 > > /[[:space:]][[:^space:]]/ > matches unicode codepoints: 133 160 > > /[[:blank:]][[:^blank:]]/ > matches unicode codepoints: 160 > > Thats a lot of characters. Sadly. > > Im trying to figure out a work around that doesnt make non-unicode > slower while at the same time not involving a complete rewrite of how > character classes are stored. The problem is that as a speed > optimisation we expand these to their ascii bitmaps at compile time > (when not under use locale), which means we merge them with whatever > non special chars are in the charclass. We could not do that, but it > would slow down non-uncode a lot. Or i guess we could change the > charclass representation (essentially doubling it) and build both a > unciode and non-uncode equivalent. That would double the size of a > charclass, and be a reasonable amount of work to fix. > > There are some "out of band" options tho. We chould implement the > "match semantics flags" as has been discussed in the past, and leave > the current implementation as is when the flags are omitted. And note > the problem. Anybody that cares would have to use the apropriate > modifier. We could do a more extreme version of this and just say that > ascii semantics apply for these items only, and remove the unicode > char class property logic when they are used, forcing people to use > explicit unicode semantics. (Or vice versa, but I dont think THAT is a > good idea.) > > The bottom line is that regex metapatterns whose semantics change > under unicode and otherwise are evil. They have lead to all kinds of > trouble at all kinds of levels. > > I think fixing this properly requires the pumpking / larry to make a > policy decision. Do we stick with the weird bifurcated behaviour > depending on string representation or not. > > Yves > > >
I don't know if this casts any light on the issue or not, but the attached simple program shows what may be another type of failure.
Download tabtest
text/plain 210b
$v = "\t"; my $re = qr/[^[:print:]]/; my $before_upgrade = $v =~ $re; utf8::upgrade($v); my $after_upgrade = $v =~ $re; print "upgrading changes whether tab matches $re\n" if $before_upgrade != $after_upgrade;
CC: "Tom Christiansen" <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Fri, 7 Nov 2008 14:00:02 +0100
To: "karl williamson" <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 9.4k
2008/11/7 karl williamson <public@khwilliamson.com>: Show quoted text
> demerphq wrote:
>> >> 2008/10/30 karl williamson <public@khwilliamson.com>:
>>> >>> Tom Christiansen wrote:
>>>> >>>> In-Reply-To: Message from karl williamson <public@khwilliamson.com> of >>>> "Wed, 29 Oct 2008 10:21:33 MDT." <49088D8D.3050202@khwilliamson.com>
>>>>> >>>>> Tom Christiansen wrote:
>>>>>> >>>>>> PRESCRIPT: Karl, my program at the end below should be good for >>>>>> sniffing out \p and \P mutual-exclusion failure bugs such >>>>>> as I >>>>>> believe you recently reported. >>>>>>
>>>>> Thanks for the program. My bug reports have mostly come from >>>>> running something similar, but not with nearly as many of the >>>>> classes as you did. >>>>> I'm guessing you didn't run it past 127, because it fails immediately >>>>> with a Malformed UTF-8 character (fatal) at (eval 3) line 2.
>>>> >>>> Yes, that's because it intentionally assumed ASCII only. The amended >>>> program and results follow. You should now be to use this program >>>> directly for your testing. >>>>
>>>>> I actually don't think there are any issues with the \p and \P, >>>>> because those go out and use auto-constructed files. The >>>>> problems I have found have been in the posix classes and >>>>> entirely in the 128-255 range.
>>>> >>>> It's not just there. It's a bug in negating of POSIX char classes. >>>> I can elicit errors at codepoints <128, even with Unicode semantics on. >>>> >>>> Specifically, just runniing POSIX charclass tests alone: >>>> >>>> REPORT: ranging from U+00 (0) .. U+00007F (127) [128 codepoints] >>>> failed 28 tests, 3300 out of 3328 tests successful (0.991587%) >>>> >>>> The problems are with the [:print:] and [:punct:] properties: >>>> >>>> Trouble w/U+007E: TILDE: Property "[:punct:]" failed >>>> Trouble w/U+007C: VERTICAL LINE: Property "[:punct:]" failed >>>> Trouble w/U+0060: GRAVE ACCENT: Property "[:punct:]" failed >>>> Trouble w/U+005E: CIRCUMFLEX ACCENT: Property "[:punct:]" failed >>>> Trouble w/U+003E: GREATER-THAN SIGN: Property "[:punct:]" failed >>>> Trouble w/U+003D: EQUALS SIGN: Property "[:punct:]" failed >>>> Trouble w/U+003C: LESS-THAN SIGN: Property "[:punct:]" failed >>>> Trouble w/U+002B: PLUS SIGN: Property "[:punct:]" failed >>>> Trouble w/U+0024: DOLLAR SIGN: Property "[:punct:]" failed >>>> Trouble w/U+000D: CARRIAGE RETURN (CR): Property "[:print:]" failed >>>> Trouble w/U+000C: FORM FEED (FF): Property "[:print:]" failed >>>> Trouble w/U+000B: LINE TABULATION: Property "[:print:]" failed >>>> Trouble w/U+000A: LINE FEED (LF): Property "[:print:]" failed >>>> Trouble w/U+0009: CHARACTER TABULATION: Property "[:print:]" failed >>>> >>>> The positive char class test is like this: >>>> >>>> ( >>>> ( $U_char =~ /\A[[:print:]]\z/ ) >>>> == >>>> ( $U_char !~ /\A[[:^print:]]\z/ ) >>>> ) >>>> && >>>> ( >>>> ( $U_char =~ /\A[[:print:]]\z/ ) >>>> == >>>> ( $U_char !~ /\A[^[:print:]]\z/ ) >>>> ) >>>> >>>> The negative char class test is this: >>>> >>>> ( $U_char =~ /\A[[:^punct:]]\z/ ) >>>> == >>>> ( $U_char =~ /\A[^[:punct:]]\z/ ) >>>> >>>> ...
>>> >>> I wondered why you were getting errors when my own (which I thought were >>> rather extensive) test cases were getting none in the ASCII range. It >>> turns >>> out that it's because I hadn't thought to try the negation ^ in the outer >>> set of braces. Also, these problems don't occur when the characters >>> aren't >>> utf8 (when they are packed as C instead of U). And they fail in exactly >>> the >>> places that the posix classes don't match the unicode ones. It is >>> documented that [:graph:] includes the precise set of symbols that have >>> failures that \p{IsGraph} does not. Similarly for [:print:] and the ones >>> it >>> has failures for. >>> >>> I haven't looked at the code, but what likely what is going on is that >>> the >>> complement of a complement loses these differences from unicode (only >>> when >>> the utf8 flag is on, because otherwise the unicode classes aren't looked >>> at) >>> >>> My goal is to fix all these problems in the 128-255 range. It's turning >>> out >>> to be more work than I thought, essentially because there are a lot of >>> pre-existing errors and inconsistencies. >>>
>> >> Just for the record in the bug ticket, I used the attached file to >> determine that the affected patterns and strings are as follows: >> >> /[\w][\W]/ >> matches unicode codepoints: 170 178..179 181 185..186 188..190 >> 192..214 216..246 248 >> >> /[\s][\S]/ >> matches unicode codepoints: 133 160 >> >> /[[:alnum:]][[:^alnum:]]/ >> matches unicode codepoints: 170 181 186 192..214 216..246 248 >> >> /[[:alpha:]][[:^alpha:]]/ >> matches unicode codepoints: 170 181 186 192..214 216..246 248 >> >> /[[:cntrl:]][[:^cntrl:]]/ >> matches unicode codepoints: 128..159 173 >> >> /[[:graph:]][[:^graph:]]/ >> matches unicode codepoints: 161 >> >> /[[:lower:]][[:^lower:]]/ >> matches unicode codepoints: 170 181 186 223..246 248 >> >> /[[:print:]][[:^print:]]/ >> matches unicode codepoints: 133 160 >> >> /[[:punct:]][[:^punct:]]/ >> matches unicode codepoints: 36 43 60..62 94 96 124 126 161 171 >> 183 187 191 >> >> /[[:upper:]][[:^upper:]]/ >> matches unicode codepoints: 192..214 216 >> >> /[[:space:]][[:^space:]]/ >> matches unicode codepoints: 133 160 >> >> /[[:blank:]][[:^blank:]]/ >> matches unicode codepoints: 160 >> >> Thats a lot of characters. Sadly. >> >> Im trying to figure out a work around that doesnt make non-unicode >> slower while at the same time not involving a complete rewrite of how >> character classes are stored. The problem is that as a speed >> optimisation we expand these to their ascii bitmaps at compile time >> (when not under use locale), which means we merge them with whatever >> non special chars are in the charclass. We could not do that, but it >> would slow down non-uncode a lot. Or i guess we could change the >> charclass representation (essentially doubling it) and build both a >> unciode and non-uncode equivalent. That would double the size of a >> charclass, and be a reasonable amount of work to fix. >> >> There are some "out of band" options tho. We chould implement the >> "match semantics flags" as has been discussed in the past, and leave >> the current implementation as is when the flags are omitted. And note >> the problem. Anybody that cares would have to use the apropriate >> modifier. We could do a more extreme version of this and just say that >> ascii semantics apply for these items only, and remove the unicode >> char class property logic when they are used, forcing people to use >> explicit unicode semantics. (Or vice versa, but I dont think THAT is a >> good idea.) >> >> The bottom line is that regex metapatterns whose semantics change >> under unicode and otherwise are evil. They have lead to all kinds of >> trouble at all kinds of levels. >> >> I think fixing this properly requires the pumpking / larry to make a >> policy decision. Do we stick with the weird bifurcated behaviour >> depending on string representation or not. >> >> Yves >> >> >>
> I don't know if this casts any light on the issue or not, but the attached > simple program shows what may be another type of failure. > > > $v = "\t"; > my $re = qr/[^[:print:]]/; > my $before_upgrade = $v =~ $re; > utf8::upgrade($v); > my $after_upgrade = $v =~ $re; > print "upgrading changes whether tab matches $re\n" if $before_upgrade != > $after_upgrade;
This is essentially the same bug as we have been discussing. Basically POSIX specifies that a horizontal tab "\t" is a member of the following POSIX character classes: cntrl, space, blank. http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html However, it would appear that mktables has different ideas: lib/unicore/mktables line 874: my $isspace = ($cat =~ /Zs|Zl|Zp/ && $code != 0x200B) # 200B is ZWSP which is for line break control # and therefore it is not part of "space" even while it is "Zs". || $code == 0x0009 # 0009: HORIZONTAL TAB || $code == 0x000A # 000A: LINE FEED || $code == 0x000B # 000B: VERTICAL TAB || $code == 0x000C # 000C: FORM FEED || $code == 0x000D # 000D: CARRIAGE RETURN || $code == 0x0085 # 0085: NEL ; [snip to line 917] $Cat{Print}->$op($code) if $isgraph || $isspace; So it would appear that at least some of our problems are of our own making. Im not really sure what to do about this, however I will say that I *strongly* feel that we cannot have different interpretations for \w, \d, \s and the POSIX charclasses under unicode for codepoints 0-255. What happens outside of these codepoints is less important as it doesn't introduce logical inconsistencies like the ones we have seen here. \w should be [A-Za-z_] \s should be [\r\n\t ] \d should be [0-9] and the POSIX charclasses unsurprisingly should be EXACTLY as defined by the POSIX standard, and not pay ANY attention to what the Unicode standard says. My feeling is that there are perfectly acceptable ways to get a "unicode word character", using the \P notation (and if there isnt there should be) and that the special character classes should be defined according to ascii. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: "karl williamson" <public [...] khwilliamson.com>, "Tom Christiansen" <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Fri, 7 Nov 2008 14:44:38 +0100
To: demerphq <demerphq [...] gmail.com>
From: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
Download (untitled) / with headers
text/plain 3.8k
2008/11/7 demerphq <demerphq@gmail.com>: Show quoted text
>>> The bottom line is that regex metapatterns whose semantics change >>> under unicode and otherwise are evil. They have lead to all kinds of >>> trouble at all kinds of levels. >>> >>> I think fixing this properly requires the pumpking / larry to make a >>> policy decision. Do we stick with the weird bifurcated behaviour >>> depending on string representation or not.
perltodo says: =head2 UTF-8 revamp The handling of Unicode is unclean in many places. For example, the regexp engine matches in Unicode semantics whenever the string or the pattern is flagged as UTF-8, but that should not be dependent on an internal storage detail of the string. Likewise, case folding behaviour is dependent on the UTF8 internal flag being on or off. =cut So, no weird bifurcated behaviour. Show quoted text
>> I don't know if this casts any light on the issue or not, but the attached >> simple program shows what may be another type of failure. >> >> >> $v = "\t"; >> my $re = qr/[^[:print:]]/; >> my $before_upgrade = $v =~ $re; >> utf8::upgrade($v); >> my $after_upgrade = $v =~ $re; >> print "upgrading changes whether tab matches $re\n" if $before_upgrade != >> $after_upgrade;
> > This is essentially the same bug as we have been discussing. > > Basically POSIX specifies that a horizontal tab "\t" is a member of > the following POSIX character classes: cntrl, space, blank. > > http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html > > However, it would appear that mktables has different ideas: > lib/unicore/mktables line 874: > my $isspace = > ($cat =~ /Zs|Zl|Zp/ && > $code != 0x200B) # 200B is ZWSP which is for line break control > # and therefore it is not part of "space" even while it is "Zs". > || $code == 0x0009 # 0009: HORIZONTAL TAB > || $code == 0x000A # 000A: LINE FEED > || $code == 0x000B # 000B: VERTICAL TAB > || $code == 0x000C # 000C: FORM FEED > || $code == 0x000D # 000D: CARRIAGE RETURN > || $code == 0x0085 # 0085: NEL > > ; > [snip to line 917] > $Cat{Print}->$op($code) if $isgraph || $isspace; > > > So it would appear that at least some of our problems are of our own making. > > Im not really sure what to do about this, however I will say that I > *strongly* feel that we cannot have different interpretations for \w, > \d, \s and the POSIX charclasses under unicode for codepoints 0-255. > What happens outside of these codepoints is less important as it > doesn't introduce logical inconsistencies like the ones we have seen > here. > > \w should be [A-Za-z_] > \s should be [\r\n\t ] > \d should be [0-9]
regardless of whether "use locale" is in effect ? This is an incompatible change, "use locale" has been historically the preferred way to modify the meaning of \w (and \s). I'm not sure we can change that (short of completely deprecating locales...) On the other hand, since "use locale" is more or less decouraged, we can leave the old locale behaviour in its small ghetto. (\d is less problematic, since locale don't affect it.) Show quoted text
> and the POSIX charclasses unsurprisingly should be EXACTLY as defined > by the POSIX standard, and not pay ANY attention to what the Unicode > standard says.
As a spoiled C programmer, I expect POSIX charclasses to behave as in C (for range 0-255, that is, since C doesn't know more.) Show quoted text
> My feeling is that there are perfectly acceptable ways to get a > "unicode word character", using the \P notation (and if there isnt > there should be) and that the special character classes should be > defined according to ascii.
Indeed, if \w starts matching Unicode word chars, and if we don't add a new regexp flag /u, we can't have anymore a behaviour of \w independent of the internal string representation.
CC: "karl williamson" <public [...] khwilliamson.com>, "Tom Christiansen" <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Fri, 7 Nov 2008 16:37:06 +0100
To: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 6.9k
2008/11/7 Rafael Garcia-Suarez <rgarciasuarez@gmail.com>: Show quoted text
> 2008/11/7 demerphq <demerphq@gmail.com>:
>>>> The bottom line is that regex metapatterns whose semantics change >>>> under unicode and otherwise are evil. They have lead to all kinds of >>>> trouble at all kinds of levels. >>>> >>>> I think fixing this properly requires the pumpking / larry to make a >>>> policy decision. Do we stick with the weird bifurcated behaviour >>>> depending on string representation or not.
> > perltodo says: > > =head2 UTF-8 revamp > > The handling of Unicode is unclean in many places. For example, the regexp > engine matches in Unicode semantics whenever the string or the pattern is > flagged as UTF-8, but that should not be dependent on an internal storage > detail of the string. Likewise, case folding behaviour is dependent on the > UTF8 internal flag being on or off. > > =cut > > So, no weird bifurcated behaviour.
Ok. Cool. So eventually we decide on a single behaviour. Good good. Show quoted text
>>> I don't know if this casts any light on the issue or not, but the attached >>> simple program shows what may be another type of failure. >>> >>> >>> $v = "\t"; >>> my $re = qr/[^[:print:]]/; >>> my $before_upgrade = $v =~ $re; >>> utf8::upgrade($v); >>> my $after_upgrade = $v =~ $re; >>> print "upgrading changes whether tab matches $re\n" if $before_upgrade != >>> $after_upgrade;
>> >> This is essentially the same bug as we have been discussing. >> >> Basically POSIX specifies that a horizontal tab "\t" is a member of >> the following POSIX character classes: cntrl, space, blank. >> >> http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html >> >> However, it would appear that mktables has different ideas: >> lib/unicore/mktables line 874: >> my $isspace = >> ($cat =~ /Zs|Zl|Zp/ && >> $code != 0x200B) # 200B is ZWSP which is for line break control >> # and therefore it is not part of "space" even while it is "Zs". >> || $code == 0x0009 # 0009: HORIZONTAL TAB >> || $code == 0x000A # 000A: LINE FEED >> || $code == 0x000B # 000B: VERTICAL TAB >> || $code == 0x000C # 000C: FORM FEED >> || $code == 0x000D # 000D: CARRIAGE RETURN >> || $code == 0x0085 # 0085: NEL >> >> ; >> [snip to line 917] >> $Cat{Print}->$op($code) if $isgraph || $isspace; >> >> >> So it would appear that at least some of our problems are of our own making. >> >> Im not really sure what to do about this, however I will say that I >> *strongly* feel that we cannot have different interpretations for \w, >> \d, \s and the POSIX charclasses under unicode for codepoints 0-255. >> What happens outside of these codepoints is less important as it >> doesn't introduce logical inconsistencies like the ones we have seen >> here. >> >> \w should be [A-Za-z_] >> \s should be [\r\n\t ] >> \d should be [0-9]
> > regardless of whether "use locale" is in effect ?
No, use locale would do the same things it has always done. I have no plans to change that. At an implementation level use locale causes char-class tests to be deferred to match time with the appropriately educated functions for doing the membership tests. Do a search for "How's that for a conditional?" in regcomp.c :-) This makes them MUCH slower. Show quoted text
> This is an incompatible change, "use locale" has been historically > the preferred way to modify the meaning of \w (and \s). I'm not sure we > can change that (short of completely deprecating locales...) On the > other hand, since "use locale" is more or less decouraged, we can > leave the old locale behaviour in its small ghetto.
Yes, but im not proposing changing this. Show quoted text
> (\d is less problematic, since locale don't affect it.)
Right, and actually would close up some security holes. Show quoted text
>> and the POSIX charclasses unsurprisingly should be EXACTLY as defined >> by the POSIX standard, and not pay ANY attention to what the Unicode >> standard says.
> > As a spoiled C programmer, I expect POSIX charclasses to behave as in C > (for range 0-255, that is, since C doesn't know more.)
Agreed there should be no deviation from what the POSIX standard dictates, and it doesnt dictate that we should include unicode semantics. Show quoted text
>> My feeling is that there are perfectly acceptable ways to get a >> "unicode word character", using the \P notation (and if there isnt >> there should be) and that the special character classes should be >> defined according to ascii.
> > Indeed, if \w starts matching Unicode word chars, and if we don't add a
[speaking to rafael offline he meant 'STOPS' here] Show quoted text
> new regexp flag /u, we can't have anymore a behaviour of \w independent > of the internal string representation.
We already have a way to make a word char match the unicode definition, that is by using the unicode definition \p{IsWord}. The whole problem (as far as the regex engine is concerned) is that we have mixed our Perl/TraditionalRegex and POSIX definitiosn up with Unicode definitions. Unicode has a set of property definitions which should not have been confused with the Perl/Regex/POSIX ones. So \w has traditionally meant [A-Za-z_] but we went and mapped it to the unicode property IsWord, instead of defining our own synthetic IsWordPerl that kept the traditional meaning. Same goes for instance for the [:print:] POSIX definition, which specifies that \t is a 'cntrl' character, and thus excluded from 'print'. We mapped to something that includes IsSpace, which is wrong as pretty much any non-synthetic unicode property definition is going to include things that POSIX doesn't deal with as far as I can tell. And so doing we basically broke our character class logic. So if we were to do the minimum required to fix our character class logic we would define a new set of synthetic properties, which exactly match their traditional complements and use them. All of a sudden all the complement bugs in the various forms would vanish. Of course this would be a backwards incompatible change. IMO one for the better I might add, as it would fix a bunch of clear bugs, and would close what I consider to be security holes. However assuming that we can make the mapping controllable by a pragma we can fix this in a backwards compatible way. Code that wants sane semantics would do something like: use re 'ascii_charclass'; #default for 5.12 use re 'unicode_charclass'; use re 'legacy_charclass'; #default for 5.10 at the same time in 5.12 (or even maybe 5.10) we could introduce new modifiers /A /U /L for this as well. Hypothetically we could even support something like: use re 'charclass_overload' '\s' => 'IsPerlSpace','\w' => 'IsPerlWord', '\d' => 'IsTrueDigit'; Currently, and especially because of the complementary bugs we are seeing, and the inconsistencies like the one pointed out by Karl I consider it a clear bug that these symbols are supposed to behave differently in unicode and otherwise. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
I've merged in Rt #49302 as it involves basically the same thing.
CC: "karl williamson" <public [...] khwilliamson.com>, "Tom Christiansen" <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Fri, 7 Nov 2008 21:27:39 +0100
To: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.8k
2008/11/7 demerphq <demerphq@gmail.com>: Show quoted text
> So if we were to do the minimum required to fix our character class > logic we would define a new set of synthetic properties, which exactly > match their traditional complements and use them. All of a sudden all > the complement bugs in the various forms would vanish. > > Of course this would be a backwards incompatible change. IMO one for > the better I might add, as it would fix a bunch of clear bugs, and > would close what I consider to be security holes. > > However assuming that we can make the mapping controllable by a pragma > we can fix this in a backwards compatible way. Code that wants sane > semantics would do something like:
I just applied the following: 34769 on 2008/11/07 by demerphq@demerphq-fresh create new unicode props as defined in POSIX spec (optionally use them in the regex engine) Perlbug #60156 and #49302 (and probably others) resolve down to the problem that the definition of \s and \w and \d and the POSIX charclasses are different for unicode strings and for non-unicode strings. This broke the character class logic in the regex engine. The easiest fix to make the character class logic sane again is to define new properties which do match. This change creates new property classes that can be used instead of the traditional ones (it does not change the previously defined ones). If the define in regcomp.h: #define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1 is changed to 0, then the new mappings will be used. This will fix a bunch of bugs that are reported as TODO items in the new reg_posixcc.t test file. This is the first step I guess in fixing this problem. The next step is figuring out how to make controlling which is used a pragma instead of a build time configure setting. -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: program to look at char class complements [perl #60156]
Date: Sat, 8 Nov 2008 15:51:07 +0100
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 109b
demerphq schreef: Show quoted text
> \w should be [A-Za-z_]
ITYM: [0-9A-Za-z_] -- Affijn, Ruud "Gewoon is een tijger."
CC: perl5-porters [...] perl.org
Subject: Re: program to look at char class complements [perl #60156]
Date: Sat, 08 Nov 2008 14:56:59 +0000
To: "Dr.Ruud" <rvtol+news [...] isolution.nl>
From: Alberto Simões <albie [...] alfarrabio.di.uminho.pt>
Download (untitled) / with headers
text/plain 273b
Dr.Ruud wrote: Show quoted text
> demerphq schreef: >
>> \w should be [A-Za-z_]
> > ITYM: [0-9A-Za-z_] >
for non US/UK folks: [0-9[:alpha:]_] -- Alberto Simões - Departamento de Informática - Universidade do Minho Campus de Gualtar - 4710-057 Braga - Portugal
CC: perl5-porters [...] perl.org
Subject: Re: program to look at char class complements [perl #60156]
Date: Sat, 8 Nov 2008 16:34:00 +0100
To: Dr.Ruud <rvtol+news [...] isolution.nl>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 221b
2008/11/8 Dr.Ruud <rvtol+news@isolution.nl>: Show quoted text
> demerphq schreef: >
>> \w should be [A-Za-z_]
> > ITYM: [0-9A-Za-z_]
Yes, that is what I meant. *blush* :-) Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>, Tom Christiansen <tchrist [...] perl.com>, perl5-porters [...] perl.org, bugs-bitbucket [...] netlabs.develooper.com
Subject: Re: program to look at char class complements [perl #60156]
Date: Sat, 08 Nov 2008 15:20:02 -0700
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 335b
demerphq wrote: Show quoted text
> [snip] > > Of course this would be a backwards incompatible change. IMO one for > the better I might add, as it would fix a bunch of clear bugs, and > would close what I consider to be security holes.
It would help me to know some of these holes, so I know what to watch for in doing my own coding. Show quoted text
> > [snip]
Subject: Re: program to look at char class complements [perl #60156]
Date: Sun, 9 Nov 2008 11:00:16 +0100
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 457b
Alberto Simões schreef: Show quoted text
> Dr.Ruud:
>> demerphq:
Show quoted text
>>> \w should be [A-Za-z_]
>> >> ITYM: [0-9A-Za-z_]
> > for non US/UK folks: [0-9[:alpha:]_]
That assumes that [[:alpha:]] is equivalent to [A-Za-z]. Thus that both [A-Za-z] and [a-zA-Z] map to [[:alpha:]]. Then [a-z] would act exactly as [[:lower:]]. But do they? I don't think so, already because the POSIX [:alpha:] and [:lower:] etc. are locale dependent. -- Affijn, Ruud "Gewoon is een tijger."
Subject: Re: program to look at char class complements [perl #60156]
Date: Sat, 08 Nov 2008 15:34:27 -0800
To: perl5-porters [...] perl.org
From: "John W. Krahn" <krahnj [...] telus.net>
Download (untitled) / with headers
text/plain 394b
Alberto Simões wrote: Show quoted text
> > > Dr.Ruud wrote:
>> demerphq schreef: >>
>>> \w should be [A-Za-z_]
>> >> ITYM: [0-9A-Za-z_]
> > for non US/UK folks: [0-9[:alpha:]_]
ITYM: [[:digit:][:alpha:]_] Or: [[:alnum:]_] John -- Perl isn't a toolbox, but a small machine shop where you can special-order certain sorts of tools at low cost and in short order. -- Larry Wall
CC: perl5-porters [...] perl.org
Subject: Re: program to look at char class complements [perl #60156]
Date: Sun, 9 Nov 2008 14:49:16 +0100
To: Dr.Ruud <rvtol+news [...] isolution.nl>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 726b
2008/11/9 Dr.Ruud <rvtol+news@isolution.nl>: Show quoted text
> Alberto Simões schreef:
>> Dr.Ruud:
>>> demerphq:
>
>>>> \w should be [A-Za-z_]
>>> >>> ITYM: [0-9A-Za-z_]
>> >> for non US/UK folks: [0-9[:alpha:]_]
> > That assumes that [[:alpha:]] is equivalent to [A-Za-z]. Thus that both > [A-Za-z] and [a-zA-Z] map to [[:alpha:]]. > Then [a-z] would act exactly as [[:lower:]]. But do they?
In the POSIX locale yes they are. Show quoted text
> I don't think so, already because the POSIX [:alpha:] and [:lower:] etc. > are locale dependent.
Under use locale they are yes, otherwise they sortof assume POSIX semantics, unless the string is unicode, in which case they behave quite strangely. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: program to look at char class complements [perl #60156]
Date: Sun, 9 Nov 2008 17:20:16 +0100
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 1.1k
demerphq schreef: Show quoted text
> Dr.Ruud:
>> Alberto Simões:
>>> Dr.Ruud:
>>>> demerphq:
Show quoted text
>>>>> \w should be [A-Za-z_]
>>>> >>>> ITYM: [0-9A-Za-z_]
>>> >>> for non US/UK folks: [0-9[:alpha:]_]
>> >> That assumes that [[:alpha:]] is equivalent to [A-Za-z]. Thus that >> both [A-Za-z] and [a-zA-Z] map to [[:alpha:]]. >> Then [a-z] would act exactly as [[:lower:]]. But do they?
> > In the POSIX locale yes they are.
Of course, but in what I wrote there was no context such as "the POSIX locale" involved. Alberto, who went from \w should be [0-9A-Za-z_] to [0-9[:alpha:]_] was I think missing the point that \w should be [0-9A-Za-z_] (so containing exactly 62 codepoints). Show quoted text
>> I don't think so, already because the POSIX [:alpha:] and [:lower:] >> etc. are locale dependent.
> > Under use locale they are yes, otherwise they sortof assume POSIX > semantics, unless the string is unicode, in which case they behave > quite strangely.
The "[:alpha:]" and "[:lower:]" (and such) are "POSIX character classes". Each of them can be different per locale. They are defined that way, no Perl involved. -- Affijn, Ruud "Gewoon is een tijger."
Subject: Re: program to look at char class complements [perl #60156]
Date: Sun, 9 Nov 2008 17:45:53 +0100
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 775b
demerphq schreef: Show quoted text
> Dr.Ruud:
Show quoted text
>> That assumes that [[:alpha:]] is equivalent to [A-Za-z]. Thus that >> both [A-Za-z] and [a-zA-Z] map to [[:alpha:]]. >> Then [a-z] would act exactly as [[:lower:]]. But do they?
> > In the POSIX locale yes they are.
I don't know exactly which of my above statements you refer to, but from the "are" I guess it is the first. Allowing my general statements to be combined with the POSIX locale limits that you attached to them, I think it is like this: the RE character sets [A-Za-z] and [a-zA-Z] contain exactly 52 codepoints, independent of locale, and the POSIX characterset [:alpha:] for the POSIX locale contains *at minimum* 52 codepoints, namely the combination of [:lower:] and [:upper:]. -- Affijn, Ruud "Gewoon is een tijger."
CC: perl5-porters [...] perl.org
Subject: Re: program to look at char class complements [perl #60156]
Date: Sun, 9 Nov 2008 18:19:30 +0100
To: Dr.Ruud <rvtol+news [...] isolution.nl>
From: demerphq <demerphq [...] gmail.com>
2008/11/9 Dr.Ruud <rvtol+news@isolution.nl>: Show quoted text
> demerphq schreef:
>> Dr.Ruud:
>
>>> That assumes that [[:alpha:]] is equivalent to [A-Za-z]. Thus that >>> both [A-Za-z] and [a-zA-Z] map to [[:alpha:]]. >>> Then [a-z] would act exactly as [[:lower:]]. But do they?
>> >> In the POSIX locale yes they are.
> > I don't know exactly which of my above statements you refer to, but from > the "are" I guess it is the first.
See the bottomish part of http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html Where it says: -------8<--------8<---------8<-------- upper Define characters to be classified as uppercase letters. In the POSIX locale, the 26 uppercase letters shall be included: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z In a locale definition file, no character specified for the keywords cntrl, digit, punct, or space shall be specified. The uppercase letters <A> to <Z>, as defined in Character Set Description File (the portable character set), are automatically included in this class. lower Define characters to be classified as lowercase letters. In the POSIX locale, the 26 lowercase letters shall be included: a b c d e f g h i j k l m n o p q r s t u v w x y z In a locale definition file, no character specified for the keywords cntrl, digit, punct, or space shall be specified. The lowercase letters <a> to <z> of the portable character set are automatically included in this class. alpha Define characters to be classified as letters. In the POSIX locale, all characters in the classes upper and lower shall be included. In a locale definition file, no character specified for the keywords cntrl, digit, punct, or space shall be specified. Characters classified as either upper or lower are automatically included in this class. digit Define the characters to be classified as numeric digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 shall be included. In a locale definition file, only the digits <zero>, <one>, <two>, <three>, <four>, <five>, <six>, <seven>, <eight>, and <nine> shall be specified, and in contiguous ascending sequence by numerical value. The digits <zero> to <nine> of the portable character set are automatically included in this class. alnum Define characters to be classified as letters and numeric digits. Only the characters specified for the alpha and digit keywords shall be specified. Characters specified for the keywords alpha and digit are automatically included in this class. space Define characters to be classified as white-space characters. In the POSIX locale, at a minimum, the <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph, or xdigit shall be specified. The <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> of the portable character set, and any characters included in the class blank are automatically included in this class. cntrl Define characters to be classified as control characters. In the POSIX locale, no characters in classes alpha or print shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, punct, graph, print, or xdigit shall be specified. punct Define characters to be classified as punctuation characters. In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space> shall be specified. graph Define characters to be classified as printable characters, not including the <space>. In the POSIX locale, all characters in classes alpha, digit, and punct shall be included; no characters in class cntrl shall be included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, and punct are automatically included in this class. No character specified for the keyword cntrl shall be specified. print Define characters to be classified as printable characters, including the <space>. In the POSIX locale, all characters in class graph shall be included; no characters in class cntrl shall be included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, graph, and the <space> are automatically included in this class. No character specified for the keyword cntrl shall be specified. xdigit Define the characters to be classified as hexadecimal digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f shall be included. In a locale definition file, only the characters defined for the class digit shall be specified, in contiguous ascending sequence by numerical value, followed by one or more sets of six characters representing the hexadecimal digits 10 to 15 inclusive, with each set in ascending order (for example, <A>, <B>, <C>, <D>, <E>, <F>, <a>, <b>, <c>, <d>, <e>, <f>). The digits <zero> to <nine>, the uppercase letters <A> to <F>, and the lowercase letters <a> to <f> of the portable character set are automatically included in this class. blank Define characters to be classified as <blank>s. In the POSIX locale, only the <space> and <tab> shall be included. In a locale definition file, the <space> and <tab> are automatically included in this class. ------->8------->8--------->8-------- One important point to keep in mind, which for me is a real issue for things like making [:print:] map to some unicode based definition is that the rules are VERY specific about what types of characters are allowed to be in a given class. So for instance the comment for [:print:] says "In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, graph, and the <space> are automatically included in this class. No character specified for the keyword cntrl shall be specified." This says to me that if unicode wants to include tab in both "control" and "space" that it is then impossible to correctly use either to define [:print:]. Show quoted text
> Allowing my general statements to be combined with the POSIX locale > limits that you attached to them, I think it is like this: > > the RE character sets [A-Za-z] and [a-zA-Z] contain exactly 52 > codepoints, independent of locale, and > the POSIX characterset [:alpha:] for the POSIX locale contains *at > minimum* 52 codepoints, namely the combination of [:lower:] and > [:upper:].
The last part is wrong. Under the POSIX locale (which is a local defined in the POSIX standard as being the locale that is used unless additional locale data is provided via the standard interfaces) the character class [:alpha:] contains exactly 52 characters. Under some other locale the POSIX character class [:alpha:] /may/ have other meanings, however this behavior is only enabled in Perl under 'use locale;' (and will make regex matching much slower). Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
resolved in perl 5.11
CC: Zefram <zefram [...] fysh.org>, Juerd Waalboer <juerd [...] convolution.nl>, Tom Christiansen <tchrist [...] perl.com>, Glenn Linderman <perl [...] NevCal.com>
Subject: RFC: [perl #60156] What to do about [[:posix:]] ?
Date: Sat, 14 Aug 2010 11:09:42 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>, demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
There are a number of problems with the [[:posix:]] character classes. I thought we had what to do about this settled, but that was before there was more of an emphasis on strict backwards compatibility, and before I did some more investigation, so I thought I had better air it again. Here are the problems: 1) They do not match the Posix standard. In our attempt to DWIM, we violate it. For example, [[:alpha:]] is only supposed to match A-Za-z, unless in a locale that has other alphabetics. But, if the target string or pattern indicate a utf8 match, it matches \p{alpha}. I suppose we could argue that we have created a new locale, the Unicode locale. I don't know if that argument holds water or not. 2) They suffer from "The Unicode Bug", in which the utf8ness of the pattern or string affects the semantics of the match. [[:alpha:]] will match "\xe1" if and only if the pattern or target string are in utf8. 3) A number of characters in utf8 match both a class and the complement of the class. Here's a list from bug #60156: [[:alnum:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 [[:alpha:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 [[:blank:]] U+A0 [[:cntrl:]] U+80 [[:graph:]] U+A1 [[:lower:]] U+AA U+B5 U+BA U+DF..F6 U+F8 [[:print:]] U+A0 [[:punct:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB U+BF [[:space:]] U+85 U+A0 [[:upper:]] U+C0..D6 U+D8 [[:word:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 Note that some of these are ASCII. The root cause of these is mostly from the same causes as the Unicode bug, but also because when they are stored in utf8 the code re-uses an existing, but not quite corresponding, \p{} property 4) Extending the posix definitions was not done consistently. This is especially noticeable in punct. Unicode splits what Posix considers punctuation into two classes: punctuation and symbol. But in extending [[:punct:]] to beyond ASCII, Perl doesn't include the Unicode symbols. The result is inconsistent, the ASCII range symbols are included, but no other. It is less clear about other extensions. Should [[:cntrl:]] include other things that Unicode considers control-like, namely the surrogates, the formats (soft hyphen et.al), and private use characters? What about title case, fractions, super and subscripts? Before, it seemed like the obvious solution to all this was to just go back to the formal Posix definition of what they should match, not having a "Unicode locale", and that was done via #ifdefs for a while in 5.11. But it was part of a larger patch that was it decided to revert. Now the #ifdefs remain defined the other direction, and perlrecharclass.pod in 5.12 says that it is proposed to make these match the Posix standard exactly, asking anyone who disagrees to notify us. There has so far been none. If we were to just reinstate those #ifdefs, it would fix all the above problems in one fell swoop. But it seems to me that we will break too much existing code. I think it was a mistake extending these definitions to a made-up "Unicode locale" in the first place, but that ship has sailed, I think, in spite of what we thought we had decided earlier. I have done some investigation, and it appears that I can easily solve problem 3) by creating more properties in mktables tailored just for these posix character classes; and easily solve 3) for regexes compiled under feature unicode_strings, by extending what I'm already about to submit a patch for regarding [\w\s]. I think I should do this, ripping out the #ifdefs If we want to restrict the posix classes to strict posix definitions, I think it probably should be done with a pragma: 'use feature "strict_posix"' or 'use re "strict_posix"'. This is not as high-priority in my view; and I'm not certain it even needs to be done at all if 2) and 3) are fixed. I think, for consistency, especially if we don't add the strict posix interpretations that punct should change to include the Unicode symbols as well; I think the other inconsistencies are not something to worry about; but am less confident in this. Comments?
CC: Perl5 Porters <perl5-porters [...] perl.org>, demerphq <demerphq [...] gmail.com>, Zefram <zefram [...] fysh.org>, Tom Christiansen <tchrist [...] perl.com>, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: RFC: [perl #60156] What to do about [[:posix:]] ?
Date: Sat, 14 Aug 2010 19:34:13 +0200
To: karl williamson <public [...] khwilliamson.com>
From: Juerd Waalboer <juerd [...] tnx.nl>
Download (untitled) / with headers
text/plain 2.6k
karl williamson skribis 2010-08-14 11:09 (-0600): Show quoted text
> 1) They do not match the Posix standard. In our attempt to DWIM, we > violate it.
This has been the case for many years and I think it's a good idea to keep it dwimmy. A strict POSIX mode could perhaps be added, but breaking backwards compatibility just to comply with an outdated standard feels wrong. Perl has, in a way, set a new standard for these character classes even though they're still referred to as "POSIX". Perhaps the bug is that we're still calling them POSIX character classes, and the fix is to just rename them to POSIX-like, POSIX-ish, or something entirely different. Show quoted text
> 2) They suffer from "The Unicode Bug", in which the utf8ness of the > pattern or string affects the semantics of the match.
This is bad. I'm strongly convinced that all text operations should be unicody by default, regardless of internal representation. Show quoted text
> 3) A number of characters in utf8 match both a class and the complement > of the class.
This is not necessarily a problem, depending on the reasons for the dual matching. It's certainly debatable whether the non-breaking space is whitespace, non-whitespace, both, or perhaps even neither. Show quoted text
> 4) Extending the posix definitions was not done consistently. This is > especially noticeable in punct. (...) > It is less clear about other extensions.
This one's really tough. I'd be in favor of fixing consistency. That'll require thorough investigation. Show quoted text
> Before, it seemed like the obvious solution to all this was to just go > back to the formal Posix definition of what they should match, not > having a "Unicode locale", and that was done via #ifdefs for a while in > 5.11. But it was part of a larger patch that was it decided to revert.
I was not particularly happy with this specific change. Going back never felt right to me, although I appreciate that it is the only easy solution and perhaps even the only thing that deserves to be called a solution. Show quoted text
> If we want to restrict the posix classes to strict posix definitions, I > think it probably should be done with a pragma: 'use feature > "strict_posix"' or 'use re "strict_posix"'.
It could be argued that this should belong in "use POSIX", maybe implied when loading it with the default exports, maybe only enabled when explicitly mentioned in the import list. Show quoted text
> I think, for consistency, especially if we don't add the strict posix > interpretations that punct should change to include the Unicode symbols > as well; I think the other inconsistencies are not something to worry > about; but am less confident in this.
Agreed. -- Met vriendelijke groet, // Kind regards, // Korajn salutojn, Juerd Waalboer <juerd@tnx.nl> TNX
CC: Perl5 Porters <perl5-porters [...] perl.org>, demerphq <demerphq [...] gmail.com>, Zefram <zefram [...] fysh.org>, Tom Christiansen <tchrist [...] perl.com>, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: RFC: [perl #60156] What to do about [[:posix:]] ?
Date: Tue, 17 Aug 2010 16:31:01 -0600
To: Juerd Waalboer <juerd [...] tnx.nl>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.3k
Juerd Waalboer wrote: Show quoted text
> karl williamson skribis 2010-08-14 11:09 (-0600):
>> 1) They do not match the Posix standard. In our attempt to DWIM, we >> violate it.
> > This has been the case for many years and I think it's a good idea to > keep it dwimmy. A strict POSIX mode could perhaps be added, but breaking > backwards compatibility just to comply with an outdated standard feels > wrong. Perl has, in a way, set a new standard for these character > classes even though they're still referred to as "POSIX". > > Perhaps the bug is that we're still calling them POSIX character > classes, and the fix is to just rename them to POSIX-like, POSIX-ish, or > something entirely different. >
>> 2) They suffer from "The Unicode Bug", in which the utf8ness of the >> pattern or string affects the semantics of the match.
> > This is bad. I'm strongly convinced that all text operations should be > unicody by default, regardless of internal representation.
That will be the case eventually if you 'use 5.12.0' or greater Show quoted text
>
>> 3) A number of characters in utf8 match both a class and the complement >> of the class.
> > This is not necessarily a problem, depending on the reasons for the dual > matching. It's certainly debatable whether the non-breaking space is > whitespace, non-whitespace, both, or perhaps even neither.
But the definition of complement is the set is all characters that aren't in the original set. We don't allow for fuzziness. Show quoted text
>
>> 4) Extending the posix definitions was not done consistently. This is >> especially noticeable in punct. (...) >> It is less clear about other extensions.
> > This one's really tough. I'd be in favor of fixing consistency. That'll > require thorough investigation.
It turns out that punct is the only one that has has this problem. Show quoted text
>
>> Before, it seemed like the obvious solution to all this was to just go >> back to the formal Posix definition of what they should match, not >> having a "Unicode locale", and that was done via #ifdefs for a while in >> 5.11. But it was part of a larger patch that was it decided to revert.
> > I was not particularly happy with this specific change. Going back > never felt right to me, although I appreciate that it is the only easy > solution and perhaps even the only thing that deserves to be called a > solution. >
>> If we want to restrict the posix classes to strict posix definitions, I >> think it probably should be done with a pragma: 'use feature >> "strict_posix"' or 'use re "strict_posix"'.
> > It could be argued that this should belong in "use POSIX", maybe implied > when loading it with the default exports, maybe only enabled when > explicitly mentioned in the import list.
OK, if we do this, it makes sense to use the existing pragma. Show quoted text
>
>> I think, for consistency, especially if we don't add the strict posix >> interpretations that punct should change to include the Unicode symbols >> as well; I think the other inconsistencies are not something to worry >> about; but am less confident in this.
> > Agreed.
Having investigated further, I've implemented things so that the bugs go away without having to go back to strict Posix. So that's what I intend to do, unless there is sufficient complaint. I also intend to separately change the extended definition of [[:punct:]] to include the Unicode symbols, as that also has to be what the intent was, and should be, unless there is sufficient complaint.
CC: Perl5 Porters <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>, Juerd Waalboer <juerd [...] convolution.nl>, Tom Christiansen <tchrist [...] perl.com>, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: RFC: [perl #60156] What to do about [[:posix:]] ?
Date: Fri, 20 Aug 2010 16:41:48 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 4.7k
On 14 August 2010 19:09, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> There are a number of problems with the [[:posix:]] character classes. I > thought we had what to do about this settled, but that was before there was > more of an emphasis on strict backwards compatibility, and before I did some > more investigation, so I thought I had better air it again. > > Here are the problems: > > 1) They do not match the Posix standard.  In our attempt to DWIM, we violate > it.  For example, [[:alpha:]] is only supposed to match A-Za-z, unless in a > locale that has other alphabetics.  But, if the target string or pattern > indicate a utf8 match, it matches \p{alpha}.  I suppose we could argue that > we have created a new locale, the Unicode locale.  I don't know if that > argument holds water or not. > > 2) They suffer from "The Unicode Bug", in which the utf8ness of the pattern > or string affects the semantics of the match.  [[:alpha:]] will match "\xe1" > if and only if the pattern or target string are in utf8. > > 3) A number of characters in utf8 match both a class and the complement of > the class.  Here's a list from bug #60156: >  [[:alnum:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 >  [[:alpha:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 >  [[:blank:]] U+A0 >  [[:cntrl:]] U+80 >  [[:graph:]] U+A1 >  [[:lower:]] U+AA U+B5 U+BA U+DF..F6 U+F8 >  [[:print:]] U+A0 >  [[:punct:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB U+BF >  [[:space:]] U+85 U+A0 >  [[:upper:]] U+C0..D6 U+D8 >  [[:word:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 > > Note that some of these are ASCII.  The root cause of these is mostly from > the same causes as the Unicode bug, but also because when they are stored in > utf8 the code re-uses an existing, but not quite corresponding, \p{} > property > > 4) Extending the posix definitions was not done consistently.  This is > especially noticeable in punct.  Unicode splits what Posix considers > punctuation into two classes: punctuation and symbol.  But in extending > [[:punct:]] to beyond ASCII, Perl doesn't include the Unicode symbols. The > result is inconsistent, the ASCII range symbols are included, but no other. > > It is less clear about other extensions.  Should [[:cntrl:]] include other > things that Unicode considers control-like, namely the surrogates, the > formats (soft hyphen et.al), and private use characters?  What about title > case, fractions, super and subscripts? > > Before, it seemed like the obvious solution to all this was to just go back > to the formal Posix definition of what they should match, not having a > "Unicode locale", and that was done via #ifdefs for a while in 5.11.  But it > was part of a larger patch that was it decided to revert.  Now the #ifdefs > remain defined the other direction, and perlrecharclass.pod in 5.12 says > that it is proposed to make these match the Posix standard exactly, asking > anyone who disagrees to notify us. There has so far been none. > > If we were to just reinstate those #ifdefs, it would fix all the above > problems in one fell swoop.  But it seems to me that we will break too much > existing code.  I think it was a mistake extending these definitions to a > made-up "Unicode locale" in the first place, but that ship has sailed, I > think, in spite of what we thought we had decided earlier. > > I have done some investigation, and it appears that I can easily solve > problem 3) by creating more properties in mktables tailored just for these > posix character classes; and easily solve 3) for regexes compiled under > feature unicode_strings, by extending what I'm already about to submit a > patch for regarding [\w\s].  I think I should do this, ripping out the > #ifdefs > > If we want to restrict the posix classes to strict posix definitions, I > think it probably should be done with a pragma: 'use feature "strict_posix"' > or 'use re "strict_posix"'.  This is not as high-priority in my view; and > I'm not certain it even needs to be done at all if 2) and 3) are fixed. > > I think, for consistency, especially if we don't add the strict posix > interpretations that punct should change to include the Unicode symbols as > well; I think the other inconsistencies are not something to worry about; > but am less confident in this. > > Comments?
POSIX is a standard. It is NOT up to us to redefine that standard. Had we realized that we were breaking the standard and the massive can of worms involved at the time I do not think we would have gone the way we did. I think it would be a HUGE benefit to return to the correct interpretation of POSIX charclasses and I do not think that backcompat will be impacted any more than a bunch of buggy programs stop being buggy. Anyway, thats my view. cheers Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>, Juerd Waalboer <juerd [...] convolution.nl>, Tom Christiansen <tchrist [...] perl.com>, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: RFC: [perl #60156] What to do about [[:posix:]] ?
Date: Tue, 30 Nov 2010 15:08:49 +0100
To: demerphq <demerphq [...] gmail.com>
From: Abigail <abigail [...] abigail.be>
Download (untitled) / with headers
text/plain 5.2k
On Fri, Aug 20, 2010 at 04:41:48PM +0200, demerphq wrote: Show quoted text
> On 14 August 2010 19:09, karl williamson <public@khwilliamson.com> wrote:
> > There are a number of problems with the [[:posix:]] character classes. I > > thought we had what to do about this settled, but that was before there was > > more of an emphasis on strict backwards compatibility, and before I did some > > more investigation, so I thought I had better air it again. > > > > Here are the problems: > > > > 1) They do not match the Posix standard.  In our attempt to DWIM, we violate > > it.  For example, [[:alpha:]] is only supposed to match A-Za-z, unless in a > > locale that has other alphabetics.  But, if the target string or pattern > > indicate a utf8 match, it matches \p{alpha}.  I suppose we could argue that > > we have created a new locale, the Unicode locale.  I don't know if that > > argument holds water or not. > > > > 2) They suffer from "The Unicode Bug", in which the utf8ness of the pattern > > or string affects the semantics of the match.  [[:alpha:]] will match "\xe1" > > if and only if the pattern or target string are in utf8. > > > > 3) A number of characters in utf8 match both a class and the complement of > > the class.  Here's a list from bug #60156: > >  [[:alnum:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 > >  [[:alpha:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 > >  [[:blank:]] U+A0 > >  [[:cntrl:]] U+80 > >  [[:graph:]] U+A1 > >  [[:lower:]] U+AA U+B5 U+BA U+DF..F6 U+F8 > >  [[:print:]] U+A0 > >  [[:punct:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB U+BF > >  [[:space:]] U+85 U+A0 > >  [[:upper:]] U+C0..D6 U+D8 > >  [[:word:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8 > > > > Note that some of these are ASCII.  The root cause of these is mostly from > > the same causes as the Unicode bug, but also because when they are stored in > > utf8 the code re-uses an existing, but not quite corresponding, \p{} > > property > > > > 4) Extending the posix definitions was not done consistently.  This is > > especially noticeable in punct.  Unicode splits what Posix considers > > punctuation into two classes: punctuation and symbol.  But in extending > > [[:punct:]] to beyond ASCII, Perl doesn't include the Unicode symbols. The > > result is inconsistent, the ASCII range symbols are included, but no other. > > > > It is less clear about other extensions.  Should [[:cntrl:]] include other > > things that Unicode considers control-like, namely the surrogates, the > > formats (soft hyphen et.al), and private use characters?  What about title > > case, fractions, super and subscripts? > > > > Before, it seemed like the obvious solution to all this was to just go back > > to the formal Posix definition of what they should match, not having a > > "Unicode locale", and that was done via #ifdefs for a while in 5.11.  But it > > was part of a larger patch that was it decided to revert.  Now the #ifdefs > > remain defined the other direction, and perlrecharclass.pod in 5.12 says > > that it is proposed to make these match the Posix standard exactly, asking > > anyone who disagrees to notify us. There has so far been none. > > > > If we were to just reinstate those #ifdefs, it would fix all the above > > problems in one fell swoop.  But it seems to me that we will break too much > > existing code.  I think it was a mistake extending these definitions to a > > made-up "Unicode locale" in the first place, but that ship has sailed, I > > think, in spite of what we thought we had decided earlier. > > > > I have done some investigation, and it appears that I can easily solve > > problem 3) by creating more properties in mktables tailored just for these > > posix character classes; and easily solve 3) for regexes compiled under > > feature unicode_strings, by extending what I'm already about to submit a > > patch for regarding [\w\s].  I think I should do this, ripping out the > > #ifdefs > > > > If we want to restrict the posix classes to strict posix definitions, I > > think it probably should be done with a pragma: 'use feature "strict_posix"' > > or 'use re "strict_posix"'.  This is not as high-priority in my view; and > > I'm not certain it even needs to be done at all if 2) and 3) are fixed. > > > > I think, for consistency, especially if we don't add the strict posix > > interpretations that punct should change to include the Unicode symbols as > > well; I think the other inconsistencies are not something to worry about; > > but am less confident in this. > > > > Comments?
> > POSIX is a standard. It is NOT up to us to redefine that standard. Had > we realized that we were breaking the standard and the massive can of > worms involved at the time I do not think we would have gone the way > we did. I think it would be a HUGE benefit to return to the correct > interpretation of POSIX charclasses and I do not think that backcompat > will be impacted any more than a bunch of buggy programs stop being > buggy. >
It's a bit late, but I agree with Yves. POSIX is a standard. It defines what goes in [[:posix:]]. Unicode may be a newer standard, but it has its own set of properties, \p{Property}. By "extending" POSIX character classes so they are (more or less) equivalent to Unicode properties, we've actually taken away functionality. Abigail
This has finally been fixed in blead --Karl Williamson


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org