Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A number of characters match both a posix class and its complement #9543

Closed
p5pRT opened this issue Oct 26, 2008 · 45 comments
Closed

A number of characters match both a posix class and its complement #9543

p5pRT opened this issue Oct 26, 2008 · 45 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 26, 2008

Migrated from rt.perl.org#60156 (status was 'resolved')

Searchable as RT60156$

@p5pRT
Copy link
Author

p5pRT commented Jan 2, 2008

From Robin.Barker@npl.co.uk

Created by robin.barker@npl.co.uk

As I read the documentation, the pairs in perlrepod of
[[​:...​:]] and \p{Is....} are supposed to match the same.

But
\p{IsPrint} matches \011 \012 \013 \014 \015
which [[​:print​:]] does not

\p{IsPunct} does not match $ + < = > ^ ` | ~
which [[​:punct​:]] does match

Various \p{Is...} match characters in the range 128-256
but no [[​:...​:]] match characters in that range.

Perl Info
---
Flags:
    category=core
    severity=medium
---
Site configuration information for perl 5.10.0:

Configured by Robin Barker at Tue Dec 18 11:23:28 GMT 2007.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.18-1.2798.fc6, archname=i686-linux-64int
    uname='linux rain.npl.co.uk 2.6.18-1.2798.fc6 #1 smp mon oct 16 14:54:20 edt 2006 i686 i686 i386 gnulinux '
    config_args='-des -Dcc=gcc -Uinstallusrbinperl -Dcf_email=Robin.Barker@npl.co.uk -Dcf_by=Robin Barker -Dman1dir=none -Dman3dir=none -Doptimize=-O2 -g -Duse64bitint -Dotherlibdirs=/usr/lib/perl5:/usr/lib/perl5/site_perl:/usr/lib/perl5/vendor_perl -Dinc_version_list=5.8.8 5.8.7 5.8.6 5.8.5 5.8.4 5.8.3 5.8.2 5.8.1 5.8.0 5.6.1 5.6.0 5.005'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.2.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:
    

---
@INC for perl 5.10.0:
    /usr/local/lib/perl5/5.10.0/i686-linux-64int
    /usr/local/lib/perl5/5.10.0
    /usr/local/lib/perl5/site_perl/5.10.0/i686-linux-64int
    /usr/local/lib/perl5/site_perl/5.10.0
    /usr/local/lib/perl5/site_perl/5.8.8
    /usr/local/lib/perl5/site_perl/5.005
    /usr/local/lib/perl5/site_perl
    /usr/lib/perl5/5.8.8
    /usr/lib/perl5/5.8.7
    /usr/lib/perl5/5.8.6
    /usr/lib/perl5/5.8.5
    /usr/lib/perl5
    /usr/lib/perl5/site_perl/58.8
    /usr/lib/perl5/site_perl/5.8.7
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl/5.85
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl/5.8.7
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl/5.8.5
    /usr/lib/perl5/vendor_perl
    .

---
Environment for perl 5.10.0:
    HOME=/home/rmb1
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/lib:/usr/local/lib
    LOGDIR (unset)
    PATH=/home/rmb1/appl/script:/opt/perl/bin:/opt/gcc/bin:/opt/SUNWspci/bin/:/usr/rain.npl.co.uk/bin:/usr/local/bin:/usr/local/Admigration/exec:/usr/local/hotjava/bin:/usr/openwin/bin:/usr/dt/bin:/usr/bin:/bin
    PERL5LIB=
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh

Robin Barker
Mathematics and Scientific Computing group
F1-A8
Ext: 7090



-------------------------------------------------------------------
This e-mail and any attachments may contain confidential and/or
privileged material; it is for the intended addressee(s) only.
If you are not a named addressee, you must not use, retain or
disclose such information.

NPL Management Ltd cannot guarantee that the e-mail or any
attachments are free from viruses.

NPL Management Ltd. Registered in England and Wales. No: 2937881
Registered Office: Serco House, 16 Bartley Wood Business Park,
                   Hook, Hampshire, United Kingdom  RG27 9UY
-------------------------------------------------------------------

@p5pRT
Copy link
Author

p5pRT commented Mar 31, 2008

From Robin.Barker@npl.co.uk

No progress in resolving this in code, so here is documentation patch.

Robin

--- pod/perlre.pod.orig 2008-01-30 20​:41​:06.000000000 +0000
+++ pod/perlre.pod
@​@​ -375,8 +375,8 @​@​
  digit IsDigit \d
  graph IsGraph
  lower IsLower
- print IsPrint
- punct IsPunct
+ print IsPrint (but see 2. below)
+ punct IsPunct (but see 3. below)
  space IsSpace
  IsSpacePerl \s
  upper IsUpper
@​@​ -385,6 +385,41 @​@​

For example C<[[​:lower​:]]> and C<\p{IsLower}> are equivalent.

+However, the equivalence between C<[[​:xxxxx​:]]> and C<\p{Xxxxx}> is not exact.
+
+=over 4
+
+=item 1.
+
+C<[[​:xxxxx​:]]> only matches characters in the range 0x00-0x7F.
+
+=item 2.
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[​:print​:]]> does not.
+
+=item 3.
+
+C<[[​:punct​::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols in Unicode.
+
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
+
+Mathematical symbols
+
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
+
If the C<utf8> pragma is not used but the C<locale> pragma is, the
classes correlate with the usual isalpha(3) interface (except for
"word" and "blank").

@p5pRT
Copy link
Author

p5pRT commented Mar 31, 2008

From Robin.Barker@npl.co.uk

perlre.patch
--- pod/perlre.pod.orig	2008-01-30 20:41:06.000000000 +0000
+++ pod/perlre.pod
@@ -375,8 +375,8 @@
     digit       IsDigit        \d
     graph       IsGraph
     lower       IsLower
-    print       IsPrint
-    punct       IsPunct
+    print       IsPrint		(but see 2. below)
+    punct       IsPunct		(but see 3. below)
     space       IsSpace
                 IsSpacePerl    \s
     upper       IsUpper
@@ -385,6 +385,41 @@
 
 For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
 
+However, the equivalence between C<[[:xxxxx:]]> and C<\p{Xxxxx}> is not exact.
+
+=over 4
+
+=item 1.
+
+C<[[:xxxxx:]]> only matches characters in the range 0x00-0x7F.
+
+=item 2.
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not.
+
+=item 3.
+
+C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols in Unicode.
+
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
+
+Mathematical symbols
+
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
+
 If the C<utf8> pragma is not used but the C<locale> pragma is, the
 classes correlate with the usual isalpha(3) interface (except for
 "word" and "blank").

@p5pRT
Copy link
Author

p5pRT commented Apr 7, 2008

From david@landgren.net

Robin Barker wrote​:

No progress in resolving this in code, so here is documentation patch.

Robin

--- pod/perlre.pod.orig 2008-01-30 20​:41​:06.000000000 +0000
+++ pod/perlre.pod
@​@​ -375,8 +375,8 @​@​
digit IsDigit \d
graph IsGraph
lower IsLower
- print IsPrint
- punct IsPunct
+ print IsPrint (but see 2. below)
+ punct IsPunct (but see 3. below)
space IsSpace
IsSpacePerl \s
upper IsUpper
@​@​ -385,6 +385,41 @​@​

For example C<[[​:lower​:]]> and C<\p{IsLower}> are equivalent.

+However, the equivalence between C<[[​:xxxxx​:]]> and C<\p{Xxxxx}> is not exact.
+
+=over 4
+
+=item 1.
+
+C<[[​:xxxxx​:]]> only matches characters in the range 0x00-0x7F.
+
+=item 2.
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[​:print​:]]> does not.
+
+=item 3.
+
+C<[[​:punct​::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols in Unicode.

s/classed/classified/

considered?

@p5pRT
Copy link
Author

p5pRT commented Apr 7, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 13, 2008

From @Juerd

Robin Barker skribis 2008-03-31 21​:42 (+0100)​:

+C<[[​:xxxxx​:]]> only matches characters in the range 0x00-0x7F.

Not always true.

  juerd@​lanova​:~$ perl -CO -e'printf "[%s]\n", "foo\x{123}" =~ /([[​:print​:]]+)/'
  [fooģ]

See also Unicode​::Semantics on CPAN.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>

@p5pRT
Copy link
Author

p5pRT commented Apr 25, 2008

From Robin.Barker@npl.co.uk

I've taken in to account comments about the last patch,
and increased my understanding of what is going on, and
present a revised documentation patch. With a patch to
t/op/pat.t to capture the current behaviour in a test.

Robin

diff -ur ../perl-current/pod/perlre.pod ./pod/perlre.pod
--- ../perl-current/pod/perlre.pod 2008-01-30 20​:41​:06.000000000 +0000
+++ ./pod/perlre.pod
@​@​ -375,20 +375,60 @​@​
  digit IsDigit \d
  graph IsGraph
  lower IsLower
- print IsPrint
- punct IsPunct
+ print IsPrint (but see [2] below)
+ punct IsPunct (but see [3] below)
  space IsSpace
  IsSpacePerl \s
  upper IsUpper
- word IsWord
+ word IsWord \w
  xdigit IsXDigit

For example C<[[​:lower​:]]> and C<\p{IsLower}> are equivalent.

+However, the equivalence between C<[[​:xxxxx​:]]> and C<\p{IsXxxxx}>
+is not exact.
+
+=over 4
+
+=item [1]
+
If the C<utf8> pragma is not used but the C<locale> pragma is, the
classes correlate with the usual isalpha(3) interface (except for
"word" and "blank").

+But if the C<locale> or C<encoding> pragmas are not used and
+the string is not C<utf8>, then C<[[​:xxxxx​:]]> (and C<\w>, etc.)
+will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will
+force the string to C<utf8> and can match these characters
+(as Unicode).
+
+=item [2]
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[​:print​:]]> does not.
+
+=item [3]
+
+C<[[​:punct​::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols (not punctuation) in Unicode.
+
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
+
+Mathematical symbols
+
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
+
The other named classes are​:

=over 4
diff -ur ../perl-current/t/op/pat.t ./t/op/pat.t
--- ../perl-current/t/op/pat.t 2008-04-15 13​:46​:40.000000000 +0100
+++ ./t/op/pat.t
@​@​ -4604,6 +4604,32 @​@​
  iseq($te[0], '../');
}

+SKIP​: {
+ if (ordA == 193) { skip("Assumes ASCII", 4) }
+
+ my @​notIsPunct = grep {/[[​:punct​:]]/ and not /\p{IsPunct}/}
+ map {chr} 0x20..0x7f;
+ iseq( join('', @​notIsPunct), '$+<=>^`|~',
+ '[​:punct​:] disagress with IsPunct on Symbols');
+
+ my @​isPrint = grep {not/[[​:print​:]]/ and /\p{IsPrint}/}
+ map {chr} 0..0x1f, 0x7f..0x9f;
+ iseq( join('', @​isPrint), "\x09\x0a\x0b\x0c\x0d\x85",
+ 'IsPrint disagrees with [​:print​:] on control characters');
+
+ my @​isPunct = grep {/[[​:punct​:]]/ != /\p{IsPunct}/}
+ map {chr} 0x80..0xff;
+ iseq( join('', @​isPunct), "\xa1\xab\xb7\xbb\xbf", # ¡ « · » ¿
+ 'IsPunct disagrees with [​:punct​:] outside ASCII');
+
+ my @​isPunctLatin1 = eval q{
+ use encoding 'latin1';
+ grep {/[[​:punct​:]]/ != /\p{IsPunct}/} map {chr} 0x80..0xff;
+ };
+ if( $@​ ){ skip( $@​, 1); }
+ iseq( join('', @​isPunctLatin1), '',
+ 'IsPunct agrees with [​:punct​:] with explicit Latin1');
+}

# Test counter is at bottom of file. Put new tests above here.
@​@​ -4667,7 +4693,7 @​@​

# Don't forget to update this!
BEGIN {
- $​::TestCount = 4031;
+ $​::TestCount = 4035;
  print "1..$​::TestCount\n";
}

@p5pRT
Copy link
Author

p5pRT commented Apr 25, 2008

From Robin.Barker@npl.co.uk

perlre-3.patch
diff -ur ../perl-current/pod/perlre.pod ./pod/perlre.pod
--- ../perl-current/pod/perlre.pod	2008-01-30 20:41:06.000000000 +0000
+++ ./pod/perlre.pod
@@ -375,20 +375,60 @@
     digit       IsDigit        \d
     graph       IsGraph
     lower       IsLower
-    print       IsPrint
-    punct       IsPunct
+    print       IsPrint		(but see [2] below)
+    punct       IsPunct		(but see [3] below)
     space       IsSpace
                 IsSpacePerl    \s
     upper       IsUpper
-    word        IsWord
+    word        IsWord         \w
     xdigit      IsXDigit
 
 For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
 
+However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}>
+is not exact.
+
+=over 4
+
+=item [1]
+
 If the C<utf8> pragma is not used but the C<locale> pragma is, the
 classes correlate with the usual isalpha(3) interface (except for
 "word" and "blank").
 
+But if the C<locale> or C<encoding> pragmas are not used and
+the string is not C<utf8>, then C<[[:xxxxx:]]> (and C<\w>, etc.)
+will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will
+force the string to C<utf8> and can match these characters
+(as Unicode).
+
+=item [2]
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not.
+
+=item [3]
+
+C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols (not punctuation) in Unicode.
+
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
+
+Mathematical symbols
+
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
+
 The other named classes are:
 
 =over 4
diff -ur ../perl-current/t/op/pat.t ./t/op/pat.t
--- ../perl-current/t/op/pat.t	2008-04-15 13:46:40.000000000 +0100
+++ ./t/op/pat.t
@@ -4604,6 +4604,32 @@
     iseq($te[0], '../');
 }
 
+SKIP: {
+    if (ordA == 193) { skip("Assumes ASCII", 4) }
+
+    my @notIsPunct = grep {/[[:punct:]]/ and not /\p{IsPunct}/}
+			map {chr} 0x20..0x7f;
+    iseq( join('', @notIsPunct), '$+<=>^`|~',
+	'[:punct:] disagress with IsPunct on Symbols');
+
+    my @isPrint = grep {not/[[:print:]]/ and /\p{IsPrint}/}
+			map {chr} 0..0x1f, 0x7f..0x9f;
+    iseq( join('', @isPrint), "\x09\x0a\x0b\x0c\x0d\x85",
+	'IsPrint disagrees with [:print:] on control characters');
+
+    my @isPunct = grep {/[[:punct:]]/ != /\p{IsPunct}/}
+			map {chr} 0x80..0xff;
+    iseq( join('', @isPunct), "\xa1\xab\xb7\xbb\xbf",		# �� �� �� �� ��
+	'IsPunct disagrees with [:punct:] outside ASCII');
+
+    my @isPunctLatin1 = eval q{
+	use encoding 'latin1';
+	grep {/[[:punct:]]/ != /\p{IsPunct}/} map {chr} 0x80..0xff;
+    };
+    if( $@ ){ skip( $@, 1); }
+    iseq( join('', @isPunctLatin1), '', 
+	'IsPunct agrees with [:punct:] with explicit Latin1');
+} 
 
 
 # Test counter is at bottom of file. Put new tests above here.
@@ -4667,7 +4693,7 @@
 
 # Don't forget to update this!
 BEGIN {
-    $::TestCount = 4031;
+    $::TestCount = 4035;
     print "1..$::TestCount\n";
 }
 

@p5pRT
Copy link
Author

p5pRT commented Apr 26, 2008

From @rgs

2008/4/25 Robin Barker <Robin.Barker@​npl.co.uk>​:

I've taken in to account comments about the last patch,
and increased my understanding of what is going on, and
present a revised documentation patch. With a patch to
t/op/pat.t to capture the current behaviour in a test.

Thanks, applied as #33752

@p5pRT
Copy link
Author

p5pRT commented Apr 26, 2008

@rgs - Status changed from 'open' to 'resolved'

@p5pRT
Copy link
Author

p5pRT commented Oct 26, 2008

From @khwilliamson

This is a bug report for perl from public@​khwilliamson.com,
generated with the help of perlbug 1.39 running under perl 5.11.0.


use utf8;

print '©' =~ /[[​:graph​:]]/, "\n";
print '©' =~ /[[​:^graph​:]]/, "\n";

both print 1. This happens for various posix classes, and various
utf8 characters. Also for perl 5.8



Flags​:
  category=core
  severity=medium


This perlbug was built using Perl 5.11.0 - Wed Oct 22 19​:16​:44 MDT 2008
It is being executed now by Perl 5.11.0 - Fri Oct 24 11​:08​:58 MDT 2008.

Site configuration information for perl 5.11.0​:

Configured by khw at Fri Oct 24 11​:08​:58 MDT 2008.

Summary of my perl5 (revision 5 version 11 subversion 0 patch 34566)
configuration​:
  Platform​:
  osname=linux, osvers=2.6.24-21-generic, archname=i686-linux
  uname='linux karl 2.6.24-21-generic #1 smp mon aug 25 17​:32​:09 utc
2008 i686 gnulinux '
  config_args='-d -Dmksymlinks -Dprefix=/home/khw/myperl -Dusedevel
-DDEBUGGING=both'
  hint=previous, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include -DDEBUGGING -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64'
  ccversion='', gccversion='4.2.3 (Ubuntu 4.2.3-2ubuntu7)',
gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.7'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib
-fstack-protector'

Locally applied patches​:
  DEVEL


@​INC for perl 5.11.0​:
  /home/khw/myperl/lib/5.11.0/i686-linux
  /home/khw/myperl/lib/5.11.0
  /home/khw/myperl/lib/site_perl/5.11.0/i686-linux
  /home/khw/myperl/lib/site_perl/5.11.0
  .


Environment for perl 5.11.0​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/home/khw/cxoffice/bin
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Author

p5pRT commented Oct 30, 2008

From @khwilliamson

Tom Christiansen wrote​:

In-Reply-To​: Message from karl williamson <public@​khwilliamson.com>
of "Wed, 29 Oct 2008 10​:21​:33 MDT." <49088D8D.3050202@​khwilliamson.com>

Tom Christiansen wrote​:

PRESCRIPT​: Karl, my program at the end below should be good for
sniffing out \p and \P mutual-exclusion failure bugs
such as I believe you recently reported.

Thanks for the program. My bug reports have mostly come from
running something similar, but not with nearly as many of the
classes as you did.

I'm guessing you didn't run it past 127, because it fails immediately
with a Malformed UTF-8 character (fatal) at (eval 3) line 2.

Yes, that's because it intentionally assumed ASCII only. The amended
program and results follow. You should now be to use this program
directly for your testing.

I actually don't think there are any issues with the \p and \P,
because those go out and use auto-constructed files. The
problems I have found have been in the posix classes and
entirely in the 128-255 range.

It's not just there. It's a bug in negating of POSIX char classes.
I can elicit errors at codepoints <128, even with Unicode semantics on.

Specifically, just runniing POSIX charclass tests alone​:

REPORT&#8203;: ranging from U\+00 \(0\) \.\. U\+00007F \(127\) \[128 codepoints\]
failed 28 tests\, 3300 out of 3328 tests successful \(0\.991587%\)

The problems are with the [​:print​:] and [​:punct​:] properties​:

Trouble w/U\+007E&#8203;: TILDE&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+007C&#8203;: VERTICAL LINE&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+0060&#8203;: GRAVE ACCENT&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+005E&#8203;: CIRCUMFLEX ACCENT&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+003E&#8203;: GREATER\-THAN SIGN&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+003D&#8203;: EQUALS SIGN&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+003C&#8203;: LESS\-THAN SIGN&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+002B&#8203;: PLUS SIGN&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+0024&#8203;: DOLLAR SIGN&#8203;: Property "\[&#8203;:punct&#8203;:\]" failed
Trouble w/U\+000D&#8203;: CARRIAGE RETURN \(CR\)&#8203;: Property "\[&#8203;:print&#8203;:\]" failed
Trouble w/U\+000C&#8203;: FORM FEED \(FF\)&#8203;: Property "\[&#8203;:print&#8203;:\]" failed
Trouble w/U\+000B&#8203;: LINE TABULATION&#8203;: Property "\[&#8203;:print&#8203;:\]" failed
Trouble w/U\+000A&#8203;: LINE FEED \(LF\)&#8203;: Property "\[&#8203;:print&#8203;:\]" failed
Trouble w/U\+0009&#8203;: CHARACTER TABULATION&#8203;: Property "\[&#8203;:print&#8203;:\]" failed

The positive char class test is like this​:

(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[[​:^print​:]]\z/ )
)
&&
(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[^[​:print​:]]\z/ )
)

The negative char class test is this​:

\( $U\_char =~ /\\A\[\[&#8203;:^punct&#8203;:\]\]\\z/ \)
          ==
\( $U\_char =~ /\\A\[^\[&#8203;:punct&#8203;:\]\]\\z/ \)

...

I wondered why you were getting errors when my own (which I thought were
rather extensive) test cases were getting none in the ASCII range. It
turns out that it's because I hadn't thought to try the negation ^ in
the outer set of braces. Also, these problems don't occur when the
characters aren't utf8 (when they are packed as C instead of U). And
they fail in exactly the places that the posix classes don't match the
unicode ones. It is documented that [​:graph​:] includes the precise set
of symbols that have failures that \p{IsGraph} does not. Similarly for
[​:print​:] and the ones it has failures for.

I haven't looked at the code, but what likely what is going on is that
the complement of a complement loses these differences from unicode
(only when the utf8 flag is on, because otherwise the unicode classes
aren't looked at)

My goal is to fix all these problems in the 128-255 range. It's turning
out to be more work than I thought, essentially because there are a lot
of pre-existing errors and inconsistencies.

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2008

From @tonycoz

On Sun, Oct 26, 2008 at 03​:32​:53PM -0700, karl williamson wrote​:

-----------------------------------------------------------------
use utf8;

print '©' =~ /[[​:graph​:]]/, "\n";
print '©' =~ /[[​:^graph​:]]/, "\n";

both print 1. This happens for various posix classes, and various
utf8 characters. Also for perl 5.8
-----------------------------------------------------------------

This problem also occurs for \s vs \S inside a character class​:

sh-3.1$ /perl/blead34613/bin/perl5.11.0 -le '$x = chr(0xa0).chr(0x100); chop $x; print $x = /[\s]/; print $x =~ /[\S]/;'
1
1
sh-3.1$

The problem is that the character classes are compiled to a
combination of a bitmap and a Unicode property check, eg​:

[\s] becomes ANYOF[\11\12\14\15 +utf8​::IsSpacePerl]
[\S] becomes ANYOF[\0-\10\13\16-\37!-\377!utf8​::IsSpacePerl]

and similarly for [[​:foo​:]]​:

[[​:graph​:]] becomes ANYOF[!-~+utf8​::IsGraph]

[[​:^graph​:]] becomes ANYOF[\0- \177-\377!utf8​::IsGraph]

and for UTF scalars both the bitmap and the unicode property is
checked (S_reginclass in regexec.c).

The bitmap is generated using isSPACE, isGRAPH, etc, which without a
locale, returns false for characters \200-\x377.

I suspect a solution is going to involve not generating the bitmap for
named classes classes in character classes and using ANYOF_SPACE,
ANYOF_GRAPH etc in the character class node.

But I don't understand enough of the regular expression engine to want
to delve that deeply.

Tony

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2008

From @demerphq

2008/11/3 Tony Cook <tony@​develop-help.com>​:

On Sun, Oct 26, 2008 at 03​:32​:53PM -0700, karl williamson wrote​:

-----------------------------------------------------------------
use utf8;

print '(c)' =~ /[[​:graph​:]]/, "\n";
print '(c)' =~ /[[​:^graph​:]]/, "\n";

both print 1. This happens for various posix classes, and various
utf8 characters. Also for perl 5.8
-----------------------------------------------------------------

This problem also occurs for \s vs \S inside a character class​:

sh-3.1$ /perl/blead34613/bin/perl5.11.0 -le '$x = chr(0xa0).chr(0x100); chop $x; print $x = /[\s]/; print $x =~ /[\S]/;'
1
1
sh-3.1$

The problem is that the character classes are compiled to a
combination of a bitmap and a Unicode property check, eg​:

[\s] becomes ANYOF[\11\12\14\15 +utf8​::IsSpacePerl]
[\S] becomes ANYOF[\0-\10\13\16-\37!-\377!utf8​::IsSpacePerl]

and similarly for [[​:foo​:]]​:

[[​:graph​:]] becomes ANYOF[!-~+utf8​::IsGraph]

[[​:^graph​:]] becomes ANYOF[\0- \177-\377!utf8​::IsGraph]

and for UTF scalars both the bitmap and the unicode property is
checked (S_reginclass in regexec.c).

The bitmap is generated using isSPACE, isGRAPH, etc, which without a
locale, returns false for characters \200-\x377.

I suspect a solution is going to involve not generating the bitmap for
named classes classes in character classes and using ANYOF_SPACE,
ANYOF_GRAPH etc in the character class node.

But I don't understand enough of the regular expression engine to want
to delve that deeply.

Tony

I expected the attached patch to work out. Unfortunately it doesnt. I
get the following failures when i do a

make test-reonly

Test Summary Report


op/pat (Wstat​: 0 Tests​: 4035 Failed​: 5)
  Failed tests​: 1871, 1901-1904
op/regexp (Wstat​: 0 Tests​: 1359 Failed​: 4)
  Failed tests​: 603, 608, 611, 1310
op/regexp_noamp (Wstat​: 0 Tests​: 1359 Failed​: 4)
  Failed tests​: 603, 608, 611, 1310
op/regexp_notrie (Wstat​: 0 Tests​: 1359 Failed​: 4)
  Failed tests​: 603, 608, 611, 1310
op/regexp_qr (Wstat​: 0 Tests​: 1359 Failed​: 4)
  Failed tests​: 603, 608, 611, 1310
op/regexp_qr_embed (Wstat​: 0 Tests​: 1359 Failed​: 4)
  Failed tests​: 603, 608, 611, 1310
op/regexp_trielist (Wstat​: 0 Tests​: 1359 Failed​: 4)
  Failed tests​: 603, 608, 611, 1310
Files=29, Tests=21383, 59 wallclock secs ( 8.70 usr 0.22 sys + 49.73
cusr 0.76 csys = 59.41 CPU)
Result​: FAIL
Failed 7/29 test programs. 29/21383 subtests failed.

Which seems quite odd, like the swash logic is failing somehow, in
ways that I personally find surprising.

The issue here is that the bitmaps are constructed for non-utf8 and
utf8 semantics differ for utf8 and non-utf8 in the codepoint range
128-255. The logic ive added is to only use the bitmaps for codepoints
0..127 when doutf8 and then rely on the unicode charclass "swash"
logic to handle the rest, but this fails. I havent had sufficient
time, not access to a debugger i know well enough* to figure out why.

However im still poking.

Yves
* Id love to attend a 'gdb for perl core hackers' course should
someone put one together for the next YAPC​::EU/ or YAPC Europe Perl
Workshop. *HINT* HINT. :-)

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2008

From @demerphq

not_quite_working.patch
diff --git a/regexec.c b/regexec.c
index 16b1495..0d1c608 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5751,8 +5751,11 @@ S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const
         if (lenp)
 	    *lenp = 0;
 	if (do_utf8 && !ANYOF_RUNTIME(n)) {
-	    if (len != (STRLEN)-1 && c < 256 && ANYOF_BITMAP_TEST(n, c))
+	    /* XXX: this can't be 256 as codepoints 128-256 have different semantics in utf8 and otherwise,
+	       and the bitmaps are built with non-utf8 in mind. See related comment below. */ 
+	    if (c < 128 && len != (STRLEN)-1 && ANYOF_BITMAP_TEST(n, c))
 		match = TRUE;
+	    
 	}
 	if (!match && do_utf8 && (flags & ANYOF_UNICODE_ALL) && c >= 256)
 	    match = TRUE;
@@ -5792,22 +5795,30 @@ S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const
 	if (match && lenp && *lenp == 0)
 	    *lenp = UNISKIP(NATIVE_TO_UNI(c));
     }
-    if (!match && c < 256) {
-	if (ANYOF_BITMAP_TEST(n, c))
-	    match = TRUE;
-	else if (flags & ANYOF_FOLD) {
-	    U8 f;
+    if (!match && c < 256 ) {
+	if (!do_utf8 || c < 128) {
+	    /* XXX:
+	       Codepoints 128-256 have different semantics in utf8  and otherwise, and the bitmaps
+	       are built with non-utf8 in mind.
+	       BUT, it would be nice if this conditional was simpler, as its in a "hot" codepath
+	       dmq. See above related comment.    
+	    */ 
+
+	    if (ANYOF_BITMAP_TEST(n, c))
+		match = TRUE;
+	    else if (flags & ANYOF_FOLD) {
+		U8 f;
 
-	    if (flags & ANYOF_LOCALE) {
-		PL_reg_flags |= RF_tainted;
-		f = PL_fold_locale[c];
+		if (flags & ANYOF_LOCALE) {
+		    PL_reg_flags |= RF_tainted;
+		    f = PL_fold_locale[c];
+		}
+		else
+		    f = PL_fold[c];
+		if (f != c && ANYOF_BITMAP_TEST(n, f))
+		    match = TRUE;
 	    }
-	    else
-		f = PL_fold[c];
-	    if (f != c && ANYOF_BITMAP_TEST(n, f))
-		match = TRUE;
 	}
-	
 	if (!match && (flags & ANYOF_CLASS)) {
 	    PL_reg_flags |= RF_tainted;
 	    if (
diff --git a/t/op/regexp.t b/t/op/regexp.t
index 147e4cc..f64dbb7 100755
--- a/t/op/regexp.t
+++ b/t/op/regexp.t
@@ -191,7 +191,7 @@ EOFCODE
 		else { # better diagnostics
 		    my $s = Data::Dumper->new([$subject],['subject'])->Useqq(1)->Dump;
 		    my $g = Data::Dumper->new([$got],['got'])->Useqq(1)->Dump;
-		    print "not ok $test ($study) $input => `$got', match=$match\n$s\n$g\n$code\n";
+		    print "not ok $test ($study) $input => '$got', match=$match\n$s\n$g\n$code\n";
 		}
 		next TEST;
 	    }

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2008

From @demerphq

2008/10/30 karl williamson <public@​khwilliamson.com>​:

Tom Christiansen wrote​:

In-Reply-To​: Message from karl williamson <public@​khwilliamson.com> of
"Wed, 29 Oct 2008 10​:21​:33 MDT." <49088D8D.3050202@​khwilliamson.com>

Tom Christiansen wrote​:

PRESCRIPT​: Karl, my program at the end below should be good for
sniffing out \p and \P mutual-exclusion failure bugs such as I
believe you recently reported.

Thanks for the program. My bug reports have mostly come from
running something similar, but not with nearly as many of the
classes as you did.

I'm guessing you didn't run it past 127, because it fails immediately
with a Malformed UTF-8 character (fatal) at (eval 3) line 2.

Yes, that's because it intentionally assumed ASCII only. The amended
program and results follow. You should now be to use this program
directly for your testing.

I actually don't think there are any issues with the \p and \P,
because those go out and use auto-constructed files. The
problems I have found have been in the posix classes and
entirely in the 128-255 range.

It's not just there. It's a bug in negating of POSIX char classes.
I can elicit errors at codepoints <128, even with Unicode semantics on.

Specifically, just runniing POSIX charclass tests alone​:

REPORT​: ranging from U+00 (0) .. U+00007F (127) [128 codepoints]
failed 28 tests, 3300 out of 3328 tests successful (0.991587%)

The problems are with the [​:print​:] and [​:punct​:] properties​:

Trouble w/U+007E​: TILDE​: Property "[​:punct​:]" failed
Trouble w/U+007C​: VERTICAL LINE​: Property "[​:punct​:]" failed
Trouble w/U+0060​: GRAVE ACCENT​: Property "[​:punct​:]" failed
Trouble w/U+005E​: CIRCUMFLEX ACCENT​: Property "[​:punct​:]" failed
Trouble w/U+003E​: GREATER-THAN SIGN​: Property "[​:punct​:]" failed
Trouble w/U+003D​: EQUALS SIGN​: Property "[​:punct​:]" failed
Trouble w/U+003C​: LESS-THAN SIGN​: Property "[​:punct​:]" failed
Trouble w/U+002B​: PLUS SIGN​: Property "[​:punct​:]" failed
Trouble w/U+0024​: DOLLAR SIGN​: Property "[​:punct​:]" failed
Trouble w/U+000D​: CARRIAGE RETURN (CR)​: Property "[​:print​:]" failed
Trouble w/U+000C​: FORM FEED (FF)​: Property "[​:print​:]" failed
Trouble w/U+000B​: LINE TABULATION​: Property "[​:print​:]" failed
Trouble w/U+000A​: LINE FEED (LF)​: Property "[​:print​:]" failed
Trouble w/U+0009​: CHARACTER TABULATION​: Property "[​:print​:]" failed

The positive char class test is like this​:

(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[[​:^print​:]]\z/ )
)
&&
(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[^[​:print​:]]\z/ )
)

The negative char class test is this​:

( $U_char =~ /\A[[​:^punct​:]]\z/ )
==
( $U_char =~ /\A[^[​:punct​:]]\z/ )

...

I wondered why you were getting errors when my own (which I thought were
rather extensive) test cases were getting none in the ASCII range. It turns
out that it's because I hadn't thought to try the negation ^ in the outer
set of braces. Also, these problems don't occur when the characters aren't
utf8 (when they are packed as C instead of U). And they fail in exactly the
places that the posix classes don't match the unicode ones. It is
documented that [​:graph​:] includes the precise set of symbols that have
failures that \p{IsGraph} does not. Similarly for [​:print​:] and the ones it
has failures for.

I haven't looked at the code, but what likely what is going on is that the
complement of a complement loses these differences from unicode (only when
the utf8 flag is on, because otherwise the unicode classes aren't looked at)

My goal is to fix all these problems in the 128-255 range. It's turning out
to be more work than I thought, essentially because there are a lot of
pre-existing errors and inconsistencies.

Just for the record in the bug ticket, I used the attached file to
determine that the affected patterns and strings are as follows​:

/[\w][\W]/
  matches unicode codepoints​: 170 178..179 181 185..186 188..190
192..214 216..246 248

/[\s][\S]/
  matches unicode codepoints​: 133 160

/[[​:alnum​:]][[​:^alnum​:]]/
  matches unicode codepoints​: 170 181 186 192..214 216..246 248

/[[​:alpha​:]][[​:^alpha​:]]/
  matches unicode codepoints​: 170 181 186 192..214 216..246 248

/[[​:cntrl​:]][[​:^cntrl​:]]/
  matches unicode codepoints​: 128..159 173

/[[​:graph​:]][[​:^graph​:]]/
  matches unicode codepoints​: 161

/[[​:lower​:]][[​:^lower​:]]/
  matches unicode codepoints​: 170 181 186 223..246 248

/[[​:print​:]][[​:^print​:]]/
  matches unicode codepoints​: 133 160

/[[​:punct​:]][[​:^punct​:]]/
  matches unicode codepoints​: 36 43 60..62 94 96 124 126 161 171 183 187 191

/[[​:upper​:]][[​:^upper​:]]/
  matches unicode codepoints​: 192..214 216

/[[​:space​:]][[​:^space​:]]/
  matches unicode codepoints​: 133 160

/[[​:blank​:]][[​:^blank​:]]/
  matches unicode codepoints​: 160

Thats a lot of characters. Sadly.

Im trying to figure out a work around that doesnt make non-unicode
slower while at the same time not involving a complete rewrite of how
character classes are stored. The problem is that as a speed
optimisation we expand these to their ascii bitmaps at compile time
(when not under use locale), which means we merge them with whatever
non special chars are in the charclass. We could not do that, but it
would slow down non-uncode a lot. Or i guess we could change the
charclass representation (essentially doubling it) and build both a
unciode and non-uncode equivalent. That would double the size of a
charclass, and be a reasonable amount of work to fix.

There are some "out of band" options tho. We chould implement the
"match semantics flags" as has been discussed in the past, and leave
the current implementation as is when the flags are omitted. And note
the problem. Anybody that cares would have to use the apropriate
modifier. We could do a more extreme version of this and just say that
ascii semantics apply for these items only, and remove the unicode
char class property logic when they are used, forcing people to use
explicit unicode semantics. (Or vice versa, but I dont think THAT is a
good idea.)

The bottom line is that regex metapatterns whose semantics change
under unicode and otherwise are evil. They have lead to all kinds of
trouble at all kinds of levels.

I think fixing this properly requires the pumpking / larry to make a
policy decision. Do we stick with the weird bifurcated behaviour
depending on string representation or not.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2008

From @demerphq

test_all_cc.pl

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @khwilliamson

demerphq wrote​:

2008/10/30 karl williamson <public@​khwilliamson.com>​:

Tom Christiansen wrote​:

In-Reply-To​: Message from karl williamson <public@​khwilliamson.com> of
"Wed, 29 Oct 2008 10​:21​:33 MDT." <49088D8D.3050202@​khwilliamson.com>

Tom Christiansen wrote​:

PRESCRIPT​: Karl, my program at the end below should be good for
sniffing out \p and \P mutual-exclusion failure bugs such as I
believe you recently reported.

Thanks for the program. My bug reports have mostly come from
running something similar, but not with nearly as many of the
classes as you did.
I'm guessing you didn't run it past 127, because it fails immediately
with a Malformed UTF-8 character (fatal) at (eval 3) line 2.
Yes, that's because it intentionally assumed ASCII only. The amended
program and results follow. You should now be to use this program
directly for your testing.

I actually don't think there are any issues with the \p and \P,
because those go out and use auto-constructed files. The
problems I have found have been in the posix classes and
entirely in the 128-255 range.
It's not just there. It's a bug in negating of POSIX char classes.
I can elicit errors at codepoints <128, even with Unicode semantics on.

Specifically, just runniing POSIX charclass tests alone​:

REPORT​: ranging from U+00 (0) .. U+00007F (127) [128 codepoints]
failed 28 tests, 3300 out of 3328 tests successful (0.991587%)

The problems are with the [​:print​:] and [​:punct​:] properties​:

Trouble w/U+007E​: TILDE​: Property "[​:punct​:]" failed
Trouble w/U+007C​: VERTICAL LINE​: Property "[​:punct​:]" failed
Trouble w/U+0060​: GRAVE ACCENT​: Property "[​:punct​:]" failed
Trouble w/U+005E​: CIRCUMFLEX ACCENT​: Property "[​:punct​:]" failed
Trouble w/U+003E​: GREATER-THAN SIGN​: Property "[​:punct​:]" failed
Trouble w/U+003D​: EQUALS SIGN​: Property "[​:punct​:]" failed
Trouble w/U+003C​: LESS-THAN SIGN​: Property "[​:punct​:]" failed
Trouble w/U+002B​: PLUS SIGN​: Property "[​:punct​:]" failed
Trouble w/U+0024​: DOLLAR SIGN​: Property "[​:punct​:]" failed
Trouble w/U+000D​: CARRIAGE RETURN (CR)​: Property "[​:print​:]" failed
Trouble w/U+000C​: FORM FEED (FF)​: Property "[​:print​:]" failed
Trouble w/U+000B​: LINE TABULATION​: Property "[​:print​:]" failed
Trouble w/U+000A​: LINE FEED (LF)​: Property "[​:print​:]" failed
Trouble w/U+0009​: CHARACTER TABULATION​: Property "[​:print​:]" failed

The positive char class test is like this​:

(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[[​:^print​:]]\z/ )
)
&&
(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[^[​:print​:]]\z/ )
)

The negative char class test is this​:

( $U_char =~ /\A[[​:^punct​:]]\z/ )
==
( $U_char =~ /\A[^[​:punct​:]]\z/ )

...
I wondered why you were getting errors when my own (which I thought were
rather extensive) test cases were getting none in the ASCII range. It turns
out that it's because I hadn't thought to try the negation ^ in the outer
set of braces. Also, these problems don't occur when the characters aren't
utf8 (when they are packed as C instead of U). And they fail in exactly the
places that the posix classes don't match the unicode ones. It is
documented that [​:graph​:] includes the precise set of symbols that have
failures that \p{IsGraph} does not. Similarly for [​:print​:] and the ones it
has failures for.

I haven't looked at the code, but what likely what is going on is that the
complement of a complement loses these differences from unicode (only when
the utf8 flag is on, because otherwise the unicode classes aren't looked at)

My goal is to fix all these problems in the 128-255 range. It's turning out
to be more work than I thought, essentially because there are a lot of
pre-existing errors and inconsistencies.

Just for the record in the bug ticket, I used the attached file to
determine that the affected patterns and strings are as follows​:

/[\w][\W]/
matches unicode codepoints​: 170 178..179 181 185..186 188..190
192..214 216..246 248

/[\s][\S]/
matches unicode codepoints​: 133 160

/[[​:alnum​:]][[​:^alnum​:]]/
matches unicode codepoints​: 170 181 186 192..214 216..246 248

/[[​:alpha​:]][[​:^alpha​:]]/
matches unicode codepoints​: 170 181 186 192..214 216..246 248

/[[​:cntrl​:]][[​:^cntrl​:]]/
matches unicode codepoints​: 128..159 173

/[[​:graph​:]][[​:^graph​:]]/
matches unicode codepoints​: 161

/[[​:lower​:]][[​:^lower​:]]/
matches unicode codepoints​: 170 181 186 223..246 248

/[[​:print​:]][[​:^print​:]]/
matches unicode codepoints​: 133 160

/[[​:punct​:]][[​:^punct​:]]/
matches unicode codepoints​: 36 43 60..62 94 96 124 126 161 171 183 187 191

/[[​:upper​:]][[​:^upper​:]]/
matches unicode codepoints​: 192..214 216

/[[​:space​:]][[​:^space​:]]/
matches unicode codepoints​: 133 160

/[[​:blank​:]][[​:^blank​:]]/
matches unicode codepoints​: 160

Thats a lot of characters. Sadly.

Im trying to figure out a work around that doesnt make non-unicode
slower while at the same time not involving a complete rewrite of how
character classes are stored. The problem is that as a speed
optimisation we expand these to their ascii bitmaps at compile time
(when not under use locale), which means we merge them with whatever
non special chars are in the charclass. We could not do that, but it
would slow down non-uncode a lot. Or i guess we could change the
charclass representation (essentially doubling it) and build both a
unciode and non-uncode equivalent. That would double the size of a
charclass, and be a reasonable amount of work to fix.

There are some "out of band" options tho. We chould implement the
"match semantics flags" as has been discussed in the past, and leave
the current implementation as is when the flags are omitted. And note
the problem. Anybody that cares would have to use the apropriate
modifier. We could do a more extreme version of this and just say that
ascii semantics apply for these items only, and remove the unicode
char class property logic when they are used, forcing people to use
explicit unicode semantics. (Or vice versa, but I dont think THAT is a
good idea.)

The bottom line is that regex metapatterns whose semantics change
under unicode and otherwise are evil. They have lead to all kinds of
trouble at all kinds of levels.

I think fixing this properly requires the pumpking / larry to make a
policy decision. Do we stick with the weird bifurcated behaviour
depending on string representation or not.

Yves

I don't know if this casts any light on the issue or not, but the
attached simple program shows what may be another type of failure.

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @khwilliamson

$v = "\t";
my $re = qr/[^[​:print​:]]/;
my $before_upgrade = $v =~ $re;
utf8​::upgrade($v);
my $after_upgrade = $v =~ $re;
print "upgrading changes whether tab matches $re\n" if $before_upgrade != $after_upgrade;

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @demerphq

2008/11/7 karl williamson <public@​khwilliamson.com>​:

demerphq wrote​:

2008/10/30 karl williamson <public@​khwilliamson.com>​:

Tom Christiansen wrote​:

In-Reply-To​: Message from karl williamson <public@​khwilliamson.com> of
"Wed, 29 Oct 2008 10​:21​:33 MDT." <49088D8D.3050202@​khwilliamson.com>

Tom Christiansen wrote​:

PRESCRIPT​: Karl, my program at the end below should be good for
sniffing out \p and \P mutual-exclusion failure bugs such
as I
believe you recently reported.

Thanks for the program. My bug reports have mostly come from
running something similar, but not with nearly as many of the
classes as you did.
I'm guessing you didn't run it past 127, because it fails immediately
with a Malformed UTF-8 character (fatal) at (eval 3) line 2.

Yes, that's because it intentionally assumed ASCII only. The amended
program and results follow. You should now be to use this program
directly for your testing.

I actually don't think there are any issues with the \p and \P,
because those go out and use auto-constructed files. The
problems I have found have been in the posix classes and
entirely in the 128-255 range.

It's not just there. It's a bug in negating of POSIX char classes.
I can elicit errors at codepoints <128, even with Unicode semantics on.

Specifically, just runniing POSIX charclass tests alone​:

REPORT​: ranging from U+00 (0) .. U+00007F (127) [128 codepoints]
failed 28 tests, 3300 out of 3328 tests successful (0.991587%)

The problems are with the [​:print​:] and [​:punct​:] properties​:

Trouble w/U+007E​: TILDE​: Property "[​:punct​:]" failed
Trouble w/U+007C​: VERTICAL LINE​: Property "[​:punct​:]" failed
Trouble w/U+0060​: GRAVE ACCENT​: Property "[​:punct​:]" failed
Trouble w/U+005E​: CIRCUMFLEX ACCENT​: Property "[​:punct​:]" failed
Trouble w/U+003E​: GREATER-THAN SIGN​: Property "[​:punct​:]" failed
Trouble w/U+003D​: EQUALS SIGN​: Property "[​:punct​:]" failed
Trouble w/U+003C​: LESS-THAN SIGN​: Property "[​:punct​:]" failed
Trouble w/U+002B​: PLUS SIGN​: Property "[​:punct​:]" failed
Trouble w/U+0024​: DOLLAR SIGN​: Property "[​:punct​:]" failed
Trouble w/U+000D​: CARRIAGE RETURN (CR)​: Property "[​:print​:]" failed
Trouble w/U+000C​: FORM FEED (FF)​: Property "[​:print​:]" failed
Trouble w/U+000B​: LINE TABULATION​: Property "[​:print​:]" failed
Trouble w/U+000A​: LINE FEED (LF)​: Property "[​:print​:]" failed
Trouble w/U+0009​: CHARACTER TABULATION​: Property "[​:print​:]" failed

The positive char class test is like this​:

(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[[​:^print​:]]\z/ )
)
&&
(
( $U_char =~ /\A[[​:print​:]]\z/ )
==
( $U_char !~ /\A[^[​:print​:]]\z/ )
)

The negative char class test is this​:

( $U_char =~ /\A[[​:^punct​:]]\z/ )
==
( $U_char =~ /\A[^[​:punct​:]]\z/ )

...

I wondered why you were getting errors when my own (which I thought were
rather extensive) test cases were getting none in the ASCII range. It
turns
out that it's because I hadn't thought to try the negation ^ in the outer
set of braces. Also, these problems don't occur when the characters
aren't
utf8 (when they are packed as C instead of U). And they fail in exactly
the
places that the posix classes don't match the unicode ones. It is
documented that [​:graph​:] includes the precise set of symbols that have
failures that \p{IsGraph} does not. Similarly for [​:print​:] and the ones
it
has failures for.

I haven't looked at the code, but what likely what is going on is that
the
complement of a complement loses these differences from unicode (only
when
the utf8 flag is on, because otherwise the unicode classes aren't looked
at)

My goal is to fix all these problems in the 128-255 range. It's turning
out
to be more work than I thought, essentially because there are a lot of
pre-existing errors and inconsistencies.

Just for the record in the bug ticket, I used the attached file to
determine that the affected patterns and strings are as follows​:

/[\w][\W]/
matches unicode codepoints​: 170 178..179 181 185..186 188..190
192..214 216..246 248

/[\s][\S]/
matches unicode codepoints​: 133 160

/[[​:alnum​:]][[​:^alnum​:]]/
matches unicode codepoints​: 170 181 186 192..214 216..246 248

/[[​:alpha​:]][[​:^alpha​:]]/
matches unicode codepoints​: 170 181 186 192..214 216..246 248

/[[​:cntrl​:]][[​:^cntrl​:]]/
matches unicode codepoints​: 128..159 173

/[[​:graph​:]][[​:^graph​:]]/
matches unicode codepoints​: 161

/[[​:lower​:]][[​:^lower​:]]/
matches unicode codepoints​: 170 181 186 223..246 248

/[[​:print​:]][[​:^print​:]]/
matches unicode codepoints​: 133 160

/[[​:punct​:]][[​:^punct​:]]/
matches unicode codepoints​: 36 43 60..62 94 96 124 126 161 171
183 187 191

/[[​:upper​:]][[​:^upper​:]]/
matches unicode codepoints​: 192..214 216

/[[​:space​:]][[​:^space​:]]/
matches unicode codepoints​: 133 160

/[[​:blank​:]][[​:^blank​:]]/
matches unicode codepoints​: 160

Thats a lot of characters. Sadly.

Im trying to figure out a work around that doesnt make non-unicode
slower while at the same time not involving a complete rewrite of how
character classes are stored. The problem is that as a speed
optimisation we expand these to their ascii bitmaps at compile time
(when not under use locale), which means we merge them with whatever
non special chars are in the charclass. We could not do that, but it
would slow down non-uncode a lot. Or i guess we could change the
charclass representation (essentially doubling it) and build both a
unciode and non-uncode equivalent. That would double the size of a
charclass, and be a reasonable amount of work to fix.

There are some "out of band" options tho. We chould implement the
"match semantics flags" as has been discussed in the past, and leave
the current implementation as is when the flags are omitted. And note
the problem. Anybody that cares would have to use the apropriate
modifier. We could do a more extreme version of this and just say that
ascii semantics apply for these items only, and remove the unicode
char class property logic when they are used, forcing people to use
explicit unicode semantics. (Or vice versa, but I dont think THAT is a
good idea.)

The bottom line is that regex metapatterns whose semantics change
under unicode and otherwise are evil. They have lead to all kinds of
trouble at all kinds of levels.

I think fixing this properly requires the pumpking / larry to make a
policy decision. Do we stick with the weird bifurcated behaviour
depending on string representation or not.

Yves

I don't know if this casts any light on the issue or not, but the attached
simple program shows what may be another type of failure.

$v = "\t";
my $re = qr/[^[​:print​:]]/;
my $before_upgrade = $v =~ $re;
utf8​::upgrade($v);
my $after_upgrade = $v =~ $re;
print "upgrading changes whether tab matches $re\n" if $before_upgrade !=
$after_upgrade;

This is essentially the same bug as we have been discussing.

Basically POSIX specifies that a horizontal tab "\t" is a member of
the following POSIX character classes​: cntrl, space, blank.

http​://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

However, it would appear that mktables has different ideas​:
lib/unicore/mktables line 874​:
  my $isspace =
  ($cat =~ /Zs|Zl|Zp/ &&
  $code != 0x200B) # 200B is ZWSP which is for line break control
  # and therefore it is not part of "space" even while it is "Zs".
  || $code == 0x0009 # 0009​: HORIZONTAL TAB
  || $code == 0x000A # 000A​: LINE FEED
  || $code == 0x000B # 000B​: VERTICAL TAB
  || $code == 0x000C # 000C​: FORM FEED
  || $code == 0x000D # 000D​: CARRIAGE RETURN
  || $code == 0x0085 # 0085​: NEL

  ;
[snip to line 917]
  $Cat{Print}->$op($code) if $isgraph || $isspace;

So it would appear that at least some of our problems are of our own making.

Im not really sure what to do about this, however I will say that I
*strongly* feel that we cannot have different interpretations for \w,
\d, \s and the POSIX charclasses under unicode for codepoints 0-255.
What happens outside of these codepoints is less important as it
doesn't introduce logical inconsistencies like the ones we have seen
here.

\w should be [A-Za-z_]
\s should be [\r\n\t ]
\d should be [0-9]
and the POSIX charclasses unsurprisingly should be EXACTLY as defined
by the POSIX standard, and not pay ANY attention to what the Unicode
standard says.

My feeling is that there are perfectly acceptable ways to get a
"unicode word character", using the \P notation (and if there isnt
there should be) and that the special character classes should be
defined according to ascii.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @rgs

2008/11/7 demerphq <demerphq@​gmail.com>​:

The bottom line is that regex metapatterns whose semantics change
under unicode and otherwise are evil. They have lead to all kinds of
trouble at all kinds of levels.

I think fixing this properly requires the pumpking / larry to make a
policy decision. Do we stick with the weird bifurcated behaviour
depending on string representation or not.

perltodo says​:

=head2 UTF-8 revamp

The handling of Unicode is unclean in many places. For example, the regexp
engine matches in Unicode semantics whenever the string or the pattern is
flagged as UTF-8, but that should not be dependent on an internal storage
detail of the string. Likewise, case folding behaviour is dependent on the
UTF8 internal flag being on or off.

=cut

So, no weird bifurcated behaviour.

I don't know if this casts any light on the issue or not, but the attached
simple program shows what may be another type of failure.

$v = "\t";
my $re = qr/[^[​:print​:]]/;
my $before_upgrade = $v =~ $re;
utf8​::upgrade($v);
my $after_upgrade = $v =~ $re;
print "upgrading changes whether tab matches $re\n" if $before_upgrade !=
$after_upgrade;

This is essentially the same bug as we have been discussing.

Basically POSIX specifies that a horizontal tab "\t" is a member of
the following POSIX character classes​: cntrl, space, blank.

http​://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

However, it would appear that mktables has different ideas​:
lib/unicore/mktables line 874​:
my $isspace =
($cat =~ /Zs|Zl|Zp/ &&
$code != 0x200B) # 200B is ZWSP which is for line break control
# and therefore it is not part of "space" even while it is "Zs".
|| $code == 0x0009 # 0009​: HORIZONTAL TAB
|| $code == 0x000A # 000A​: LINE FEED
|| $code == 0x000B # 000B​: VERTICAL TAB
|| $code == 0x000C # 000C​: FORM FEED
|| $code == 0x000D # 000D​: CARRIAGE RETURN
|| $code == 0x0085 # 0085​: NEL

       ;

[snip to line 917]
$Cat{Print}->$op($code) if $isgraph || $isspace;

So it would appear that at least some of our problems are of our own making.

Im not really sure what to do about this, however I will say that I
*strongly* feel that we cannot have different interpretations for \w,
\d, \s and the POSIX charclasses under unicode for codepoints 0-255.
What happens outside of these codepoints is less important as it
doesn't introduce logical inconsistencies like the ones we have seen
here.

\w should be [A-Za-z_]
\s should be [\r\n\t ]
\d should be [0-9]

regardless of whether "use locale" is in effect ?
This is an incompatible change, "use locale" has been historically
the preferred way to modify the meaning of \w (and \s). I'm not sure we
can change that (short of completely deprecating locales...) On the
other hand, since "use locale" is more or less decouraged, we can
leave the old locale behaviour in its small ghetto.

(\d is less problematic, since locale don't affect it.)

and the POSIX charclasses unsurprisingly should be EXACTLY as defined
by the POSIX standard, and not pay ANY attention to what the Unicode
standard says.

As a spoiled C programmer, I expect POSIX charclasses to behave as in C
(for range 0-255, that is, since C doesn't know more.)

My feeling is that there are perfectly acceptable ways to get a
"unicode word character", using the \P notation (and if there isnt
there should be) and that the special character classes should be
defined according to ascii.

Indeed, if \w starts matching Unicode word chars, and if we don't add a
new regexp flag /u, we can't have anymore a behaviour of \w independent
of the internal string representation.

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @demerphq

2008/11/7 Rafael Garcia-Suarez <rgarciasuarez@​gmail.com>​:

2008/11/7 demerphq <demerphq@​gmail.com>​:

The bottom line is that regex metapatterns whose semantics change
under unicode and otherwise are evil. They have lead to all kinds of
trouble at all kinds of levels.

I think fixing this properly requires the pumpking / larry to make a
policy decision. Do we stick with the weird bifurcated behaviour
depending on string representation or not.

perltodo says​:

=head2 UTF-8 revamp

The handling of Unicode is unclean in many places. For example, the regexp
engine matches in Unicode semantics whenever the string or the pattern is
flagged as UTF-8, but that should not be dependent on an internal storage
detail of the string. Likewise, case folding behaviour is dependent on the
UTF8 internal flag being on or off.

=cut

So, no weird bifurcated behaviour.

Ok. Cool. So eventually we decide on a single behaviour. Good good.

I don't know if this casts any light on the issue or not, but the attached
simple program shows what may be another type of failure.

$v = "\t";
my $re = qr/[^[​:print​:]]/;
my $before_upgrade = $v =~ $re;
utf8​::upgrade($v);
my $after_upgrade = $v =~ $re;
print "upgrading changes whether tab matches $re\n" if $before_upgrade !=
$after_upgrade;

This is essentially the same bug as we have been discussing.

Basically POSIX specifies that a horizontal tab "\t" is a member of
the following POSIX character classes​: cntrl, space, blank.

http​://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

However, it would appear that mktables has different ideas​:
lib/unicore/mktables line 874​:
my $isspace =
($cat =~ /Zs|Zl|Zp/ &&
$code != 0x200B) # 200B is ZWSP which is for line break control
# and therefore it is not part of "space" even while it is "Zs".
|| $code == 0x0009 # 0009​: HORIZONTAL TAB
|| $code == 0x000A # 000A​: LINE FEED
|| $code == 0x000B # 000B​: VERTICAL TAB
|| $code == 0x000C # 000C​: FORM FEED
|| $code == 0x000D # 000D​: CARRIAGE RETURN
|| $code == 0x0085 # 0085​: NEL

       ;

[snip to line 917]
$Cat{Print}->$op($code) if $isgraph || $isspace;

So it would appear that at least some of our problems are of our own making.

Im not really sure what to do about this, however I will say that I
*strongly* feel that we cannot have different interpretations for \w,
\d, \s and the POSIX charclasses under unicode for codepoints 0-255.
What happens outside of these codepoints is less important as it
doesn't introduce logical inconsistencies like the ones we have seen
here.

\w should be [A-Za-z_]
\s should be [\r\n\t ]
\d should be [0-9]

regardless of whether "use locale" is in effect ?

No, use locale would do the same things it has always done. I have no
plans to change that. At an implementation level use locale causes
char-class tests to be deferred to match time with the appropriately
educated functions for doing the membership tests. Do a search for
"How's that for a conditional?" in regcomp.c :-) This makes them MUCH
slower.

This is an incompatible change, "use locale" has been historically
the preferred way to modify the meaning of \w (and \s). I'm not sure we
can change that (short of completely deprecating locales...) On the
other hand, since "use locale" is more or less decouraged, we can
leave the old locale behaviour in its small ghetto.

Yes, but im not proposing changing this.

(\d is less problematic, since locale don't affect it.)

Right, and actually would close up some security holes.

and the POSIX charclasses unsurprisingly should be EXACTLY as defined
by the POSIX standard, and not pay ANY attention to what the Unicode
standard says.

As a spoiled C programmer, I expect POSIX charclasses to behave as in C
(for range 0-255, that is, since C doesn't know more.)

Agreed there should be no deviation from what the POSIX standard
dictates, and it doesnt dictate that we should include unicode
semantics.

My feeling is that there are perfectly acceptable ways to get a
"unicode word character", using the \P notation (and if there isnt
there should be) and that the special character classes should be
defined according to ascii.

Indeed, if \w starts matching Unicode word chars, and if we don't add a
[speaking to rafael offline he meant 'STOPS' here]
new regexp flag /u, we can't have anymore a behaviour of \w independent
of the internal string representation.

We already have a way to make a word char match the unicode
definition, that is by using the unicode definition \p{IsWord}.

The whole problem (as far as the regex engine is concerned) is that we
have mixed our Perl/TraditionalRegex and POSIX definitiosn up with
Unicode definitions. Unicode has a set of property definitions which
should not have been confused with the Perl/Regex/POSIX ones.

So \w has traditionally meant [A-Za-z_] but we went and mapped it to
the unicode property IsWord, instead of defining our own synthetic
IsWordPerl that kept the traditional meaning. Same goes for instance
for the [​:print​:] POSIX definition, which specifies that \t is a
'cntrl' character, and thus excluded from 'print'. We mapped to
something that includes IsSpace, which is wrong as pretty much any
non-synthetic unicode property definition is going to include things
that POSIX doesn't deal with as far as I can tell. And so doing we
basically broke our character class logic.

So if we were to do the minimum required to fix our character class
logic we would define a new set of synthetic properties, which exactly
match their traditional complements and use them. All of a sudden all
the complement bugs in the various forms would vanish.

Of course this would be a backwards incompatible change. IMO one for
the better I might add, as it would fix a bunch of clear bugs, and
would close what I consider to be security holes.

However assuming that we can make the mapping controllable by a pragma
we can fix this in a backwards compatible way. Code that wants sane
semantics would do something like​:

use re 'ascii_charclass'; #default for 5.12
use re 'unicode_charclass';
use re 'legacy_charclass'; #default for 5.10

at the same time in 5.12 (or even maybe 5.10) we could introduce new
modifiers /A /U /L for this as well.

Hypothetically we could even support something like​:

  use re 'charclass_overload' '\s' => 'IsPerlSpace','\w' =>
'IsPerlWord', '\d' => 'IsTrueDigit';

Currently, and especially because of the complementary bugs we are
seeing, and the inconsistencies like the one pointed out by Karl I
consider it a clear bug that these symbols are supposed to behave
differently in unicode and otherwise.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @demerphq

I've merged in Rt #49302 as it involves basically the same thing.

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2008

From @demerphq

2008/11/7 demerphq <demerphq@​gmail.com>​:

So if we were to do the minimum required to fix our character class
logic we would define a new set of synthetic properties, which exactly
match their traditional complements and use them. All of a sudden all
the complement bugs in the various forms would vanish.

Of course this would be a backwards incompatible change. IMO one for
the better I might add, as it would fix a bunch of clear bugs, and
would close what I consider to be security holes.

However assuming that we can make the mapping controllable by a pragma
we can fix this in a backwards compatible way. Code that wants sane
semantics would do something like​:

I just applied the following​:

34769 on 2008/11/07 by demerphq@​demerphq-fresh

  create new unicode props as defined in POSIX spec (optionally use
them in the regex engine)
 
  Perlbug #60156 and #49302 (and probably others) resolve down to the problem
  that the definition of \s and \w and \d and the POSIX charclasses are different
  for unicode strings and for non-unicode strings. This broke the
character class
  logic in the regex engine. The easiest fix to make the character
class logic sane
  again is to define new properties which do match.
 
  This change creates new property classes that can be used instead of the
  traditional ones (it does not change the previously defined ones). If the
  define in regcomp.h​:
 
  #define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1
 
  is changed to 0, then the new mappings will be used. This will fix a bunch
  of bugs that are reported as TODO items in the new reg_posixcc.t test file.

This is the first step I guess in fixing this problem. The next step
is figuring out how to make
controlling which is used a pragma instead of a build time configure setting.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 8, 2008

From @druud62

demerphq schreef​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented Nov 8, 2008

From ambs@zbr.pt

Dr.Ruud wrote​:

demerphq schreef​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

for non US/UK folks​: [0-9[​:alpha​:]_]

--
Alberto Simões - Departamento de Informática - Universidade do Minho
  Campus de Gualtar - 4710-057 Braga - Portugal

@p5pRT
Copy link
Author

p5pRT commented Nov 8, 2008

From @demerphq

2008/11/8 Dr.Ruud <rvtol+news@​isolution.nl>​:

demerphq schreef​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

Yes, that is what I meant. *blush*

:-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 8, 2008

From @khwilliamson

demerphq wrote​:

[snip]

Of course this would be a backwards incompatible change. IMO one for
the better I might add, as it would fix a bunch of clear bugs, and
would close what I consider to be security holes.

It would help me to know some of these holes, so I know what to watch
for in doing my own coding.

[snip]

@p5pRT
Copy link
Author

p5pRT commented Nov 9, 2008

From @druud62

Alberto Simões schreef​:

Dr.Ruud​:

demerphq​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

for non US/UK folks​: [0-9[​:alpha​:]_]

That assumes that [[​:alpha​:]] is equivalent to [A-Za-z]. Thus that both
[A-Za-z] and [a-zA-Z] map to [[​:alpha​:]].
Then [a-z] would act exactly as [[​:lower​:]]. But do they?

I don't think so, already because the POSIX [​:alpha​:] and [​:lower​:] etc.
are locale dependent.

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented Nov 9, 2008

From krahnj@telus.net

Alberto Simões wrote​:

Dr.Ruud wrote​:

demerphq schreef​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

for non US/UK folks​: [0-9[​:alpha​:]_]

ITYM​: [[​:digit​:][​:alpha​:]_] Or​: [[​:alnum​:]_]

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

@p5pRT
Copy link
Author

p5pRT commented Nov 9, 2008

From @demerphq

2008/11/9 Dr.Ruud <rvtol+news@​isolution.nl>​:

Alberto Simões schreef​:

Dr.Ruud​:

demerphq​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

for non US/UK folks​: [0-9[​:alpha​:]_]

That assumes that [[​:alpha​:]] is equivalent to [A-Za-z]. Thus that both
[A-Za-z] and [a-zA-Z] map to [[​:alpha​:]].
Then [a-z] would act exactly as [[​:lower​:]]. But do they?

In the POSIX locale yes they are.

I don't think so, already because the POSIX [​:alpha​:] and [​:lower​:] etc.
are locale dependent.

Under use locale they are yes, otherwise they sortof assume POSIX
semantics, unless the string is unicode, in which case they behave
quite strangely.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 9, 2008

From @druud62

demerphq schreef​:

Dr.Ruud​:

Alberto Simões​:

Dr.Ruud​:

demerphq​:

\w should be [A-Za-z_]

ITYM​: [0-9A-Za-z_]

for non US/UK folks​: [0-9[​:alpha​:]_]

That assumes that [[​:alpha​:]] is equivalent to [A-Za-z]. Thus that
both [A-Za-z] and [a-zA-Z] map to [[​:alpha​:]].
Then [a-z] would act exactly as [[​:lower​:]]. But do they?

In the POSIX locale yes they are.

Of course, but in what I wrote there was no context such as "the POSIX
locale" involved.

Alberto, who went from

  \w should be [0-9A-Za-z_]

to

  [0-9[​:alpha​:]_]

was I think missing the point that \w should be [0-9A-Za-z_] (so
containing exactly 62 codepoints).

I don't think so, already because the POSIX [​:alpha​:] and [​:lower​:]
etc. are locale dependent.

Under use locale they are yes, otherwise they sortof assume POSIX
semantics, unless the string is unicode, in which case they behave
quite strangely.

The "[​:alpha​:]" and "[​:lower​:]" (and such) are "POSIX character
classes".
Each of them can be different per locale. They are defined that way, no
Perl involved.

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented Nov 9, 2008

From @druud62

demerphq schreef​:

Dr.Ruud​:

That assumes that [[​:alpha​:]] is equivalent to [A-Za-z]. Thus that
both [A-Za-z] and [a-zA-Z] map to [[​:alpha​:]].
Then [a-z] would act exactly as [[​:lower​:]]. But do they?

In the POSIX locale yes they are.

I don't know exactly which of my above statements you refer to, but from
the "are" I guess it is the first.

Allowing my general statements to be combined with the POSIX locale
limits that you attached to them, I think it is like this​:

the RE character sets [A-Za-z] and [a-zA-Z] contain exactly 52
codepoints, independent of locale, and
the POSIX characterset [​:alpha​:] for the POSIX locale contains *at
minimum* 52 codepoints, namely the combination of [​:lower​:] and
[​:upper​:].

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented Nov 9, 2008

From @demerphq

2008/11/9 Dr.Ruud <rvtol+news@​isolution.nl>​:

demerphq schreef​:

Dr.Ruud​:

That assumes that [[​:alpha​:]] is equivalent to [A-Za-z]. Thus that
both [A-Za-z] and [a-zA-Z] map to [[​:alpha​:]].
Then [a-z] would act exactly as [[​:lower​:]]. But do they?

In the POSIX locale yes they are.

I don't know exactly which of my above statements you refer to, but from
the "are" I guess it is the first.

See the bottomish part of

http​://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

Where it says​:

-------8<--------8<---------8<--------
upper
  Define characters to be classified as uppercase letters.

  In the POSIX locale, the 26 uppercase letters shall be included​:

  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

  In a locale definition file, no character specified for the
keywords cntrl, digit, punct, or space shall be specified. The
uppercase letters <A> to <Z>, as defined in Character Set Description
File (the portable character set), are automatically included in this
class.

lower
  Define characters to be classified as lowercase letters.

  In the POSIX locale, the 26 lowercase letters shall be included​:

  a b c d e f g h i j k l m n o p q r s t u v w x y z

  In a locale definition file, no character specified for the
keywords cntrl, digit, punct, or space shall be specified. The
lowercase letters <a> to <z> of the portable character set are
automatically included in this class.

alpha
  Define characters to be classified as letters.

  In the POSIX locale, all characters in the classes upper and lower
shall be included.

  In a locale definition file, no character specified for the
keywords cntrl, digit, punct, or space shall be specified. Characters
classified as either upper or lower are automatically included in this
class.

digit
  Define the characters to be classified as numeric digits.

  In the POSIX locale, only​:

  0 1 2 3 4 5 6 7 8 9

  shall be included.

  In a locale definition file, only the digits <zero>, <one>, <two>,
<three>, <four>, <five>, <six>, <seven>, <eight>, and <nine> shall be
specified, and in contiguous ascending sequence by numerical value.
The digits <zero> to <nine> of the portable character set are
automatically included in this class.

alnum
  Define characters to be classified as letters and numeric digits.
Only the characters specified for the alpha and digit keywords shall
be specified. Characters specified for the keywords alpha and digit
are automatically included in this class.

space
  Define characters to be classified as white-space characters.

  In the POSIX locale, at a minimum, the <space>, <form-feed>,
<newline>, <carriage-return>, <tab>, and <vertical-tab> shall be
included.

  In a locale definition file, no character specified for the
keywords upper, lower, alpha, digit, graph, or xdigit shall be
specified. The <space>, <form-feed>, <newline>, <carriage-return>,
<tab>, and <vertical-tab> of the portable character set, and any
characters included in the class blank are automatically included in
this class.

cntrl
  Define characters to be classified as control characters.

  In the POSIX locale, no characters in classes alpha or print shall
be included.

  In a locale definition file, no character specified for the
keywords upper, lower, alpha, digit, punct, graph, print, or xdigit
shall be specified.

punct
  Define characters to be classified as punctuation characters.

  In the POSIX locale, neither the <space> nor any characters in
classes alpha, digit, or cntrl shall be included.

  In a locale definition file, no character specified for the
keywords upper, lower, alpha, digit, cntrl, xdigit, or as the <space>
shall be specified.

graph
  Define characters to be classified as printable characters, not
including the <space>.

  In the POSIX locale, all characters in classes alpha, digit, and
punct shall be included; no characters in class cntrl shall be
included.

  In a locale definition file, characters specified for the keywords
upper, lower, alpha, digit, xdigit, and punct are automatically
included in this class. No character specified for the keyword cntrl
shall be specified.

print
  Define characters to be classified as printable characters,
including the <space>.

  In the POSIX locale, all characters in class graph shall be
included; no characters in class cntrl shall be included.

  In a locale definition file, characters specified for the keywords
upper, lower, alpha, digit, xdigit, punct, graph, and the <space> are
automatically included in this class. No character specified for the
keyword cntrl shall be specified.

xdigit
  Define the characters to be classified as hexadecimal digits.

  In the POSIX locale, only​:

  0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

  shall be included.

  In a locale definition file, only the characters defined for the
class digit shall be specified, in contiguous ascending sequence by
numerical value, followed by one or more sets of six characters
representing the hexadecimal digits 10 to 15 inclusive, with each set
in ascending order (for example, <A>, <B>, <C>, <D>, <E>, <F>, <a>,
<b>, <c>, <d>, <e>, <f>). The digits <zero> to <nine>, the uppercase
letters <A> to <F>, and the lowercase letters <a> to <f> of the
portable character set are automatically included in this class.

blank
  Define characters to be classified as <blank>s.

  In the POSIX locale, only the <space> and <tab> shall be included.

  In a locale definition file, the <space> and <tab> are
automatically included in this class.
------->8------->8--------->8--------

One important point to keep in mind, which for me is a real issue for
things like making [​:print​:] map to some unicode based definition is
that the rules are VERY specific about what types of characters are
allowed to be in a given class. So for instance the comment for
[​:print​:] says "In a locale definition file, characters specified for
the keywords upper, lower, alpha, digit, xdigit, punct, graph, and the
<space> are automatically included in this class. No character
specified for the keyword cntrl shall be specified." This says to me
that if unicode wants to include tab in both "control" and "space"
that it is then impossible to correctly use either to define
[​:print​:].

Allowing my general statements to be combined with the POSIX locale
limits that you attached to them, I think it is like this​:

the RE character sets [A-Za-z] and [a-zA-Z] contain exactly 52
codepoints, independent of locale, and
the POSIX characterset [​:alpha​:] for the POSIX locale contains *at
minimum* 52 codepoints, namely the combination of [​:lower​:] and
[​:upper​:].

The last part is wrong. Under the POSIX locale (which is a local
defined in the POSIX standard as being the locale that is used unless
additional locale data is provided via the standard interfaces) the
character class [​:alpha​:] contains exactly 52 characters. Under some
other locale the POSIX character class [​:alpha​:] /may/ have other
meanings, however this behavior is only enabled in Perl under 'use
locale;' (and will make regex matching much slower).

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Sep 2, 2009

From @demerphq

resolved in perl 5.11

@p5pRT
Copy link
Author

p5pRT commented Sep 2, 2009

@demerphq - Status changed from 'open' to 'resolved'

@p5pRT
Copy link
Author

p5pRT commented Aug 11, 2010

@khwilliamson - Status changed from 'resolved' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2010

From @khwilliamson

There are a number of problems with the [[​:posix​:]] character classes.
I thought we had what to do about this settled, but that was before
there was more of an emphasis on strict backwards compatibility, and
before I did some more investigation, so I thought I had better air it
again.

Here are the problems​:

1) They do not match the Posix standard. In our attempt to DWIM, we
violate it. For example, [[​:alpha​:]] is only supposed to match A-Za-z,
unless in a locale that has other alphabetics. But, if the target
string or pattern indicate a utf8 match, it matches \p{alpha}. I
suppose we could argue that we have created a new locale, the Unicode
locale. I don't know if that argument holds water or not.

2) They suffer from "The Unicode Bug", in which the utf8ness of the
pattern or string affects the semantics of the match. [[​:alpha​:]] will
match "\xe1" if and only if the pattern or target string are in utf8.

3) A number of characters in utf8 match both a class and the complement
of the class. Here's a list from bug #60156​:
  [[​:alnum​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
  [[​:alpha​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
  [[​:blank​:]] U+A0
  [[​:cntrl​:]] U+80
  [[​:graph​:]] U+A1
  [[​:lower​:]] U+AA U+B5 U+BA U+DF..F6 U+F8
  [[​:print​:]] U+A0
  [[​:punct​:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB
U+BF
  [[​:space​:]] U+85 U+A0
  [[​:upper​:]] U+C0..D6 U+D8
  [[​:word​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8

Note that some of these are ASCII. The root cause of these is mostly
from the same causes as the Unicode bug, but also because when they are
stored in utf8 the code re-uses an existing, but not quite
corresponding, \p{} property

4) Extending the posix definitions was not done consistently. This is
especially noticeable in punct. Unicode splits what Posix considers
punctuation into two classes​: punctuation and symbol. But in extending
[[​:punct​:]] to beyond ASCII, Perl doesn't include the Unicode symbols.
The result is inconsistent, the ASCII range symbols are included, but no
other.

It is less clear about other extensions. Should [[​:cntrl​:]] include
other things that Unicode considers control-like, namely the surrogates,
the formats (soft hyphen et.al), and private use characters? What about
title case, fractions, super and subscripts?

Before, it seemed like the obvious solution to all this was to just go
back to the formal Posix definition of what they should match, not
having a "Unicode locale", and that was done via #ifdefs for a while in
5.11. But it was part of a larger patch that was it decided to revert.
  Now the #ifdefs remain defined the other direction, and
perlrecharclass.pod in 5.12 says that it is proposed to make these match
the Posix standard exactly, asking anyone who disagrees to notify us.
There has so far been none.

If we were to just reinstate those #ifdefs, it would fix all the above
problems in one fell swoop. But it seems to me that we will break too
much existing code. I think it was a mistake extending these
definitions to a made-up "Unicode locale" in the first place, but that
ship has sailed, I think, in spite of what we thought we had decided
earlier.

I have done some investigation, and it appears that I can easily solve
problem 3) by creating more properties in mktables tailored just for
these posix character classes; and easily solve 3) for regexes compiled
under feature unicode_strings, by extending what I'm already about to
submit a patch for regarding [\w\s]. I think I should do this, ripping
out the #ifdefs

If we want to restrict the posix classes to strict posix definitions, I
think it probably should be done with a pragma​: 'use feature
"strict_posix"' or 'use re "strict_posix"'. This is not as
high-priority in my view; and I'm not certain it even needs to be done
at all if 2) and 3) are fixed.

I think, for consistency, especially if we don't add the strict posix
interpretations that punct should change to include the Unicode symbols
as well; I think the other inconsistencies are not something to worry
about; but am less confident in this.

Comments?

@p5pRT
Copy link
Author

p5pRT commented Aug 14, 2010

From juerd@tnx.nl

karl williamson skribis 2010-08-14 11​:09 (-0600)​:

1) They do not match the Posix standard. In our attempt to DWIM, we
violate it.

This has been the case for many years and I think it's a good idea to
keep it dwimmy. A strict POSIX mode could perhaps be added, but breaking
backwards compatibility just to comply with an outdated standard feels
wrong. Perl has, in a way, set a new standard for these character
classes even though they're still referred to as "POSIX".

Perhaps the bug is that we're still calling them POSIX character
classes, and the fix is to just rename them to POSIX-like, POSIX-ish, or
something entirely different.

2) They suffer from "The Unicode Bug", in which the utf8ness of the
pattern or string affects the semantics of the match.

This is bad. I'm strongly convinced that all text operations should be
unicody by default, regardless of internal representation.

3) A number of characters in utf8 match both a class and the complement
of the class.

This is not necessarily a problem, depending on the reasons for the dual
matching. It's certainly debatable whether the non-breaking space is
whitespace, non-whitespace, both, or perhaps even neither.

4) Extending the posix definitions was not done consistently. This is
especially noticeable in punct. (...)
It is less clear about other extensions.

This one's really tough. I'd be in favor of fixing consistency. That'll
require thorough investigation.

Before, it seemed like the obvious solution to all this was to just go
back to the formal Posix definition of what they should match, not
having a "Unicode locale", and that was done via #ifdefs for a while in
5.11. But it was part of a larger patch that was it decided to revert.

I was not particularly happy with this specific change. Going back
never felt right to me, although I appreciate that it is the only easy
solution and perhaps even the only thing that deserves to be called a
solution.

If we want to restrict the posix classes to strict posix definitions, I
think it probably should be done with a pragma​: 'use feature
"strict_posix"' or 'use re "strict_posix"'.

It could be argued that this should belong in "use POSIX", maybe implied
when loading it with the default exports, maybe only enabled when
explicitly mentioned in the import list.

I think, for consistency, especially if we don't add the strict posix
interpretations that punct should change to include the Unicode symbols
as well; I think the other inconsistencies are not something to worry
about; but am less confident in this.

Agreed.
--
Met vriendelijke groet, // Kind regards, // Korajn salutojn,

Juerd Waalboer <juerd@​tnx.nl>
TNX

@p5pRT
Copy link
Author

p5pRT commented Aug 17, 2010

From @khwilliamson

Juerd Waalboer wrote​:

karl williamson skribis 2010-08-14 11​:09 (-0600)​:

1) They do not match the Posix standard. In our attempt to DWIM, we
violate it.

This has been the case for many years and I think it's a good idea to
keep it dwimmy. A strict POSIX mode could perhaps be added, but breaking
backwards compatibility just to comply with an outdated standard feels
wrong. Perl has, in a way, set a new standard for these character
classes even though they're still referred to as "POSIX".

Perhaps the bug is that we're still calling them POSIX character
classes, and the fix is to just rename them to POSIX-like, POSIX-ish, or
something entirely different.

2) They suffer from "The Unicode Bug", in which the utf8ness of the
pattern or string affects the semantics of the match.

This is bad. I'm strongly convinced that all text operations should be
unicody by default, regardless of internal representation.

That will be the case eventually if you 'use 5.12.0' or greater

3) A number of characters in utf8 match both a class and the complement
of the class.

This is not necessarily a problem, depending on the reasons for the dual
matching. It's certainly debatable whether the non-breaking space is
whitespace, non-whitespace, both, or perhaps even neither.

But the definition of complement is the set is all characters that
aren't in the original set. We don't allow for fuzziness.

4) Extending the posix definitions was not done consistently. This is
especially noticeable in punct. (...)
It is less clear about other extensions.

This one's really tough. I'd be in favor of fixing consistency. That'll
require thorough investigation.

It turns out that punct is the only one that has has this problem.

Before, it seemed like the obvious solution to all this was to just go
back to the formal Posix definition of what they should match, not
having a "Unicode locale", and that was done via #ifdefs for a while in
5.11. But it was part of a larger patch that was it decided to revert.

I was not particularly happy with this specific change. Going back
never felt right to me, although I appreciate that it is the only easy
solution and perhaps even the only thing that deserves to be called a
solution.

If we want to restrict the posix classes to strict posix definitions, I
think it probably should be done with a pragma​: 'use feature
"strict_posix"' or 'use re "strict_posix"'.

It could be argued that this should belong in "use POSIX", maybe implied
when loading it with the default exports, maybe only enabled when
explicitly mentioned in the import list.

OK, if we do this, it makes sense to use the existing pragma.

I think, for consistency, especially if we don't add the strict posix
interpretations that punct should change to include the Unicode symbols
as well; I think the other inconsistencies are not something to worry
about; but am less confident in this.

Agreed.

Having investigated further, I've implemented things so that the bugs go
away without having to go back to strict Posix. So that's what I intend
to do, unless there is sufficient complaint.

I also intend to separately change the extended definition of
[[​:punct​:]] to include the Unicode symbols, as that also has to be what
the intent was, and should be, unless there is sufficient complaint.

@p5pRT
Copy link
Author

p5pRT commented Aug 20, 2010

From @demerphq

On 14 August 2010 19​:09, karl williamson <public@​khwilliamson.com> wrote​:

There are a number of problems with the [[​:posix​:]] character classes. I
thought we had what to do about this settled, but that was before there was
more of an emphasis on strict backwards compatibility, and before I did some
more investigation, so I thought I had better air it again.

Here are the problems​:

1) They do not match the Posix standard.  In our attempt to DWIM, we violate
it.  For example, [[​:alpha​:]] is only supposed to match A-Za-z, unless in a
locale that has other alphabetics.  But, if the target string or pattern
indicate a utf8 match, it matches \p{alpha}.  I suppose we could argue that
we have created a new locale, the Unicode locale.  I don't know if that
argument holds water or not.

2) They suffer from "The Unicode Bug", in which the utf8ness of the pattern
or string affects the semantics of the match.  [[​:alpha​:]] will match "\xe1"
if and only if the pattern or target string are in utf8.

3) A number of characters in utf8 match both a class and the complement of
the class.  Here's a list from bug #60156​:
 [[​:alnum​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
 [[​:alpha​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
 [[​:blank​:]] U+A0
 [[​:cntrl​:]] U+80
 [[​:graph​:]] U+A1
 [[​:lower​:]] U+AA U+B5 U+BA U+DF..F6 U+F8
 [[​:print​:]] U+A0
 [[​:punct​:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB U+BF
 [[​:space​:]] U+85 U+A0
 [[​:upper​:]] U+C0..D6 U+D8
 [[​:word​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8

Note that some of these are ASCII.  The root cause of these is mostly from
the same causes as the Unicode bug, but also because when they are stored in
utf8 the code re-uses an existing, but not quite corresponding, \p{}
property

4) Extending the posix definitions was not done consistently.  This is
especially noticeable in punct.  Unicode splits what Posix considers
punctuation into two classes​: punctuation and symbol.  But in extending
[[​:punct​:]] to beyond ASCII, Perl doesn't include the Unicode symbols. The
result is inconsistent, the ASCII range symbols are included, but no other.

It is less clear about other extensions.  Should [[​:cntrl​:]] include other
things that Unicode considers control-like, namely the surrogates, the
formats (soft hyphen et.al), and private use characters?  What about title
case, fractions, super and subscripts?

Before, it seemed like the obvious solution to all this was to just go back
to the formal Posix definition of what they should match, not having a
"Unicode locale", and that was done via #ifdefs for a while in 5.11.  But it
was part of a larger patch that was it decided to revert.  Now the #ifdefs
remain defined the other direction, and perlrecharclass.pod in 5.12 says
that it is proposed to make these match the Posix standard exactly, asking
anyone who disagrees to notify us. There has so far been none.

If we were to just reinstate those #ifdefs, it would fix all the above
problems in one fell swoop.  But it seems to me that we will break too much
existing code.  I think it was a mistake extending these definitions to a
made-up "Unicode locale" in the first place, but that ship has sailed, I
think, in spite of what we thought we had decided earlier.

I have done some investigation, and it appears that I can easily solve
problem 3) by creating more properties in mktables tailored just for these
posix character classes; and easily solve 3) for regexes compiled under
feature unicode_strings, by extending what I'm already about to submit a
patch for regarding [\w\s].  I think I should do this, ripping out the
#ifdefs

If we want to restrict the posix classes to strict posix definitions, I
think it probably should be done with a pragma​: 'use feature "strict_posix"'
or 'use re "strict_posix"'.  This is not as high-priority in my view; and
I'm not certain it even needs to be done at all if 2) and 3) are fixed.

I think, for consistency, especially if we don't add the strict posix
interpretations that punct should change to include the Unicode symbols as
well; I think the other inconsistencies are not something to worry about;
but am less confident in this.

Comments?

POSIX is a standard. It is NOT up to us to redefine that standard. Had
we realized that we were breaking the standard and the massive can of
worms involved at the time I do not think we would have gone the way
we did. I think it would be a HUGE benefit to return to the correct
interpretation of POSIX charclasses and I do not think that backcompat
will be impacted any more than a bunch of buggy programs stop being
buggy.

Anyway, thats my view.

cheers
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 30, 2010

From @Abigail

On Fri, Aug 20, 2010 at 04​:41​:48PM +0200, demerphq wrote​:

On 14 August 2010 19​:09, karl williamson <public@​khwilliamson.com> wrote​:

There are a number of problems with the [[​:posix​:]] character classes. I
thought we had what to do about this settled, but that was before there was
more of an emphasis on strict backwards compatibility, and before I did some
more investigation, so I thought I had better air it again.

Here are the problems​:

1) They do not match the Posix standard.  In our attempt to DWIM, we violate
it.  For example, [[​:alpha​:]] is only supposed to match A-Za-z, unless in a
locale that has other alphabetics.  But, if the target string or pattern
indicate a utf8 match, it matches \p{alpha}.  I suppose we could argue that
we have created a new locale, the Unicode locale.  I don't know if that
argument holds water or not.

2) They suffer from "The Unicode Bug", in which the utf8ness of the pattern
or string affects the semantics of the match.  [[​:alpha​:]] will match "\xe1"
if and only if the pattern or target string are in utf8.

3) A number of characters in utf8 match both a class and the complement of
the class.  Here's a list from bug #60156​:
 [[​:alnum​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
 [[​:alpha​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8
 [[​:blank​:]] U+A0
 [[​:cntrl​:]] U+80
 [[​:graph​:]] U+A1
 [[​:lower​:]] U+AA U+B5 U+BA U+DF..F6 U+F8
 [[​:print​:]] U+A0
 [[​:punct​:]] U+24 U+2B U+3C..3E U+5E U+60 U+7C U+7E U+A1 U+AB U+B7 U+BB U+BF
 [[​:space​:]] U+85 U+A0
 [[​:upper​:]] U+C0..D6 U+D8
 [[​:word​:]] U+AA U+B5 U+BA U+C0..D6 U+D8..F6 U+F8

Note that some of these are ASCII.  The root cause of these is mostly from
the same causes as the Unicode bug, but also because when they are stored in
utf8 the code re-uses an existing, but not quite corresponding, \p{}
property

4) Extending the posix definitions was not done consistently.  This is
especially noticeable in punct.  Unicode splits what Posix considers
punctuation into two classes​: punctuation and symbol.  But in extending
[[​:punct​:]] to beyond ASCII, Perl doesn't include the Unicode symbols. The
result is inconsistent, the ASCII range symbols are included, but no other.

It is less clear about other extensions.  Should [[​:cntrl​:]] include other
things that Unicode considers control-like, namely the surrogates, the
formats (soft hyphen et.al), and private use characters?  What about title
case, fractions, super and subscripts?

Before, it seemed like the obvious solution to all this was to just go back
to the formal Posix definition of what they should match, not having a
"Unicode locale", and that was done via #ifdefs for a while in 5.11.  But it
was part of a larger patch that was it decided to revert.  Now the #ifdefs
remain defined the other direction, and perlrecharclass.pod in 5.12 says
that it is proposed to make these match the Posix standard exactly, asking
anyone who disagrees to notify us. There has so far been none.

If we were to just reinstate those #ifdefs, it would fix all the above
problems in one fell swoop.  But it seems to me that we will break too much
existing code.  I think it was a mistake extending these definitions to a
made-up "Unicode locale" in the first place, but that ship has sailed, I
think, in spite of what we thought we had decided earlier.

I have done some investigation, and it appears that I can easily solve
problem 3) by creating more properties in mktables tailored just for these
posix character classes; and easily solve 3) for regexes compiled under
feature unicode_strings, by extending what I'm already about to submit a
patch for regarding [\w\s].  I think I should do this, ripping out the
#ifdefs

If we want to restrict the posix classes to strict posix definitions, I
think it probably should be done with a pragma​: 'use feature "strict_posix"'
or 'use re "strict_posix"'.  This is not as high-priority in my view; and
I'm not certain it even needs to be done at all if 2) and 3) are fixed.

I think, for consistency, especially if we don't add the strict posix
interpretations that punct should change to include the Unicode symbols as
well; I think the other inconsistencies are not something to worry about;
but am less confident in this.

Comments?

POSIX is a standard. It is NOT up to us to redefine that standard. Had
we realized that we were breaking the standard and the massive can of
worms involved at the time I do not think we would have gone the way
we did. I think it would be a HUGE benefit to return to the correct
interpretation of POSIX charclasses and I do not think that backcompat
will be impacted any more than a bunch of buggy programs stop being
buggy.

It's a bit late, but I agree with Yves. POSIX is a standard. It defines
what goes in [[​:posix​:]]. Unicode may be a newer standard, but it has
its own set of properties, \p{Property}. By "extending" POSIX character
classes so they are (more or less) equivalent to Unicode properties,
we've actually taken away functionality.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2011

From @khwilliamson

This has finally been fixed in blead
--Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2011

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant