UTF-8 bug: latin1 character both [\w] and [\W] #6065

p5pRT · 2002-11-08T15:11:29Z

Migrated from rt.perl.org#18281 (status was 'resolved')

Searchable as RT18281$

p5pRT · 2002-11-08T15:11:29Z

From @andk

Created by @andk

The terse test would be:

$_ = "\x{df}"; utf8::upgrade($_);
print /[\w]/ ^ /[\W]/ ? "ok\n" : "not ok\n";

A more illustrative test to wake you up:

% /usr/local/perl-5.8.0@18125/bin/perl -le '
$_ = "\x{df}"; utf8::upgrade($_);
if (/([\w])/){
warn "a letter";
}
if (/([\W])/){
warn "not a letter";
}
'
a letter at -e line 4.
not a letter at -e line 7.

I've tested with all perls since patch 9340 and they all misbehave.

Perl Info


Flags:
    category=core
    severity=high

This perlbug was built using Perl v5.8.0 - Mon Sep  9 18:12:37 UTC 2002
It is being executed now by  Perl v5.8.0 - Mon Sep  9 18:02:36 UTC 2002.

Site configuration information for perl v5.8.0:

Configured by root at Mon Sep  9 18:02:36 UTC 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.19, archname=i586-linux-thread-multi
    uname='linux bloembergen 2.4.19 #1 mon apr 15 08:57:26 gmt 2002 i686 unknown '
    config_args='-ds -e -Dprefix=/usr -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3 --pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing'
    ccversion='', gccversion='3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =''
    libpth=/lib /usr/lib /usr/local/lib
    libs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i586-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared'

Locally applied patches:
    


@INC for perl v5.8.0:
    /usr/lib/perl5/5.8.0/i586-linux-thread-multi
    /usr/lib/perl5/5.8.0
    /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.0
    /usr/lib/perl5/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/k
    LANG=C
    LANGUAGE (unset)
    LC_COLLATE=POSIX
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/k/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/kde3/bin:/usr/lib/java/jre/bin:.:/usr/local/perl/bin:/usr/X11/bin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

p5pRT · 2002-12-11T23:13:03Z

From @jhi

I now took a look at this and it is surprisingly tricky (or so I think).
The problem comes form the dual model of Perl: bytes and (Unicode)
characters. Why the [\W] falsely matches is caused by the \xDF
being *on* in the *byte* side of the \W. My quick straightforward
attempts at fixing this break one of split.t tests where the split
pattern includes \x80.

p5pRT · 2003-01-03T11:50:48Z

From @hvds

(Andreas J. Koenig) (via RT) <perlbug@perl.org> wrote:
:A more illustrative test to wake you up:
:
:% /usr/local/perl-5.8.0@18125/bin/perl -le '
:$_ = "\x{df}"; utf8::upgrade($_);
:if (/([\w])/){
: warn "a letter";
:}
:if (/([\W])/){
: warn "not a letter";
:}
:'
:a letter at -e line 4.
:not a letter at -e line 7.
:
:I've tested with all perls since patch 9340 and they all misbehave.

I've talked a bit with Jarkko about this, and we can't at this time
come up with any fix other than to document the behaviour.

The core issue is that in the old 'bytes' world, \xdf (and in fact
all of \x80-\xff) are not treated as letters, except when subject
to the vagaries of locale. In the Unicode world, \xdf maps to
a character that is defined under Unicode as being a letter. And
we are still trying to support both definitions.

There doesn't seem to be any consistent way of redefining the cases
that doesn't break some existing tests.

Note that when used outside of a character class, \xdf does not
match /\W/; I don't currently understand why it is different in this
case.

Jarkko's suggested workarounds:
(1) \w and \W: with bytes \xDF is not a letter (except if using locale
and the locale thinks \xDF is a letter), with Unicode \xDF is a letter
(2) \pL and \PL: input can be either byte or Unicode (and \xDF is a letter)
(3) \p{Word} and \P{Word}: ditto

Hugo

p5pRT · 2004-01-01T16:54:42Z

From @rgs

I documented this in perlunicode.pod as change #22031.

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

I think there are other tickets fixed by this, but couldn't find any. I
know I wrote one myself a couple years ago.

regen required

This is part of the Unicode Bug.

The bug is fixed only for /u regexes. Yves' analysis was that it wasn't
fixable otherwise, and this is part of the reason we are adding /u.

This patch requires [perl #78722] to be applied. Both are also
available at git://github.com/khwilliamson/perl.git
branch matching.

Essentially, the patch just uses Unicode semantics if that is called
for. The macros that do this have been applied earlier, but there was a
bug in one of them that didn't surface until this patch.

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0001-mktables-Add-tests-for-wrong-equivalence-attempts.patch

From 8c41aef5491314eef781463bac8abb96cfb46a0e Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 09:43:50 -0600
Subject: [PATCH] mktables: Add tests for wrong equivalence attempts

mktables allows for multiple tables to be made equivalent, which in Unix
terminology means that they are essentially symbolic links.  However
this should happen only when they have the same code points in them to
begin with.  This adds a little more error checking.
---
 lib/unicore/mktables |   21 +++++++++++++++------
 1 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index b7cda64..b13fe0e 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -6560,12 +6560,21 @@ sub trace { return main::trace(@_); }
         my $addr = do { no overloading; pack 'J', $self; };
         my $current_leader = ($related) ? $parent{$addr} : $leader{$addr};
 
-        if ($related &&
-            ! $other->perl_extension
-            && ! $current_leader->perl_extension)
-        {
-            Carp::my_carp_bug("set_equivalent_to should have 'Related => 0 for equivalencing two Unicode properties.  Assuming $self is not related to $other");
-            $related = 0;
+        if ($related) {
+            if ($current_leader->perl_extension) {
+                if ($other->perl_extension) {
+                    Carp::my_carp_bug("Use add_alias() to set two Perl tables '$self' and '$other', equivalent.");
+                    return;
+                }
+            } elsif (! $other->perl_extension) {
+                Carp::my_carp_bug("set_equivalent_to should have 'Related => 0 for equivalencing two Unicode properties.  Assuming $self is not related to $other");
+                $related = 0;
+            }
+        }
+
+        if (! $self->is_empty && ! $self->matches_identically_to($other)) {
+            Carp::my_carp_bug("$self should be empty or match identically to $other.  Not setting equivalent");
+            return;
         }
 
         my $leader = do { no overloading; pack 'J', $current_leader; };
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0002-mktables-Clarify-d-description-for-perluniprops.patch

From 4f04e7d11054b1792ca26651b6fdbf4806e2bc46 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 09:53:06 -0600
Subject: [PATCH] mktables: Clarify \d description for perluniprops

---
 lib/unicore/mktables |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index b13fe0e..c432809 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -11423,7 +11423,7 @@ sub compile_perl() {
         );
 
     my $Digit = $perl->add_match_table('Digit',
-                            Description => '\d, extended beyond just [0-9]');
+                            Description => '[0-9] + all other decimal digits');
     $Digit->set_equivalent_to($gc->table('Decimal_Number'), Related => 1);
     my $PosixDigit = $perl->add_match_table("PosixDigit",
                                             Description => '[0-9]',
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0003-perlrecharclass-Nits.patch

From fbce0c5d241837f0038b187e3f7f73f1ae271cc3 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 10:13:35 -0600
Subject: [PATCH] perlrecharclass: Nits

---
 pod/perlrecharclass.pod |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 7cb2f78..0b88cc4 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -595,7 +595,7 @@ of all the alphanumerical characters and all punctuation characters.
 All printable characters, which is the set of all the graphical characters
 plus whitespace characters that are not also controls.
 
-=item [5] (punct)
+=item [5]
 
 C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the
 non-controls, non-alphanumeric, non-space characters:
@@ -683,7 +683,8 @@ A regular expression is marked for Unicode semantics if it is encoded in
 utf8 (usually as a result of including a literal character whose code
 point is above 255), or if it contains a C<\N{U+...}> or C<\N{I<name>}>
 construct, or (starting in Perl 5.14) if it was compiled in the scope of a
-C<S<use feature "unicode_strings">> pragma.
+C<S<use feature "unicode_strings">> pragma, or has the C<"u"> regular
+expression modifier.
 
 The differences in behavior between locale and non-locale semantics
 can affect any character whose code point is 255 or less.  The
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0004-Add-consistent-synonyms-for-p-PosxFOO.patch

From 59603337717194779be9b5d448857ec991baee11 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 10:13:48 -0600
Subject: [PATCH] Add consistent synonyms for \p{PosxFOO}

This patch adds a set of synonyms \p{XPosixFOO} for the full extended
Unicode version of \p{PosixFOO}, so only one rule need be remembered.
Similarly, \p{XPerlSpace} is added to preserve the rule for the one
similar class that doesn't have Posix in its name.

Prior to this patch there was no exact equivalent to \p{PosixPunct}
extended beyond ASCII.
---
 lib/unicore/mktables    |   38 +++++++++++++++++++++-----
 pod/perlrecharclass.pod |   64 +++++++++++++++++++++++++------------------
 pod/perlreref.pod       |   69 +++++++++++++++++++++-------------------------
 3 files changed, 100 insertions(+), 71 deletions(-)

diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index c432809..8a5c89a 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -11130,7 +11130,8 @@ sub compile_perl() {
     # range, with their names prefaced by 'Posix', to signify that these match
     # what the Posix standard says they should match.  A couple are
     # effectively this, but the name doesn't have 'Posix' in it because there
-    # just isn't any Posix equivalent.
+    # just isn't any Posix equivalent.  'XPosix' are the Posix tables extended
+    # to the full Unicode range, by our guesses as to what is appropriate.
 
     # 'Any' is all code points.  As an error check, instead of just setting it
     # to be that, construct it to be the union of all the major categories
@@ -11195,6 +11196,7 @@ sub compile_perl() {
         $Lower->set_equivalent_to($gc->table('Lowercase_Letter'),
                                                                 Related => 1);
     }
+    $Lower->add_alias('XPosixLower');
     $perl->add_match_table("PosixLower",
                             Description => "[a-z]",
                             Initialize => $Lower & $ASCII,
@@ -11209,6 +11211,7 @@ sub compile_perl() {
         $Upper->set_equivalent_to($gc->table('Uppercase_Letter'),
                                                                 Related => 1);
     }
+    $Upper->add_alias('XPosixUpper');
     $perl->add_match_table("PosixUpper",
                             Description => "[A-Z]",
                             Initialize => $Upper & $ASCII,
@@ -11303,6 +11306,7 @@ sub compile_perl() {
         $Alpha += $gc->table('Nl') if defined $gc->table('Nl');
         $Alpha->add_description('Alphabetic');
     }
+    $Alpha->add_alias('XPosixAlpha');
     $perl->add_match_table("PosixAlpha",
                             Description => "[A-Za-z]",
                             Initialize => $Alpha & $ASCII,
@@ -11312,6 +11316,7 @@ sub compile_perl() {
                         Description => 'Alphabetic and (Decimal) Numeric',
                         Initialize => $Alpha + $gc->table('Decimal_Number'),
                         );
+    $Alnum->add_alias('XPosixAlnum');
     $perl->add_match_table("PosixAlnum",
                             Description => "[A-Za-z0-9]",
                             Initialize => $Alnum & $ASCII,
@@ -11321,14 +11326,16 @@ sub compile_perl() {
                                 Description => '\w, including beyond ASCII',
                                 Initialize => $Alnum + $gc->table('Mark'),
                                 );
+    $Word->add_alias('XPosixWord');
     my $Pc = $gc->table('Connector_Punctuation'); # 'Pc' Not in release 1
     $Word += $Pc if defined $Pc;
 
     # This is a Perl extension, so the name doesn't begin with Posix.
-    $perl->add_match_table('PerlWord',
+    my $PerlWord = $perl->add_match_table('PerlWord',
                     Description => '\w, restricted to ASCII = [A-Za-z0-9_]',
                     Initialize => $Word & $ASCII,
                     );
+    $PerlWord->add_alias('PosixWord');
 
     my $Blank = $perl->add_match_table('Blank',
                                 Description => '\h, Horizontal white space',
@@ -11341,6 +11348,7 @@ sub compile_perl() {
                                             -   0x200B, # ZWSP
                                 );
     $Blank->add_alias('HorizSpace');        # Another name for it.
+    $Blank->add_alias('XPosixBlank');
     $perl->add_match_table("PosixBlank",
                             Description => "\\t and ' '",
                             Initialize => $Blank & $ASCII,
@@ -11362,24 +11370,28 @@ sub compile_perl() {
                 Description => '\s including beyond ASCII plus vertical tab',
                 Initialize => $Blank + $VertSpace,
     );
+    $Space->add_alias('XPosixSpace');
     $perl->add_match_table("PosixSpace",
                             Description => "\\t, \\n, \\cK, \\f, \\r, and ' '.  (\\cK is vertical tab)",
                             Initialize => $Space & $ASCII,
                             );
 
     # Perl's traditional space doesn't include Vertical Tab
-    my $SpacePerl = $perl->add_match_table('SpacePerl',
+    my $XPerlSpace = $perl->add_match_table('XPerlSpace',
                                   Description => '\s, including beyond ASCII',
                                   Initialize => $Space - 0x000B,
                                 );
-    $perl->add_match_table('PerlSpace',
+    $XPerlSpace->add_alias('SpacePerl');    # A pre-existing synonym
+    my $PerlSpace = $perl->add_match_table('PerlSpace',
                             Description => '\s, restricted to ASCII',
-                            Initialize => $SpacePerl & $ASCII,
+                            Initialize => $XPerlSpace & $ASCII,
                             );
 
+
     my $Cntrl = $perl->add_match_table('Cntrl',
                                         Description => 'Control characters');
     $Cntrl->set_equivalent_to($gc->table('Cc'), Related => 1);
+    $Cntrl->add_alias('XPosixCntrl');
     $perl->add_match_table("PosixCntrl",
                             Description => "ASCII control characters: NUL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS, HT, LF, VT, FF, CR, SO, SI, DLE, DC1, DC2, DC3, DC4, NAK, SYN, ETB, CAN, EOM, SUB, ESC, FS, GS, RS, US, and DEL",
                             Initialize => $Cntrl & $ASCII,
@@ -11396,6 +11408,7 @@ sub compile_perl() {
                         Description => 'Characters that are graphical',
                         Initialize => ~ ($Space + $controls),
                         );
+    $Graph->add_alias('XPosixGraph');
     $perl->add_match_table("PosixGraph",
                             Description =>
                                 '[-!"#$%&\'()*+,./:;<>?@[\\\]^_`{|}~0-9A-Za-z]',
@@ -11406,6 +11419,7 @@ sub compile_perl() {
                         Description => 'Characters that are graphical plus space characters (but no controls)',
                         Initialize => $Blank + $Graph - $gc->table('Control'),
                         );
+    $print->add_alias('XPosixPrint');
     $perl->add_match_table("PosixPrint",
                             Description =>
                               '[- 0-9A-Za-z!"#$%&\'()*+,./:;<>?@[\\\]^_`{|}~]',
@@ -11416,15 +11430,20 @@ sub compile_perl() {
     $Punct->set_equivalent_to($gc->table('Punctuation'), Related => 1);
 
     # \p{punct} doesn't include the symbols, which posix does
+    my $XPosixPunct = $perl->add_match_table('XPosixPunct',
+                    Description => '\p{Punct} + ASCII-range \p{Symbol}',
+                    Initialize => $gc->table('Punctuation')
+                                + ($ASCII & $gc->table('Symbol')),
+        );
     $perl->add_match_table('PosixPunct',
         Description => '[-!"#$%&\'()*+,./:;<>?@[\\\]^_`{|}~]',
-        Initialize => $ASCII & ($gc->table('Punctuation')
-                                + $gc->table('Symbol')),
+        Initialize => $ASCII & $XPosixPunct,
         );
 
     my $Digit = $perl->add_match_table('Digit',
                             Description => '[0-9] + all other decimal digits');
     $Digit->set_equivalent_to($gc->table('Decimal_Number'), Related => 1);
+    $Digit->add_alias('XPosixDigit');
     my $PosixDigit = $perl->add_match_table("PosixDigit",
                                             Description => '[0-9]',
                                             Initialize => $Digit & $ASCII,
@@ -11432,6 +11451,7 @@ sub compile_perl() {
 
     # Hex_Digit was not present in first release
     my $Xdigit = $perl->add_match_table('XDigit');
+    $Xdigit->add_alias('XPosixXDigit');
     my $Hex = property_ref('Hex_Digit');
     if (defined $Hex && ! $Hex->is_empty) {
         $Xdigit->set_equivalent_to($Hex->table('Y'), Related => 1);
@@ -11443,6 +11463,10 @@ sub compile_perl() {
                               0xFF10..0xFF19, 0xFF21..0xFF26, 0xFF41..0xFF46]);
         $Xdigit->add_description('[0-9A-Fa-f] and corresponding fullwidth versions, like U+FF10: FULLWIDTH DIGIT ZERO');
     }
+    $perl->add_match_table('PosixXDigit',
+                            Initialize => $ASCII & $Xdigit,
+                            Description => '[0-9A-Fa-f]',
+                        );
 
     my $dt = property_ref('Decomposition_Type');
     $dt->add_match_table('Non_Canon', Full_Name => 'Non_Canonical',
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 0b88cc4..1a6fd31 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -522,7 +522,8 @@ The other counterpart, in the column labelled "Full-range Unicode", matches any
 appropriate characters in the full Unicode character set.  For example,
 C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any
 character in the entire Unicode character set that is considered to be
-alphabetic.
+alphabetic.  The backslash sequence column is a (short) synonym for
+the Full-range Unicode form.
 
 (Each of the counterparts has various synonyms as well.
 L<perluniprops/Properties accessible through \p{} and \P{}> lists all the
@@ -533,8 +534,8 @@ and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.)
 Both the C<\p> forms are unaffected by any locale that is in effect, or whether
 the string is in UTF-8 format or not, or whether the platform is EBCDIC or not.
 In contrast, the POSIX character classes are affected.  If the source string is
-in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see
-Note [5] below) behave like their "Full-range" Unicode counterparts.  If the
+in UTF-8 format, the POSIX classes behave like their "Full-range"
+Unicode counterparts.  If the
 source string is not in UTF-8 format, and no locale is in effect, and the
 platform is not EBCDIC, all the POSIX classes behave like their ASCII-range
 counterparts.  Otherwise, they behave based on the rules of the locale or
@@ -548,25 +549,25 @@ EBCDIC code page is present, they will behave in accordance with those; if
 absent, the classes will match only their ASCII-range counterparts.  If you
 disagree with this proposal, send email to C<perl5-porters@perl.org>.
 
- [[:...:]]      ASCII-range        Full-range  backslash  Note
-                 Unicode            Unicode    sequence
+ [[:...:]]      ASCII-range          Full-range  backslash  Note
+                 Unicode              Unicode     sequence
  -----------------------------------------------------
-   alpha      \p{PosixAlpha}       \p{Alpha}
-   alnum      \p{PosixAlnum}       \p{Alnum}
+   alpha      \p{PosixAlpha}       \p{XPosixAlpha}
+   alnum      \p{PosixAlnum}       \p{XPosixAlnum}
    ascii      \p{ASCII}          
-   blank      \p{PosixBlank}       \p{Blank} =             [1]
-                                   \p{HorizSpace}  \h      [1]
-   cntrl      \p{PosixCntrl}       \p{Cntrl}               [2]
-   digit      \p{PosixDigit}       \p{Digit}       \d
-   graph      \p{PosixGraph}       \p{Graph}               [3]
-   lower      \p{PosixLower}       \p{Lower}
-   print      \p{PosixPrint}       \p{Print}               [4]
-   punct      \p{PosixPunct}       \p{Punct}               [5]
-              \p{PerlSpace}        \p{SpacePerl}   \s      [6]
-   space      \p{PosixSpace}       \p{Space}               [6]
-   upper      \p{PosixUpper}       \p{Upper}
-   word       \p{PerlWord}         \p{Word}        \w
-   xdigit     \p{ASCII_Hex_Digit}  \p{XDigit}
+   blank      \p{PosixBlank}       \p{XPosixBlank}  \h      [1]
+                                   or \p{HorizSpace}        [1]
+   cntrl      \p{PosixCntrl}       \p{XPosixCntrl}          [2]
+   digit      \p{PosixDigit}       \p{XPosixDigit}  \d
+   graph      \p{PosixGraph}       \p{XPosixGraph}          [3]
+   lower      \p{PosixLower}       \p{XPosixLower}
+   print      \p{PosixPrint}       \p{XPosixPrint}          [4]
+   punct      \p{PosixPunct}       \p{XPosixPunct}          [5]
+              \p{PerlSpace}        \p{XPerlSpace}   \s      [6]
+   space      \p{PosixSpace}       \p{XPosixSpace}          [6]
+   upper      \p{PosixUpper}       \p{XPosixUpper}
+   word       \p{PosixWord}        \p{XPosixWord}   \w
+   xdigit     \p{ASCII_Hex_Digit}  \p{XPosixXDigit}
 
 =over 4
 
@@ -602,13 +603,15 @@ non-controls, non-alphanumeric, non-space characters:
 C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect,
 it could alter the behavior of C<[[:punct:]]>).
 
-C<\p{Punct}> matches a somewhat different set in the ASCII range, namely
+The similarly named property, C<\p{Punct}>, matches a somewhat different
+set in the ASCII range, namely
 C<[-!"#%&'()*,./:;?@[\\\]_{}]>.  That is, it is missing C<[$+E<lt>=E<gt>^`|~]>.
 This is because Unicode splits what POSIX considers to be punctuation into two
 categories, Punctuation and Symbols.
 
-When the matching string is in UTF-8 format, C<[[:punct:]]> matches what it
-matches in the ASCII range, plus what C<\p{Punct}> matches.  This is different
+C<\p{PosixPunct>, and when the matching string is in UTF-8 format,
+C<[[:punct:]]>, match what they match in the ASCII range, plus what
+C<\p{Punct}> matches.  This is different
 than strictly matching according to C<\p{Punct}>.  Another way to say it is that
 for a UTF-8 string, C<[[:punct:]]> matches all the characters that Unicode
 considers to be punctuation, plus all the ASCII-range characters that Unicode
@@ -621,6 +624,11 @@ matches the vertical tab, C<\cK>.   Same for the two ASCII-only range forms.
 
 =back
 
+There are various other synonyms that can be used for these besides
+C<\p{HorizSpace}> and \C<\p{XPosixBlank}>.  For example
+C<\p{PosixAlpha}> can be written as C<\p{Alpha}>.  All are listed
+in L<perluniprops/Properties accessible through \p{} and \P{}>.
+
 =head4 Negation
 X<character class, negation>
 
@@ -631,10 +639,12 @@ Some examples:
      POSIX         ASCII-range     Full-range  backslash
                     Unicode         Unicode    sequence
  -----------------------------------------------------
- [[:^digit:]]   \P{PosixDigit}     \P{Digit}      \D
- [[:^space:]]   \P{PosixSpace}     \P{Space}
-                \P{PerlSpace}      \P{SpacePerl}  \S
- [[:^word:]]    \P{PerlWord}       \P{Word}       \W
+ [[:^digit:]]   \P{PosixDigit}  \P{XPosixDigit}   \D
+ [[:^space:]]   \P{PosixSpace}  \P{XPosixSpace}
+                \P{PerlSpace}   \P{XPerlSpace}    \S
+ [[:^word:]]    \P{PerlWord}    \P{XPosixWord}    \W
+
+Again, the backslash sequence means Full-range Unicode.
 
 =head4 [= =] and [. .]
 
diff --git a/pod/perlreref.pod b/pod/perlreref.pod
index 6e028ee..4805c9b 100644
--- a/pod/perlreref.pod
+++ b/pod/perlreref.pod
@@ -145,44 +145,39 @@ and L<perlunicode> for details.
 
 POSIX character classes and their Unicode and Perl equivalents:
 
-           ASCII-         Full-
-           range          range   backslash
- POSIX    \p{...}         \p{}    sequence       Description
+            ASCII-         Full-
+   POSIX    range          range    backslash
+ [[:...:]]  \p{...}        \p{...}   sequence    Description
+
  -----------------------------------------------------------------------
- alnum   PosixAlnum       Alnum               Alpha plus Digit
- alpha   PosixAlpha       Alpha               Alphabetic characters
- ascii   ASCII                                Any ASCII character
- blank   PosixBlank       Blank     \h        Horizontal whitespace;
-                                                full-range also written
-                                                as \p{HorizSpace} (GNU
-                                                extension)
- cntrl   PosixCntrl       Cntrl               Control characters
- digit   PosixDigit       Digit     \d        Decimal digits
- graph   PosixGraph       Graph               Alnum plus Punct
- lower   PosixLower       Lower               Lowercase characters
- print   PosixPrint       Print               Graph plus Print, but not
-                                                any Cntrls
- punct   PosixPunct       Punct               These aren't precisely
-                                                equivalent.  See NOTE,
-                                                below.
- space   PosixSpace       Space     [\s\cK]   Whitespace
-         PerlSpace        SpacePerl \s        Perl's whitespace
-                                                definition
- upper   PosixUpper       Upper               Uppercase characters
- word    PerlWord         Word      \w        Alnum plus '_' (Perl
-                                                extension)
- xdigit  ASCII_Hex_Digit  XDigit              Hexadecimal digit,
-                                                ASCII-range is
-                                                [0-9A-Fa-f]
-
-NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
-In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
-C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
-effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
-matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>.  When matching a UTF-8 string,
-C<[[:punct:]]> matches what it does in the ASCII range, plus what
-C<\p{Punct}> matches.  C<\p{Punct}> matches, anything that isn't a
-control, an alphanumeric, a space, nor a symbol.
+ alnum   PosixAlnum       XPosixAlnum            Alpha plus Digit
+ alpha   PosixAlpha       XPosixAlpha            Alphabetic characters
+ ascii   ASCII                                   Any ASCII character
+ blank   PosixBlank       XPosixBlank   \h       Horizontal whitespace;
+                                                   full-range also
+                                                   written as
+                                                   \p{HorizSpace} (GNU
+                                                   extension)
+ cntrl   PosixCntrl       XPosixCntrl            Control characters
+ digit   PosixDigit       XPosixDigit   \d       Decimal digits
+ graph   PosixGraph       XPosixGraph            Alnum plus Punct
+ lower   PosixLower       XPosixLower            Lowercase characters
+ print   PosixPrint       XPosixPrint            Graph plus Print, but
+                                                   not any Cntrls
+ punct   PosixPunct       XPosixPunct            Punctuation and Symbols
+                                                   in ASCII-range; just
+                                                   punct outside it
+ space   PosixSpace       XPosixSpace   [\s\cK]  Whitespace
+         PerlSpace        XPerlSpace    \s       Perl's whitespace def'n
+ upper   PosixUpper       XPosixUpper            Uppercase characters
+ word    PerlWord         XPosixWord    \w       Alnum + '_' (Perl
+                                                   extension)
+ xdigit  ASCII_Hex_Digit  XPosixDigit            Hexadecimal digit,
+                                                    ASCII-range is
+                                                    [0-9A-Fa-f]
+
+Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed
+in L<perluniprops/Properties accessible through \p{} and \P{}>
 
 Within a character class:
 
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0005-DOCs-Clarify-that-w-matches-marks-and-Pc.patch

From 0cb7e216f4302e2d427e2d048dc3414a10c3b65a Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 15:20:24 -0600
Subject: [PATCH] DOCs: Clarify that \w matches marks and \Pc

The previous documentation really didn't specify what \w is.  It matches
the underscore, but also all other connector punctuation, plus any
marks, such as diacritical accents that occur within a word.
---
 lib/unicore/mktables    |    3 ++-
 pod/perlre.pod          |    4 +++-
 pod/perlrebackslash.pod |    7 ++++---
 pod/perlrecharclass.pod |   11 ++++++++---
 pod/perlreref.pod       |    5 +++--
 5 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index 8a5c89a..73dea61 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -11323,7 +11323,8 @@ sub compile_perl() {
                             );
 
     my $Word = $perl->add_match_table('Word',
-                                Description => '\w, including beyond ASCII',
+                                Description => '\w, including beyond ASCII;'
+                                            . ' = \p{Alnum} + \pM + \p{Pc}',
                                 Initialize => $Alnum + $gc->table('Mark'),
                                 );
     $Word->add_alias('XPosixWord');
diff --git a/pod/perlre.pod b/pod/perlre.pod
index d4e6599..acc1ad5 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -256,7 +256,9 @@ X<\g> X<\k> X<\K> X<backreference>
                    character class "..." within the outer bracketed
                    character class.  Example: [[:upper:]] matches any
                    uppercase character.
-  \w        [3]  Match a "word" character (alphanumeric plus "_")
+  \w        [3]  Match a "word" character (alphanumeric plus "_", plus
+                   other connector punctuation chars plus Unicode
+                   marks
   \W        [3]  Match a non-"word" character
   \s        [3]  Match a whitespace character
   \S        [3]  Match a non-whitespace character
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index b75c1e4..642acd6 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -359,9 +359,10 @@ the character classes are written as a backslash sequence. We will briefly
 discuss those here; full details of character classes can be found in
 L<perlrecharclass>.
 
-C<\w> is a character class that matches any single I<word> character (letters,
-digits, underscore). C<\d> is a character class that matches any decimal digit,
-while the character class C<\s> matches any whitespace character.
+C<\w> is a character class that matches any single I<word> character
+(letters, digits, Unicode marks, and connector punctuation (like the
+underscore)).  C<\d> is a character class that matches any decimal
+digit, while the character class C<\s> matches any whitespace character.
 New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
 and vertical whitespace characters.
 
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 1a6fd31..1b7c6cf 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -113,14 +113,19 @@ Any character that isn't matched by C<\d> will be matched by C<\D>.
 =head3 Word characters
 
 A C<\w> matches a single alphanumeric character (an alphabetic character, or a
-decimal digit) or an underscore (C<_>), not a whole word.  To match a whole
+decimal digit) or a connecting punctuation character, such as an
+underscore ("_").  It does not match a whole word.  To match a whole
 word, use C<\w+>.  This isn't the same thing as matching an English word, but 
-is the same as a string of Perl-identifier characters.  What is considered a
+in the ASCII range is the same as a string of Perl-identifier
+characters.  What is considered a
 word character depends on several factors, detailed below in L</Locale,
 EBCDIC, Unicode and UTF-8>.  If those factors indicate a Unicode
 interpretation, C<\w> matches the characters that are considered word
 characters in the Unicode database. That is, it not only matches ASCII letters,
-but also Thai letters, Greek letters, etc.  If a Unicode interpretation
+but also Thai letters, Greek letters, etc.  This includes connector
+punctuation (like the underscore) which connect two words together, or
+marks, such as a C<COMBINING TILDE>, which are generally used to add
+diacritical marks to letters.   If a Unicode interpretation
 is not indicated, C<\w> matches those characters that are considered
 word characters by the current locale or EBCDIC code page.  Without a
 locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and
diff --git a/pod/perlreref.pod b/pod/perlreref.pod
index 4805c9b..5247a63 100644
--- a/pod/perlreref.pod
+++ b/pod/perlreref.pod
@@ -170,8 +170,9 @@ POSIX character classes and their Unicode and Perl equivalents:
  space   PosixSpace       XPosixSpace   [\s\cK]  Whitespace
          PerlSpace        XPerlSpace    \s       Perl's whitespace def'n
  upper   PosixUpper       XPosixUpper            Uppercase characters
- word    PerlWord         XPosixWord    \w       Alnum + '_' (Perl
-                                                   extension)
+ word    PerlWord         XPosixWord    \w       Alnum + Unicode marks +
+                                                   connectors, like '_'
+                                                   (Perl extension)
  xdigit  ASCII_Hex_Digit  XPosixDigit            Hexadecimal digit,
                                                     ASCII-range is
                                                     [0-9A-Fa-f]
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0006-Nits-in-re-pods.patch

From 8fcefd7d2ae9dff5f605c1ab1f813e2d6b5266b2 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 15:23:11 -0600
Subject: [PATCH] Nits in re pods

---
 pod/perlrebackslash.pod |    5 +++--
 pod/perlrecharclass.pod |    2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index 642acd6..8c532c1 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -367,8 +367,9 @@ New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
 and vertical whitespace characters.
 
 The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
-character classes that match any character that isn't a word character,
-digit, whitespace, horizontal whitespace nor vertical whitespace.
+character classes that match respectively, any character that isn't a
+word character, digit, whitespace, horizontal whitespace, or vertical
+whitespace.
 
 Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
 
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 1b7c6cf..3329d60 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -234,7 +234,7 @@ UTF-8 format, or the locale or EBCDIC code page that is in effect includes them.
 =back
 
 It is worth noting that C<\d>, C<\w>, etc, match single characters, not
-complete numbers or words. To match a number (that consists of integers),
+complete numbers or words. To match a number (that consists of digits),
 use C<\d+>; to match a word, use C<\w+>.
 
 =head3 \N
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0007-l1_char_class_tab.h-Wrong-for-ALNUMC.patch

From f77c7a1173b1c45effa7d07d5bdc43a3796f8588 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 15:33:34 -0600
Subject: [PATCH] l1_char_class_tab.h: Wrong for ALNUMC

The generated table was wrong in the Latin1 range for characters with
the ALNUMC property
---
 Porting/mk_PL_charclass.pl |    2 +-
 l1_char_class_tab.h        |  130 ++++++++++++++++++++++----------------------
 2 files changed, 66 insertions(+), 66 deletions(-)

diff --git a/Porting/mk_PL_charclass.pl b/Porting/mk_PL_charclass.pl
index a23d611..64599e0 100644
--- a/Porting/mk_PL_charclass.pl
+++ b/Porting/mk_PL_charclass.pl
@@ -90,7 +90,7 @@ for my $ord (0..255) {
             $re = qr/\w/;
         } elsif ($name eq 'ALNUMC') {
             # Like \w, but no underscore
-            $re = qr/[^_\W]/;
+            $re = qr/\p{Alnum}/;
         } elsif ($name eq 'OCTAL') {
             $re = qr/[0-7]/;
         } else {    # The remainder have the same name and values as Unicode
diff --git a/l1_char_class_tab.h b/l1_char_class_tab.h
index f5c8870..f50b8d2 100644
--- a/l1_char_class_tab.h
+++ b/l1_char_class_tab.h
@@ -209,7 +209,7 @@ EXTCONST  U32 PL_charclass[] = {
 /* U+A7 SECTION SIGN */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+A8 DIAERESIS */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+A9 COPYRIGHT SIGN */ _CC_GRAPH_L1|_CC_PRINT_L1,
-/* U+AA FEMININE ORDINAL INDICATOR */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+AA FEMININE ORDINAL INDICATOR */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
 /* U+AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK */ _CC_GRAPH_L1|_CC_PRINT_L1|_CC_PUNCT_L1,
 /* U+AC NOT SIGN */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+AD SOFT HYPHEN */ _CC_GRAPH_L1|_CC_PRINT_L1,
@@ -220,81 +220,81 @@ EXTCONST  U32 PL_charclass[] = {
 /* U+B2 SUPERSCRIPT TWO */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+B3 SUPERSCRIPT THREE */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+B4 ACUTE ACCENT */ _CC_GRAPH_L1|_CC_PRINT_L1,
-/* U+B5 MICRO SIGN */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+B5 MICRO SIGN */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
 /* U+B6 PILCROW SIGN */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+B7 MIDDLE DOT */ _CC_GRAPH_L1|_CC_PRINT_L1|_CC_PUNCT_L1,
 /* U+B8 CEDILLA */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+B9 SUPERSCRIPT ONE */ _CC_GRAPH_L1|_CC_PRINT_L1,
-/* U+BA MASCULINE ORDINAL INDICATOR */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+BA MASCULINE ORDINAL INDICATOR */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
 /* U+BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK */ _CC_GRAPH_L1|_CC_PRINT_L1|_CC_PUNCT_L1,
 /* U+BC VULGAR FRACTION ONE QUARTER */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+BD VULGAR FRACTION ONE HALF */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+BE VULGAR FRACTION THREE QUARTERS */ _CC_GRAPH_L1|_CC_PRINT_L1,
 /* U+BF INVERTED QUESTION MARK */ _CC_GRAPH_L1|_CC_PRINT_L1|_CC_PUNCT_L1,
-/* U+C0 A WITH GRAVE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C1 A WITH ACUTE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C2 A WITH CIRCUMFLEX */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C3 A WITH TILDE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C4 A WITH DIAERESIS */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C5 A WITH RING ABOVE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C6 AE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C7 C WITH CEDILLA */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C8 E WITH GRAVE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+C9 E WITH ACUTE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+CA E WITH CIRCUMFLEX */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+CB E WITH DIAERESIS */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+CC I WITH GRAVE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+CD I WITH ACUTE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+CE I WITH CIRCUMFLEX */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+CF I WITH DIAERESIS */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D0 ETH */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D1 N WITH TILDE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D2 O WITH GRAVE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D3 O WITH ACUTE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D4 O WITH CIRCUMFLEX */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D5 O WITH TILDE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D6 O WITH DIAERESIS */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C0 A WITH GRAVE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C1 A WITH ACUTE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C2 A WITH CIRCUMFLEX */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C3 A WITH TILDE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C4 A WITH DIAERESIS */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C5 A WITH RING ABOVE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C6 AE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C7 C WITH CEDILLA */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C8 E WITH GRAVE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+C9 E WITH ACUTE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+CA E WITH CIRCUMFLEX */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+CB E WITH DIAERESIS */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+CC I WITH GRAVE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+CD I WITH ACUTE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+CE I WITH CIRCUMFLEX */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+CF I WITH DIAERESIS */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D0 ETH */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D1 N WITH TILDE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D2 O WITH GRAVE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D3 O WITH ACUTE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D4 O WITH CIRCUMFLEX */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D5 O WITH TILDE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D6 O WITH DIAERESIS */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
 /* U+D7 MULTIPLICATION SIGN */ _CC_GRAPH_L1|_CC_PRINT_L1,
-/* U+D8 O WITH STROKE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+D9 U WITH GRAVE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+DA U WITH ACUTE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+DB U WITH CIRCUMFLEX */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+DC U WITH DIAERESIS */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+DD Y WITH ACUTE */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+DE THORN */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
-/* U+DF sharp s */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E0 a with grave */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E1 a with acute */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E2 a with circumflex */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E3 a with tilde */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E4 a with diaeresis */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E5 a with ring above */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E6 ae */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E7 c with cedilla */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E8 e with grave */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+E9 e with acute */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+EA e with circumflex */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+EB e with diaeresis */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+EC i with grave */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+ED i with acute */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+EE i with circumflex */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+EF i with diaeresis */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F0 eth */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F1 n with tilde */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F2 o with grave */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F3 o with acute */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F4 o with circumflex */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F5 o with tilde */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F6 o with diaeresis */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+D8 O WITH STROKE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+D9 U WITH GRAVE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+DA U WITH ACUTE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+DB U WITH CIRCUMFLEX */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+DC U WITH DIAERESIS */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+DD Y WITH ACUTE */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+DE THORN */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_PRINT_L1|_CC_UPPER_L1|_CC_WORDCHAR_L1,
+/* U+DF sharp s */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E0 a with grave */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E1 a with acute */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E2 a with circumflex */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E3 a with tilde */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E4 a with diaeresis */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E5 a with ring above */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E6 ae */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E7 c with cedilla */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E8 e with grave */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+E9 e with acute */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+EA e with circumflex */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+EB e with diaeresis */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+EC i with grave */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+ED i with acute */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+EE i with circumflex */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+EF i with diaeresis */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F0 eth */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F1 n with tilde */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F2 o with grave */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F3 o with acute */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F4 o with circumflex */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F5 o with tilde */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F6 o with diaeresis */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
 /* U+F7 DIVISION SIGN */ _CC_GRAPH_L1|_CC_PRINT_L1,
-/* U+F8 o with stroke */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+F9 u with grave */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+FA u with acute */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+FB u with circumflex */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+FC u with diaeresis */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+FD y with acute */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+FE thorn */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
-/* U+FF y with diaeresis */ _CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F8 o with stroke */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+F9 u with grave */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+FA u with acute */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+FB u with circumflex */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+FC u with diaeresis */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+FD y with acute */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+FE thorn */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
+/* U+FF y with diaeresis */ _CC_ALNUMC_L1|_CC_ALPHA_L1|_CC_CHARNAME_CONT|_CC_GRAPH_L1|_CC_IDFIRST_L1|_CC_LOWER_L1|_CC_PRINT_L1|_CC_WORDCHAR_L1,
 };
 
 #else /* ! DOINIT */
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0008-Add-l1_char_class_tab.h-to-Make-dependencies.patch

From 90bd692cb62ccbdc8915d009857ba6ffdf1acfc4 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 15:35:00 -0600
Subject: [PATCH] Add l1_char_class_tab.h to Make dependencies

---
 Makefile.SH |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Makefile.SH b/Makefile.SH
index 5c4e9b7..5f08790 100755
--- a/Makefile.SH
+++ b/Makefile.SH
@@ -454,7 +454,7 @@ h1 = EXTERN.h INTERN.h XSUB.h av.h $(CONFIGH) cop.h cv.h dosish.h
 h2 = embed.h form.h gv.h handy.h hv.h keywords.h mg.h op.h opcode.h
 h3 = pad.h patchlevel.h perl.h perlapi.h perly.h pp.h proto.h regcomp.h
 h4 = regexp.h scope.h sv.h unixish.h util.h iperlsys.h thread.h
-h5 = utf8.h warnings.h mydtrace.h op_reg_common.h
+h5 = utf8.h warnings.h mydtrace.h op_reg_common.h l1_char_class_tab.h
 h = $(h1) $(h2) $(h3) $(h4) $(h5)
 
 c1 = av.c scope.c op.c doop.c doio.c dump.c gv.c hv.c mg.c reentr.c mro.c perl.c
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0009--posix-now-works-under-u.patch

From ef661e6d8cafe55bba78d4e48a4a2dae29eeba47 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 16:48:55 -0600
Subject: [PATCH] [:posix:] now works under /u

This patch is part of fixing the Unicode bug.  The /u regex modifier now
applies to posix character classes.  This resolves [perl #18281].

The Todo tests in reg_posicc.t have all been made not todo.
---
 pod/perldelta.pod   |   20 ++++++++++++++++++++
 pod/perlunicode.pod |    5 +++--
 regcomp.c           |   22 +++++++++++-----------
 t/re/reg_posixcc.t  |   29 ++++++++++-------------------
 4 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 3d00cce..6b4ef0f 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -65,6 +65,18 @@ See L<re/'/flags' mode> for details.
 Statement labels can now occur before any type of statement or declaration,
 such as C<package>.
 
+=head2 C<use feature "unicode_strings"> now applies to more regex matching
+
+Another chunk of the L<perlunicode/The "Unicode Bug"> is fixed in this
+release.  Now, regular expressions compiled within the scope of the
+"unicode_strings" feature (or under the "u" regex modifier (specifiable
+currently only with infix notation C<(?u:...)> or via C<use re '/u'>)
+will match the same whether or not the target string is encoded in utf8,
+with regard to C<[[:posix:]]> character classes
+
+Work is underway to add the case sensitive matching to the control of
+this feature, but was not complete in time for this dot release.
+
 =head1 Security
 
 XXX Any security-related notices go here.  In particular, any security
@@ -617,6 +629,14 @@ L<[perl #77498]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=77498>.
 C<sprintf> was ignoring locales when called with constant arguments
 L<[perl #78632]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=78632>.
 
+=item *
+
+A non-ASCII character in the Latin-1 range could match both a Posix
+class, such as C<[[:alnum:]]>, and its inverse C<[[:^alnum:]]>.  This is
+now fixed for regular expressions compiled under the C<"u"> modifier.
+See L</C<use feature "unicode_strings"> now applies to more regex matching>.
+L<[perl #18281]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=18281>.
+
 =back
 
 =head1 Known Problems
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 8ff5bb0..dfd6d42 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1510,8 +1510,9 @@ support seamlessly.  The result wasn't seamless: these characters were
 orphaned.
 
 Work is being done to correct this, but only some of it is complete.
-What has been finished is the matching of C<\b>, C<\s>, C<\w> and their
-complements in regular expressions, and the important part of the case
+What has been finished is the matching of C<\b>, C<\s>, C<\w> and the Posix
+character classes and their complements in regular expressions, and the
+important part of the case
 changing component.  Due to concerns, and some evidence, that older code might
 have come to rely on the existing behavior, the new behavior must be explicitly
 enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
diff --git a/regcomp.c b/regcomp.c
index 0d469c1..0489cc9 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -8471,16 +8471,16 @@ parseit:
 		 * --jhi */
 		switch ((I32)namedclass) {
 		
-		case _C_C_T_(ALNUMC, isALNUMC(value), POSIX_CC_UNI_NAME("Alnum"));
-		case _C_C_T_(ALPHA, isALPHA(value), POSIX_CC_UNI_NAME("Alpha"));
-		case _C_C_T_(BLANK, isBLANK(value), POSIX_CC_UNI_NAME("Blank"));
-		case _C_C_T_(CNTRL, isCNTRL(value), POSIX_CC_UNI_NAME("Cntrl"));
-		case _C_C_T_(GRAPH, isGRAPH(value), POSIX_CC_UNI_NAME("Graph"));
-		case _C_C_T_(LOWER, isLOWER(value), POSIX_CC_UNI_NAME("Lower"));
-		case _C_C_T_(PRINT, isPRINT(value), POSIX_CC_UNI_NAME("Print"));
-		case _C_C_T_(PSXSPC, isPSXSPC(value), POSIX_CC_UNI_NAME("Space"));
-		case _C_C_T_(PUNCT, isPUNCT(value), POSIX_CC_UNI_NAME("Punct"));
-		case _C_C_T_(UPPER, isUPPER(value), POSIX_CC_UNI_NAME("Upper"));
+		case _C_C_T_UNI_8_BIT(ALNUMC, isALNUMC_L1(value), isALNUMC(value), "XPosixAlnum");
+		case _C_C_T_UNI_8_BIT(ALPHA, isALPHA_L1(value), isALPHA(value), "XPosixAlpha");
+		case _C_C_T_UNI_8_BIT(BLANK, isBLANK_L1(value), isBLANK(value), "XPosixBlank");
+		case _C_C_T_UNI_8_BIT(CNTRL, isCNTRL_L1(value), isCNTRL(value), "XPosixCntrl");
+		case _C_C_T_UNI_8_BIT(GRAPH, isGRAPH_L1(value), isGRAPH(value), "XPosixGraph");
+		case _C_C_T_UNI_8_BIT(LOWER, isLOWER_L1(value), isLOWER(value), "XPosixLower");
+		case _C_C_T_UNI_8_BIT(PRINT, isPRINT_L1(value), isPRINT(value), "XPosixPrint");
+		case _C_C_T_UNI_8_BIT(PSXSPC, isPSXSPC_L1(value), isPSXSPC(value), "XPosixSpace");
+		case _C_C_T_UNI_8_BIT(PUNCT, isPUNCT_L1(value), isPUNCT(value), "XPosixPunct");
+		case _C_C_T_UNI_8_BIT(UPPER, isUPPER_L1(value), isUPPER(value), "XPosixUpper");
 #ifdef BROKEN_UNICODE_CHARCLASS_MAPPINGS
                 /* \s, \w match all unicode if utf8. */
                 case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "SpacePerl");
@@ -8490,7 +8490,7 @@ parseit:
                 case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "PerlSpace");
                 case _C_C_T_UNI_8_BIT(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "PerlWord");
 #endif		
-		case _C_C_T_(XDIGIT, isXDIGIT(value), "XDigit");
+		case _C_C_T_UNI_8_BIT(XDIGIT, isXDIGIT_L1(value), isXDIGIT(value), "XPosixXDigit");
 		case _C_C_T_NOLOC_(VERTWS, is_VERTWS_latin1(&value), "VertSpace");
 		case _C_C_T_NOLOC_(HORIZWS, is_HORIZWS_latin1(&value), "HorizSpace");
 		case ANYOF_ASCII:
diff --git a/t/re/reg_posixcc.t b/t/re/reg_posixcc.t
index cd3890c..aa7f445 100644
--- a/t/re/reg_posixcc.t
+++ b/t/re/reg_posixcc.t
@@ -41,9 +41,6 @@ my @pats=(
 	    "[:^space:]",
 	    "[:blank:]",
 	    "[:^blank:]" );
-if (1 or $ENV{PERL_TEST_LEGACY_POSIX_CC}) {
-    $::TODO = "Only works under PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS = 0";
-}
 
 sub rangify {
     my $ary= shift;
@@ -72,6 +69,9 @@ sub rangify {
     return $ret;
 }
 
+# The bug is only fixed for /u
+use feature 'unicode_strings';
+
 my $description = "";
 while (@pats) {
     my ($yes,$no)= splice @pats,0,2;
@@ -81,6 +81,7 @@ while (@pats) {
     my %complements;
     foreach my $b (0..255) {
         my %got;
+        my $display_b = sprintf("\\x%02X", $b);
         for my $type ('unicode','not-unicode') {
             my $str=chr($b).chr($b);
             if ($type eq 'unicode') {
@@ -88,10 +89,8 @@ while (@pats) {
                 chop $str;
             }
             if ($str=~/[$yes][$no]/){
-                TODO: {
-                    unlike($str,qr/[$yes][$no]/,
-                        "chr($b)=~/[$yes][$no]/ should not match under $type");
-                }
+                unlike($str,qr/[$yes][$no]/,
+                    "chr($display_b) X 2 =~/[$yes][$no]/ should not match under $type");
                 push @{$err_by_type{$type}},$b;
             }
             $got{"[$yes]"}{$type} = $str=~/[$yes]/ ? 1 : 0;
@@ -101,20 +100,16 @@ while (@pats) {
         }
         foreach my $which ("[$yes]","[$no]","[^$yes]","[^$no]") {
             if ($got{$which}{'unicode'} != $got{$which}{'not-unicode'}){
-                TODO: {
-                    is($got{$which}{'unicode'},$got{$which}{'not-unicode'},
-                        "chr($b)=~/$which/ should have the same results regardless of internal string encoding");
-                }
+                is($got{$which}{'unicode'},$got{$which}{'not-unicode'},
+                    "chr($display_b) X 2=~ /$which/ should have the same results regardless of internal string encoding");
                 push @{$singles{$which}},$b;
             }
         }
         foreach my $which ($yes,$no) {
             foreach my $strtype ('unicode','not-unicode') {
                 if ($got{"[$which]"}{$strtype} == $got{"[^$which]"}{$strtype}) {
-                    TODO: {
-                        isnt($got{"[$which]"}{$strtype},$got{"[^$which]"}{$strtype},
-                            "chr($b)=~/[$which]/ should not have the same result as chr($b)=~/[^$which]/");
-                    }
+                    isnt($got{"[$which]"}{$strtype},$got{"[^$which]"}{$strtype},
+                        "chr($display_b) X 2 =~ /[$which]/ should not have the same result as chr($display_b)=~/[^$which]/");
                     push @{$complements{$which}{$strtype}},$b;
                 }
             }
@@ -153,8 +148,4 @@ while (@pats) {
         }
     }
 }
-TODO: {
-    is( $description, "", "POSIX and perl charclasses should not depend on string type");
-}
-
 __DATA__
-- 
1.5.6.3

p5pRT · 2010-10-30T23:16:33Z

From @khwilliamson

0010-regcomp.c-No-longer-need-_C_C_T_-and-variant-macro.patch

From 7534f7e8a2c8997b999b241b970745bb23858961 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 30 Oct 2010 16:55:42 -0600
Subject: [PATCH] regcomp.c: No longer need _C_C_T_ and variant macro

Now, all calls have been converted to the more general case; can remove
the old one, and rename the new one to have the same name as the old one
---
 regcomp.c |   59 +++++++++++++++++------------------------------------------
 1 files changed, 17 insertions(+), 42 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 0489cc9..cbba23d 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -8056,32 +8056,7 @@ S_checkposixcc(pTHX_ RExC_state_t *pRExC_state)
     }
 }
 
-
-#define _C_C_T_(NAME,TEST,WORD)                         \
-ANYOF_##NAME:                                           \
-    if (LOC)                                            \
-	ANYOF_CLASS_SET(ret, ANYOF_##NAME);             \
-    else {                                              \
-	for (value = 0; value < 256; value++)           \
-	    if (TEST)                                   \
-		ANYOF_BITMAP_SET(ret, value);           \
-    }                                                   \
-    yesno = '+';                                        \
-    what = WORD;                                        \
-    break;                                              \
-case ANYOF_N##NAME:                                     \
-    if (LOC)                                            \
-	ANYOF_CLASS_SET(ret, ANYOF_N##NAME);            \
-    else {                                              \
-	for (value = 0; value < 256; value++)           \
-	    if (!TEST)                                  \
-		ANYOF_BITMAP_SET(ret, value);           \
-    }                                                   \
-    yesno = '!';                                        \
-    what = WORD;                                        \
-    break
-
-/* Like above, but no locale test */
+/* No locale test */
 #define _C_C_T_NOLOC_(NAME,TEST,WORD)                   \
 ANYOF_##NAME:                                           \
 	for (value = 0; value < 256; value++)           \
@@ -8102,7 +8077,7 @@ case ANYOF_N##NAME:                                     \
  * there are two tests passed in, to use depending on that. There aren't any
  * cases where the label is different from the name, so no need for that
  * parameter */
-#define _C_C_T_UNI_8_BIT(NAME,TEST_8,TEST_7,WORD)       \
+#define _C_C_T_(NAME,TEST_8,TEST_7,WORD)       \
 ANYOF_##NAME:                                           \
     if (LOC) ANYOF_CLASS_SET(ret, ANYOF_##NAME);        \
     else if (UNI_SEMANTICS) {                           \
@@ -8471,26 +8446,26 @@ parseit:
 		 * --jhi */
 		switch ((I32)namedclass) {
 		
-		case _C_C_T_UNI_8_BIT(ALNUMC, isALNUMC_L1(value), isALNUMC(value), "XPosixAlnum");
-		case _C_C_T_UNI_8_BIT(ALPHA, isALPHA_L1(value), isALPHA(value), "XPosixAlpha");
-		case _C_C_T_UNI_8_BIT(BLANK, isBLANK_L1(value), isBLANK(value), "XPosixBlank");
-		case _C_C_T_UNI_8_BIT(CNTRL, isCNTRL_L1(value), isCNTRL(value), "XPosixCntrl");
-		case _C_C_T_UNI_8_BIT(GRAPH, isGRAPH_L1(value), isGRAPH(value), "XPosixGraph");
-		case _C_C_T_UNI_8_BIT(LOWER, isLOWER_L1(value), isLOWER(value), "XPosixLower");
-		case _C_C_T_UNI_8_BIT(PRINT, isPRINT_L1(value), isPRINT(value), "XPosixPrint");
-		case _C_C_T_UNI_8_BIT(PSXSPC, isPSXSPC_L1(value), isPSXSPC(value), "XPosixSpace");
-		case _C_C_T_UNI_8_BIT(PUNCT, isPUNCT_L1(value), isPUNCT(value), "XPosixPunct");
-		case _C_C_T_UNI_8_BIT(UPPER, isUPPER_L1(value), isUPPER(value), "XPosixUpper");
+		case _C_C_T_(ALNUMC, isALNUMC_L1(value), isALNUMC(value), "XPosixAlnum");
+		case _C_C_T_(ALPHA, isALPHA_L1(value), isALPHA(value), "XPosixAlpha");
+		case _C_C_T_(BLANK, isBLANK_L1(value), isBLANK(value), "XPosixBlank");
+		case _C_C_T_(CNTRL, isCNTRL_L1(value), isCNTRL(value), "XPosixCntrl");
+		case _C_C_T_(GRAPH, isGRAPH_L1(value), isGRAPH(value), "XPosixGraph");
+		case _C_C_T_(LOWER, isLOWER_L1(value), isLOWER(value), "XPosixLower");
+		case _C_C_T_(PRINT, isPRINT_L1(value), isPRINT(value), "XPosixPrint");
+		case _C_C_T_(PSXSPC, isPSXSPC_L1(value), isPSXSPC(value), "XPosixSpace");
+		case _C_C_T_(PUNCT, isPUNCT_L1(value), isPUNCT(value), "XPosixPunct");
+		case _C_C_T_(UPPER, isUPPER_L1(value), isUPPER(value), "XPosixUpper");
 #ifdef BROKEN_UNICODE_CHARCLASS_MAPPINGS
                 /* \s, \w match all unicode if utf8. */
-                case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "SpacePerl");
-                case _C_C_T_UNI_8_BIT(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "Word");
+                case _C_C_T_(SPACE, isSPACE_L1(value), isSPACE(value), "SpacePerl");
+                case _C_C_T_(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "Word");
 #else
                 /* \s, \w match ascii and locale only */
-                case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "PerlSpace");
-                case _C_C_T_UNI_8_BIT(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "PerlWord");
+                case _C_C_T_(SPACE, isSPACE_L1(value), isSPACE(value), "PerlSpace");
+                case _C_C_T_(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "PerlWord");
 #endif		
-		case _C_C_T_UNI_8_BIT(XDIGIT, isXDIGIT_L1(value), isXDIGIT(value), "XPosixXDigit");
+		case _C_C_T_(XDIGIT, isXDIGIT_L1(value), isXDIGIT(value), "XPosixXDigit");
 		case _C_C_T_NOLOC_(VERTWS, is_VERTWS_latin1(&value), "VertSpace");
 		case _C_C_T_NOLOC_(HORIZWS, is_HORIZWS_latin1(&value), "HorizSpace");
 		case ANYOF_ASCII:
-- 
1.5.6.3

p5pRT · 2010-10-31T01:11:36Z

From @khwilliamson

karl williamson (via RT) wrote:

# New Ticket Created by karl williamson
# Please include the string: [perl #78726]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78726 >

I think there are other tickets fixed by this, but couldn't find any. I
know I wrote one myself a couple years ago.

regen required

This is part of the Unicode Bug.

The bug is fixed only for /u regexes. Yves' analysis was that it wasn't
fixable otherwise, and this is part of the reason we are adding /u.

This patch requires [perl #78722] to be applied. Both are also
available at git://github.com/khwilliamson/perl.git
branch matching.

Essentially, the patch just uses Unicode semantics if that is called
for. The macros that do this have been applied earlier, but there was a
bug in one of them that didn't surface until this patch.

I shouldn't have created a new ticket for this. I've merged #78726 into
#18281.

In one of the commit messages, I say this is fixed for /u regexes only,
and not fixable for non-. However, in the work I'm doing for solving
the /iu problem, I've discovered several wrong things in the old code,
that when corrected, will also solve this for non-/u regexes.

p5pRT · 2010-10-31T19:44:12Z

From @cpansprout

On Sat Oct 30 16:16:33 2010, public@khwilliamson.com wrote:

I think there are other tickets fixed by this, but couldn't find any. I
know I wrote one myself a couple years ago.

regen required

This is part of the Unicode Bug.

The bug is fixed only for /u regexes. Yves' analysis was that it wasn't
fixable otherwise, and this is part of the reason we are adding /u.

This patch requires [perl #78722] to be applied. Both are also
available at git://github.com/khwilliamson/perl.git
branch matching.

Essentially, the patch just uses Unicode semantics if that is called
for. The macros that do this have been applied earlier, but there was a
bug in one of them that didn't surface until this patch.

Patches 4-10 have been applied as:
cbc24f9
d35dd6c
e486b3c
aedd44b
9b7c43b
0399b21
7bbf947

I added a comma before ‘respectively’ to patch number 6.

p5pRT · 2010-10-31T19:44:13Z

From [Unknown Contact. See original ticket]

On Sat Oct 30 16:16:33 2010, public@khwilliamson.com wrote:

I think there are other tickets fixed by this, but couldn't find any. I
know I wrote one myself a couple years ago.

regen required

This is part of the Unicode Bug.

The bug is fixed only for /u regexes. Yves' analysis was that it wasn't
fixable otherwise, and this is part of the reason we are adding /u.

This patch requires [perl #78722] to be applied. Both are also
available at git://github.com/khwilliamson/perl.git
branch matching.

Essentially, the patch just uses Unicode semantics if that is called
for. The macros that do this have been applied earlier, but there was a
bug in one of them that didn't surface until this patch.

Patches 4-10 have been applied as:
cbc24f9
d35dd6c
e486b3c
aedd44b
9b7c43b
0399b21
7bbf947

I added a comma before ‘respectively’ to patch number 6.

p5pRT · 2011-03-25T19:42:33Z

From @khwilliamson

This is now fixed
--Karl Williamson

p5pRT · 2011-03-25T19:42:33Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Mar 25, 2011

p5pRT added Severity High distro-Linux type-Unicode type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 bug: latin1 character both [\w] and [\W] #6065

UTF-8 bug: latin1 character both [\w] and [\W] #6065

p5pRT commented Nov 8, 2002

p5pRT commented Nov 8, 2002

p5pRT commented Dec 11, 2002

p5pRT commented Jan 3, 2003

p5pRT commented Jan 1, 2004

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 30, 2010

p5pRT commented Oct 31, 2010

p5pRT commented Oct 31, 2010

p5pRT commented Oct 31, 2010

p5pRT commented Mar 25, 2011

p5pRT commented Mar 25, 2011

UTF-8 bug: latin1 character both [\w] and [\W] #6065

UTF-8 bug: latin1 character both [\w] and [\W] #6065

Comments

p5pRT commented Nov 8, 2002

p5pRT commented Nov 8, 2002

From @andk

Created by @andk

p5pRT commented Dec 11, 2002

From @jhi

p5pRT commented Jan 3, 2003

From @hvds

p5pRT commented Jan 1, 2004

From @rgs

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 30, 2010

From @khwilliamson

p5pRT commented Oct 31, 2010

From @khwilliamson

p5pRT commented Oct 31, 2010

From @cpansprout

p5pRT commented Oct 31, 2010

From [Unknown Contact. See original ticket]

p5pRT commented Mar 25, 2011

From @khwilliamson

p5pRT commented Mar 25, 2011