Skip Menu |
Report information
Id: 18281
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: andk <andreas.koenig [at] anima.de>
public [at] khwilliamson.com
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: High
Type:
Perl Version: 5.8.0
Fixed In: (no value)



To: perlbug [...] perl.org
Subject: UTF-8 bug: latin1 character both [\w] and [\W]
Cc: andreas.koenig [...] anima.de
Date: Fri, 8 Nov 2002 16:11:15 +0100 (CET)
From: k [...] neo.ub.uni-dortmund.de
Download (untitled) / with headers
text/plain 3.2k
This is a bug report for perl from andreas.koenig@anima.de, generated with the help of perlbug 1.34 running under perl v5.8.0. ----------------------------------------------------------------- [Please enter your report here] The terse test would be: $_ = "\x{df}"; utf8::upgrade($_); print /[\w]/ ^ /[\W]/ ? "ok\n" : "not ok\n"; A more illustrative test to wake you up: % /usr/local/perl-5.8.0@18125/bin/perl -le ' $_ = "\x{df}"; utf8::upgrade($_); if (/([\w])/){ warn "a letter"; } if (/([\W])/){ warn "not a letter"; } ' a letter at -e line 4. not a letter at -e line 7. I've tested with all perls since patch 9340 and they all misbehave. [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=high --- This perlbug was built using Perl v5.8.0 - Mon Sep 9 18:12:37 UTC 2002 It is being executed now by Perl v5.8.0 - Mon Sep 9 18:02:36 UTC 2002. Site configuration information for perl v5.8.0: Configured by root at Mon Sep 9 18:02:36 UTC 2002. Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration: Platform: osname=linux, osvers=2.4.19, archname=i586-linux-thread-multi uname='linux bloembergen 2.4.19 #1 mon apr 15 08:57:26 gmt 2002 i686 unknown ' config_args='-ds -e -Dprefix=/usr -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O3 --pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing' ccversion='', gccversion='3.2', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='' libpth=/lib /usr/lib /usr/local/lib libs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.2.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i586-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared' Locally applied patches: --- @INC for perl v5.8.0: /usr/lib/perl5/5.8.0/i586-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl . --- Environment for perl v5.8.0: HOME=/home/k LANG=C LANGUAGE (unset) LC_COLLATE=POSIX LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/k/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/kde3/bin:/usr/lib/java/jre/bin:.:/usr/local/perl/bin:/usr/X11/bin:/sbin:/usr/sbin PERL_BADLANG (unset) SHELL=/usr/bin/zsh
Download (untitled) / with headers
text/plain 360b
I now took a look at this and it is surprisingly tricky (or so I think). The problem comes form the dual model of Perl: bytes and (Unicode) characters. Why the [\W] falsely matches is caused by the \xDF being *on* in the *byte* side of the \W. My quick straightforward attempts at fixing this break one of split.t tests where the split pattern includes \x80.
To: perl5-porters [...] perl.org
Subject: Re: [perl #18281] UTF-8 bug: latin1 character both [\w] and [\W]
From: hv [...] crypt.org
Date: Fri, 03 Jan 2003 11:51:54 +0000
RT-Send-Cc:
Download (untitled) / with headers
text/plain 1.3k
(Andreas J. Koenig) (via RT) <perlbug@perl.org> wrote: :A more illustrative test to wake you up: : :% /usr/local/perl-5.8.0@18125/bin/perl -le ' :$_ = "\x{df}"; utf8::upgrade($_); :if (/([\w])/){ : warn "a letter"; :} :if (/([\W])/){ : warn "not a letter"; :} :' :a letter at -e line 4. :not a letter at -e line 7. : :I've tested with all perls since patch 9340 and they all misbehave. I've talked a bit with Jarkko about this, and we can't at this time come up with any fix other than to document the behaviour. The core issue is that in the old 'bytes' world, \xdf (and in fact all of \x80-\xff) are not treated as letters, except when subject to the vagaries of locale. In the Unicode world, \xdf maps to a character that is defined under Unicode as being a letter. And we are still trying to support both definitions. There doesn't seem to be any consistent way of redefining the cases that doesn't break some existing tests. Note that when used outside of a character class, \xdf does not match /\W/; I don't currently understand why it is different in this case. Jarkko's suggested workarounds: (1) \w and \W: with bytes \xDF is not a letter (except if using locale and the locale thinks \xDF is a letter), with Unicode \xDF is a letter (2) \pL and \PL: input can be either byte or Unicode (and \xDF is a letter) (3) \p{Word} and \P{Word}: ditto Hugo
I documented this in perlunicode.pod as change #22031.
Subject: PATCH: [perl#18281]: latin1 char matches both posix class and its complement; Unicode Bug, partial
Date: Sat, 30 Oct 2010 17:15:25 -0600
To: perlbug [...] perl.org, Juerd Waalboer <juerd [...] convolution.nl>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 648b
I think there are other tickets fixed by this, but couldn't find any. I know I wrote one myself a couple years ago. regen required This is part of the Unicode Bug. The bug is fixed only for /u regexes. Yves' analysis was that it wasn't fixable otherwise, and this is part of the reason we are adding /u. This patch requires [perl #78722] to be applied. Both are also available at git://github.com/khwilliamson/perl.git branch matching. Essentially, the patch just uses Unicode semantics if that is called for. The macros that do this have been applied earlier, but there was a bug in one of them that didn't surface until this patch.
From 8c41aef5491314eef781463bac8abb96cfb46a0e Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 09:43:50 -0600 Subject: [PATCH] mktables: Add tests for wrong equivalence attempts mktables allows for multiple tables to be made equivalent, which in Unix terminology means that they are essentially symbolic links. However this should happen only when they have the same code points in them to begin with. This adds a little more error checking. --- lib/unicore/mktables | 21 +++++++++++++++------ 1 files changed, 15 insertions(+), 6 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index b7cda64..b13fe0e 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -6560,12 +6560,21 @@ sub trace { return main::trace(@_); } my $addr = do { no overloading; pack 'J', $self; }; my $current_leader = ($related) ? $parent{$addr} : $leader{$addr}; - if ($related && - ! $other->perl_extension - && ! $current_leader->perl_extension) - { - Carp::my_carp_bug("set_equivalent_to should have 'Related => 0 for equivalencing two Unicode properties. Assuming $self is not related to $other"); - $related = 0; + if ($related) { + if ($current_leader->perl_extension) { + if ($other->perl_extension) { + Carp::my_carp_bug("Use add_alias() to set two Perl tables '$self' and '$other', equivalent."); + return; + } + } elsif (! $other->perl_extension) { + Carp::my_carp_bug("set_equivalent_to should have 'Related => 0 for equivalencing two Unicode properties. Assuming $self is not related to $other"); + $related = 0; + } + } + + if (! $self->is_empty && ! $self->matches_identically_to($other)) { + Carp::my_carp_bug("$self should be empty or match identically to $other. Not setting equivalent"); + return; } my $leader = do { no overloading; pack 'J', $current_leader; }; -- 1.5.6.3
From 4f04e7d11054b1792ca26651b6fdbf4806e2bc46 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 09:53:06 -0600 Subject: [PATCH] mktables: Clarify \d description for perluniprops --- lib/unicore/mktables | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index b13fe0e..c432809 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -11423,7 +11423,7 @@ sub compile_perl() { ); my $Digit = $perl->add_match_table('Digit', - Description => '\d, extended beyond just [0-9]'); + Description => '[0-9] + all other decimal digits'); $Digit->set_equivalent_to($gc->table('Decimal_Number'), Related => 1); my $PosixDigit = $perl->add_match_table("PosixDigit", Description => '[0-9]', -- 1.5.6.3
From fbce0c5d241837f0038b187e3f7f73f1ae271cc3 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 10:13:35 -0600 Subject: [PATCH] perlrecharclass: Nits --- pod/perlrecharclass.pod | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 7cb2f78..0b88cc4 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -595,7 +595,7 @@ of all the alphanumerical characters and all punctuation characters. All printable characters, which is the set of all the graphical characters plus whitespace characters that are not also controls. -=item [5] (punct) +=item [5] C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the non-controls, non-alphanumeric, non-space characters: @@ -683,7 +683,8 @@ A regular expression is marked for Unicode semantics if it is encoded in utf8 (usually as a result of including a literal character whose code point is above 255), or if it contains a C<\N{U+...}> or C<\N{I<name>}> construct, or (starting in Perl 5.14) if it was compiled in the scope of a -C<S<use feature "unicode_strings">> pragma. +C<S<use feature "unicode_strings">> pragma, or has the C<"u"> regular +expression modifier. The differences in behavior between locale and non-locale semantics can affect any character whose code point is 255 or less. The -- 1.5.6.3

Message body is not shown because it is too large.

From 0cb7e216f4302e2d427e2d048dc3414a10c3b65a Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 15:20:24 -0600 Subject: [PATCH] DOCs: Clarify that \w matches marks and \Pc The previous documentation really didn't specify what \w is. It matches the underscore, but also all other connector punctuation, plus any marks, such as diacritical accents that occur within a word. --- lib/unicore/mktables | 3 ++- pod/perlre.pod | 4 +++- pod/perlrebackslash.pod | 7 ++++--- pod/perlrecharclass.pod | 11 ++++++++--- pod/perlreref.pod | 5 +++-- 5 files changed, 20 insertions(+), 10 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index 8a5c89a..73dea61 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -11323,7 +11323,8 @@ sub compile_perl() { ); my $Word = $perl->add_match_table('Word', - Description => '\w, including beyond ASCII', + Description => '\w, including beyond ASCII;' + . ' = \p{Alnum} + \pM + \p{Pc}', Initialize => $Alnum + $gc->table('Mark'), ); $Word->add_alias('XPosixWord'); diff --git a/pod/perlre.pod b/pod/perlre.pod index d4e6599..acc1ad5 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -256,7 +256,9 @@ X<\g> X<\k> X<\K> X<backreference> character class "..." within the outer bracketed character class. Example: [[:upper:]] matches any uppercase character. - \w [3] Match a "word" character (alphanumeric plus "_") + \w [3] Match a "word" character (alphanumeric plus "_", plus + other connector punctuation chars plus Unicode + marks \W [3] Match a non-"word" character \s [3] Match a whitespace character \S [3] Match a non-whitespace character diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index b75c1e4..642acd6 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -359,9 +359,10 @@ the character classes are written as a backslash sequence. We will briefly discuss those here; full details of character classes can be found in L<perlrecharclass>. -C<\w> is a character class that matches any single I<word> character (letters, -digits, underscore). C<\d> is a character class that matches any decimal digit, -while the character class C<\s> matches any whitespace character. +C<\w> is a character class that matches any single I<word> character +(letters, digits, Unicode marks, and connector punctuation (like the +underscore)). C<\d> is a character class that matches any decimal +digit, while the character class C<\s> matches any whitespace character. New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal and vertical whitespace characters. diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 1a6fd31..1b7c6cf 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -113,14 +113,19 @@ Any character that isn't matched by C<\d> will be matched by C<\D>. =head3 Word characters A C<\w> matches a single alphanumeric character (an alphabetic character, or a -decimal digit) or an underscore (C<_>), not a whole word. To match a whole +decimal digit) or a connecting punctuation character, such as an +underscore ("_"). It does not match a whole word. To match a whole word, use C<\w+>. This isn't the same thing as matching an English word, but -is the same as a string of Perl-identifier characters. What is considered a +in the ASCII range is the same as a string of Perl-identifier +characters. What is considered a word character depends on several factors, detailed below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors indicate a Unicode interpretation, C<\w> matches the characters that are considered word characters in the Unicode database. That is, it not only matches ASCII letters, -but also Thai letters, Greek letters, etc. If a Unicode interpretation +but also Thai letters, Greek letters, etc. This includes connector +punctuation (like the underscore) which connect two words together, or +marks, such as a C<COMBINING TILDE>, which are generally used to add +diacritical marks to letters. If a Unicode interpretation is not indicated, C<\w> matches those characters that are considered word characters by the current locale or EBCDIC code page. Without a locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and diff --git a/pod/perlreref.pod b/pod/perlreref.pod index 4805c9b..5247a63 100644 --- a/pod/perlreref.pod +++ b/pod/perlreref.pod @@ -170,8 +170,9 @@ POSIX character classes and their Unicode and Perl equivalents: space PosixSpace XPosixSpace [\s\cK] Whitespace PerlSpace XPerlSpace \s Perl's whitespace def'n upper PosixUpper XPosixUpper Uppercase characters - word PerlWord XPosixWord \w Alnum + '_' (Perl - extension) + word PerlWord XPosixWord \w Alnum + Unicode marks + + connectors, like '_' + (Perl extension) xdigit ASCII_Hex_Digit XPosixDigit Hexadecimal digit, ASCII-range is [0-9A-Fa-f] -- 1.5.6.3
From 8fcefd7d2ae9dff5f605c1ab1f813e2d6b5266b2 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 15:23:11 -0600 Subject: [PATCH] Nits in re pods --- pod/perlrebackslash.pod | 5 +++-- pod/perlrecharclass.pod | 2 +- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 642acd6..8c532c1 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -367,8 +367,9 @@ New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal and vertical whitespace characters. The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are -character classes that match any character that isn't a word character, -digit, whitespace, horizontal whitespace nor vertical whitespace. +character classes that match respectively, any character that isn't a +word character, digit, whitespace, horizontal whitespace, or vertical +whitespace. Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical. diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 1b7c6cf..3329d60 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -234,7 +234,7 @@ UTF-8 format, or the locale or EBCDIC code page that is in effect includes them. =back It is worth noting that C<\d>, C<\w>, etc, match single characters, not -complete numbers or words. To match a number (that consists of integers), +complete numbers or words. To match a number (that consists of digits), use C<\d+>; to match a word, use C<\w+>. =head3 \N -- 1.5.6.3

Message body is not shown because it is too large.

From 90bd692cb62ccbdc8915d009857ba6ffdf1acfc4 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 15:35:00 -0600 Subject: [PATCH] Add l1_char_class_tab.h to Make dependencies --- Makefile.SH | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Makefile.SH b/Makefile.SH index 5c4e9b7..5f08790 100755 --- a/Makefile.SH +++ b/Makefile.SH @@ -454,7 +454,7 @@ h1 = EXTERN.h INTERN.h XSUB.h av.h $(CONFIGH) cop.h cv.h dosish.h h2 = embed.h form.h gv.h handy.h hv.h keywords.h mg.h op.h opcode.h h3 = pad.h patchlevel.h perl.h perlapi.h perly.h pp.h proto.h regcomp.h h4 = regexp.h scope.h sv.h unixish.h util.h iperlsys.h thread.h -h5 = utf8.h warnings.h mydtrace.h op_reg_common.h +h5 = utf8.h warnings.h mydtrace.h op_reg_common.h l1_char_class_tab.h h = $(h1) $(h2) $(h3) $(h4) $(h5) c1 = av.c scope.c op.c doop.c doio.c dump.c gv.c hv.c mg.c reentr.c mro.c perl.c -- 1.5.6.3
From ef661e6d8cafe55bba78d4e48a4a2dae29eeba47 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 16:48:55 -0600 Subject: [PATCH] [:posix:] now works under /u This patch is part of fixing the Unicode bug. The /u regex modifier now applies to posix character classes. This resolves [perl #18281]. The Todo tests in reg_posicc.t have all been made not todo. --- pod/perldelta.pod | 20 ++++++++++++++++++++ pod/perlunicode.pod | 5 +++-- regcomp.c | 22 +++++++++++----------- t/re/reg_posixcc.t | 29 ++++++++++------------------- 4 files changed, 44 insertions(+), 32 deletions(-) diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 3d00cce..6b4ef0f 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -65,6 +65,18 @@ See L<re/'/flags' mode> for details. Statement labels can now occur before any type of statement or declaration, such as C<package>. +=head2 C<use feature "unicode_strings"> now applies to more regex matching + +Another chunk of the L<perlunicode/The "Unicode Bug"> is fixed in this +release. Now, regular expressions compiled within the scope of the +"unicode_strings" feature (or under the "u" regex modifier (specifiable +currently only with infix notation C<(?u:...)> or via C<use re '/u'>) +will match the same whether or not the target string is encoded in utf8, +with regard to C<[[:posix:]]> character classes + +Work is underway to add the case sensitive matching to the control of +this feature, but was not complete in time for this dot release. + =head1 Security XXX Any security-related notices go here. In particular, any security @@ -617,6 +629,14 @@ L<[perl #77498]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=77498>. C<sprintf> was ignoring locales when called with constant arguments L<[perl #78632]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=78632>. +=item * + +A non-ASCII character in the Latin-1 range could match both a Posix +class, such as C<[[:alnum:]]>, and its inverse C<[[:^alnum:]]>. This is +now fixed for regular expressions compiled under the C<"u"> modifier. +See L</C<use feature "unicode_strings"> now applies to more regex matching>. +L<[perl #18281]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=18281>. + =back =head1 Known Problems diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 8ff5bb0..dfd6d42 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -1510,8 +1510,9 @@ support seamlessly. The result wasn't seamless: these characters were orphaned. Work is being done to correct this, but only some of it is complete. -What has been finished is the matching of C<\b>, C<\s>, C<\w> and their -complements in regular expressions, and the important part of the case +What has been finished is the matching of C<\b>, C<\s>, C<\w> and the Posix +character classes and their complements in regular expressions, and the +important part of the case changing component. Due to concerns, and some evidence, that older code might have come to rely on the existing behavior, the new behavior must be explicitly enabled by the feature C<unicode_strings> in the L<feature> pragma, even though diff --git a/regcomp.c b/regcomp.c index 0d469c1..0489cc9 100644 --- a/regcomp.c +++ b/regcomp.c @@ -8471,16 +8471,16 @@ parseit: * --jhi */ switch ((I32)namedclass) { - case _C_C_T_(ALNUMC, isALNUMC(value), POSIX_CC_UNI_NAME("Alnum")); - case _C_C_T_(ALPHA, isALPHA(value), POSIX_CC_UNI_NAME("Alpha")); - case _C_C_T_(BLANK, isBLANK(value), POSIX_CC_UNI_NAME("Blank")); - case _C_C_T_(CNTRL, isCNTRL(value), POSIX_CC_UNI_NAME("Cntrl")); - case _C_C_T_(GRAPH, isGRAPH(value), POSIX_CC_UNI_NAME("Graph")); - case _C_C_T_(LOWER, isLOWER(value), POSIX_CC_UNI_NAME("Lower")); - case _C_C_T_(PRINT, isPRINT(value), POSIX_CC_UNI_NAME("Print")); - case _C_C_T_(PSXSPC, isPSXSPC(value), POSIX_CC_UNI_NAME("Space")); - case _C_C_T_(PUNCT, isPUNCT(value), POSIX_CC_UNI_NAME("Punct")); - case _C_C_T_(UPPER, isUPPER(value), POSIX_CC_UNI_NAME("Upper")); + case _C_C_T_UNI_8_BIT(ALNUMC, isALNUMC_L1(value), isALNUMC(value), "XPosixAlnum"); + case _C_C_T_UNI_8_BIT(ALPHA, isALPHA_L1(value), isALPHA(value), "XPosixAlpha"); + case _C_C_T_UNI_8_BIT(BLANK, isBLANK_L1(value), isBLANK(value), "XPosixBlank"); + case _C_C_T_UNI_8_BIT(CNTRL, isCNTRL_L1(value), isCNTRL(value), "XPosixCntrl"); + case _C_C_T_UNI_8_BIT(GRAPH, isGRAPH_L1(value), isGRAPH(value), "XPosixGraph"); + case _C_C_T_UNI_8_BIT(LOWER, isLOWER_L1(value), isLOWER(value), "XPosixLower"); + case _C_C_T_UNI_8_BIT(PRINT, isPRINT_L1(value), isPRINT(value), "XPosixPrint"); + case _C_C_T_UNI_8_BIT(PSXSPC, isPSXSPC_L1(value), isPSXSPC(value), "XPosixSpace"); + case _C_C_T_UNI_8_BIT(PUNCT, isPUNCT_L1(value), isPUNCT(value), "XPosixPunct"); + case _C_C_T_UNI_8_BIT(UPPER, isUPPER_L1(value), isUPPER(value), "XPosixUpper"); #ifdef BROKEN_UNICODE_CHARCLASS_MAPPINGS /* \s, \w match all unicode if utf8. */ case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "SpacePerl"); @@ -8490,7 +8490,7 @@ parseit: case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "PerlSpace"); case _C_C_T_UNI_8_BIT(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "PerlWord"); #endif - case _C_C_T_(XDIGIT, isXDIGIT(value), "XDigit"); + case _C_C_T_UNI_8_BIT(XDIGIT, isXDIGIT_L1(value), isXDIGIT(value), "XPosixXDigit"); case _C_C_T_NOLOC_(VERTWS, is_VERTWS_latin1(&value), "VertSpace"); case _C_C_T_NOLOC_(HORIZWS, is_HORIZWS_latin1(&value), "HorizSpace"); case ANYOF_ASCII: diff --git a/t/re/reg_posixcc.t b/t/re/reg_posixcc.t index cd3890c..aa7f445 100644 --- a/t/re/reg_posixcc.t +++ b/t/re/reg_posixcc.t @@ -41,9 +41,6 @@ my @pats=( "[:^space:]", "[:blank:]", "[:^blank:]" ); -if (1 or $ENV{PERL_TEST_LEGACY_POSIX_CC}) { - $::TODO = "Only works under PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS = 0"; -} sub rangify { my $ary= shift; @@ -72,6 +69,9 @@ sub rangify { return $ret; } +# The bug is only fixed for /u +use feature 'unicode_strings'; + my $description = ""; while (@pats) { my ($yes,$no)= splice @pats,0,2; @@ -81,6 +81,7 @@ while (@pats) { my %complements; foreach my $b (0..255) { my %got; + my $display_b = sprintf("\\x%02X", $b); for my $type ('unicode','not-unicode') { my $str=chr($b).chr($b); if ($type eq 'unicode') { @@ -88,10 +89,8 @@ while (@pats) { chop $str; } if ($str=~/[$yes][$no]/){ - TODO: { - unlike($str,qr/[$yes][$no]/, - "chr($b)=~/[$yes][$no]/ should not match under $type"); - } + unlike($str,qr/[$yes][$no]/, + "chr($display_b) X 2 =~/[$yes][$no]/ should not match under $type"); push @{$err_by_type{$type}},$b; } $got{"[$yes]"}{$type} = $str=~/[$yes]/ ? 1 : 0; @@ -101,20 +100,16 @@ while (@pats) { } foreach my $which ("[$yes]","[$no]","[^$yes]","[^$no]") { if ($got{$which}{'unicode'} != $got{$which}{'not-unicode'}){ - TODO: { - is($got{$which}{'unicode'},$got{$which}{'not-unicode'}, - "chr($b)=~/$which/ should have the same results regardless of internal string encoding"); - } + is($got{$which}{'unicode'},$got{$which}{'not-unicode'}, + "chr($display_b) X 2=~ /$which/ should have the same results regardless of internal string encoding"); push @{$singles{$which}},$b; } } foreach my $which ($yes,$no) { foreach my $strtype ('unicode','not-unicode') { if ($got{"[$which]"}{$strtype} == $got{"[^$which]"}{$strtype}) { - TODO: { - isnt($got{"[$which]"}{$strtype},$got{"[^$which]"}{$strtype}, - "chr($b)=~/[$which]/ should not have the same result as chr($b)=~/[^$which]/"); - } + isnt($got{"[$which]"}{$strtype},$got{"[^$which]"}{$strtype}, + "chr($display_b) X 2 =~ /[$which]/ should not have the same result as chr($display_b)=~/[^$which]/"); push @{$complements{$which}{$strtype}},$b; } } @@ -153,8 +148,4 @@ while (@pats) { } } } -TODO: { - is( $description, "", "POSIX and perl charclasses should not depend on string type"); -} - __DATA__ -- 1.5.6.3
From 7534f7e8a2c8997b999b241b970745bb23858961 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 30 Oct 2010 16:55:42 -0600 Subject: [PATCH] regcomp.c: No longer need _C_C_T_ and variant macro Now, all calls have been converted to the more general case; can remove the old one, and rename the new one to have the same name as the old one --- regcomp.c | 59 +++++++++++++++++------------------------------------------ 1 files changed, 17 insertions(+), 42 deletions(-) diff --git a/regcomp.c b/regcomp.c index 0489cc9..cbba23d 100644 --- a/regcomp.c +++ b/regcomp.c @@ -8056,32 +8056,7 @@ S_checkposixcc(pTHX_ RExC_state_t *pRExC_state) } } - -#define _C_C_T_(NAME,TEST,WORD) \ -ANYOF_##NAME: \ - if (LOC) \ - ANYOF_CLASS_SET(ret, ANYOF_##NAME); \ - else { \ - for (value = 0; value < 256; value++) \ - if (TEST) \ - ANYOF_BITMAP_SET(ret, value); \ - } \ - yesno = '+'; \ - what = WORD; \ - break; \ -case ANYOF_N##NAME: \ - if (LOC) \ - ANYOF_CLASS_SET(ret, ANYOF_N##NAME); \ - else { \ - for (value = 0; value < 256; value++) \ - if (!TEST) \ - ANYOF_BITMAP_SET(ret, value); \ - } \ - yesno = '!'; \ - what = WORD; \ - break - -/* Like above, but no locale test */ +/* No locale test */ #define _C_C_T_NOLOC_(NAME,TEST,WORD) \ ANYOF_##NAME: \ for (value = 0; value < 256; value++) \ @@ -8102,7 +8077,7 @@ case ANYOF_N##NAME: \ * there are two tests passed in, to use depending on that. There aren't any * cases where the label is different from the name, so no need for that * parameter */ -#define _C_C_T_UNI_8_BIT(NAME,TEST_8,TEST_7,WORD) \ +#define _C_C_T_(NAME,TEST_8,TEST_7,WORD) \ ANYOF_##NAME: \ if (LOC) ANYOF_CLASS_SET(ret, ANYOF_##NAME); \ else if (UNI_SEMANTICS) { \ @@ -8471,26 +8446,26 @@ parseit: * --jhi */ switch ((I32)namedclass) { - case _C_C_T_UNI_8_BIT(ALNUMC, isALNUMC_L1(value), isALNUMC(value), "XPosixAlnum"); - case _C_C_T_UNI_8_BIT(ALPHA, isALPHA_L1(value), isALPHA(value), "XPosixAlpha"); - case _C_C_T_UNI_8_BIT(BLANK, isBLANK_L1(value), isBLANK(value), "XPosixBlank"); - case _C_C_T_UNI_8_BIT(CNTRL, isCNTRL_L1(value), isCNTRL(value), "XPosixCntrl"); - case _C_C_T_UNI_8_BIT(GRAPH, isGRAPH_L1(value), isGRAPH(value), "XPosixGraph"); - case _C_C_T_UNI_8_BIT(LOWER, isLOWER_L1(value), isLOWER(value), "XPosixLower"); - case _C_C_T_UNI_8_BIT(PRINT, isPRINT_L1(value), isPRINT(value), "XPosixPrint"); - case _C_C_T_UNI_8_BIT(PSXSPC, isPSXSPC_L1(value), isPSXSPC(value), "XPosixSpace"); - case _C_C_T_UNI_8_BIT(PUNCT, isPUNCT_L1(value), isPUNCT(value), "XPosixPunct"); - case _C_C_T_UNI_8_BIT(UPPER, isUPPER_L1(value), isUPPER(value), "XPosixUpper"); + case _C_C_T_(ALNUMC, isALNUMC_L1(value), isALNUMC(value), "XPosixAlnum"); + case _C_C_T_(ALPHA, isALPHA_L1(value), isALPHA(value), "XPosixAlpha"); + case _C_C_T_(BLANK, isBLANK_L1(value), isBLANK(value), "XPosixBlank"); + case _C_C_T_(CNTRL, isCNTRL_L1(value), isCNTRL(value), "XPosixCntrl"); + case _C_C_T_(GRAPH, isGRAPH_L1(value), isGRAPH(value), "XPosixGraph"); + case _C_C_T_(LOWER, isLOWER_L1(value), isLOWER(value), "XPosixLower"); + case _C_C_T_(PRINT, isPRINT_L1(value), isPRINT(value), "XPosixPrint"); + case _C_C_T_(PSXSPC, isPSXSPC_L1(value), isPSXSPC(value), "XPosixSpace"); + case _C_C_T_(PUNCT, isPUNCT_L1(value), isPUNCT(value), "XPosixPunct"); + case _C_C_T_(UPPER, isUPPER_L1(value), isUPPER(value), "XPosixUpper"); #ifdef BROKEN_UNICODE_CHARCLASS_MAPPINGS /* \s, \w match all unicode if utf8. */ - case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "SpacePerl"); - case _C_C_T_UNI_8_BIT(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "Word"); + case _C_C_T_(SPACE, isSPACE_L1(value), isSPACE(value), "SpacePerl"); + case _C_C_T_(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "Word"); #else /* \s, \w match ascii and locale only */ - case _C_C_T_UNI_8_BIT(SPACE, isSPACE_L1(value), isSPACE(value), "PerlSpace"); - case _C_C_T_UNI_8_BIT(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "PerlWord"); + case _C_C_T_(SPACE, isSPACE_L1(value), isSPACE(value), "PerlSpace"); + case _C_C_T_(ALNUM, isWORDCHAR_L1(value), isALNUM(value), "PerlWord"); #endif - case _C_C_T_UNI_8_BIT(XDIGIT, isXDIGIT_L1(value), isXDIGIT(value), "XPosixXDigit"); + case _C_C_T_(XDIGIT, isXDIGIT_L1(value), isXDIGIT(value), "XPosixXDigit"); case _C_C_T_NOLOC_(VERTWS, is_VERTWS_latin1(&value), "VertSpace"); case _C_C_T_NOLOC_(HORIZWS, is_HORIZWS_latin1(&value), "HorizSpace"); case ANYOF_ASCII: -- 1.5.6.3
Subject: Re: [perl #78726] PATCH: [perl#18281]: latin1 char matches both posix class and its complement; Unicode Bug, partial
Date: Sat, 30 Oct 2010 19:11:00 -0600
To: perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.2k
karl williamson (via RT) wrote: Show quoted text
> # New Ticket Created by karl williamson > # Please include the string: [perl #78726] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78726 > > > > I think there are other tickets fixed by this, but couldn't find any. I > know I wrote one myself a couple years ago. > > regen required > > This is part of the Unicode Bug. > > The bug is fixed only for /u regexes. Yves' analysis was that it wasn't > fixable otherwise, and this is part of the reason we are adding /u. > > This patch requires [perl #78722] to be applied. Both are also > available at git://github.com/khwilliamson/perl.git > branch matching. > > Essentially, the patch just uses Unicode semantics if that is called > for. The macros that do this have been applied earlier, but there was a > bug in one of them that didn't surface until this patch. >
I shouldn't have created a new ticket for this. I've merged #78726 into #18281. In one of the commit messages, I say this is fixed for /u regexes only, and not fixable for non-. However, in the work I'm doing for solving the /iu problem, I've discovered several wrong things in the old code, that when corrected, will also solve this for non-/u regexes.
RT-Send-CC: perl5-porters [...] perl.org, public [...] khwilliamson.com
Download (untitled) / with headers
text/plain 1.1k
On Sat Oct 30 16:16:33 2010, public@khwilliamson.com wrote: Show quoted text
> I think there are other tickets fixed by this, but couldn't find any. I > know I wrote one myself a couple years ago. > > regen required > > This is part of the Unicode Bug. > > The bug is fixed only for /u regexes. Yves' analysis was that it wasn't > fixable otherwise, and this is part of the reason we are adding /u. > > This patch requires [perl #78722] to be applied. Both are also > available at git://github.com/khwilliamson/perl.git > branch matching. > > Essentially, the patch just uses Unicode semantics if that is called > for. The macros that do this have been applied earlier, but there was a > bug in one of them that didn't surface until this patch.
Patches 4-10 have been applied as: cbc24f92709e23449028ec3036bda16c0af294fb d35dd6c678badc24d545f8b7b7a3ebdf0fb0b355 e486b3ccda3754fd159530607148c92cbfcbddf8 aedd44b501ab1196eeb3ebe56ef7647debb77eab 9b7c43baf09d4c57d5cd6c9a052ce398d1626a6a 0399b2152e23eb6ce1f09562d53b87be7fe30924 7bbf947b84c2a0700fd31acb7a31342cd0b8f796 I added a comma before ‘respectively’ to patch number 6.
This is now fixed --Karl Williamson


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org