New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A number of characters match both a posix class and its complement #9543
Comments
From Robin.Barker@npl.co.ukCreated by robin.barker@npl.co.ukAs I read the documentation, the pairs in perlrepod of But \p{IsPunct} does not match $ + < = > ^ ` | ~ Various \p{Is...} match characters in the range 128-256 Perl Info
|
From Robin.Barker@npl.co.ukNo progress in resolving this in code, so here is documentation patch. Robin --- pod/perlre.pod.orig 2008-01-30 20:41:06.000000000 +0000 For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. +However, the equivalence between C<[[:xxxxx:]]> and C<\p{Xxxxx}> is not exact. |
From Robin.Barker@npl.co.ukperlre.patch--- pod/perlre.pod.orig 2008-01-30 20:41:06.000000000 +0000
+++ pod/perlre.pod
@@ -375,8 +375,8 @@
digit IsDigit \d
graph IsGraph
lower IsLower
- print IsPrint
- punct IsPunct
+ print IsPrint (but see 2. below)
+ punct IsPunct (but see 3. below)
space IsSpace
IsSpacePerl \s
upper IsUpper
@@ -385,6 +385,41 @@
For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
+However, the equivalence between C<[[:xxxxx:]]> and C<\p{Xxxxx}> is not exact.
+
+=over 4
+
+=item 1.
+
+C<[[:xxxxx:]]> only matches characters in the range 0x00-0x7F.
+
+=item 2.
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not.
+
+=item 3.
+
+C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols in Unicode.
+
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
+
+Mathematical symbols
+
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
+
If the C<utf8> pragma is not used but the C<locale> pragma is, the
classes correlate with the usual isalpha(3) interface (except for
"word" and "blank").
|
From david@landgren.netRobin Barker wrote:
s/classed/classified/ considered? |
The RT System itself - Status changed from 'new' to 'open' |
From @JuerdRobin Barker skribis 2008-03-31 21:42 (+0100):
Not always true. juerd@lanova:~$ perl -CO -e'printf "[%s]\n", "foo\x{123}" =~ /([[:print:]]+)/' See also Unicode::Semantics on CPAN. Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From Robin.Barker@npl.co.ukI've taken in to account comments about the last patch, Robin diff -ur ../perl-current/pod/perlre.pod ./pod/perlre.pod For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. +However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}> +But if the C<locale> or C<encoding> pragmas are not used and =over 4 +SKIP: { # Test counter is at bottom of file. Put new tests above here. # Don't forget to update this! |
From Robin.Barker@npl.co.ukperlre-3.patchdiff -ur ../perl-current/pod/perlre.pod ./pod/perlre.pod
--- ../perl-current/pod/perlre.pod 2008-01-30 20:41:06.000000000 +0000
+++ ./pod/perlre.pod
@@ -375,20 +375,60 @@
digit IsDigit \d
graph IsGraph
lower IsLower
- print IsPrint
- punct IsPunct
+ print IsPrint (but see [2] below)
+ punct IsPunct (but see [3] below)
space IsSpace
IsSpacePerl \s
upper IsUpper
- word IsWord
+ word IsWord \w
xdigit IsXDigit
For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
+However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}>
+is not exact.
+
+=over 4
+
+=item [1]
+
If the C<utf8> pragma is not used but the C<locale> pragma is, the
classes correlate with the usual isalpha(3) interface (except for
"word" and "blank").
+But if the C<locale> or C<encoding> pragmas are not used and
+the string is not C<utf8>, then C<[[:xxxxx:]]> (and C<\w>, etc.)
+will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will
+force the string to C<utf8> and can match these characters
+(as Unicode).
+
+=item [2]
+
+C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not.
+
+=item [3]
+
+C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not,
+because they are classed as symbols (not punctuation) in Unicode.
+
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
+
+Mathematical symbols
+
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
+
The other named classes are:
=over 4
diff -ur ../perl-current/t/op/pat.t ./t/op/pat.t
--- ../perl-current/t/op/pat.t 2008-04-15 13:46:40.000000000 +0100
+++ ./t/op/pat.t
@@ -4604,6 +4604,32 @@
iseq($te[0], '../');
}
+SKIP: {
+ if (ordA == 193) { skip("Assumes ASCII", 4) }
+
+ my @notIsPunct = grep {/[[:punct:]]/ and not /\p{IsPunct}/}
+ map {chr} 0x20..0x7f;
+ iseq( join('', @notIsPunct), '$+<=>^`|~',
+ '[:punct:] disagress with IsPunct on Symbols');
+
+ my @isPrint = grep {not/[[:print:]]/ and /\p{IsPrint}/}
+ map {chr} 0..0x1f, 0x7f..0x9f;
+ iseq( join('', @isPrint), "\x09\x0a\x0b\x0c\x0d\x85",
+ 'IsPrint disagrees with [:print:] on control characters');
+
+ my @isPunct = grep {/[[:punct:]]/ != /\p{IsPunct}/}
+ map {chr} 0x80..0xff;
+ iseq( join('', @isPunct), "\xa1\xab\xb7\xbb\xbf", # �� �� �� �� ��
+ 'IsPunct disagrees with [:punct:] outside ASCII');
+
+ my @isPunctLatin1 = eval q{
+ use encoding 'latin1';
+ grep {/[[:punct:]]/ != /\p{IsPunct}/} map {chr} 0x80..0xff;
+ };
+ if( $@ ){ skip( $@, 1); }
+ iseq( join('', @isPunctLatin1), '',
+ 'IsPunct agrees with [:punct:] with explicit Latin1');
+}
# Test counter is at bottom of file. Put new tests above here.
@@ -4667,7 +4693,7 @@
# Don't forget to update this!
BEGIN {
- $::TestCount = 4031;
+ $::TestCount = 4035;
print "1..$::TestCount\n";
}
|
From @rgs2008/4/25 Robin Barker <Robin.Barker@npl.co.uk>:
Thanks, applied as #33752 |
@rgs - Status changed from 'open' to 'resolved' |
From @khwilliamsonThis is a bug report for perl from public@khwilliamson.com, use utf8; print '©' =~ /[[:graph:]]/, "\n"; both print 1. This happens for various posix classes, and various Flags: This perlbug was built using Perl 5.11.0 - Wed Oct 22 19:16:44 MDT 2008 Site configuration information for perl 5.11.0: Configured by khw at Fri Oct 24 11:08:58 MDT 2008. Summary of my perl5 (revision 5 version 11 subversion 0 patch 34566) Locally applied patches: @INC for perl 5.11.0: Environment for perl 5.11.0: PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin |
From @khwilliamsonTom Christiansen wrote:
I wondered why you were getting errors when my own (which I thought were I haven't looked at the code, but what likely what is going on is that My goal is to fix all these problems in the 128-255 range. It's turning |
From @tonycozOn Sun, Oct 26, 2008 at 03:32:53PM -0700, karl williamson wrote:
This problem also occurs for \s vs \S inside a character class: sh-3.1$ The problem is that the character classes are compiled to a [\s] becomes ANYOF[\11\12\14\15 +utf8::IsSpacePerl] and similarly for [[:foo:]]: [[:graph:]] becomes ANYOF[!-~+utf8::IsGraph] [[:^graph:]] becomes ANYOF[\0- \177-\377!utf8::IsGraph] and for UTF scalars both the bitmap and the unicode property is The bitmap is generated using isSPACE, isGRAPH, etc, which without a I suspect a solution is going to involve not generating the bitmap for But I don't understand enough of the regular expression engine to want Tony |
The RT System itself - Status changed from 'new' to 'open' |
From @demerphq2008/11/3 Tony Cook <tony@develop-help.com>:
I expected the attached patch to work out. Unfortunately it doesnt. I make test-reonly Test Summary Report op/pat (Wstat: 0 Tests: 4035 Failed: 5) Which seems quite odd, like the swash logic is failing somehow, in The issue here is that the bitmaps are constructed for non-utf8 and However im still poking. Yves -- |
From @demerphqnot_quite_working.patchdiff --git a/regexec.c b/regexec.c
index 16b1495..0d1c608 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5751,8 +5751,11 @@ S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const
if (lenp)
*lenp = 0;
if (do_utf8 && !ANYOF_RUNTIME(n)) {
- if (len != (STRLEN)-1 && c < 256 && ANYOF_BITMAP_TEST(n, c))
+ /* XXX: this can't be 256 as codepoints 128-256 have different semantics in utf8 and otherwise,
+ and the bitmaps are built with non-utf8 in mind. See related comment below. */
+ if (c < 128 && len != (STRLEN)-1 && ANYOF_BITMAP_TEST(n, c))
match = TRUE;
+
}
if (!match && do_utf8 && (flags & ANYOF_UNICODE_ALL) && c >= 256)
match = TRUE;
@@ -5792,22 +5795,30 @@ S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const
if (match && lenp && *lenp == 0)
*lenp = UNISKIP(NATIVE_TO_UNI(c));
}
- if (!match && c < 256) {
- if (ANYOF_BITMAP_TEST(n, c))
- match = TRUE;
- else if (flags & ANYOF_FOLD) {
- U8 f;
+ if (!match && c < 256 ) {
+ if (!do_utf8 || c < 128) {
+ /* XXX:
+ Codepoints 128-256 have different semantics in utf8 and otherwise, and the bitmaps
+ are built with non-utf8 in mind.
+ BUT, it would be nice if this conditional was simpler, as its in a "hot" codepath
+ dmq. See above related comment.
+ */
+
+ if (ANYOF_BITMAP_TEST(n, c))
+ match = TRUE;
+ else if (flags & ANYOF_FOLD) {
+ U8 f;
- if (flags & ANYOF_LOCALE) {
- PL_reg_flags |= RF_tainted;
- f = PL_fold_locale[c];
+ if (flags & ANYOF_LOCALE) {
+ PL_reg_flags |= RF_tainted;
+ f = PL_fold_locale[c];
+ }
+ else
+ f = PL_fold[c];
+ if (f != c && ANYOF_BITMAP_TEST(n, f))
+ match = TRUE;
}
- else
- f = PL_fold[c];
- if (f != c && ANYOF_BITMAP_TEST(n, f))
- match = TRUE;
}
-
if (!match && (flags & ANYOF_CLASS)) {
PL_reg_flags |= RF_tainted;
if (
diff --git a/t/op/regexp.t b/t/op/regexp.t
index 147e4cc..f64dbb7 100755
--- a/t/op/regexp.t
+++ b/t/op/regexp.t
@@ -191,7 +191,7 @@ EOFCODE
else { # better diagnostics
my $s = Data::Dumper->new([$subject],['subject'])->Useqq(1)->Dump;
my $g = Data::Dumper->new([$got],['got'])->Useqq(1)->Dump;
- print "not ok $test ($study) $input => `$got', match=$match\n$s\n$g\n$code\n";
+ print "not ok $test ($study) $input => '$got', match=$match\n$s\n$g\n$code\n";
}
next TEST;
}
|
From @demerphq2008/10/30 karl williamson <public@khwilliamson.com>:
Just for the record in the bug ticket, I used the attached file to /[\w][\W]/ /[\s][\S]/ /[[:alnum:]][[:^alnum:]]/ /[[:alpha:]][[:^alpha:]]/ /[[:cntrl:]][[:^cntrl:]]/ /[[:graph:]][[:^graph:]]/ /[[:lower:]][[:^lower:]]/ /[[:print:]][[:^print:]]/ /[[:punct:]][[:^punct:]]/ /[[:upper:]][[:^upper:]]/ /[[:space:]][[:^space:]]/ /[[:blank:]][[:^blank:]]/ Thats a lot of characters. Sadly. Im trying to figure out a work around that doesnt make non-unicode There are some "out of band" options tho. We chould implement the The bottom line is that regex metapatterns whose semantics change I think fixing this properly requires the pumpking / larry to make a Yves -- |
From @demerphq |
From @khwilliamsondemerphq wrote:
I don't know if this casts any light on the issue or not, but the |
From @khwilliamson$v = "\t"; |
From @demerphq2008/11/7 karl williamson <public@khwilliamson.com>:
This is essentially the same bug as we have been discussing. Basically POSIX specifies that a horizontal tab "\t" is a member of http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html However, it would appear that mktables has different ideas: ; So it would appear that at least some of our problems are of our own making. Im not really sure what to do about this, however I will say that I \w should be [A-Za-z_] My feeling is that there are perfectly acceptable ways to get a Yves -- |
From @rgs2008/11/7 demerphq <demerphq@gmail.com>:
perltodo says: =head2 UTF-8 revamp The handling of Unicode is unclean in many places. For example, the regexp =cut So, no weird bifurcated behaviour.
regardless of whether "use locale" is in effect ? (\d is less problematic, since locale don't affect it.)
As a spoiled C programmer, I expect POSIX charclasses to behave as in C
Indeed, if \w starts matching Unicode word chars, and if we don't add a |
From @demerphq2008/11/7 Rafael Garcia-Suarez <rgarciasuarez@gmail.com>:
Ok. Cool. So eventually we decide on a single behaviour. Good good.
No, use locale would do the same things it has always done. I have no
Yes, but im not proposing changing this.
Right, and actually would close up some security holes.
Agreed there should be no deviation from what the POSIX standard
We already have a way to make a word char match the unicode The whole problem (as far as the regex engine is concerned) is that we So \w has traditionally meant [A-Za-z_] but we went and mapped it to So if we were to do the minimum required to fix our character class Of course this would be a backwards incompatible change. IMO one for However assuming that we can make the mapping controllable by a pragma use re 'ascii_charclass'; #default for 5.12 at the same time in 5.12 (or even maybe 5.10) we could introduce new Hypothetically we could even support something like: use re 'charclass_overload' '\s' => 'IsPerlSpace','\w' => Currently, and especially because of the complementary bugs we are Yves -- |
From @demerphqI've merged in Rt #49302 as it involves basically the same thing. |
From @demerphq2008/11/7 demerphq <demerphq@gmail.com>:
I just applied the following: 34769 on 2008/11/07 by demerphq@demerphq-fresh create new unicode props as defined in POSIX spec (optionally use This is the first step I guess in fixing this problem. The next step -- |
From @druud62demerphq schreef:
ITYM: [0-9A-Za-z_] -- "Gewoon is een tijger." |
From ambs@zbr.ptDr.Ruud wrote:
for non US/UK folks: [0-9[:alpha:]_] -- |
From @demerphq2008/11/8 Dr.Ruud <rvtol+news@isolution.nl>:
Yes, that is what I meant. *blush* :-) Yves -- |
From @khwilliamsondemerphq wrote:
It would help me to know some of these holes, so I know what to watch
|
From @druud62Alberto Simões schreef:
That assumes that [[:alpha:]] is equivalent to [A-Za-z]. Thus that both I don't think so, already because the POSIX [:alpha:] and [:lower:] etc. -- "Gewoon is een tijger." |
From krahnj@telus.netAlberto Simões wrote:
ITYM: [[:digit:][:alpha:]_] Or: [[:alnum:]_] John |
From @demerphq2008/11/9 Dr.Ruud <rvtol+news@isolution.nl>:
In the POSIX locale yes they are.
Under use locale they are yes, otherwise they sortof assume POSIX Yves -- |
From @druud62demerphq schreef:
Of course, but in what I wrote there was no context such as "the POSIX Alberto, who went from \w should be [0-9A-Za-z_] to [0-9[:alpha:]_] was I think missing the point that \w should be [0-9A-Za-z_] (so
The "[:alpha:]" and "[:lower:]" (and such) are "POSIX character -- "Gewoon is een tijger." |
From @druud62demerphq schreef:
I don't know exactly which of my above statements you refer to, but from Allowing my general statements to be combined with the POSIX locale the RE character sets [A-Za-z] and [a-zA-Z] contain exactly 52 -- "Gewoon is een tijger." |
From @demerphq2008/11/9 Dr.Ruud <rvtol+news@isolution.nl>:
See the bottomish part of http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html Where it says: -------8<--------8<---------8<-------- In the POSIX locale, the 26 uppercase letters shall be included: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z In a locale definition file, no character specified for the lower In the POSIX locale, the 26 lowercase letters shall be included: a b c d e f g h i j k l m n o p q r s t u v w x y z In a locale definition file, no character specified for the alpha In the POSIX locale, all characters in the classes upper and lower In a locale definition file, no character specified for the digit In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 shall be included. In a locale definition file, only the digits <zero>, <one>, <two>, alnum space In the POSIX locale, at a minimum, the <space>, <form-feed>, In a locale definition file, no character specified for the cntrl In the POSIX locale, no characters in classes alpha or print shall In a locale definition file, no character specified for the punct In the POSIX locale, neither the <space> nor any characters in In a locale definition file, no character specified for the graph In the POSIX locale, all characters in classes alpha, digit, and In a locale definition file, characters specified for the keywords print In the POSIX locale, all characters in class graph shall be In a locale definition file, characters specified for the keywords xdigit In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f shall be included. In a locale definition file, only the characters defined for the blank In the POSIX locale, only the <space> and <tab> shall be included. In a locale definition file, the <space> and <tab> are One important point to keep in mind, which for me is a real issue for
The last part is wrong. Under the POSIX locale (which is a local Yves -- |
From @demerphqresolved in perl 5.11 |
@demerphq - Status changed from 'open' to 'resolved' |
@khwilliamson - Status changed from 'resolved' to 'open' |
From @khwilliamsonThere are a number of problems with the [[:posix:]] character classes. Here are the problems: 1) They do not match the Posix standard. In our attempt to DWIM, we 2) They suffer from "The Unicode Bug", in which the utf8ness of the 3) A number of characters in utf8 match both a class and the complement Note that some of these are ASCII. The root cause of these is mostly 4) Extending the posix definitions was not done consistently. This is It is less clear about other extensions. Should [[:cntrl:]] include Before, it seemed like the obvious solution to all this was to just go If we were to just reinstate those #ifdefs, it would fix all the above I have done some investigation, and it appears that I can easily solve If we want to restrict the posix classes to strict posix definitions, I I think, for consistency, especially if we don't add the strict posix Comments? |
From juerd@tnx.nlkarl williamson skribis 2010-08-14 11:09 (-0600):
This has been the case for many years and I think it's a good idea to Perhaps the bug is that we're still calling them POSIX character
This is bad. I'm strongly convinced that all text operations should be
This is not necessarily a problem, depending on the reasons for the dual
This one's really tough. I'd be in favor of fixing consistency. That'll
I was not particularly happy with this specific change. Going back
It could be argued that this should belong in "use POSIX", maybe implied
Agreed. Juerd Waalboer <juerd@tnx.nl> |
From @khwilliamsonJuerd Waalboer wrote:
That will be the case eventually if you 'use 5.12.0' or greater
But the definition of complement is the set is all characters that
It turns out that punct is the only one that has has this problem.
OK, if we do this, it makes sense to use the existing pragma.
Having investigated further, I've implemented things so that the bugs go I also intend to separately change the extended definition of |
From @demerphqOn 14 August 2010 19:09, karl williamson <public@khwilliamson.com> wrote:
POSIX is a standard. It is NOT up to us to redefine that standard. Had Anyway, thats my view. cheers -- |
From @AbigailOn Fri, Aug 20, 2010 at 04:41:48PM +0200, demerphq wrote:
It's a bit late, but I agree with Yves. POSIX is a standard. It defines Abigail |
From @khwilliamsonThis has finally been fixed in blead |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#60156 (status was 'resolved')
Searchable as RT60156$
The text was updated successfully, but these errors were encountered: