Skip Menu |
Report information
Id: 78464
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: public [at] khwilliamson.com
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: HasPatch
Severity: low
Type:
Perl Version: (no value)
Fixed In: (no value)



Subject: regex CURLYs fail on utf8 latin1 chars when pattern not utf8
Date: Tue, 16 Feb 2010 14:40:04 -0700
To: perlbug [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.4k
This is a bug report for perl from khw@khw-desktop.nonet, generated with the help of perlbug 1.39 running under perl 5.11.4. ----------------------------------------------------------------- Neither pattern nor string utf8: ./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+,?)/' 1 Upgrade string to utf8 but leave pattern non-utf8: ./perl -Ilib -E 'my $c = "\xF7"; utf8::upgrade($c); say 0 + $c =~ /((\x{00F7})+,?)/' 0 As above, but force pattern to utf8 by adding a char above latin1 that can match 0 times: ./perl -Ilib -E 'my $c = "\xF7"; utf8::upgrade($c); say 0 + $c =~ /((\x{00F7})+\x{100}?)/' 1 As above, but remove upgrade so string is not utf8, pattern is: ./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+\x{100}?)/' 1 To summarize, when the string is utf8 and the pattern isn't, it fails to match itself. ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl 5.11.4: Configured by khw at Sat Jan 23 13:32:28 MST 2010. Summary of my perl5 (revision 5 version 11 subversion 4) configuration: Commit id: d0a76353c03de55a7ec9f5f128afee8d3c83e23f Platform: osname=linux, osvers=2.6.27-16-generic, archname=i686-linux uname='linux khw-desktop 2.6.27-16-generic #1 smp tue dec 1 17:56:54 utc 2009 i686 gnulinux ' config_args='-s -d -Dprefix=/home/khw/blead -Dusedevel -D'optimize=-g3' -A'optimize=-g3' -A'optimize=-O0'' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O0 -g3', cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.3.2', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.8.90.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.8.90' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -g3 -g3 -O0 -L/usr/local/lib -fstack-protector' Locally applied patches: --- @INC for perl 5.11.4: /home/khw/blead/lib/perl5/site_perl/5.11.4/i686-linux /home/khw/blead/lib/perl5/site_perl/5.11.4 /home/khw/blead/lib/perl5/5.11.4/i686-linux /home/khw/blead/lib/perl5/5.11.4 /home/khw/blead/lib/perl5/site_perl . --- Environment for perl 5.11.4: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/opt/real/RealPlayer:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh
Subject: Re: [perl #72888] regex CURLYs fail on utf8 latin1 chars when pattern not utf8
Date: Wed, 17 Feb 2010 10:42:57 +0100
To: perl5-porters [...] perl.org
From: Abigail <abigail [...] abigail.be>
Download (untitled) / with headers
text/plain 3.1k
On Tue, Feb 16, 2010 at 01:41:02PM -0800, karl williamson wrote: Show quoted text
> # New Ticket Created by karl williamson > # Please include the string: [perl #72888] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=72888 > > > > This is a bug report for perl from khw@khw-desktop.nonet, > generated with the help of perlbug 1.39 running under perl 5.11.4. > > > ----------------------------------------------------------------- > Neither pattern nor string utf8: > ./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+,?)/' > 1 > > Upgrade string to utf8 but leave pattern non-utf8: > ./perl -Ilib -E 'my $c = "\xF7"; utf8::upgrade($c); say 0 + $c =~ > /((\x{00F7})+,?)/' > 0 > > As above, but force pattern to utf8 by adding a char above latin1 that > can match 0 times: > ./perl -Ilib -E 'my $c = "\xF7"; utf8::upgrade($c); say 0 + $c =~ > /((\x{00F7})+\x{100}?)/' > 1 > > As above, but remove upgrade so string is not utf8, pattern is: > ./perl -Ilib -E 'my $c = "\xF7"; say 0 + $c =~ /((\x{00F7})+\x{100}?)/' > 1 > > To summarize, when the string is utf8 and the pattern isn't, it fails to > match itself.
But only in a combination of capturing and a quantifier: #!/usr/bin/perl use 5.010; use strict; use warnings; no warnings 'syntax'; use Test::More; $_ = "\xF7"; utf8::upgrade ($_); ok /\x{00F7}/, "No capture, no quantifier, no anchors"; ok /(\x{00F7})/, "Capture, no quantifier, no anchors"; ok /\x{00F7}+/, "No capture, + quantifier, no anchors"; ok /(\x{00F7})+/, "Capture, + quantifier, no anchors"; ok /^\x{00F7}$/, "No capture, no quantifier, anchors"; ok /^(\x{00F7})$/, "Capture, no quantifier, anchors"; ok /^\x{00F7}+$/, "No capture, + quantifier, anchors"; ok /^\x{00F7}*$/, "No capture, * quantifier, anchors"; ok /^\x{00F7}?$/, "No capture, ? quantifier, anchors"; ok /^(\x{00F7})+$/, "Capture, + quantifier, anchors"; ok /^(\x{00F7})*$/, "Capture, * quantifier, anchors"; ok /^(\x{00F7})?$/, "Capture, ? quantifier, anchors"; done_testing; __END__ ok 1 - No capture, no quantifier, no anchors ok 2 - Capture, no quantifier, no anchors ok 3 - No capture, + quantifier, no anchors not ok 4 - Capture, + quantifier, no anchors # Failed test 'Capture, + quantifier, no anchors' # at /tmp/eep line 18. ok 5 - No capture, no quantifier, anchors ok 6 - Capture, no quantifier, anchors ok 7 - No capture, + quantifier, anchors ok 8 - No capture, * quantifier, anchors ok 9 - No capture, ? quantifier, anchors not ok 10 - Capture, + quantifier, anchors # Failed test 'Capture, + quantifier, anchors' # at /tmp/eep line 25. not ok 11 - Capture, * quantifier, anchors # Failed test 'Capture, * quantifier, anchors' # at /tmp/eep line 26. not ok 12 - Capture, ? quantifier, anchors # Failed test 'Capture, ? quantifier, anchors' # at /tmp/eep line 27. 1..12 # Looks like you failed 4 tests of 12. I get these failures in 5.8.[89] and 5.10.1 as well.
Subject: PATCH: Some constructs have a Latin1 string in utf8 string not match itself in a non-utf8 pattern
Date: Tue, 19 Oct 2010 20:53:28 -0600
To: perlbug [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
I thought there was already a ticket for this, but couldn't find it. Patch attached.
From 9ca5f258da774908a47d340499d1bc8e20e6eeb0 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Tue, 19 Oct 2010 20:45:16 -0600 Subject: [PATCH] regexec.c: utf8 strings don't match non-utf8 self Some constructs with characters that are variant under utf8 have a utf8 string not matching it's non-utf8 self. The solution is to test for this case and change to use both in utf8 while doing the comparison. --- regexec.c | 45 +++++++++++++++++++++++++++++++++++++++++---- t/re/pat.t | 14 +++++++++++++- 2 files changed, 54 insertions(+), 5 deletions(-) diff --git a/regexec.c b/regexec.c index 901703f..a1a1bc2 100644 --- a/regexec.c +++ b/regexec.c @@ -5750,10 +5750,47 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth) case CANY: scan = loceol; break; - case EXACT: /* length of string is 1 */ - c = (U8)*STRING(p); - while (scan < loceol && UCHARAT(scan) == c) - scan++; + case EXACT: + /* To get here, EXACT nodes must have *byte* length == 1. That means + * they match only characters in the string that can be expressed as a + * single byte. For non-utf8 strings, that means a simple match. For + * utf8 strings, the character matched must be an invariant, or + * downgradable to a single byte. The pattern's utf8ness is + * irrelevant, as it must be a single byte, so either it isn't utf8, or + * if it is it's an invariant */ + + assert(! UTF_PATTERN || UNI_IS_INVARIANT(c)); + { + U8* pattern = (U8*) STRING(p); + U8 c = UCHARAT(pattern); + if ((! utf8_target) || UNI_IS_INVARIANT(c)) { + + /* Here, the string isn't utf8, or the character in the EXACT + * node is the same in utf8 as not, so can just do equality. + * Each matching char must be 1 byte long */ + while (scan < loceol && UCHARAT(scan) == c) { + scan++; + } + } + else { + + /* Here, the string is utf8, and the char to match is different + * in utf8 than not. Fastest to find the two utf8 bytes that + * represent c, and then look for those in sequence in the utf8 + * string */ + U8 high = UTF8_TWO_BYTE_HI(c); + U8 low = UTF8_TWO_BYTE_LO(c); + loceol = PL_regeol; + while (hardcount < max + && scan + 1 < loceol + && UCHARAT(scan) == high + && UCHARAT(scan + 1) == low) + { + scan += 2; + hardcount++; + } + } + } break; case EXACTF: /* length of string is 1 */ c = (U8)*STRING(p); diff --git a/t/re/pat.t b/t/re/pat.t index c007880..4668104 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -23,7 +23,7 @@ BEGIN { } -plan tests => 398; # Update this when adding/deleting tests. +plan tests => 402; # Update this when adding/deleting tests. run_tests() unless caller; @@ -1072,6 +1072,18 @@ sub run_tests { } + { # Some constructs with Latin1 characters cause a utf8 string not to + # match itself in non-utf8 + my $c = "\xc0"; + my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/; + utf8::upgrade($utf8_pattern); + ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8"; + ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not"; + utf8::upgrade($c); + ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not"; + ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8"; + } + { # Test that a regex followed by an operator and/or a statement modifier work # These tests use string-eval so that it reports a clean error when it fails -- 1.5.6.3
Subject: Revised: [perl #78464] PATCH: Some constructs have a Latin1 string in utf8 string not match itself in a non-utf8 pattern
Date: Tue, 19 Oct 2010 22:39:12 -0600
To: perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 525b
karl williamson (via RT) wrote: Show quoted text
> # New Ticket Created by karl williamson > # Please include the string: [perl #78464] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78464 > > > > I thought there was already a ticket for this, but couldn't find it. > Patch attached. >
I realized that I had put the assert before the value it was testing was set, so it was only chance that it passed the test suite. Here is a slightly revised patch
From 8701660d127695c1e77d8dea57cb23ef8ef16cd8 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Tue, 19 Oct 2010 20:45:16 -0600 Subject: [PATCH] regexec.c: utf8 strings don't match non-utf8 self Some constructs with characters that are variant under utf8 have a utf8 string not matching it's non-utf8 self. The solution is to test for this case and change to use both in utf8 while doing the comparison. --- regexec.c | 40 +++++++++++++++++++++++++++++++++++++--- t/re/pat.t | 14 +++++++++++++- 2 files changed, 50 insertions(+), 4 deletions(-) diff --git a/regexec.c b/regexec.c index 901703f..662cf7d 100644 --- a/regexec.c +++ b/regexec.c @@ -5750,10 +5750,44 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth) case CANY: scan = loceol; break; - case EXACT: /* length of string is 1 */ + case EXACT: + /* To get here, EXACT nodes must have *byte* length == 1. That means + * they match only characters in the string that can be expressed as a + * single byte. For non-utf8 strings, that means a simple match. For + * utf8 strings, the character matched must be an invariant, or + * downgradable to a single byte. The pattern's utf8ness is + * irrelevant, as it must be a single byte, so either it isn't utf8, or + * if it is it's an invariant */ + c = (U8)*STRING(p); - while (scan < loceol && UCHARAT(scan) == c) - scan++; + assert(! UTF_PATTERN || UNI_IS_INVARIANT(c)); + if ((! utf8_target) || UNI_IS_INVARIANT(c)) { + + /* Here, the string isn't utf8, or the character in the EXACT + * node is the same in utf8 as not, so can just do equality. + * Each matching char must be 1 byte long */ + while (scan < loceol && UCHARAT(scan) == c) { + scan++; + } + } + else { + + /* Here, the string is utf8, and the char to match is different + * in utf8 than not. Fastest to find the two utf8 bytes that + * represent c, and then look for those in sequence in the utf8 + * string */ + U8 high = UTF8_TWO_BYTE_HI(c); + U8 low = UTF8_TWO_BYTE_LO(c); + loceol = PL_regeol; + while (hardcount < max + && scan + 1 < loceol + && UCHARAT(scan) == high + && UCHARAT(scan + 1) == low) + { + scan += 2; + hardcount++; + } + } break; case EXACTF: /* length of string is 1 */ c = (U8)*STRING(p); diff --git a/t/re/pat.t b/t/re/pat.t index c007880..4668104 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -23,7 +23,7 @@ BEGIN { } -plan tests => 398; # Update this when adding/deleting tests. +plan tests => 402; # Update this when adding/deleting tests. run_tests() unless caller; @@ -1072,6 +1072,18 @@ sub run_tests { } + { # Some constructs with Latin1 characters cause a utf8 string not to + # match itself in non-utf8 + my $c = "\xc0"; + my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/; + utf8::upgrade($utf8_pattern); + ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8"; + ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not"; + utf8::upgrade($c); + ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not"; + ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8"; + } + { # Test that a regex followed by an operator and/or a statement modifier work # These tests use string-eval so that it reports a clean error when it fails -- 1.5.6.3
Subject: Re: [perl #78464] Revised PATCH: Some constructs have a Latin1 string in utf8 string not match itself in a non-utf8 pattern
Date: Wed, 20 Oct 2010 11:38:01 -0600
To: perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 608b
karl williamson (via RT) wrote: Show quoted text
> # New Ticket Created by karl williamson > # Please include the string: [perl #78464] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78464 > > > > I thought there was already a ticket for this, but couldn't find it. > Patch attached. >
This slightly revises the first patch to remove an unnecessary block and consequent indent. The second patch extends the fix to the /i flag. With both patches, the affected cases will have a character match its variant self with and without /i
From 3c117e8450fa50df5f635aaa4b20fa3eda8dc66e Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Wed, 20 Oct 2010 10:20:29 -0600 Subject: [PATCH] Subject: regexec.c: utf8 doesn't match non-utf8 self Some regex patterns don't match a character with itself when the target string is in utf8 and the pattern isn't, and the character is variant under utf8. (This means only Latin1-range characters in the pattern are affected.) The solution is to test for this case and use the utf8 representation of the pattern character for the comparison. --- regexec.c | 40 +++++++++++++++++++++++++++++++++++++--- t/re/pat.t | 14 +++++++++++++- 2 files changed, 50 insertions(+), 4 deletions(-) diff --git a/regexec.c b/regexec.c index 901703f..f87c2fa 100644 --- a/regexec.c +++ b/regexec.c @@ -5750,10 +5750,44 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth) case CANY: scan = loceol; break; - case EXACT: /* length of string is 1 */ + case EXACT: + /* To get here, EXACT nodes must have *byte* length == 1. That means + * they match only characters in the string that can be expressed as a + * single byte. For non-utf8 strings, that means a simple match. For + * utf8 strings, the character matched must be an invariant, or + * downgradable to a single byte. The pattern's utf8ness is + * irrelevant, as it must be a single byte, so either it isn't utf8, or + * if it is it's an invariant */ + c = (U8)*STRING(p); - while (scan < loceol && UCHARAT(scan) == c) - scan++; + assert(! UTF_PATTERN || UNI_IS_INVARIANT(c)); + if ((! utf8_target) || UNI_IS_INVARIANT(c)) { + + /* Here, the string isn't utf8, or the character in the EXACT + * node is the same in utf8 as not, so can just do equality. + * Each matching char must be 1 byte long */ + while (scan < loceol && UCHARAT(scan) == c) { + scan++; + } + } + else { + + /* Here, the string is utf8, and the char to match is different + * in utf8 than not. Fastest to find the two utf8 bytes that + * represent c, and then look for those in sequence in the utf8 + * string */ + U8 high = UTF8_TWO_BYTE_HI(c); + U8 low = UTF8_TWO_BYTE_LO(c); + loceol = PL_regeol; + while (hardcount < max + && scan + 1 < loceol + && UCHARAT(scan) == high + && UCHARAT(scan + 1) == low) + { + scan += 2; + hardcount++; + } + } break; case EXACTF: /* length of string is 1 */ c = (U8)*STRING(p); diff --git a/t/re/pat.t b/t/re/pat.t index c007880..4668104 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -23,7 +23,7 @@ BEGIN { } -plan tests => 398; # Update this when adding/deleting tests. +plan tests => 402; # Update this when adding/deleting tests. run_tests() unless caller; @@ -1072,6 +1072,18 @@ sub run_tests { } + { # Some constructs with Latin1 characters cause a utf8 string not to + # match itself in non-utf8 + my $c = "\xc0"; + my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/; + utf8::upgrade($utf8_pattern); + ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8"; + ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not"; + utf8::upgrade($c); + ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not"; + ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8"; + } + { # Test that a regex followed by an operator and/or a statement modifier work # These tests use string-eval so that it reports a clean error when it fails -- 1.5.6.3
From 36a48cf433c58efc5d7c81e082c42d4c21423fa6 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Wed, 20 Oct 2010 11:11:13 -0600 Subject: [PATCH] regexec.c: utf8 doesn't match /i nonutf8 self This is a continuation of [perl #78464]. It fixes it also for the /i flag. After this, a character should match itself in the regrepeat function, even if one is in utf8 and the other isn't, for both /i and not. The solution is to move the code for handling /i into the non-i structure so that the decisions about utf8 are all in one place. When the string is in utf8, it uses the utf8-fold function. This has the added effect of fixing a few cases where a utf8 string did not match a fold in a non-utf8 pattern. I haven't added tests for these, as it only fixes a few cases where this is a problem, and I'm working on a comprehensive solution to the problem, accompanied by extensive tests. --- regexec.c | 109 +++++++++++++++++++++++++++++++++++++++++++---------------- t/re/pat.t | 4 ++ 2 files changed, 83 insertions(+), 30 deletions(-) diff --git a/regexec.c b/regexec.c index f87c2fa..842afaf 100644 --- a/regexec.c +++ b/regexec.c @@ -5750,8 +5750,12 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth) case CANY: scan = loceol; break; + case EXACTFL: + PL_reg_flags |= RF_tainted; + /* FALL THROUGH */ case EXACT: - /* To get here, EXACT nodes must have *byte* length == 1. That means + case EXACTF: + /* To get here, EXACTish nodes must have *byte* length == 1. That means * they match only characters in the string that can be expressed as a * single byte. For non-utf8 strings, that means a simple match. For * utf8 strings, the character matched must be an invariant, or @@ -5761,47 +5765,92 @@ S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max, int depth) c = (U8)*STRING(p); assert(! UTF_PATTERN || UNI_IS_INVARIANT(c)); + if ((! utf8_target) || UNI_IS_INVARIANT(c)) { /* Here, the string isn't utf8, or the character in the EXACT * node is the same in utf8 as not, so can just do equality. * Each matching char must be 1 byte long */ - while (scan < loceol && UCHARAT(scan) == c) { - scan++; + switch (OP(p)) { + case EXACT: + while (scan < loceol && UCHARAT(scan) == c) { + scan++; + } + break; + case EXACTF: + while (scan < loceol && + (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c])) + { + scan++; + } + break; + case EXACTFL: + while (scan < loceol && + (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c])) + { + scan++; + } + break; + default: + Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p)); } } else { - /* Here, the string is utf8, and the char to match is different - * in utf8 than not. Fastest to find the two utf8 bytes that - * represent c, and then look for those in sequence in the utf8 - * string */ - U8 high = UTF8_TWO_BYTE_HI(c); - U8 low = UTF8_TWO_BYTE_LO(c); - loceol = PL_regeol; - while (hardcount < max - && scan + 1 < loceol - && UCHARAT(scan) == high - && UCHARAT(scan + 1) == low) - { - scan += 2; - hardcount++; + /* Here, the string is utf8, and the pattern char is different + * in utf8 than not. */ + + switch (OP(p)) { + case EXACT: + { + /* Fastest to find the two utf8 bytes that represent c, and + * then look for those in sequence in the utf8 string */ + U8 high = UTF8_TWO_BYTE_HI(c); + U8 low = UTF8_TWO_BYTE_LO(c); + loceol = PL_regeol; + + while (hardcount < max + && scan + 1 < loceol + && UCHARAT(scan) == high + && UCHARAT(scan + 1) == low) + { + scan += 2; + hardcount++; + } + } + break; + case EXACTFL: /* Doesn't really make sense, but is best we can + do. The documents warn against mixing locale + and utf8 */ + case EXACTF: + { /* utf8 string, so use utf8 foldEQ */ + char *tmpeol = loceol; + while (hardcount < max + && foldEQ_utf8(scan, &tmpeol, 0, utf8_target, + STRING(p), NULL, 1, UTF_PATTERN)) + { + scan = tmpeol; + tmpeol = loceol; + hardcount++; + } + + /* XXX Note that the above handles properly the German + * sharp ss in the pattern matching ss in the string. But + * it doesn't handle properly cases where the string + * contains say 'LIGATURE ff' and the pattern is 'f+'. + * This would require, say, a new function or revised + * interface to foldEQ_utf8(), in which the maximum number + * of characters to match could be passed and it would + * return how many actually did. This is just one of many + * cases where multi-char folds don't work properly, and so + * the fix is being deferred */ + } + break; + default: + Perl_croak(aTHX_ "panic: Unexpected op %u", OP(p)); } } break; - case EXACTF: /* length of string is 1 */ - c = (U8)*STRING(p); - while (scan < loceol && - (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold[c])) - scan++; - break; - case EXACTFL: /* length of string is 1 */ - PL_reg_flags |= RF_tainted; - c = (U8)*STRING(p); - while (scan < loceol && - (UCHARAT(scan) == c || UCHARAT(scan) == PL_fold_locale[c])) - scan++; - break; case ANYOF: if (utf8_target) { loceol = PL_regeol; diff --git a/t/re/pat.t b/t/re/pat.t index 4668104..d4bbbb8 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -1078,10 +1078,14 @@ sub run_tests { my $pattern = my $utf8_pattern = qr/((\xc0)+,?)/; utf8::upgrade($utf8_pattern); ok $c =~ $pattern, "\\xc0 =~ $pattern; Neither pattern nor target utf8"; + ok $c =~ /$pattern/i, "\\xc0 =~ /$pattern/i; Neither pattern nor target utf8"; ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; pattern utf8, target not"; + ok $c =~ /$utf8_pattern/i, "\\xc0 =~ /$pattern/i; pattern utf8, target not"; utf8::upgrade($c); ok $c =~ $pattern, "\\xc0 =~ $pattern; target utf8, pattern not"; + ok $c =~ /$pattern/i, "\\xc0 =~ /$pattern/i; target utf8, pattern not"; ok $c =~ $utf8_pattern, "\\xc0 =~ $pattern; Both target and pattern utf8"; + ok $c =~ /$utf8_pattern/i, "\\xc0 =~ /$pattern/i; Both target and pattern utf8"; } { -- 1.5.6.3
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 808b
On Wed Oct 20 10:38:45 2010, public@khwilliamson.com wrote: Show quoted text
> karl williamson (via RT) wrote:
> > # New Ticket Created by karl williamson > > # Please include the string: [perl #78464] > > # in the subject line of all future correspondence about this issue. > > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78464 > > > > > > > I thought there was already a ticket for this, but couldn't find it. > > Patch attached. > >
> > This slightly revises the first patch to remove an unnecessary block and > consequent indent. The second patch extends the fix to the /i flag. > With both patches, the affected cases will have a character match its > variant self with and without /i
Thank you. Applied as 634c83a2672252257e360eb1939b7ec762ef6308 and d4e0b827d7de0f9f86fa8b9dde76e6d722b0cf2d.
CC: perl5-porters [...] perl.org
Subject: Re: [perl #78464] Revised PATCH: Some constructs have a Latin1 string in utf8 string not match itself in a non-utf8 pattern
Date: Thu, 28 Oct 2010 19:28:13 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 793b
is this applied? seems reasonable On 20 October 2010 19:38, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> karl williamson (via RT) wrote:
>> >> # New Ticket Created by  karl williamson # Please include the string: >>  [perl #78464] >> # in the subject line of all future correspondence about this issue. # >> <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78464 > >> >> >> I thought there was already a ticket for this, but couldn't find it. Patch >> attached. >>
> > This slightly revises the first patch to remove an unnecessary block and >  consequent indent.  The second patch extends the fix to the /i flag. With > both patches, the affected cases will have a character match its variant > self with and without /i >
-- perl -Mre=debug -e "/just|another|perl|hacker/"


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org