New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8/Latin 1/i regexp "Malformed character" warning #7957
Comments
From @nwc10Created by @nwc10Stig came across this: ./perl -Ilib -we '$term = "\xe9"; $target = "\xe9\x{100}"; chop $target; $target =~ /$term/i' I'm not sure if this is similar or the same to a bug previously reported. Perl Info
|
From @iabynOn Tue, Jun 07, 2005 at 04:28:08PM -0000, Nicholas Clark wrote:
it turns out perl is totally borked for $utf8 =~ /latin1/i $latin1 =~ /$utf8/i unless all the chars happen to be < 0x7f. The patch below fixes the first case: the fault was in ibcmp_utf8(), The second case fails in S_find_byclass() in regexec.c: while (s <= e) { where it calls ibcmp() but s is a pointer to a latin1 while m is pointer Dave. -- Change 25095 by davem@davem-splatty on 2005/07/08 01:43:24 [perl #36207] UTF8/Latin 1/i regexp "Malformed character" warning Affected files ... ... //depot/perl/t/op/pat.t#222 edit Differences ... ==== //depot/perl/t/op/pat.t#222 (xtext) ==== @@ -6,7 +6,7 @@ $| = 1; -print "1..1178\n"; BEGIN { ==== //depot/perl/utf8.c#239 (text) ==== @@ -2037,7 +2037,7 @@ |
The RT System itself - Status changed from 'new' to 'open' |
From @jhi
That bugs of this level are found only now is telling something about
Looks good, off-hand. (Feel free to quote me on haggis, though.)
I won't have time to look into this anytime soon, but I think the fix You might want to extend the tests also to run-time: my $utf8 = "\x{e9}\x{100}"; chop $utf8; ok($utf8 =~ /\xe9/i); ok($utf8 =~ /$latin1/i); |
From @demerphqOn 7/8/05, Dave Mitchell <davem@iabyn.com> wrote:
The case where the pattern is /(foo|bar)/ is handled by a totally cheers, -- |
From @nwc10On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote:
The bugs are actually subtle in that they are in the mixed bytes (mumble Nicholas Clark |
From @demerphqOn 7/8/05, Dave Mitchell <davem@iabyn.com> wrote:
Well, i guess half wrong is better than all wrong but that still thanks tho. -- |
From @iabynOn Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote:
$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if $x=~/(abc|\xe9)/i' $ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if "\xe9"=~/(abc|$x)/i' -- |
From @demerphqOn 7/8/05, Dave Mitchell <davem@iabyn.com> wrote:
Attached patch fixes it in the TRIE(FL?)? code afaict. D:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1 D:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1 Yves -- |
From @demerphqfix_utf.patch--- regexec.c.bak 2005-07-08 17:02:45.171875000 +0200
+++ regexec.c 2005-07-08 17:03:02.156250000 +0200
@@ -2612,7 +2612,7 @@
if ( base ) {
- if ( do_utf8 || UTF ) {
+ if ( do_utf8 ) {
if ( foldlen>0 ) {
uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len, uniflags );
foldlen -= len;
@@ -2678,7 +2678,7 @@
if ( base ) {
- if ( do_utf8 || UTF ) {
+ if ( do_utf8 ) {
uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len, uniflags );
} else {
uvc = (U32)*uc;
|
From @iabynOn Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote:
if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced -- |
From @iabynOn Fri, Jul 08, 2005 at 05:07:26PM +0200, demerphq wrote:
thanks, applied as change #25106 -- |
From @jhiDave Mitchell wrote:
It depends, and not "replaced with". If it is a UTF-8 locale, |
From @demerphqHi, this patch should resolve the bugs that Slaven found in the regex The original bug was that $ord=181; $u=$c=chr($ord); utf8::upgrade $u; was not working. This then lead to the discovery that $u=~/$c/i was not working for $ord>127. This in turn lead to the discovery that when ibcmp_utf8() is called Following this I found that the FOLDCHAR regop was not being produced And lastly I discovered some unnecessary code and added some comments The file 3030-up.patch is the full deal as one. The others are each The only thing im not feeling confident about is whether i was too Ive included a single patch that has all the changes and a tar.gz of Yves -- |
From @demerphq3030-up.patchIndex: regcomp.c
===================================================================
--- regcomp.c (revision 3030)
+++ regcomp.c (revision 3037)
@@ -2791,7 +2791,15 @@
last = cur;
}
} else {
- if ( last ) {
+/*
+ Currently we assume that the trie can handle unicode and ascii
+ matches fold cased matches. If this proves true then the following
+ define will prevent tries in this situation.
+
+ #define TRIE_TYPE_IS_SAFE (UTF || optype==EXACT)
+*/
+#define TRIE_TYPE_IS_SAFE 1
+ if ( last && TRIE_TYPE_IS_SAFE ) {
make_trie( pRExC_state,
startbranch, first, cur, tail, count,
optype, depth+1 );
@@ -2819,7 +2827,8 @@
"", SvPV_nolen_const( mysv ),REG_NODE_NUM(cur));
});
- if ( last ) {
+
+ if ( last && TRIE_TYPE_IS_SAFE ) {
made= make_trie( pRExC_state, startbranch, first, scan, tail, count, optype, depth+1 );
#ifdef TRIE_STUDY_OPT
if ( ((made == MADE_EXACT_TRIE &&
@@ -6867,6 +6876,7 @@
case 0xDF:
case 0xC3:
case 0xCE:
+ do_foldchar:
if (!LOC && FOLD) {
U32 len,cp;
len=0; /* silence a spurious compiler warning */
@@ -6893,7 +6903,11 @@
required, as the default for this switch is to jump to the
literal text handling code.
*/
- switch (*++RExC_parse) {
+ switch ((U8)*++RExC_parse) {
+ case 0xDF:
+ case 0xC3:
+ case 0xCE:
+ goto do_foldchar;
/* Special Escapes */
case 'A':
RExC_seen_zerolen++;
@@ -7211,8 +7225,13 @@
an unescaped equivalent literal.
*/
- switch (*++p) {
+ switch ((U8)*++p) {
/* These are all the special escapes. */
+ case 0xDF:
+ case 0xC3:
+ case 0xCE:
+ if (LOC || !FOLD || !is_TRICKYFOLD_safe(p,RExC_end,UTF))
+ goto normal_default;
case 'A': /* Start assertion */
case 'b': case 'B': /* Word-boundary assertion*/
case 'C': /* Single char !DANGEROUS! */
Index: utf8.c
===================================================================
--- utf8.c (revision 3030)
+++ utf8.c (revision 3037)
@@ -2228,6 +2228,11 @@
that define an explicit length must reach their goal pointers for
a match to succeed).
+It is mandatory to provide an endpoint for both strings being compared,
+although this may be either through the pe1/pe2 or l1/l2 parameters.
+Otherwise under debugging it will fail an assert and under
+release it will silently return "does not match".
+
For case-insensitiveness, the "casefolding" of Unicode is used
instead of upper/lowercasing both the characters, see
http://www.unicode.org/unicode/reports/tr21/ (Case Mappings).
@@ -2254,13 +2259,16 @@
if (pe1)
e1 = *(U8**)pe1;
+ assert(e1 || l1);
if (e1 == 0 || (l1 && l1 < (UV)(e1 - (const U8*)s1)))
f1 = (const U8*)s1 + l1;
if (pe2)
e2 = *(U8**)pe2;
+ assert(e2 || l2);
if (e2 == 0 || (l2 && l2 < (UV)(e2 - (const U8*)s2)))
f2 = (const U8*)s2 + l2;
+ assert((e1 == 0 && f1 == 0) || (e2 == 0 && f2 == 0) || (f1 == 0 && f2 == 0));
if ((e1 == 0 && f1 == 0) || (e2 == 0 && f2 == 0) || (f1 == 0 && f2 == 0))
return 1; /* mismatch; possible infinite loop or false positive */
Index: t/op/reg_fold.t
===================================================================
--- t/op/reg_fold.t (revision 0)
+++ t/op/reg_fold.t (revision 3037)
@@ -0,0 +1,37 @@
+use strict;
+use warnings;
+use Test::More;
+my $count=1;
+my @tests;
+use Cwd;
+
+my $file="../lib/unicore/CaseFolding.txt";
+open my $fh,"<",$file
+ or die "Failed to read '$file' from '".cwd()."': $!";
+while (<$fh>) {
+ chomp;
+ my ($line,$comment)= split/\s+#\s+/, $_;
+ my ($cp,$type,@fc)=split/[\s;]+/,$line||'';
+ next unless $type and ($type eq 'F' or $type eq 'C');
+ $_="\\x{$_}" for @fc;
+ my $cpv=hex("0x$cp");
+ my $chr="chr(0x$cp)";
+ my @str;
+ push @str,$chr if $cpv<128 or $cpv>256;
+ if ($cpv<256) {
+ push @str,"do{my \$c=$chr; utf8::upgrade(\$c); \$c}"
+ }
+
+ foreach my $str ( @str ) {
+ my $expr="$str=~/@fc/ix";
+ my $t=($cpv > 256 || $str=~/^do/) ? "unicode" : "latin";
+ push @tests,
+ qq[ok($expr,'$chr=~/@fc/ix - $comment ($t string)')];
+ $tests[-1]="TODO: { local \$TODO='[13:41] <BinGOs> cue *It is all Greek to me* joke.';\n$tests[-1] }"
+ if $cp eq '0390' or $cp eq '03B0';
+ $count++;
+ }
+}
+eval join ";\n","plan tests=>".($count-1),@tests,"1"
+ or die $@;
+__DATA__
Index: t/op/pat.t
===================================================================
--- t/op/pat.t (revision 3030)
+++ t/op/pat.t (revision 3037)
@@ -3406,9 +3406,9 @@
ok($utf8 =~ /(abc|\xe9)/i, "utf8/latin trie");
ok($utf8 =~ /(abc|$latin1)/i, "utf8/latin trie runtime");
- ok("\xe9" =~ /$utf8/i, "# TODO latin/utf8");
+ ok("\xe9" =~ /$utf8/i, "# latin/utf8");
ok("\xe9" =~ /(abc|$utf8)/i, "# latin/utf8 trie");
- ok($latin1 =~ /$utf8/i, "# TODO latin/utf8 runtime");
+ ok($latin1 =~ /$utf8/i, "# latin/utf8 runtime");
ok($latin1 =~ /(abc|$utf8)/i, "# latin/utf8 trie runtime");
}
@@ -4487,6 +4487,23 @@
iseq($1,"\xd6","#45605");
}
+{
+ # Regardless of utf8ness any character matches itself when
+ # doing a case insensitive match. See also [perl #36207]
+ for my $o (0..255) {
+ my @ch=(chr($o),chr($o));
+ utf8::upgrade($ch[1]);
+ for my $u_str (0,1) {
+ for my $u_pat (0,1) {
+ ok( $ch[$u_str]=~/\Q$ch[$u_pat]\E/i,
+ "\$c=~/\$c/i : chr($o) : u_str=$u_str u_pat=$u_pat");
+ ok( $ch[$u_str]=~/\Q$ch[$u_pat]\E|xyz/i,
+ "# \$c=~/\$c|xyz/i : chr($o) : u_str=$u_str u_pat=$u_pat");
+ }
+ }
+ }
+}
+
# Test counter is at bottom of file. Put new tests above here.
#-------------------------------------------------------------------
# Keep the following tests last -- they may crash perl
@@ -4545,6 +4562,6 @@
iseq(0+$::test,$::TestCount,"Got the right number of tests!");
# Don't forget to update this!
BEGIN {
- $::TestCount = 1965;
+ $::TestCount = 4013;
print "1..$::TestCount\n";
}
Index: regexec.c
===================================================================
--- regexec.c (revision 3030)
+++ regexec.c (revision 3037)
@@ -989,8 +989,12 @@
return NULL;
}
+#define DECL_TRIE_TYPE(scan) \
+ const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold } \
+ trie_type = (scan->flags != EXACT) \
+ ? (do_utf8 ? trie_utf8_fold : (UTF ? trie_latin_utf8_fold : trie_plain)) \
+ : (do_utf8 ? trie_utf8 : trie_plain)
-
#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, \
uvc, charid, foldlen, foldbuf, uniflags) STMT_START { \
switch (trie_type) { \
@@ -1007,6 +1011,19 @@
uscan = foldbuf + UNISKIP( uvc ); \
} \
break; \
+ case trie_latin_utf8_fold: \
+ if ( foldlen>0 ) { \
+ uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len, uniflags ); \
+ foldlen -= len; \
+ uscan += len; \
+ len=0; \
+ } else { \
+ len = 1; \
+ uvc = to_uni_fold( *(U8*)uc, foldbuf, &foldlen ); \
+ foldlen -= UNISKIP( uvc ); \
+ uscan = foldbuf + UNISKIP( uvc ); \
+ } \
+ break; \
case trie_utf8: \
uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len, uniflags ); \
break; \
@@ -1029,12 +1046,14 @@
} \
} STMT_END
-#define REXEC_FBC_EXACTISH_CHECK(CoNd) \
+#define REXEC_FBC_EXACTISH_CHECK(CoNd) \
+{ \
+ char *my_strend= (char *)strend; \
if ( (CoNd) \
&& (ln == len || \
- ibcmp_utf8(s, NULL, 0, do_utf8, \
+ !ibcmp_utf8(s, &my_strend, 0, do_utf8, \
m, NULL, ln, (bool)UTF)) \
- && (!reginfo || regtry(reginfo, &s)) ) \
+ && (!reginfo || regtry(reginfo, &s)) ) \
goto got_it; \
else { \
U8 foldbuf[UTF8_MAXBYTES_CASE+1]; \
@@ -1042,15 +1061,14 @@
f = to_utf8_fold(tmpbuf, foldbuf, &foldlen); \
if ( f != c \
&& (f == c1 || f == c2) \
- && (ln == foldlen || \
- !ibcmp_utf8((char *) foldbuf, \
- NULL, foldlen, do_utf8, \
- m, \
- NULL, ln, (bool)UTF)) \
- && (!reginfo || regtry(reginfo, &s)) ) \
+ && (ln == len || \
+ !ibcmp_utf8(s, &my_strend, 0, do_utf8,\
+ m, NULL, ln, (bool)UTF)) \
+ && (!reginfo || regtry(reginfo, &s)) ) \
goto got_it; \
} \
- s += len
+} \
+s += len
#define REXEC_FBC_EXACTISH_SCAN(CoNd) \
STMT_START { \
@@ -1210,14 +1228,26 @@
U8 tmpbuf1[UTF8_MAXBYTES_CASE+1];
U8 tmpbuf2[UTF8_MAXBYTES_CASE+1];
const U32 uniflags = UTF8_ALLOW_DEFAULT;
+
+ /* XXX: Since the node will be case folded at compile
+ time this logic is a little odd, although im not
+ sure that its actually wrong. --dmq */
+
+ c1 = to_utf8_lower((U8*)m, tmpbuf1, &ulen1);
+ c2 = to_utf8_upper((U8*)m, tmpbuf2, &ulen2);
- to_utf8_lower((U8*)m, tmpbuf1, &ulen1);
- to_utf8_upper((U8*)m, tmpbuf2, &ulen2);
-
+ /* XXX: This is kinda strange. to_utf8_XYZ returns the
+ codepoint of the first character in the converted
+ form, yet originally we did the extra step.
+ No tests fail by commenting this code out however
+ so Ive left it out. -- dmq.
+
c1 = utf8n_to_uvchr(tmpbuf1, UTF8_MAXBYTES_CASE,
0, uniflags);
c2 = utf8n_to_uvchr(tmpbuf2, UTF8_MAXBYTES_CASE,
0, uniflags);
+ */
+
lnc = 0;
while (sm < ((U8 *) m + ln)) {
lnc++;
@@ -1252,24 +1282,33 @@
* matching (called "loose matching" in Unicode).
* ibcmp_utf8() will do just that. */
- if (do_utf8) {
+ if (do_utf8 || UTF) {
UV c, f;
U8 tmpbuf [UTF8_MAXBYTES+1];
- STRLEN len, foldlen;
+ STRLEN len = 1;
+ STRLEN foldlen;
const U32 uniflags = UTF8_ALLOW_DEFAULT;
if (c1 == c2) {
/* Upper and lower of 1st char are equal -
* probably not a "letter". */
while (s <= e) {
- c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
+ if (do_utf8) {
+ c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
uniflags);
+ } else {
+ c = *((U8*)s);
+ }
REXEC_FBC_EXACTISH_CHECK(c == c1);
}
}
else {
while (s <= e) {
- c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
+ if (do_utf8) {
+ c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
uniflags);
+ } else {
+ c = *((U8*)s);
+ }
/* Handle some of the three Greek sigmas cases.
* Note that not all the possible combinations
@@ -1287,6 +1326,7 @@
}
}
else {
+ /* Neither pattern nor string are UTF8 */
if (c1 == c2)
REXEC_FBC_EXACTISH_SCAN(*(U8*)s == c1);
else
@@ -1461,10 +1501,7 @@
case AHOCORASICKC:
case AHOCORASICK:
{
- const enum { trie_plain, trie_utf8, trie_utf8_fold }
- trie_type = do_utf8 ?
- (c->flags == EXACT ? trie_utf8 : trie_utf8_fold)
- : trie_plain;
+ DECL_TRIE_TYPE(c);
/* what trie are we using right now */
reg_ac_data *aho
= (reg_ac_data*)progi->data->data[ ARG( c ) ];
@@ -2872,10 +2909,7 @@
case TRIE:
{
/* what type of TRIE am I? (utf8 makes this contextual) */
- const enum { trie_plain, trie_utf8, trie_utf8_fold }
- trie_type = do_utf8 ?
- (scan->flags == EXACT ? trie_utf8 : trie_utf8_fold)
- : trie_plain;
+ DECL_TRIE_TYPE(scan);
/* what trie are we using right now */
reg_trie_data * const trie
Index: MANIFEST
===================================================================
--- MANIFEST (revision 3030)
+++ MANIFEST (revision 3037)
@@ -3786,6 +3786,7 @@
t/op/recurse.t See if deep recursion works
t/op/ref.t See if refs and objects work
t/op/reg_email.t See if regex recursion works by parsing email addresses
+t/op/reg_fold.t See if case folding works properly
t/op/regexp_noamp.t See if regular expressions work with optimizations
t/op/regexp_notrie.t See if regular expressions work without trie optimisation
t/op/regexp_qr_embed.t See if regular expressions work with embedded qr//
|
From @rgsOn 17/12/2007, demerphq <demerphq@gmail.com> wrote:
Thanks, applied as change #32628, except your patch 3032, where I Well, so, that's the last showstopper going off, right ? Andreas, if you could kick a BBC smoke, that would probably be helpful... |
From @druud62"Rafael Garcia-Suarez" schreef:
Ah, so maybe "Perl is 5.10<<2" can still be tomorrow? -- "Gewoon is een tijger." |
From @andk
> Andreas, if you could kick a BBC smoke, that would probably be helpful... My usual 1500 distribution smoke with stock options (no threads, no I have a few BBC articles on the backburner which I could not finish. A bit hot seems to be RT #31556 at the moment but it can probably Happy birthday everybody, |
@smpeters - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#36207 (status was 'resolved')
Searchable as RT36207$
The text was updated successfully, but these errors were encountered: