Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8/Latin 1/i regexp "Malformed character" warning #7957

Closed
p5pRT opened this issue Jun 7, 2005 · 20 comments
Closed

UTF8/Latin 1/i regexp "Malformed character" warning #7957

p5pRT opened this issue Jun 7, 2005 · 20 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 7, 2005

Migrated from rt.perl.org#36207 (status was 'resolved')

Searchable as RT36207$

@p5pRT
Copy link
Author

p5pRT commented Jun 7, 2005

From @nwc10

Created by @nwc10

Stig came across this​:

./perl -Ilib -we '$term = "\xe9"; $target = "\xe9\x{100}"; chop $target; $target =~ /$term/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

I'm not sure if this is similar or the same to a bug previously reported.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.9.3:

Configured by nick at Tue Jun  7 15:04:19 BST 2005.

Summary of my perl5 (revision 5 version 9 subversion 3 patch 24148) configuration:
  Platform:
    osname=linux, osvers=2.4.21-15.0.3.elsmp, archname=i686-linux
    uname='linux switch 2.4.21-15.0.3.elsmp #1 smp wed jul 7 09:34:05 edt 2004 i686 i686 i386 gnulinux '
    config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-g -Uusethreads -Uuse64bitint -Duseperlio -Dusemymalloc -Dprefix=~/Sandpit/snap5.9.x-24728 -Dinstallman1dir=none -Dinstallman3dir=none -de'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=y, bincompat5005=undef
  Compiler:
    cc='ccache gcc', ccflags ='-DDEBUGGING -DPERL_COPY_ON_WRITE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g',
    cppflags='-DDEBUGGING -DPERL_COPY_ON_WRITE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include'
    ccversion='', gccversion='3.2.3 20030502 (Red Hat Linux 3.2.3-49)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.9.3:
    lib
    /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/5.9.3/i686-linux
    /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/5.9.3
    /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl/5.9.3/i686-linux
    /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl/5.9.3
    /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl
    .


Environment for perl v5.9.3:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/kerberos/bin:/usr/lib/ccache/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @iabyn

On Tue, Jun 07, 2005 at 04​:28​:08PM -0000, Nicholas Clark wrote​:

./perl -Ilib -we '$term = "\xe9"; $target = "\xe9\x{100}"; chop $target; $target =~ /$term/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

it turns out perl is totally borked for

  $utf8 =~ /latin1/i
and

  $latin1 =~ /$utf8/i

unless all the chars happen to be < 0x7f.

The patch below fixes the first case​: the fault was in ibcmp_utf8(),
which was calling to_utf8_fold() without first converting the char to
utf8.

The second case fails in S_find_byclass() in regexec.c​:

  while (s <= e) {
  if ( (*(U8*)s == c1 || *(U8*)s == c2)
  && (ln == 1 || !(OP(c) == EXACTF
  ? ibcmp(s, m, ln)
  : ibcmp_locale(s, m, ln)))
  && (norun || regtry(prog, s)) )
  goto got_it;
  s++;
  }

where it calls ibcmp() but s is a pointer to a latin1 while m is pointer
to a utf8 char. I don't really understand this code well enough to feel
confident fixing it. In fact I'm not even totally confident in my fix for
the first case, hence I'm Ccing everyone's Favourite Finn.

Dave.

--
"But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged." -- Lady Croom - Arcadia

Change 25095 by davem@​davem-splatty on 2005/07/08 01​:43​:24

  [perl #36207] UTF8/Latin 1/i regexp "Malformed character" warning
  $utf8 =~ /latin/i didn't match.
  Also added TODO for $latin =~ /utf8/i which also fails

Affected files ...

... //depot/perl/t/op/pat.t#222 edit
... //depot/perl/utf8.c#239 edit

Differences ...

==== //depot/perl/t/op/pat.t#222 (xtext) ====

@​@​ -6,7 +6,7 @​@​

$| = 1;

-print "1..1178\n";
+print "1..1180\n";

BEGIN {
  chdir 't' if -d 't';
@​@​ -3364,4 +3364,14 @​@​
  my $psycho=join "|",@​normal,map chr $_,255..20000;
  ok(('these'=/($psycho)/) && $1 eq 'these','Pyscho');
}
-# last test 1178
+
+# [perl #36207] mixed utf8 / latin-1 and case folding
+
+{
+ my $u = "\xe9\x{100}";
+ chop $u;
+ ok($u =
/\xe9/i, "utf8/latin");
+ ok("\xe9" =~ /$u/i, "# TODO latin/utf8");
+}
+
+# last test 1180

==== //depot/perl/utf8.c#239 (text) ====

@​@​ -2037,7 +2037,7 @​@​
  if (u1)
  to_utf8_fold(p1, foldbuf1, &foldlen1);
  else {
- natbuf[0] = *p1;
+ uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1)));
  to_utf8_fold(natbuf, foldbuf1, &foldlen1);
  }
  q1 = foldbuf1;
@​@​ -2047,7 +2047,7 @​@​
  if (u2)
  to_utf8_fold(p2, foldbuf2, &foldlen2);
  else {
- natbuf[0] = *p2;
+ uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2)));
  to_utf8_fold(natbuf, foldbuf2, &foldlen2);
  }
  q2 = foldbuf2;

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @jhi

it turns out perl is totally borked for

That bugs of this level are found only now is telling something about
how much people actually use the Unicode support of Perl. Well, and
something about how suckily I tested it.

The patch below fixes the first case​: the fault was in ibcmp_utf8(),

Looks good, off-hand. (Feel free to quote me on haggis, though.)

which was calling to_utf8_fold() without first converting the char to
utf8.

The second case fails in S_find_byclass() in regexec.c​:

        while \(s \<= e\) \{
        if \( \(\*\(U8\*\)s == c1 || \*\(U8\*\)s == c2\)
             && \(ln == 1 || \!\(OP\(c\) == EXACTF
                      ? ibcmp\(s\, m\, ln\)
                      : ibcmp\_locale\(s\, m\, ln\)\)\)
             && \(norun || regtry\(prog\, s\)\) \)
            goto got\_it;
        s\+\+;
        \}

where it calls ibcmp() but s is a pointer to a latin1 while m is pointer
to a utf8 char. I don't really understand this code well enough to feel
confident fixing it.

I won't have time to look into this anytime soon, but I think the fix
for the second case shouldn't be too hard to find. First of all, if
either the matcher or the matchee are UTF-8, we should never ever ever
end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right
combination of flags to indicate which or both is UTF-8.

You might want to extend the tests also to run-time​:

my $utf8 = "\x{e9}\x{100}"; chop $utf8;
my $latin1 = "\x{e9}";

ok($utf8 =~ /\xe9/i);
ok("\xe9" =~ /$utf8/i);

ok($utf8 =~ /$latin1/i);
ok($latin1 =~ /$utf8/i);

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @demerphq

On 7/8/05, Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Jun 07, 2005 at 04​:28​:08PM -0000, Nicholas Clark wrote​:

./perl -Ilib -we '$term = "\xe9"; $target = "\xe9\x{100}"; chop $target; $target =~ /$term/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

it turns out perl is totally borked for

$utf8 =~ /latin1/i

and

$latin1 =~ /$utf8/i

unless all the chars happen to be < 0x7f.

The case where the pattern is /(foo|bar)/ is handled by a totally
different codepath in blead, does it also fail there? I seem to recall
that I put in tests for this, but possibly im wrong. Im flying on
holiday in less than 24 hours and i doubt Ill be able to check until i
return at the end of the month.

cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @nwc10

On Fri, Jul 08, 2005 at 09​:45​:32AM +0300, Jarkko Hietaniemi wrote​:

it turns out perl is totally borked for

That bugs of this level are found only now is telling something about
how much people actually use the Unicode support of Perl. Well, and
something about how suckily I tested it.

The bugs are actually subtle in that they are in the mixed bytes (mumble
Latin 1)/UTF-8 code. People could be quite happily using the Unicode support,
where all their input data is already in UTF-8, or is in (say) Shift-JIS and
hence has to be converted on input, and hence never have cause to go down
these code paths.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @demerphq

On 7/8/05, Dave Mitchell <davem@​iabyn.com> wrote​:

On Fri, Jul 08, 2005 at 09​:24​:42AM +0200, demerphq wrote​:

it turns out perl is totally borked for

$utf8 =~ /latin1/i

and

$latin1 =~ /$utf8/i

unless all the chars happen to be < 0x7f.

The case where the pattern is /(foo|bar)/ is handled by a totally
different codepath in blead, does it also fail there? I seem to recall
that I put in tests for this, but possibly im wrong. Im flying on
holiday in less than 24 hours and i doubt Ill be able to check until i
return at the end of the month.

$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if $x=~/(abc|\xe9)/i'
1

$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if "\xe9"=~/(abc|$x)/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

Well, i guess half wrong is better than all wrong but that still
sucks. Maybe ill get some time to put together a patch on the plane.

thanks tho.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @iabyn

On Fri, Jul 08, 2005 at 09​:24​:42AM +0200, demerphq wrote​:

it turns out perl is totally borked for

$utf8 =~ /latin1/i

and

$latin1 =~ /$utf8/i

unless all the chars happen to be < 0x7f.

The case where the pattern is /(foo|bar)/ is handled by a totally
different codepath in blead, does it also fail there? I seem to recall
that I put in tests for this, but possibly im wrong. Im flying on
holiday in less than 24 hours and i doubt Ill be able to check until i
return at the end of the month.

$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if $x=~/(abc|\xe9)/i'
1

$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if "\xe9"=~/(abc|$x)/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

--
Never do today what you can put off till tomorrow.

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @demerphq

On 7/8/05, Dave Mitchell <davem@​iabyn.com> wrote​:

On Fri, Jul 08, 2005 at 09​:24​:42AM +0200, demerphq wrote​:

it turns out perl is totally borked for

$utf8 =~ /latin1/i

and

$latin1 =~ /$utf8/i

unless all the chars happen to be < 0x7f.

The case where the pattern is /(foo|bar)/ is handled by a totally
different codepath in blead, does it also fail there? I seem to recall
that I put in tests for this, but possibly im wrong. Im flying on
holiday in less than 24 hours and i doubt Ill be able to check until i
return at the end of the month.

$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if $x=~/(abc|\xe9)/i'
1

$ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if "\xe9"=~/(abc|$x)/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

Attached patch fixes it in the TRIE(FL?)? code afaict.

D​:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1
if qq[\xe9]=~/(abc|$x)/i"
1

D​:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1
if $x=~/(abc|\xe9)/i"
1

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2005

From @demerphq

fix_utf.patch
--- regexec.c.bak	2005-07-08 17:02:45.171875000 +0200
+++ regexec.c	2005-07-08 17:03:02.156250000 +0200
@@ -2612,7 +2612,7 @@
 
 		    if ( base ) {
 
-			if ( do_utf8 || UTF ) {
+			if ( do_utf8 ) {
 			    if ( foldlen>0 ) {
 				uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len, uniflags );
 				foldlen -= len;
@@ -2678,7 +2678,7 @@
 
 		    if ( base ) {
 
-			if ( do_utf8 || UTF ) {
+			if ( do_utf8 ) {
 			    uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len, uniflags );
 			} else {
 			    uvc = (U32)*uc;

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2005

From @iabyn

On Fri, Jul 08, 2005 at 09​:45​:32AM +0300, Jarkko Hietaniemi wrote​:

I won't have time to look into this anytime soon, but I think the fix
for the second case shouldn't be too hard to find. First of all, if
either the matcher or the matchee are UTF-8, we should never ever ever
end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right
combination of flags to indicate which or both is UTF-8.

if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced
with?

--
The Enterprise is captured by a vastly superior alien intelligence which
does not put them on trial.
  -- Things That Never Happen in "Star Trek" #10

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2005

From @iabyn

On Fri, Jul 08, 2005 at 05​:07​:26PM +0200, demerphq wrote​:

Attached patch fixes it in the TRIE(FL?)? code afaict.

D​:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1
if qq[\xe9]=~/(abc|$x)/i"
1

D​:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1
if $x=~/(abc|\xe9)/i"
1

thanks, applied as change #25106
I also added some tests.

--
That he said that that that that is is is debatable, is debatable.

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2005

From @jhi

Dave Mitchell wrote​:

On Fri, Jul 08, 2005 at 09​:45​:32AM +0300, Jarkko Hietaniemi wrote​:

I won't have time to look into this anytime soon, but I think the fix
for the second case shouldn't be too hard to find. First of all, if
either the matcher or the matchee are UTF-8, we should never ever ever
end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right
combination of flags to indicate which or both is UTF-8.

if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced
with?

It depends, and not "replaced with". If it is a UTF-8 locale,
the ibcmp_utf8() should be used, otherwise I *guess* the ibcmp_locale()
as it is now should be used.

@p5pRT
Copy link
Author

p5pRT commented Dec 17, 2007

From @demerphq

Hi, this patch should resolve the bugs that Slaven found in the regex
engine with case insensitive matches. There were actually several bugs
all interacting with each other, one of which is quite old. These were
all interacting to produce strange results.

The original bug was that

$ord=181; $u=$c=chr($ord); utf8​::upgrade $u;
$u=~/$c|xyz/i;

was not working. This then lead to the discovery that

$u=~/$c/i

was not working for $ord>127.

This in turn lead to the discovery that when ibcmp_utf8() is called
without a defined endpoint for each string it automatically and
silently returns no match, which the regex was doing. I added asserts
to catch this case under DEBUGGING.

Following this I found that the FOLDCHAR regop was not being produced
when a codepoint it is supposed to match is escaped. This was also
fixed.

And lastly I discovered some unnecessary code and added some comments
about strangeness in the code.

The file 3030-up.patch is the full deal as one. The others are each
stage of the process.

The only thing im not feeling confident about is whether i was too
aggressive with the asserts. Its possible i have set it up to assert
in a legitimate use case. Im still waiting to see.

Ive included a single patch that has all the changes and a tar.gz of
each of my local commits.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 17, 2007

From @demerphq

3030-up.patch
Index: regcomp.c
===================================================================
--- regcomp.c	(revision 3030)
+++ regcomp.c	(revision 3037)
@@ -2791,7 +2791,15 @@
                                     last = cur;
                                 }
                             } else {
-                                if ( last ) {
+/* 
+    Currently we assume that the trie can handle unicode and ascii
+    matches fold cased matches. If this proves true then the following
+    define will prevent tries in this situation. 
+    
+    #define TRIE_TYPE_IS_SAFE (UTF || optype==EXACT)
+*/
+#define TRIE_TYPE_IS_SAFE 1
+                                if ( last && TRIE_TYPE_IS_SAFE ) {
                                     make_trie( pRExC_state, 
                                             startbranch, first, cur, tail, count, 
                                             optype, depth+1 );
@@ -2819,7 +2827,8 @@
                               "", SvPV_nolen_const( mysv ),REG_NODE_NUM(cur));
 
                         });
-                        if ( last ) {
+                        
+                        if ( last && TRIE_TYPE_IS_SAFE ) {
                             made= make_trie( pRExC_state, startbranch, first, scan, tail, count, optype, depth+1 );
 #ifdef TRIE_STUDY_OPT	
                             if ( ((made == MADE_EXACT_TRIE && 
@@ -6867,6 +6876,7 @@
     case 0xDF:
     case 0xC3:
     case 0xCE:
+        do_foldchar:
         if (!LOC && FOLD) {
             U32 len,cp;
 	    len=0; /* silence a spurious compiler warning */
@@ -6893,7 +6903,11 @@
 	   required, as the default for this switch is to jump to the
 	   literal text handling code.
 	*/
-	switch (*++RExC_parse) {
+	switch ((U8)*++RExC_parse) {
+	case 0xDF:
+	case 0xC3:
+	case 0xCE:
+	           goto do_foldchar;	    
 	/* Special Escapes */
 	case 'A':
 	    RExC_seen_zerolen++;
@@ -7211,8 +7225,13 @@
 		       an unescaped equivalent literal.
 		    */
 
-		    switch (*++p) {
+		    switch ((U8)*++p) {
 		    /* These are all the special escapes. */
+    		    case 0xDF:
+    		    case 0xC3:
+    		    case 0xCE:
+    		           if (LOC || !FOLD || !is_TRICKYFOLD_safe(p,RExC_end,UTF))
+    		                goto normal_default;		    
 		    case 'A':             /* Start assertion */
 		    case 'b': case 'B':   /* Word-boundary assertion*/
 		    case 'C':             /* Single char !DANGEROUS! */
Index: utf8.c
===================================================================
--- utf8.c	(revision 3030)
+++ utf8.c	(revision 3037)
@@ -2228,6 +2228,11 @@
 that define an explicit length must reach their goal pointers for
 a match to succeed).
 
+It is mandatory to provide an endpoint for both strings being compared, 
+although this may be either through the pe1/pe2 or l1/l2 parameters. 
+Otherwise under debugging it will fail an assert and under
+release it will silently return "does not match".
+
 For case-insensitiveness, the "casefolding" of Unicode is used
 instead of upper/lowercasing both the characters, see
 http://www.unicode.org/unicode/reports/tr21/ (Case Mappings).
@@ -2254,13 +2259,16 @@
      
      if (pe1)
 	  e1 = *(U8**)pe1;
+     assert(e1 || l1);
      if (e1 == 0 || (l1 && l1 < (UV)(e1 - (const U8*)s1)))
 	  f1 = (const U8*)s1 + l1;
      if (pe2)
 	  e2 = *(U8**)pe2;
+     assert(e2 || l2);
      if (e2 == 0 || (l2 && l2 < (UV)(e2 - (const U8*)s2)))
 	  f2 = (const U8*)s2 + l2;
 
+     assert((e1 == 0 && f1 == 0) || (e2 == 0 && f2 == 0) || (f1 == 0 && f2 == 0));
      if ((e1 == 0 && f1 == 0) || (e2 == 0 && f2 == 0) || (f1 == 0 && f2 == 0))
 	  return 1; /* mismatch; possible infinite loop or false positive */
 
Index: t/op/reg_fold.t
===================================================================
--- t/op/reg_fold.t	(revision 0)
+++ t/op/reg_fold.t	(revision 3037)
@@ -0,0 +1,37 @@
+use strict;
+use warnings;
+use Test::More;
+my $count=1;
+my @tests;
+use Cwd;
+
+my $file="../lib/unicore/CaseFolding.txt";
+open my $fh,"<",$file 
+    or die "Failed to read '$file' from '".cwd()."': $!";
+while (<$fh>) {
+    chomp;
+    my ($line,$comment)= split/\s+#\s+/, $_;
+    my ($cp,$type,@fc)=split/[\s;]+/,$line||'';
+    next unless $type and ($type eq 'F' or $type eq 'C');
+    $_="\\x{$_}" for @fc;
+    my $cpv=hex("0x$cp");
+    my $chr="chr(0x$cp)";
+    my @str;
+    push @str,$chr if $cpv<128 or $cpv>256;
+    if ($cpv<256) {
+        push @str,"do{my \$c=$chr; utf8::upgrade(\$c); \$c}"
+    }
+    
+    foreach my $str ( @str ) {    
+        my $expr="$str=~/@fc/ix";
+        my $t=($cpv > 256 || $str=~/^do/) ? "unicode" : "latin";
+        push @tests,
+            qq[ok($expr,'$chr=~/@fc/ix - $comment ($t string)')];
+        $tests[-1]="TODO: { local \$TODO='[13:41] <BinGOs> cue *It is all Greek to me* joke.';\n$tests[-1] }"
+            if $cp eq '0390' or $cp eq '03B0';
+        $count++;
+    }
+}    
+eval join ";\n","plan tests=>".($count-1),@tests,"1"
+    or die $@;
+__DATA__
Index: t/op/pat.t
===================================================================
--- t/op/pat.t	(revision 3030)
+++ t/op/pat.t	(revision 3037)
@@ -3406,9 +3406,9 @@
     ok($utf8 =~ /(abc|\xe9)/i, "utf8/latin trie");
     ok($utf8 =~ /(abc|$latin1)/i, "utf8/latin trie runtime");
 
-    ok("\xe9" =~ /$utf8/i, "# TODO latin/utf8");
+    ok("\xe9" =~ /$utf8/i, "# latin/utf8");
     ok("\xe9" =~ /(abc|$utf8)/i, "# latin/utf8 trie");
-    ok($latin1 =~ /$utf8/i, "# TODO latin/utf8 runtime");
+    ok($latin1 =~ /$utf8/i, "# latin/utf8 runtime");
     ok($latin1 =~ /(abc|$utf8)/i, "# latin/utf8 trie runtime");
 }
 
@@ -4487,6 +4487,23 @@
     iseq($1,"\xd6","#45605");
 }
 
+{
+    # Regardless of utf8ness any character matches itself when 
+    # doing a case insensitive match. See also [perl #36207] 
+    for my $o (0..255) {
+        my @ch=(chr($o),chr($o));
+        utf8::upgrade($ch[1]);
+        for my $u_str (0,1) {
+            for my $u_pat (0,1) {
+                ok( $ch[$u_str]=~/\Q$ch[$u_pat]\E/i,
+                    "\$c=~/\$c/i : chr($o) : u_str=$u_str u_pat=$u_pat");
+                ok( $ch[$u_str]=~/\Q$ch[$u_pat]\E|xyz/i,
+                "# \$c=~/\$c|xyz/i : chr($o) : u_str=$u_str u_pat=$u_pat");
+            }
+        }
+    }
+}
+
 # Test counter is at bottom of file. Put new tests above here.
 #-------------------------------------------------------------------
 # Keep the following tests last -- they may crash perl
@@ -4545,6 +4562,6 @@
 iseq(0+$::test,$::TestCount,"Got the right number of tests!");
 # Don't forget to update this!
 BEGIN {
-    $::TestCount = 1965;
+    $::TestCount = 4013;
     print "1..$::TestCount\n";
 }
Index: regexec.c
===================================================================
--- regexec.c	(revision 3030)
+++ regexec.c	(revision 3037)
@@ -989,8 +989,12 @@
     return NULL;
 }
 
+#define DECL_TRIE_TYPE(scan) \
+    const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold } \
+		    trie_type = (scan->flags != EXACT) \
+		              ? (do_utf8 ? trie_utf8_fold : (UTF ? trie_latin_utf8_fold : trie_plain)) \
+                              : (do_utf8 ? trie_utf8 : trie_plain)
 
-
 #define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len,  \
 uvc, charid, foldlen, foldbuf, uniflags) STMT_START {                       \
     switch (trie_type) {                                                    \
@@ -1007,6 +1011,19 @@
 	    uscan = foldbuf + UNISKIP( uvc );                               \
 	}                                                                   \
 	break;                                                              \
+    case trie_latin_utf8_fold:                                              \
+	if ( foldlen>0 ) {                                                  \
+	    uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len, uniflags );     \
+	    foldlen -= len;                                                 \
+	    uscan += len;                                                   \
+	    len=0;                                                          \
+	} else {                                                            \
+	    len = 1;                                                        \
+	    uvc = to_uni_fold( *(U8*)uc, foldbuf, &foldlen );               \
+	    foldlen -= UNISKIP( uvc );                                      \
+	    uscan = foldbuf + UNISKIP( uvc );                               \
+	}                                                                   \
+	break;                                                              \
     case trie_utf8:                                                         \
 	uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len, uniflags );       \
 	break;                                                              \
@@ -1029,12 +1046,14 @@
     }                                                                       \
 } STMT_END
 
-#define REXEC_FBC_EXACTISH_CHECK(CoNd)                  \
+#define REXEC_FBC_EXACTISH_CHECK(CoNd)                 \
+{                                                      \
+    char *my_strend= (char *)strend;                   \
     if ( (CoNd)                                        \
 	 && (ln == len ||                              \
-	     ibcmp_utf8(s, NULL, 0,  do_utf8,          \
+	     !ibcmp_utf8(s, &my_strend, 0,  do_utf8,   \
 			m, NULL, ln, (bool)UTF))       \
-	 && (!reginfo || regtry(reginfo, &s)) )         \
+	 && (!reginfo || regtry(reginfo, &s)) )        \
 	goto got_it;                                   \
     else {                                             \
 	 U8 foldbuf[UTF8_MAXBYTES_CASE+1];             \
@@ -1042,15 +1061,14 @@
 	 f = to_utf8_fold(tmpbuf, foldbuf, &foldlen);  \
 	 if ( f != c                                   \
 	      && (f == c1 || f == c2)                  \
-	      && (ln == foldlen ||                     \
-		  !ibcmp_utf8((char *) foldbuf,        \
-			      NULL, foldlen, do_utf8,  \
-			      m,                       \
-			      NULL, ln, (bool)UTF))    \
-	      && (!reginfo || regtry(reginfo, &s)) )    \
+	      && (ln == len ||                         \
+	        !ibcmp_utf8(s, &my_strend, 0,  do_utf8,\
+			      m, NULL, ln, (bool)UTF)) \
+	      && (!reginfo || regtry(reginfo, &s)) )   \
 	      goto got_it;                             \
     }                                                  \
-    s += len
+}                                                      \
+s += len
 
 #define REXEC_FBC_EXACTISH_SCAN(CoNd)                     \
 STMT_START {                                              \
@@ -1210,14 +1228,26 @@
 		U8 tmpbuf1[UTF8_MAXBYTES_CASE+1];
 		U8 tmpbuf2[UTF8_MAXBYTES_CASE+1];
 		const U32 uniflags = UTF8_ALLOW_DEFAULT;
+		
+                /* XXX: Since the node will be case folded at compile
+                   time this logic is a little odd, although im not 
+                   sure that its actually wrong. --dmq */
+                   
+		c1 = to_utf8_lower((U8*)m, tmpbuf1, &ulen1);
+		c2 = to_utf8_upper((U8*)m, tmpbuf2, &ulen2);
 
-		to_utf8_lower((U8*)m, tmpbuf1, &ulen1);
-		to_utf8_upper((U8*)m, tmpbuf2, &ulen2);
-
+		/* XXX: This is kinda strange. to_utf8_XYZ returns the 
+                   codepoint of the first character in the converted
+                   form, yet originally we did the extra step. 
+                   No tests fail by commenting this code out however
+                   so Ive left it out. -- dmq.
+                   
 		c1 = utf8n_to_uvchr(tmpbuf1, UTF8_MAXBYTES_CASE, 
 				    0, uniflags);
 		c2 = utf8n_to_uvchr(tmpbuf2, UTF8_MAXBYTES_CASE,
 				    0, uniflags);
+                */
+                
 		lnc = 0;
 		while (sm < ((U8 *) m + ln)) {
 		    lnc++;
@@ -1252,24 +1282,33 @@
 	     * matching (called "loose matching" in Unicode).
 	     * ibcmp_utf8() will do just that. */
 
-	    if (do_utf8) {
+	    if (do_utf8 || UTF) {
 	        UV c, f;
 	        U8 tmpbuf [UTF8_MAXBYTES+1];
-		STRLEN len, foldlen;
+		STRLEN len = 1;
+		STRLEN foldlen;
 		const U32 uniflags = UTF8_ALLOW_DEFAULT;
 		if (c1 == c2) {
 		    /* Upper and lower of 1st char are equal -
 		     * probably not a "letter". */
 		    while (s <= e) {
-		        c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
+		        if (do_utf8) {
+		            c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
 					   uniflags);
+                        } else {
+                            c = *((U8*)s);
+                        }					  
 			REXEC_FBC_EXACTISH_CHECK(c == c1);
 		    }
 		}
 		else {
 		    while (s <= e) {
-		      c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
+		        if (do_utf8) {
+		            c = utf8n_to_uvchr((U8*)s, UTF8_MAXBYTES, &len,
 					   uniflags);
+                        } else {
+                            c = *((U8*)s);
+                        }
 
 			/* Handle some of the three Greek sigmas cases.
 			 * Note that not all the possible combinations
@@ -1287,6 +1326,7 @@
 		}
 	    }
 	    else {
+	        /* Neither pattern nor string are UTF8 */
 		if (c1 == c2)
 		    REXEC_FBC_EXACTISH_SCAN(*(U8*)s == c1);
 		else
@@ -1461,10 +1501,7 @@
 	case AHOCORASICKC:
 	case AHOCORASICK: 
 	    {
-	        const enum { trie_plain, trie_utf8, trie_utf8_fold }
-		    trie_type = do_utf8 ?
-			  (c->flags == EXACT ? trie_utf8 : trie_utf8_fold)
-			: trie_plain;
+	        DECL_TRIE_TYPE(c);
                 /* what trie are we using right now */
         	reg_ac_data *aho
         	    = (reg_ac_data*)progi->data->data[ ARG( c ) ];
@@ -2872,10 +2909,7 @@
 	case TRIE:
 	    {
                 /* what type of TRIE am I? (utf8 makes this contextual) */
-	        const enum { trie_plain, trie_utf8, trie_utf8_fold }
-		    trie_type = do_utf8 ?
-			  (scan->flags == EXACT ? trie_utf8 : trie_utf8_fold)
-			: trie_plain;
+                DECL_TRIE_TYPE(scan);
 
                 /* what trie are we using right now */
 		reg_trie_data * const trie
Index: MANIFEST
===================================================================
--- MANIFEST	(revision 3030)
+++ MANIFEST	(revision 3037)
@@ -3786,6 +3786,7 @@
 t/op/recurse.t			See if deep recursion works
 t/op/ref.t			See if refs and objects work
 t/op/reg_email.t		See if regex recursion works by parsing email addresses
+t/op/reg_fold.t			See if case folding works properly
 t/op/regexp_noamp.t		See if regular expressions work with optimizations
 t/op/regexp_notrie.t		See if regular expressions work without trie optimisation
 t/op/regexp_qr_embed.t		See if regular expressions work with embedded qr//

@p5pRT
Copy link
Author

p5pRT commented Dec 17, 2007

From @demerphq

3030-3037-patch.tar.gz

@p5pRT
Copy link
Author

p5pRT commented Dec 17, 2007

From @rgs

On 17/12/2007, demerphq <demerphq@​gmail.com> wrote​:

Hi, this patch should resolve the bugs that Slaven found in the regex
engine with case insensitive matches. There were actually several bugs
all interacting with each other, one of which is quite old. These were
all interacting to produce strange results.

The original bug was that

$ord=181; $u=$c=chr($ord); utf8​::upgrade $u;
$u=~/$c|xyz/i;

was not working. This then lead to the discovery that

$u=~/$c/i

was not working for $ord>127.

This in turn lead to the discovery that when ibcmp_utf8() is called
without a defined endpoint for each string it automatically and
silently returns no match, which the regex was doing. I added asserts
to catch this case under DEBUGGING.

Following this I found that the FOLDCHAR regop was not being produced
when a codepoint it is supposed to match is escaped. This was also
fixed.

And lastly I discovered some unnecessary code and added some comments
about strangeness in the code.

The file 3030-up.patch is the full deal as one. The others are each
stage of the process.

The only thing im not feeling confident about is whether i was too
aggressive with the asserts. Its possible i have set it up to assert
in a legitimate use case. Im still waiting to see.

Ive included a single patch that has all the changes and a tar.gz of
each of my local commits.

Thanks, applied as change #32628, except your patch 3032, where I
commented out the asserts.

Well, so, that's the last showstopper going off, right ?

Andreas, if you could kick a BBC smoke, that would probably be helpful...

@p5pRT
Copy link
Author

p5pRT commented Dec 17, 2007

From @druud62

"Rafael Garcia-Suarez" schreef​:

demerphq wrote​:

Hi, this patch should resolve the bugs that Slaven found in the regex
engine with case insensitive matches. There were actually several
bugs all interacting with each other, one of which is quite old.
These were all interacting to produce strange results. [...]

Thanks, applied as change #32628, except your patch 3032, where I
commented out the asserts.

Well, so, that's the last showstopper going off, right ?

Ah, so maybe "Perl is 5.10<<2" can still be tomorrow?

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

p5pRT commented Dec 18, 2007

From @andk

On Mon, 17 Dec 2007 17​:05​:43 +0100, "Rafael Garcia-Suarez" <rgarciasuarez@​gmail.com> said​:

  > Andreas, if you could kick a BBC smoke, that would probably be helpful...

My usual 1500 distribution smoke with stock options (no threads, no
debugging, no 64bitint) completed successfully without surprises.

I have a few BBC articles on the backburner which I could not finish.
But as far as I can see, no showstoppers.

A bit hot seems to be RT #31556 at the moment but it can probably
wait.

Happy birthday everybody,
--
andreas

@p5pRT
Copy link
Author

p5pRT commented Nov 25, 2008

@smpeters - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant