Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning: Malformed UTF-8 character in substitution operation #10000

Closed
p5pRT opened this issue Dec 2, 2009 · 8 comments
Closed

Warning: Malformed UTF-8 character in substitution operation #10000

p5pRT opened this issue Dec 2, 2009 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 2, 2009

Migrated from rt.perl.org#70998 (status was 'resolved')

Searchable as RT70998$

@p5pRT
Copy link
Author

p5pRT commented Dec 2, 2009

From slavenr@iconmobile.com

Created by slavenr@devpc01-debian.iconmobile.de

The following code worked fine with 5.8.8, but outputs a number of warnings
with 5.10.0 and 5.10.1​:

#!/usr/bin/perl

use strict;
use warnings;

my %conv = (
  "\xab" => "<",
  "\xa9" => "(c)",
  );
my $conv_rx = '(' . join('|', map { quotemeta } keys %conv) . ')';
$conv_rx = qr{$conv_rx};

my $x = qq{\x{3042}\x{304b}\x{3055}\x{305f}\x{306a}\x{306f}\x{307e}\x{3084}\x{3089}\x{308f}\x{3093}\x{3042}\x{304b}\x{3055}\x{305f}\x{306a}\x{306f}\x{307e}\x{3084}\x{3089}\x{308f}\x{3093}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}};

$x =~ s{$conv_rx}{$conv{$1}}eg;

__END__

The warnings are​:

Malformed UTF-8 character (unexpected continuation byte 0xab, with no preceding start byte) in substitution (s///) at /tmp/bla.pl line 15.
Malformed UTF-8 character (unexpected continuation byte 0xa9, with no preceding start byte) in substitution (s///) at /tmp/bla.pl line 15.
(etc)

Regards,
  Slaven

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.1:

Configured by slavenr at Wed Sep  2 17:15:52 CEST 2009.

Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=2.6.18-4-486, archname=i686-linux
    uname='linux devpc01-debian 2.6.18-4-486 #1 wed feb 21 15:25:16 utc 2007 i686 gnulinux '
    config_args='-ds -e -Dprefix=/usr/perl5.10.1'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-21)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.9.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.9'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.10.1:
    /usr/perl5.10.1/lib/5.10.1/i686-linux
    /usr/perl5.10.1/lib/5.10.1
    /usr/perl5.10.1/lib/site_perl/5.10.1/i686-linux
    /usr/perl5.10.1/lib/site_perl/5.10.1
    .


Environment for perl 5.10.1:
    HOME=/home/slavenr
    LANG=hr_HR.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/mysql-max-5.0.15-linux-i686/bin:/usr/local/netpbm/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/slavenr/bin/linux-gnu:/home/slavenr/bin/sh:/home/slavenr/bin:/usr/X11R6/bin:/usr/X11/bin:/usr/games:/home/slavenr/devel:/services/mobile/framework/bin/:/epoc/nokia60/epoc32/tools:/usr/local/er6/bin/:/services/mobile/clientframework/perl/bin/:/usr/lib/jvm/java-1.5.0-sun/bin:/usr/java/ant/bin
    PERLDOC=-MPod::Perldoc::ToTextOverstrike
    PERL_BADLANG (unset)
    PERL_HTML_DISPLAY_CLASS=HTML::Display::Mozilla
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2009

From @cpansprout

This bug (which was caused by change 28373/07be1b8) can be reduced to​:

qq{\x{30ab}} =~ /\xab|\xa9/;

Malformed UTF-8 character (unexpected continuation byte 0xab, with no
preceding start byte) in pattern match (m//) at /Users/sprout/Perl/
5.8.7-regressions/70998.d/70998 copy line 2.

30ab is e3 82 ab in UTF-8.

The trie optimisation in S_find_byclass (added by change
28373/07be1b8) searches in e3 82 ab for a char matching [\xab\xa9],
and sets 2 (\xab) as its starting position. The code it passes control
to then stumbles across this ‘lone’ \xab.

The attached patch fixes this.

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2009

From @cpansprout

Inline Patch
diff -Nurp blead/regexec.c blead-70998/regexec.c
--- blead/regexec.c	2009-12-12 00:57:11.000000000 -0800
+++ blead-70998/regexec.c	2009-12-13 23:26:41.000000000 -0800
@@ -1722,7 +1722,17 @@ S_find_byclass(pTHX_ regexp * prog, cons
                                             " Scanning for legal start char...\n");
                                     }
                                 );            
-                                while ( uc <= (U8*)last_start  && !BITMAP_TEST(bitmap,*uc) ) {
+                                if(do_utf8) /* Skip continuation octets */
+                                 while (    /* for UTF8 strings */
+                                  uc <= (U8*)last_start
+                                  &&(
+                                   UTF8_IS_CONTINUATION(*uc)
+                                    ||
+                                   !BITMAP_TEST(bitmap,*uc)
+                                  )
+                                 ) uc++;
+                                else while ( uc <= (U8*)last_start
+                                          && !BITMAP_TEST(bitmap,*uc) ) {
                                     uc++;
                                 }
                                 s= (char *)uc;
diff -Nurp blead/t/re/pat_rt_report.t blead-70998/t/re/pat_rt_report.t
--- blead/t/re/pat_rt_report.t	2009-11-19 08:51:40.000000000 -0800
+++ blead-70998/t/re/pat_rt_report.t	2009-12-13 21:54:00.000000000 -0800
@@ -21,7 +21,7 @@ BEGIN {
 }
 
 
-plan tests => 2511;  # Update this when adding/deleting tests.
+plan tests => 2512;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1076,6 +1076,42 @@ sub run_tests {
            '(?>) does not cause wrongness on long string with UTF-8';
     }
 
+    {
+	local $BugId = 70998;
+	local $Message
+	 = 'utf8 =~ /trie/ where trie matches a continuation octet';
+
+	# Catch warnings:
+	my $w;
+	local $SIG{__WARN__} = sub { $w .= shift };
+
+	# This bug can be reduced to
+	qq{\x{30ab}} =~ /\xab|\xa9/;
+	# but it's nice to have a more 'real-world' test. The original test
+	# case from the RT ticket follows:
+
+	my %conv = (
+		    "\xab"     => "&lt;",
+		    "\xa9"     => "(c)",
+		   );
+	my $conv_rx = '(' . join('|', map { quotemeta } keys %conv) . ')';
+	$conv_rx = qr{$conv_rx};
+	
+	my $x
+	 = qq{\x{3042}\x{304b}\x{3055}\x{305f}\x{306a}\x{306f}\x{307e}}
+	 . qq{\x{3084}\x{3089}\x{308f}\x{3093}\x{3042}\x{304b}\x{3055}}
+	 . qq{\x{305f}\x{306a}\x{306f}\x{307e}\x{3084}\x{3089}\x{308f}}
+	 . qq{\x{3093}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}}
+	 . qq{\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}\x{30a2}\x{30ab}}
+	 . qq{\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}}
+	 . qq{\x{30ef}\x{30f3}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}}
+	 . qq{\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}};
+	
+	$x =~ s{$conv_rx}{$conv{$1}}eg;
+
+	iseq($w,undef);
+    }
+
 
     #
     # Keep the following tests last -- they may crash perl

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2009

From @demerphq

2009/12/20 Father Chrysostomos <sprout@​cpan.org>​:

This bug (which was caused by change 28373/07be1b8) can be reduced to​:

qq{\x{30ab}} =~ /\xab|\xa9/;

Malformed UTF-8 character (unexpected continuation byte 0xab, with no
preceding start byte) in pattern match (m//) at
/Users/sprout/Perl/5.8.7-regressions/70998.d/70998 copy line 2.

30ab is e3 82 ab in UTF-8.

The trie optimisation in S_find_byclass (added by change 28373/07be1b8)
searches in e3 82 ab for a char matching [\xab\xa9], and sets 2 (\xab) as
its starting position. The code it passes control to then stumbles across
this ‘lone’ \xab.

The attached patch fixes this.

Is this a problem/tested in blead?

I have a feeling that when i disabled certain trie functionality i
"fixed" this one.

Before this gets applied id like to know more.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2009

From @cpansprout

On Dec 21, 2009, at 2​:46 AM, demerphq wrote​:

2009/12/20 Father Chrysostomos <sprout@​cpan.org>​:

This bug (which was caused by change 28373/07be1b8) can be reduced
to​:

qq{\x{30ab}} =~ /\xab|\xa9/;

Malformed UTF-8 character (unexpected continuation byte 0xab, with no
preceding start byte) in pattern match (m//) at
/Users/sprout/Perl/5.8.7-regressions/70998.d/70998 copy line 2.

30ab is e3 82 ab in UTF-8.

The trie optimisation in S_find_byclass (added by change
28373/07be1b8)
searches in e3 82 ab for a char matching [\xab\xa9], and sets 2
(\xab) as
its starting position. The code it passes control to then stumbles
across
this ‘lone’ \xab.

The attached patch fixes this.

Is this a problem/tested in blead?

I have a feeling that when i disabled certain trie functionality i
"fixed" this one.

Before this gets applied id like to know more.

I hope this is a sufficient answer​:

$ perl5.11.3 -e 'qq{\x{30ab}} =~ /\xab|\xa9/'
Malformed UTF-8 character (unexpected continuation byte 0xab, with no
preceding start byte) in pattern match (m//) at -e line 1.

Whether mine is the best way to fix this I cannot say.

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2010

From @demerphq

2009/12/27 Father Chrysostomos <sprout@​cpan.org>​:

On Dec 21, 2009, at 2​:46 AM, demerphq wrote​:

2009/12/20 Father Chrysostomos <sprout@​cpan.org>​:

This bug (which was caused by change 28373/07be1b8) can be reduced to​:

qq{\x{30ab}} =~ /\xab|\xa9/;

Malformed UTF-8 character (unexpected continuation byte 0xab, with no
preceding start byte) in pattern match (m//) at
/Users/sprout/Perl/5.8.7-regressions/70998.d/70998 copy line 2.

30ab is e3 82 ab in UTF-8.

The trie optimisation in S_find_byclass (added by change 28373/07be1b8)
searches in e3 82 ab for a char matching [\xab\xa9], and sets 2 (\xab) as
its starting position. The code it passes control to then stumbles across
this ‘lone’ \xab.

The attached patch fixes this.

Is this a problem/tested in blead?

I have a feeling that when i disabled certain trie functionality i
"fixed" this one.

Before this gets applied id like to know more.

I hope this is a sufficient answer​:

$ perl5.11.3 -e 'qq{\x{30ab}} =~ /\xab|\xa9/'
Malformed UTF-8 character (unexpected continuation byte 0xab, with no
preceding start byte) in pattern match (m//) at -e line 1.

Whether mine is the best way to fix this I cannot say.

Thanks. I decided to fix it a different way, but I applied your test
code. Patches pushed to blead​:

commit d085b49
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Tue Nov 2 11​:29​:18 2010 +0100

  Fix RT-70998​: qq{\x{30ab}} =~ /\xab|\xa9/ produces warnings

commit aca5303
Author​: Father Chrysostomos <sprout@​cpan.org>
Date​: Tue Nov 2 11​:28​:33 2010 +0100

  Add test for rt-70998​: qq{\x{30ab}} =~ /\xab|\xa9/ produces warnings

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2010

@cpansprout - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant