New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warning: Malformed UTF-8 character in substitution operation #10000
Comments
From slavenr@iconmobile.comCreated by slavenr@devpc01-debian.iconmobile.deThe following code worked fine with 5.8.8, but outputs a number of warnings #!/usr/bin/perl use strict; my %conv = ( my $x = qq{\x{3042}\x{304b}\x{3055}\x{305f}\x{306a}\x{306f}\x{307e}\x{3084}\x{3089}\x{308f}\x{3093}\x{3042}\x{304b}\x{3055}\x{305f}\x{306a}\x{306f}\x{307e}\x{3084}\x{3089}\x{308f}\x{3093}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}}; $x =~ s{$conv_rx}{$conv{$1}}eg; __END__ The warnings are: Malformed UTF-8 character (unexpected continuation byte 0xab, with no preceding start byte) in substitution (s///) at /tmp/bla.pl line 15. Regards, Perl Info
|
From @cpansproutThis bug (which was caused by change 28373/07be1b8) can be reduced to: qq{\x{30ab}} =~ /\xab|\xa9/; Malformed UTF-8 character (unexpected continuation byte 0xab, with no 30ab is e3 82 ab in UTF-8. The trie optimisation in S_find_byclass (added by change The attached patch fixes this. |
From @cpansproutInline Patchdiff -Nurp blead/regexec.c blead-70998/regexec.c
--- blead/regexec.c 2009-12-12 00:57:11.000000000 -0800
+++ blead-70998/regexec.c 2009-12-13 23:26:41.000000000 -0800
@@ -1722,7 +1722,17 @@ S_find_byclass(pTHX_ regexp * prog, cons
" Scanning for legal start char...\n");
}
);
- while ( uc <= (U8*)last_start && !BITMAP_TEST(bitmap,*uc) ) {
+ if(do_utf8) /* Skip continuation octets */
+ while ( /* for UTF8 strings */
+ uc <= (U8*)last_start
+ &&(
+ UTF8_IS_CONTINUATION(*uc)
+ ||
+ !BITMAP_TEST(bitmap,*uc)
+ )
+ ) uc++;
+ else while ( uc <= (U8*)last_start
+ && !BITMAP_TEST(bitmap,*uc) ) {
uc++;
}
s= (char *)uc;
diff -Nurp blead/t/re/pat_rt_report.t blead-70998/t/re/pat_rt_report.t
--- blead/t/re/pat_rt_report.t 2009-11-19 08:51:40.000000000 -0800
+++ blead-70998/t/re/pat_rt_report.t 2009-12-13 21:54:00.000000000 -0800
@@ -21,7 +21,7 @@ BEGIN {
}
-plan tests => 2511; # Update this when adding/deleting tests.
+plan tests => 2512; # Update this when adding/deleting tests.
run_tests() unless caller;
@@ -1076,6 +1076,42 @@ sub run_tests {
'(?>) does not cause wrongness on long string with UTF-8';
}
+ {
+ local $BugId = 70998;
+ local $Message
+ = 'utf8 =~ /trie/ where trie matches a continuation octet';
+
+ # Catch warnings:
+ my $w;
+ local $SIG{__WARN__} = sub { $w .= shift };
+
+ # This bug can be reduced to
+ qq{\x{30ab}} =~ /\xab|\xa9/;
+ # but it's nice to have a more 'real-world' test. The original test
+ # case from the RT ticket follows:
+
+ my %conv = (
+ "\xab" => "<",
+ "\xa9" => "(c)",
+ );
+ my $conv_rx = '(' . join('|', map { quotemeta } keys %conv) . ')';
+ $conv_rx = qr{$conv_rx};
+
+ my $x
+ = qq{\x{3042}\x{304b}\x{3055}\x{305f}\x{306a}\x{306f}\x{307e}}
+ . qq{\x{3084}\x{3089}\x{308f}\x{3093}\x{3042}\x{304b}\x{3055}}
+ . qq{\x{305f}\x{306a}\x{306f}\x{307e}\x{3084}\x{3089}\x{308f}}
+ . qq{\x{3093}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}\x{30cf}}
+ . qq{\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}\x{30a2}\x{30ab}}
+ . qq{\x{30b5}\x{30bf}\x{30ca}\x{30cf}\x{30de}\x{30e4}\x{30e9}}
+ . qq{\x{30ef}\x{30f3}\x{30a2}\x{30ab}\x{30b5}\x{30bf}\x{30ca}}
+ . qq{\x{30cf}\x{30de}\x{30e4}\x{30e9}\x{30ef}\x{30f3}};
+
+ $x =~ s{$conv_rx}{$conv{$1}}eg;
+
+ iseq($w,undef);
+ }
+
#
# Keep the following tests last -- they may crash perl |
The RT System itself - Status changed from 'new' to 'open' |
From @demerphq2009/12/20 Father Chrysostomos <sprout@cpan.org>:
Is this a problem/tested in blead? I have a feeling that when i disabled certain trie functionality i Before this gets applied id like to know more. Yves -- |
From @cpansproutOn Dec 21, 2009, at 2:46 AM, demerphq wrote:
I hope this is a sufficient answer: $ perl5.11.3 -e 'qq{\x{30ab}} =~ /\xab|\xa9/' Whether mine is the best way to fix this I cannot say. |
From @demerphq2009/12/27 Father Chrysostomos <sprout@cpan.org>:
Thanks. I decided to fix it a different way, but I applied your test commit d085b49 Fix RT-70998: qq{\x{30ab}} =~ /\xab|\xa9/ produces warnings commit aca5303 Add test for rt-70998: qq{\x{30ab}} =~ /\xab|\xa9/ produces warnings -- |
@cpansprout - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#70998 (status was 'resolved')
Searchable as RT70998$
The text was updated successfully, but these errors were encountered: