Skip Menu |
Report information
Id: 56902
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: BKB <benkasminbullock [at] gmail.com>
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: low
Type: core
Perl Version: 5.10.0
Fixed In: (no value)



Subject: regex utf8 "uninitialized value" error
Date: Mon, 14 Jul 2008 07:29:51 +0900
To: perlbug [...] perl.org
From: "Ben Bullock" <benkasminbullock [...] gmail.com>
Download (untitled) / with headers
text/plain 3.7k
This is a bug report for perl from benkasminbullock@gmail.com, generated with the help of perlbug 1.36 running under perl 5.10.0. ----------------------------------------------------------------- [Please enter your report here] The following script prints lots of erroneous "uninitialized value" warnings depending on whether UTF-8 is switched on or off #!/usr/bin/perl use warnings; use strict; my $regex = "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}"; my $test = "ABCDEFG"; if ($test =~ /($regex)/) { print "m:<$1>\n"; } __END__ If the last character ("\x{5e74}") is removed from the regexp, the warning vanishes. But if the capturing () is removed (leaving just "\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74} which triggers the warning, only that combined with something else. The above is a condensed version, which was originally as followss: #!/usr/local/bin/perl -lw use strict; use Encode 'decode'; use Lingua::JA::FindDates 'subsjdate'; binmode STDERR,"utf8"; binmode STDOUT,"utf8"; print STDERR "first try\n"; my $test = "ABCDEFG"; print subsjdate($test); print STDERR "now try again\n"; $test = decode ('utf8', $test); print subsjdate($test); See also this discussion: http://groups.google.co.jp/group/comp.lang.perl.misc/browse_frm/thread/e487e48569c928b7?hl=en# [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl 5.10.0: Configured by ben at Sun Mar 23 08:50:32 JST 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux, osvers=2.6.22-14-generic, archname=i686-linux uname='linux lemon 2.6.22-14-generic #1 smp tue feb 12 07:42:25 utc 2008 i686 gnulinux ' config_args='' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /usr/lib64 libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.6.1.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.6.1' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib' Locally applied patches: --- @INC for perl 5.10.0: /usr/local/lib/perl5/5.10.0/i686-linux /usr/local/lib/perl5/5.10.0 /usr/local/lib/perl5/site_perl/5.10.0/i686-linux /usr/local/lib/perl5/site_perl/5.10.0 . --- Environment for perl 5.10.0: HOME=/home/ben LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/ben/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.2k
On Sun Jul 13 15:30:16 2008, benkasminbullock@gmail.com wrote: Show quoted text
> This is a bug report for perl from benkasminbullock@gmail.com, > generated with the help of perlbug 1.36 running under perl 5.10.0. > > > ----------------------------------------------------------------- > [Please enter your report here] > > The following script prints lots of erroneous "uninitialized value" > warnings depending on whether UTF-8 is switched on or off > > #!/usr/bin/perl > use warnings; > use strict; > > my $regex = > "([\x{ff10}-\x{ff19}0- > 9]{4}|
[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}"; Show quoted text
> my $test = "ABCDEFG"; > if ($test =~ /($regex)/) { > print "m:<$1>\n"; > } > __END__ > > If the last character ("\x{5e74}") is removed from the regexp, the > warning vanishes. But if the capturing () is removed (leaving just > "\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74} > which triggers the warning, only that combined with something else. >
Should "ABCDEFG" match or not? I got rid of the warning via: a) use qr// instead of "" b) using my $test = "ABCDEFG\x{5e74}"; Kind regards, Bram
Subject: Re: [perl #56902] regex utf8 "uninitialized value" error
Date: Sun, 27 Jul 2008 09:05:58 +0900
To: perlbug-followup [...] perl.org
From: "Ben Bullock" <benkasminbullock [...] gmail.com>
Download (untitled) / with headers
text/plain 1.9k
2008/7/27 Bram via RT <perlbug-followup@perl.org>: Show quoted text
> On Sun Jul 13 15:30:16 2008, benkasminbullock@gmail.com wrote:
>> This is a bug report for perl from benkasminbullock@gmail.com, >> generated with the help of perlbug 1.36 running under perl 5.10.0. >> >> >> ----------------------------------------------------------------- >> [Please enter your report here] >> >> The following script prints lots of erroneous "uninitialized value" >> warnings depending on whether UTF-8 is switched on or off >> >> #!/usr/bin/perl >> use warnings; >> use strict; >> >> my $regex = >> "([\x{ff10}-\x{ff19}0- >> 9]{4}|
> [\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
>> my $test = "ABCDEFG"; >> if ($test =~ /($regex)/) { >> print "m:<$1>\n"; >> } >> __END__ >> >> If the last character ("\x{5e74}") is removed from the regexp, the >> warning vanishes. But if the capturing () is removed (leaving just >> "\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74} >> which triggers the warning, only that combined with something else. >>
> > Should "ABCDEFG" match or not? > > I got rid of the warning via: > a) use qr// instead of "" > b) using my $test = "ABCDEFG\x{5e74}";
Thanks for your reply, Bram. No, ABCDEFG does not match. The bug is that, when Perl is sent a pure ASCII string and asked to match it against a regex which contains UTF-8 characters, Perl prints out these warnings about uninitilized variables. There may be cases when the regex would match the pure ASCII string, but I don't have an example of that. What you have done in b) is added a non-ASCII character to the string $test, which prevents $test from tripping the bug. As for a), this may also remove the warning, but that information is of more use to people tracking down the bug than me. Ben Bullock
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 779b
On Sun Jul 13 15:30:16 2008, BKB wrote: Show quoted text
> The following script prints lots of erroneous "uninitialized value" > warnings depending on whether UTF-8 is switched on or off > > #!/usr/bin/perl > use warnings; > use strict; > > my $regex = > "([\x{ff10}-\x{ff19}0- >
9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}"; Show quoted text
> my $test = "ABCDEFG"; > if ($test =~ /($regex)/) { > print "m:<$1>\n"; > } > __END__ > > If the last character ("\x{5e74}") is removed from the regexp, the > warning vanishes.
This appears to be fixed in blead. I suppose it needs a bisect, though. -- Father Chrysostomos
Subject: Re: [perl #56902] regex utf8 "uninitialized value" error
Date: Sat, 27 Jul 2013 10:29:11 +0200
To: "Father Chrysostomos via RT" <perlbug-followup [...] perl.org>
From: Andreas Koenig <andreas.koenig.7os6VVqR [...] franz.ak.mind.de>
"Father Chrysostomos via RT" <perlbug-followup@perl.org> writes: Show quoted text
> On Sun Jul 13 15:30:16 2008, BKB wrote:
>> The following script prints lots of erroneous "uninitialized value" >> warnings depending on whether UTF-8 is switched on or off >> >> #!/usr/bin/perl >> use warnings; >> use strict; >> >> my $regex = >> "([\x{ff10}-\x{ff19}0- >>
> 9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
>> my $test = "ABCDEFG"; >> if ($test =~ /($regex)/) { >> print "m:<$1>\n"; >> } >> __END__ >> >> If the last character ("\x{5e74}") is removed from the regexp, the >> warning vanishes.
> > This appears to be fixed in blead. I suppose it needs a bisect, > though.
I suppose between 5.10.0 and 5.16.3 there were other relevant fixes, but the last warning remaining was still in 5.16.3 and was fixed in: c72077c4fff72b66cdde1621c62fb4fd383ce093 is the first bad commit commit c72077c4fff72b66cdde1621c62fb4fd383ce093 Author: Aaron Crane <arc@cpan.org> Date: Wed Sep 12 16:04:38 2012 +0100 Fix spurious "uninitialized value" warning in regex match The warning appeared if the pattern contains a floating substring for which utf8 is needed, and the target string isn't in utf8. In this situation, downgrading the floating substring yields undef, which triggers the warning. Matching can't succeed in this situation, because it's impossible for the non-utf8 target string to contain any string which needs utf8 for its own representation. So the warning is quelled by aborting the match early. Anchored substrings already have a check of this form; this commit makes the corresponding change in the floating-substring case. :100644 100644 989affafb428e9413b824ed1ca53ab6096025af7 350f2937aec6ab7709c0a62ad9988422871ef097 M regexec.c :040000 040000 cb93b9c520722293f705f323320a1d26868973ab 315ed1765bf80a14b5f04b07f2386588efa09144 M t bisect run success That took 789 seconds -- andreas
Subject: Re: [perl #56902] regex utf8 "uninitialized value" error
Date: Sat, 27 Jul 2013 11:21:53 +0100
To: Father Chrysostomos via RT <perlbug-followup [...] perl.org>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.9k
On Sat, Jul 27, 2013 at 10:29:11AM +0200, Andreas Koenig wrote: Show quoted text
> I suppose between 5.10.0 and 5.16.3 there were other relevant fixes, but > the last warning remaining was still in 5.16.3 and was fixed in: > > c72077c4fff72b66cdde1621c62fb4fd383ce093 is the first bad commit > commit c72077c4fff72b66cdde1621c62fb4fd383ce093 > Author: Aaron Crane <arc@cpan.org> > Date: Wed Sep 12 16:04:38 2012 +0100 > > Fix spurious "uninitialized value" warning in regex match > > The warning appeared if the pattern contains a floating substring for > which utf8 is needed, and the target string isn't in utf8. In this > situation, downgrading the floating substring yields undef, which > triggers the warning. > > Matching can't succeed in this situation, because it's impossible for > the non-utf8 target string to contain any string which needs utf8 for > its own representation. So the warning is quelled by aborting the match > early. > > Anchored substrings already have a check of this form; this commit makes > the corresponding change in the floating-substring case. > > :100644 100644 989affafb428e9413b824ed1ca53ab6096025af7 350f2937aec6ab7709c0a62ad9988422871ef097 M regexec.c > :040000 040000 cb93b9c520722293f705f323320a1d26868973ab 315ed1765bf80a14b5f04b07f2386588efa09144 M t > bisect run success > That took 789 seconds
For reference, the added test is: diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index 05cc191..9502928 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -789,6 +789,12 @@ sub run_tests { } { + # The second half of RT #114808 + warning_is(sub {'aa' =~ /.+\x{100}/}, undef, + 'utf8-only floating substr, non-utf8 target, no warning'); + } + + { my $message = "qr /.../x"; my $R = qr / A B C # D E/x; ok("ABCDE" =~ $R && $& eq "ABC", $message); Nicholas Clark
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 2.3k
On Sat Jul 27 01:29:57 2013, andreas.koenig.7os6VVqR@franz.ak.mind.de wrote: Show quoted text
> "Father Chrysostomos via RT" <perlbug-followup@perl.org> writes: >
> > On Sun Jul 13 15:30:16 2008, BKB wrote:
> >> The following script prints lots of erroneous "uninitialized value" > >> warnings depending on whether UTF-8 is switched on or off > >> > >> #!/usr/bin/perl > >> use warnings; > >> use strict; > >> > >> my $regex = > >> "([\x{ff10}-\x{ff19}0- > >>
> >
>
9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}"; Show quoted text
> >> my $test = "ABCDEFG"; > >> if ($test =~ /($regex)/) { > >> print "m:<$1>\n"; > >> } > >> __END__ > >> > >> If the last character ("\x{5e74}") is removed from the regexp, the > >> warning vanishes.
> > > > This appears to be fixed in blead. I suppose it needs a bisect, > > though.
> > I suppose between 5.10.0 and 5.16.3 there were other relevant fixes, > but > the last warning remaining was still in 5.16.3 and was fixed in: > > c72077c4fff72b66cdde1621c62fb4fd383ce093 is the first bad commit > commit c72077c4fff72b66cdde1621c62fb4fd383ce093 > Author: Aaron Crane <arc@cpan.org> > Date: Wed Sep 12 16:04:38 2012 +0100 > > Fix spurious "uninitialized value" warning in regex match > > The warning appeared if the pattern contains a floating substring > for > which utf8 is needed, and the target string isn't in utf8. In > this > situation, downgrading the floating substring yields undef, which > triggers the warning. > > Matching can't succeed in this situation, because it's impossible > for > the non-utf8 target string to contain any string which needs utf8 > for > its own representation. So the warning is quelled by aborting the > match > early. > > Anchored substrings already have a check of this form; this commit > makes > the corresponding change in the floating-substring case. > > :100644 100644 989affafb428e9413b824ed1ca53ab6096025af7 > 350f2937aec6ab7709c0a62ad9988422871ef097 M regexec.c > :040000 040000 cb93b9c520722293f705f323320a1d26868973ab > 315ed1765bf80a14b5f04b07f2386588efa09144 M t > bisect run success > That took 789 seconds
Thank you again! -- Father Chrysostomos


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org