Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex utf8 "uninitialized value" error #9416

Closed
p5pRT opened this issue Jul 13, 2008 · 9 comments
Closed

regex utf8 "uninitialized value" error #9416

p5pRT opened this issue Jul 13, 2008 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 13, 2008

Migrated from rt.perl.org#56902 (status was 'resolved')

Searchable as RT56902$

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2008

From @benkasminbullock

Created by @benkasminbullock

The following script prints lots of erroneous "uninitialized value"
warnings depending on whether UTF-8 is switched on or off

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
  print "m​:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.

The above is a condensed version, which was originally as followss​:
#!/usr/local/bin/perl -lw
use strict;
use Encode 'decode';
use Lingua​::JA​::FindDates 'subsjdate';
binmode STDERR,"utf8";
binmode STDOUT,"utf8";
print STDERR "first try\n";
my $test = "ABCDEFG";
print subsjdate($test);
print STDERR "now try again\n";
$test = decode ('utf8', $test);
print subsjdate($test);

See also this discussion​:

http​://groups.google.co.jp/group/comp.lang.perl.misc/browse_frm/thread/e487e48569c928b7?hl=en#

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.0:

Configured by ben at Sun Mar 23 08:50:32 JST 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.22-14-generic, archname=i686-linux
    uname='linux lemon 2.6.22-14-generic #1 smp tue feb 12 07:42:25
utc 2008 i686 gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.1.3 20070929 (prerelease) (Ubuntu
4.1.2-16ubuntu2)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.6.1.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.6.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /usr/local/lib/perl5/5.10.0/i686-linux
    /usr/local/lib/perl5/5.10.0
    /usr/local/lib/perl5/site_perl/5.10.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.10.0
    .


Environment for perl 5.10.0:
    HOME=/home/ben
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/ben/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 26, 2008

From p5p@spam.wizbit.be

On Sun Jul 13 15​:30​:16 2008, benkasminbullock@​gmail.com wrote​:

This is a bug report for perl from benkasminbullock@​gmail.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

The following script prints lots of erroneous "uninitialized value"
warnings depending on whether UTF-8 is switched on or off

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-
9]{4}|
[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m​:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.

Should "ABCDEFG" match or not?

I got rid of the warning via​:
a) use qr// instead of ""
b) using my $test = "ABCDEFG\x{5e74}";

Kind regards,

Bram

@p5pRT
Copy link
Author

p5pRT commented Jul 26, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2008

From @benkasminbullock

2008/7/27 Bram via RT <perlbug-followup@​perl.org>​:

On Sun Jul 13 15​:30​:16 2008, benkasminbullock@​gmail.com wrote​:

This is a bug report for perl from benkasminbullock@​gmail.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

The following script prints lots of erroneous "uninitialized value"
warnings depending on whether UTF-8 is switched on or off

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-
9]{4}|
[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m​:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.

Should "ABCDEFG" match or not?

I got rid of the warning via​:
a) use qr// instead of ""
b) using my $test = "ABCDEFG\x{5e74}";

Thanks for your reply, Bram.

No, ABCDEFG does not match. The bug is that, when Perl is sent a pure
ASCII string and asked to match it against a regex which contains
UTF-8 characters, Perl prints out these warnings about uninitilized
variables. There may be cases when the regex would match the pure
ASCII string, but I don't have an example of that.

What you have done in b) is added a non-ASCII character to the string
$test, which prevents $test from tripping the bug. As for a), this may
also remove the warning, but that information is of more use to people
tracking down the bug than me.

Ben Bullock

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

From @cpansprout

On Sun Jul 13 15​:30​:16 2008, BKB wrote​:

The following script prints lots of erroneous "uninitialized value"
warnings depending on whether UTF-8 is switched on or off

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-

9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";

my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m​:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes.

This appears to be fixed in blead. I suppose it needs a bisect, though.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

From @andk

"Father Chrysostomos via RT" <perlbug-followup@​perl.org> writes​:

On Sun Jul 13 15​:30​:16 2008, BKB wrote​:

The following script prints lots of erroneous "uninitialized value"
warnings depending on whether UTF-8 is switched on or off

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-

9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";

my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m​:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes.

This appears to be fixed in blead. I suppose it needs a bisect,
though.

I suppose between 5.10.0 and 5.16.3 there were other relevant fixes, but
the last warning remaining was still in 5.16.3 and was fixed in​:

c72077c is the first bad commit
commit c72077c
Author​: Aaron Crane <arc@​cpan.org>
Date​: Wed Sep 12 16​:04​:38 2012 +0100

  Fix spurious "uninitialized value" warning in regex match
 
  The warning appeared if the pattern contains a floating substring for
  which utf8 is needed, and the target string isn't in utf8. In this
  situation, downgrading the floating substring yields undef, which
  triggers the warning.
 
  Matching can't succeed in this situation, because it's impossible for
  the non-utf8 target string to contain any string which needs utf8 for
  its own representation. So the warning is quelled by aborting the match
  early.
 
  Anchored substrings already have a check of this form; this commit makes
  the corresponding change in the floating-substring case.

:100644 100644 989affafb428e9413b824ed1ca53ab6096025af7 350f2937aec6ab7709c0a62ad9988422871ef097 M regexec.c
:040000 040000 cb93b9c520722293f705f323320a1d26868973ab 315ed1765bf80a14b5f04b07f2386588efa09144 M t
bisect run success
That took 789 seconds

--
andreas

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

From @nwc10

On Sat, Jul 27, 2013 at 10​:29​:11AM +0200, Andreas Koenig wrote​:

I suppose between 5.10.0 and 5.16.3 there were other relevant fixes, but
the last warning remaining was still in 5.16.3 and was fixed in​:

c72077c is the first bad commit
commit c72077c
Author​: Aaron Crane <arc@​cpan.org>
Date​: Wed Sep 12 16​:04​:38 2012 +0100

Fix spurious "uninitialized value" warning in regex match

The warning appeared if the pattern contains a floating substring for
which utf8 is needed\, and the target string isn't in utf8\.  In this
situation\, downgrading the floating substring yields undef\, which
triggers the warning\.

Matching can't succeed in this situation\, because it's impossible for
the non\-utf8 target string to contain any string which needs utf8 for
its own representation\.  So the warning is quelled by aborting the match
early\.

Anchored substrings already have a check of this form; this commit makes
the corresponding change in the floating\-substring case\.

:100644 100644 989affafb428e9413b824ed1ca53ab6096025af7 350f2937aec6ab7709c0a62ad9988422871ef097 M regexec.c
:040000 040000 cb93b9c520722293f705f323320a1d26868973ab 315ed1765bf80a14b5f04b07f2386588efa09144 M t
bisect run success
That took 789 seconds

For reference, the added test is​:

Inline Patch
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 05cc191..9502928 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -789,6 +789,12 @@ sub run_tests {
     }
 
     {
+        # The second half of RT #114808
+        warning_is(sub {'aa' =~ /.+\x{100}/}, undef,
+                   'utf8-only floating substr, non-utf8 target, no warning');
+    }
+
+    {
         my $message = "qr /.../x";
         my $R = qr / A B C # D E/x;
         ok("ABCDE" =~    $R   && $& eq "ABC", $message);

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

From @cpansprout

On Sat Jul 27 01​:29​:57 2013, andreas.koenig.7os6VVqR@​franz.ak.mind.de wrote​:

"Father Chrysostomos via RT" <perlbug-followup@​perl.org> writes​:

On Sun Jul 13 15​:30​:16 2008, BKB wrote​:

The following script prints lots of erroneous "uninitialized value"
warnings depending on whether UTF-8 is switched on or off

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-

9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";

my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m​:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes.

This appears to be fixed in blead. I suppose it needs a bisect,
though.

I suppose between 5.10.0 and 5.16.3 there were other relevant fixes,
but
the last warning remaining was still in 5.16.3 and was fixed in​:

c72077c is the first bad commit
commit c72077c
Author​: Aaron Crane <arc@​cpan.org>
Date​: Wed Sep 12 16​:04​:38 2012 +0100

Fix spurious "uninitialized value" warning in regex match

The warning appeared if the pattern contains a floating substring

for
which utf8 is needed, and the target string isn't in utf8. In
this
situation, downgrading the floating substring yields undef, which
triggers the warning.

Matching can't succeed in this situation\, because it's impossible

for
the non-utf8 target string to contain any string which needs utf8
for
its own representation. So the warning is quelled by aborting the
match
early.

Anchored substrings already have a check of this form; this commit

makes
the corresponding change in the floating-substring case.

:100644 100644 989affafb428e9413b824ed1ca53ab6096025af7
350f2937aec6ab7709c0a62ad9988422871ef097 M regexec.c
:040000 040000 cb93b9c520722293f705f323320a1d26868973ab
315ed1765bf80a14b5f04b07f2386588efa09144 M t
bisect run success
That took 789 seconds

Thank you again!

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

@cpansprout - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant