Skip Menu |
 
Report information
Id: 114808
Status: resolved
Priority: 0/
Queue: perl5

Owner: khw <khw [at] cpan.org>
Requestors: mobrule [at] gmail.com
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version: 5.12.4
Fixed In: (no value)

Attachments
0001-Fix-spurious-uninitialized-value-warning-in-regex-ma.patch
0001-Fix-spurious-uninitialized-value-warning-in-split.patch



CC: mobrule [...] gmail.com
Subject: split output empty when PATTERN and EXPR have "wide" characters
Date: Sun, 9 Sep 2012 19:00:08 GMT
To: perlbug [...] perl.org
From: root <root [...] li128-189.members.linode.com>
Download (untitled) / with headers
text/plain 6.1k
This is a bug report for perl from mobrule@gmail.com, generated with the help of perlbug 1.39 running under perl 5.12.4. =head1 BUG REPORT split output empty when PATTERN and EXPR have "wide" characters =head1 DESCRIPTION In a call to the builtin C<split PATTERN, EXPR> function: if the C<PATTERN> contains one or more "wide" characters (that is, greater than C<chr(255)>), and if C<EXPR> contains a different set of wide characters, then in some cases the C<split> call will not produce any output. In addition, when C<warnings> are enabled, Perl will produce one and sometimes several spurious C<Use of uninitialized value in split at ...> messages. This bug has been observed on many different versions of Perl, from v5.8 to v5.15, on Linux, Cygwin, and Windows. =head2 ADDITIONAL INFORMATION A C<split> statement will only exhibit this problem after the first time it is executed. =head2 DEMONSTRATION The code in this script demonstrates the issue. The tests check that C<split> produced output, and since the cases are designed so that the C<PATTERN>s do not match the C<EXPR>s, that the output are one-element lists that contain the original C<EXPR>. Tests pass when the characters in C<EXPR> are all below C<chr(256)>, or after a C<split> statement is executed for the first time. Tests fail when C<EXPR> contains a "wide" character, and the output is not from the first time a C<split> statement has been executed. =cut use strict; use warnings; use Test::More; use Encode; binmode *STDOUT, ":encoding(UTF-8)"; binmode *STDERR, ":encoding(UTF-8)"; sub toUTF8 ($) { Encode::encode("utf-8",$_[0]) }; # for output to Test::Builder my $text0 = "normal\x{ee}"; # 8-bit string my $text1 = "\x{444}"; # single wide char my $text2 = "ab\x{ccc}de\x{999}gh"; # string containing wide char my $pattern1 = chr(0xabc); # single wide char my $pattern2 = "\x{abc}\x{def}ghi"; # more than one wide char for ( $text1, # the first call to each split function is ok $text0, # ok $text1, # tests fail, one warning $text2) { # tests fail, more than one warning print STDERR "--------------------\ntext is $_\n"; print STDERR "pattern is /$pattern1/\n"; my @list1 = split /$pattern1/, $_; ok(@list1 > 0, toUTF8 "split had results for text $_, pattern $pattern1"); ok($list1[0] eq $_, toUTF8 "correct result for text $_, pattern $pattern1"); print STDERR "pattern is /$pattern2/\n"; my @list2 = split /$pattern2/, $_; ok(@list2 > 0, toUTF8 "split had results for text $_, pattern $pattern2"); ok($list2[0] eq $_, toUTF8 "correct result for text $_, pattern $pattern2"); } done_testing; =head2 workaround A workaround is to call C<split> indirectly in a way that insures the C<PATTERN> is recompiled before each call. For the tests above, we could define an indirect function sub SPLIT { my ($PATTERN, $EXPR) = @_; return split /$PATTERN/,$EXPR } and call my @list1 = SPLIT $pattern1, $_; ... my @list2 = SPLIT $pattern2, $_; instead. =head1 SUBMITTED BY Marty O'Brien, E<lt>mob@cpan.orgE<gt> =cut --- Flags: category=core severity=medium --- Site configuration information for perl 5.12.4: Configured by Debian Project at Tue Sep 6 08:07:52 UTC 2011. Summary of my perl5 (revision 5 version 12 subversion 4) configuration: Platform: osname=linux, osvers=2.6.24-28-server, archname=i686-linux-gnu-thread-multi-64int uname='linux roseapple 2.6.24-28-server #1 smp wed aug 18 21:17:51 utc 2010 i686 i686 i386 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i686-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.12 -Darchlib=/usr/lib/perl/5.12 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.12.4 -Dsitearch=/usr/local/lib/perl/5.12.4 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.12.4 -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.6.1', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib/i386-linux-gnu /lib/../lib /usr/lib/i386-linux-gnu /usr/lib/../lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=, so=so, useshrplib=true, libperl=libperl.so.5.12.4 gnulibc_version='2.13' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector' Locally applied patches: --- @INC for perl 5.12.4: /etc/perl /usr/local/lib/perl/5.12.4 /usr/local/share/perl/5.12.4 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.12 /usr/share/perl/5.12 /usr/local/lib/site_perl . --- Environment for perl 5.12.4: HOME=/root LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 467b
Possibly related: http://stackoverflow.com/questions/12358714 Why does this unicode char cause perl regex match to emit “uninitialized” warning? The following code: #!/usr/bin/env perl use utf8; use strict; use warnings; use 5.012; # implicitly turn on feature unicode_strings my $test = "some string"; $test =~ m/.+\x{2013}/x; Yields: Use of uninitialized value $test in pattern match (m//) at test.pl line 9.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.2k
On Mon Sep 10 13:21:50 2012, mob wrote: Show quoted text
> Possibly related: http://stackoverflow.com/questions/12358714 > > Why does this unicode char cause perl regex match to emit > “uninitialized” warning? > > The following code: > > #!/usr/bin/env perl > use utf8; > use strict; > use warnings; > use 5.012; # implicitly turn on feature unicode_strings > my $test = "some string"; > $test =~ m/.+\x{2013}/x; > > Yields: > > Use of uninitialized value $test in pattern match (m//) at test.pl > line 9. > >
I looked at this a little. It comes from these lines (in today's blead) regexec.c S_regtry(): 2454 if (!(utf8_target ? prog->float_utf8 : prog->float_substr)) 2455 utf8_target ? to_utf8_substr(prog) : to_byte_substr(prog); 2456 float_real = utf8_target ? prog->float_utf8 : prog->float_substr; 2457 2458 little = SvPV_const(float_real, len); The target is not utf8, but float_substr is set to NULL. The code tries to guess the name of the uninitialized SV, and comes up erroneously with $temp, instead of the internal Perl variable, float_substr. This appears to be a regex optimizer bug, and I think others are better qualified to look into that. -- Karl Williamson
CC: perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Tue, 11 Sep 2012 19:55:35 +0100
To: perlbug-followup [...] perl.org
From: Aaron Crane <perl [...] aaroncrane.co.uk>
Download (untitled) / with headers
text/plain 1.4k
Karl Williamson via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Mon Sep 10 13:21:50 2012, mob wrote:
>> #!/usr/bin/env perl >> use utf8; >> use strict; >> use warnings; >> use 5.012; # implicitly turn on feature unicode_strings >> my $test = "some string"; >> $test =~ m/.+\x{2013}/x; >> >> Yields: >> >> Use of uninitialized value $test in pattern match (m//) at test.pl line 9.
That can be reduced further: $ perl -we '"aa" =~ /.+\x{100}/' Use of uninitialized value in pattern match (m//) at -e line 1. Show quoted text
> This appears to be a regex optimizer bug
I think I agree, more or less. This quells the warning, and everything in t/re still passes with it (on my machine): diff --git a/regexec.c b/regexec.c index 2dc2314..5ef493b 100644 --- a/regexec.c +++ b/regexec.c @@ -2453,6 +2453,9 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, registe if (!(utf8_target ? prog->float_utf8 : prog->float_substr)) utf8_target ? to_utf8_substr(prog) : to_byte_substr(prog); + if (!utf8_target && (!prog->float_substr || !SvOK(prog->float_substr))) + /* can't match: floating substr needs utf8, but target is not utf8 */ + goto phooey; float_real = utf8_target ? prog->float_utf8 : prog->float_substr; little = SvPV_const(float_real, len); I have to go out just now, but I'll put together a proper patch later this evening, including a test. -- Aaron Crane ** http://aaroncrane.co.uk/
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.4k
Show quoted text
> I think I agree, more or less. This quells the warning, and > everything in t/re still passes with it (on my machine): > > diff --git a/regexec.c b/regexec.c > index 2dc2314..5ef493b 100644 > --- a/regexec.c > +++ b/regexec.c > @@ -2453,6 +2453,9 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char > *stringarg, registe > > if (!(utf8_target ? prog->float_utf8 : prog-
> >float_substr))
> utf8_target ? to_utf8_substr(prog) : > to_byte_substr(prog); > + if (!utf8_target && (!prog->float_substr || > !SvOK(prog->float_substr))) > + /* can't match: floating substr needs utf8, but > target is not utf8 */ > + goto phooey; > float_real = utf8_target ? prog->float_utf8 : prog-
> >float_substr;
> > little = SvPV_const(float_real, len); > > I have to go out just now, but I'll put together a proper patch later > this evening, including a test. >
The patch fixes the warnings from this one-liner $ perl -we '"aa" =~ /.+\x{100}/' but not this one $ perl -we '$p=chr(0x100); for (".","ab\x{101}def") { @q = split /$p/ }' Name "main::q" used only once: possible typo at -e line 1. Use of uninitialized value in split at -e line 1. Use of uninitialized value in split at -e line 1. Use of uninitialized value in split at -e line 1. Use of uninitialized value in split at -e line 1. Use of uninitialized value in split at -e line 1. Use of uninitialized value in split at -e line 1.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.7k
On Tue Sep 11 20:23:24 2012, mob wrote: Show quoted text
> > I think I agree, more or less. This quells the warning, and > > everything in t/re still passes with it (on my machine): > > > > diff --git a/regexec.c b/regexec.c > > index 2dc2314..5ef493b 100644 > > --- a/regexec.c > > +++ b/regexec.c > > @@ -2453,6 +2453,9 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char > > *stringarg, registe > > > > if (!(utf8_target ? prog->float_utf8 : prog-
> > >float_substr))
> > utf8_target ? to_utf8_substr(prog) : > > to_byte_substr(prog); > > + if (!utf8_target && (!prog->float_substr || > > !SvOK(prog->float_substr))) > > + /* can't match: floating substr needs utf8, but > > target is not utf8 */ > > + goto phooey; > > float_real = utf8_target ? prog->float_utf8 : prog-
> > >float_substr;
> > > > little = SvPV_const(float_real, len); > > > > I have to go out just now, but I'll put together a proper patch later > > this evening, including a test. > >
> > The patch fixes the warnings from this one-liner > > $ perl -we '"aa" =~ /.+\x{100}/' > > but not this one > > $ perl -we '$p=chr(0x100); for (".","ab\x{101}def") { @q = split
/$p/ }' Show quoted text
> Name "main::q" used only once: possible typo at -e line 1. > Use of uninitialized value in split at -e line 1. > Use of uninitialized value in split at -e line 1. > Use of uninitialized value in split at -e line 1. > Use of uninitialized value in split at -e line 1. > Use of uninitialized value in split at -e line 1. > Use of uninitialized value in split at -e line 1. >
Any successful patch of this would require analyzing regcomp.c to find out why float_substr is NULL. The fault may not lie in regexec.c at all. -- Karl Williamson
CC: perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Wed, 12 Sep 2012 09:47:35 +0100
To: perlbug-followup [...] perl.org
From: Aaron Crane <perl [...] aaroncrane.co.uk>
Download (untitled) / with headers
text/plain 1.6k
Karl Williamson via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Tue Sep 11 20:23:24 2012, mob wrote:
>> The patch fixes the warnings from this one-liner >> >> $ perl -we '"aa" =~ /.+\x{100}/' >> >> but not this one >> >> $ perl -we '$p=chr(0x100); for (".","ab\x{101}def") { @q = split
> /$p/ }'
>> Name "main::q" used only once: possible typo at -e line 1. >> Use of uninitialized value in split at -e line 1. >> Use of uninitialized value in split at -e line 1. >> Use of uninitialized value in split at -e line 1. >> Use of uninitialized value in split at -e line 1. >> Use of uninitialized value in split at -e line 1. >> Use of uninitialized value in split at -e line 1.
> > Any successful patch of this would require analyzing regcomp.c to find > out why float_substr is NULL. The fault may not lie in regexec.c at all.
For the /.+\x{100}/ case, I believe the logic is equivalent to this, for the anchored-substring case, round line 2366: if (must == &PL_sv_undef) /* could not downgrade utf8 check substring, so must fail */ goto phooey; That is: the reason we don't have float_substr is that the floating substring was utf8 in the pattern, but it can't be downgraded to the non-utf8 representation. So if the target string is non-utf8, we know there can't be a match, and we can skip the regex engine proper entirely. I haven't investigated the warnings from split, but I don't think they're related, because you need both loop iterations (and in that order, non-utf8 before utf8) to trigger them. Does that ring a bell for anyone who's familiar with how split works? -- Aaron Crane ** http://aaroncrane.co.uk/
CC: perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Wed, 12 Sep 2012 19:44:52 +0100
To: perlbug-followup [...] perl.org
From: Aaron Crane <perl [...] aaroncrane.co.uk>
Download (untitled) / with headers
text/plain 2.5k
I wrote: Show quoted text
> Karl Williamson via RT <perlbug-followup@perl.org> wrote:
>> Any successful patch of this would require analyzing regcomp.c to find >> out why float_substr is NULL. The fault may not lie in regexec.c at all.
> > For the /.+\x{100}/ case, I believe the logic is equivalent to this, > for the anchored-substring case, round line 2366: > > if (must == &PL_sv_undef) > /* could not downgrade utf8 check substring, so must fail */ > goto phooey;
I'm still happy with my analysis of that case; see the attached patch. In particular, the similarity to the anchored-substring case increases my confidence that this is a reasonable fix. However… Show quoted text
> I haven't investigated the warnings from split, but I don't think > they're related, because you need both loop iterations (and in that > order, non-utf8 before utf8) to trigger them. Does that ring a bell > for anyone who's familiar with how split works?
… after a little more investigation, I've come to the conclusion that the use of split is a red herring here, because I can trigger what seems to be the same bug using only regex matches: $ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r' Use of uninitialized value $_ in pattern match (m//) at -e line 1. Before I realised that, I worked up this patch: diff --git a/regexec.c b/regexec.c index 7f81fea..76c261f 100644 --- a/regexec.c +++ b/regexec.c @@ -629,6 +629,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV *sv, char *strpos, if (check == &PL_sv_undef) { DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, "Non-utf8 string cannot match utf8 check string\n")); + /* At this point, to_byte_substr() has set prog->check_substr to + * &PL_sv_undef. But this function (and pregexec()) use the + * non-nullity of prog->check_substr (rather than the definedness of + * the sv it points to) to decide whether to use it. So should this + * regex be reused in the future, pregexec() will *try* using + * check_substr, but emit an "uninitialized value" warning every + * time it consults it. We work around this by setting the useless, + * undefined check_substr to a null pointer. */ + prog->check_substr = 0; goto fail; } if (prog->extflags & RXf_ANCH) { /* Match at beg-of-str or after \n */ But that clearly *isn't* the right fix, because while it does fix the warnings in split, it doesn't help with my split-free test case. Does anyone have any ideas? -- Aaron Crane ** http://aaroncrane.co.uk/

Message body is not shown because sender requested not to inline it.

CC: perlbug-followup [...] perl.org, perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Wed, 12 Sep 2012 22:10:36 +0200
To: Aaron Crane <perl [...] aaroncrane.co.uk>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.9k
On 12 September 2012 20:44, Aaron Crane <perl@aaroncrane.co.uk> wrote: Show quoted text
> I wrote:
>> Karl Williamson via RT <perlbug-followup@perl.org> wrote:
>>> Any successful patch of this would require analyzing regcomp.c to find >>> out why float_substr is NULL. The fault may not lie in regexec.c at all.
>> >> For the /.+\x{100}/ case, I believe the logic is equivalent to this, >> for the anchored-substring case, round line 2366: >> >> if (must == &PL_sv_undef) >> /* could not downgrade utf8 check substring, so must fail */ >> goto phooey;
> > I'm still happy with my analysis of that case; see the attached patch. > In particular, the similarity to the anchored-substring case > increases my confidence that this is a reasonable fix. However… >
>> I haven't investigated the warnings from split, but I don't think >> they're related, because you need both loop iterations (and in that >> order, non-utf8 before utf8) to trigger them. Does that ring a bell >> for anyone who's familiar with how split works?
> > … after a little more investigation, I've come to the conclusion that > the use of split is a red herring here, because I can trigger what > seems to be the same bug using only regex matches: > > $ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r' > Use of uninitialized value $_ in pattern match (m//) at -e line 1.
Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;? As written I would have said it is going to match /$c/ against $_ which is actually undefined and stick the result of the match (0) in $r. Show quoted text
> Before I realised that, I worked up this patch: > > diff --git a/regexec.c b/regexec.c > index 7f81fea..76c261f 100644 > --- a/regexec.c > +++ b/regexec.c > @@ -629,6 +629,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV > *sv, char *strpos, > if (check == &PL_sv_undef) { > DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, > "Non-utf8 string cannot match utf8 check string\n")); > + /* At this point, to_byte_substr() has set prog->check_substr to > + * &PL_sv_undef. But this function (and pregexec()) use the > + * non-nullity of prog->check_substr (rather than the definedness of > + * the sv it points to) to decide whether to use it. So should this > + * regex be reused in the future, pregexec() will *try* using > + * check_substr, but emit an "uninitialized value" warning every > + * time it consults it. We work around this by setting the useless, > + * undefined check_substr to a null pointer. */ > + prog->check_substr = 0; > goto fail; > } > if (prog->extflags & RXf_ANCH) { /* Match at beg-of-str or after \n */ > > But that clearly *isn't* the right fix, because while it does fix the > warnings in split, it doesn't help with my split-free test case. > > Does anyone have any ideas?
I need to review this more closely. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perlbug-followup [...] perl.org, perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Wed, 12 Sep 2012 21:56:44 +0100
To: demerphq <demerphq [...] gmail.com>
From: Aaron Crane <perl [...] aaroncrane.co.uk>
Download (untitled) / with headers
text/plain 839b
demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 12 September 2012 20:44, Aaron Crane <perl@aaroncrane.co.uk> wrote:
>> $ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r' >> Use of uninitialized value $_ in pattern match (m//) at -e line 1.
> > Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;?
D'oh. Thanks for pointing out my mistake. With the missing qr inserted, that code is warning-free in blead, so I'm back to thinking that my suggested fix for the original bug report (the split case) is, at least, not overwhelmingly wrong. Patch attached. Show quoted text
> I need to review this more closely.
Thanks, I'd be most grateful for the review. If you also have chance to consider my other patch (the one for the /.+\x{100}/ report), that'd be great. -- Aaron Crane ** http://aaroncrane.co.uk/

Message body is not shown because sender requested not to inline it.

CC: perlbug-followup [...] perl.org, perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Wed, 12 Sep 2012 23:19:03 +0200
To: Aaron Crane <perl [...] aaroncrane.co.uk>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.3k
On 12 September 2012 22:56, Aaron Crane <perl@aaroncrane.co.uk> wrote: Show quoted text
> demerphq <demerphq@gmail.com> wrote:
>> On 12 September 2012 20:44, Aaron Crane <perl@aaroncrane.co.uk> wrote:
>>> $ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r' >>> Use of uninitialized value $_ in pattern match (m//) at -e line 1.
>> >> Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;?
> > D'oh. Thanks for pointing out my mistake. > > With the missing qr inserted, that code is warning-free in blead, so > I'm back to thinking that my suggested fix for the original bug report > (the split case) is, at least, not overwhelmingly wrong. Patch > attached. >
>> I need to review this more closely.
> > Thanks, I'd be most grateful for the review. > > If you also have chance to consider my other patch (the one for the > /.+\x{100}/ report), that'd be great.
You said at one point "fails to downgrade, so it returns undef", and that patch adds a SvOK() check, and then this patch will set it to null right? But doesn't that mean that the substring optimization would be lost permanently? So shouldnt we fix that logic so it can handle this case without destroying the match substr? IOW, doesnt this patch series (which I appreciate) solve this problem a bit too far down the stack? (Am i incorrectly entangling these two patches?) Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perlbug-followup [...] perl.org, perl5-porters [...] perl.org
Subject: Re: [perl #114808] split output empty when PATTERN and EXPR have "wide" characters
Date: Thu, 13 Sep 2012 15:44:37 +0100
To: demerphq <demerphq [...] gmail.com>
From: Aaron Crane <perl [...] aaroncrane.co.uk>
Download (untitled) / with headers
text/plain 6.1k
demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 12 September 2012 22:56, Aaron Crane <perl@aaroncrane.co.uk> wrote:
>> Thanks, I'd be most grateful for the review. >> >> If you also have chance to consider my other patch (the one for the >> /.+\x{100}/ report), that'd be great.
> > You said at one point "fails to downgrade, so it returns undef", and > that patch adds a SvOK() check, and then this patch will set it to > null right?
The submitted version of my patch for /.+\x{100}/ just compared `== &PL_sv_undef`, rather than using SvOK(). But that case doesn't seem to be the same as this split case: my fix for the split warnings works regardless of whether the earlier patch has been applied. For clarity, this is the earlier patch I'm talking about: https://rt.perl.org/rt3/Ticket/Attachment/1153646/588206/0001-Fix-spurious-uninitialized-value-warning-in-regex-ma.patch Show quoted text
> But doesn't that mean that the substring optimization would be lost > permanently? So shouldnt we fix that logic so it can handle this case > without destroying the match substr?
I don't *think* that's the case. This is just the version of the floating-or-anchored substring that's copied into check_substr from anchored_substr or float_substr as appropriate; the original is left where it is. For added tricksiness, though, those apparent fields of struct regexp are actually preprocessor macros; struct regexp contains struct reg_substr_data *substrs, which points to a struct containing a three-element array "data" where element 0 is anchored, 1 is floating, and 2 is "check". (See line 573 in regcomp.h.) This gdb session seems to support my position that we aren't destroying the match substr: $ gdb --args ./miniperl -we '$p = chr 0x100; for (".", "ab\x{100}cd") { @q = split /$p/ }' GNU gdb 6.3.50-20050815 (Apple version gdb-1515) (Sat Jan 15 08:33:48 UTC 2011) […] (gdb) b regexec.c:640 Breakpoint 1 at 0x10015168a: file regexec.c, line 640. That's the `prog->check_substr = 0` line my patch adds. (gdb) run Starting program: ./miniperl -we \$p\ =\ chr\ 0x100\;\ for\ \(\".\",\ \"ab\\x\{100\}cd\"\)\ \{\ @q\ =\ split\ /\$p/\ \} Reading symbols for shared libraries ++.. done Name "main::q" used only once: possible typo at -e line 1. Breakpoint 1, Perl_re_intuit_start (rx=0x10082e108, sv=0x100826610, strpos=0x1003098f0 ".", strend=0x1003098f1 "", flags=0, data=0x7fff5fbfe640) at regexec.c:640 640 prog->check_substr = 0; (gdb) p prog->substrs->data $1 = {{ min_offset = 0, max_offset = 0, substr = 0x1001d4230, utf8_substr = 0x10082e120, end_shift = 0 }, { min_offset = 0, max_offset = 0, substr = 0x0, utf8_substr = 0x0, end_shift = 0 }, { min_offset = 0, max_offset = 0, substr = 0x1001d4230, utf8_substr = 0x10082e120, end_shift = 0 }} So that's the substring data we would leave if my change weren't there. 0x1001d4230 is &PL_sv_undef in this build, btw. (gdb) s 641 goto fail; (gdb) p prog->substrs->data $2 = {{ min_offset = 0, max_offset = 0, substr = 0x1001d4230, utf8_substr = 0x10082e120, end_shift = 0 }, { min_offset = 0, max_offset = 0, substr = 0x0, utf8_substr = 0x0, end_shift = 0 }, { min_offset = 0, max_offset = 0, substr = 0x0, utf8_substr = 0x10082e120, end_shift = 0 }} Now we've set check_substr (aka prog->substrs->data[2]->substr) to a null pointer, but check_utf8 is still there. The next line is `goto fail`, which goes to the "Match rejected by optimiser" exit. Let's continue this split invocation, but set a breakpoint in split for the next one: (gdb) b Perl_pp_split Breakpoint 2 at 0x10010e467: file pp.c, line 5246. (gdb) cont Continuing. Breakpoint 2, Perl_pp_split () at pp.c:5246 5246 dVAR; dSP; dTARG; At this point, if we're going to successfully use check_utf8, we'll end up in fbm_instr(). So make sure that we do: (gdb) b Perl_fbm_instr Breakpoint 3 at 0x10009df20: file util.c, line 655. (gdb) cont Continuing. Breakpoint 3, Perl_fbm_instr (big=0x100309980 "abĀcd", bigend=0x100309986 "", littlestr=0x10082e120, flags=0) at util.c:655 655 const unsigned char *little = (const unsigned char *)SvPV_const(littlestr,l); (gdb) s 656 STRLEN littlelen = l; (gdb) p little $3 = (const unsigned char *) 0x100306390 "Ā" So, yes, the optimiser is using the \x{100} substring to strength-reduce full regex matching to an FBM search, which means we didn't destroy the match substr. Phew. Show quoted text
> IOW, doesnt this patch series (which I appreciate) solve this problem > a bit too far down the stack?
That's a good question, and I really don't grok the regex engine well enough to have an answer to it. Also, I've now realised the split warnings are hiding an actual bug. Under 5.16 (or under current blead): $ perl -CS -le '$p = chr 0x100; for (".", "ab\x{100}cd") { print for split /$p/ }' . $ perl -CS -le '$p = chr 0x100; for (".", "ab\x{101}cd") { print for split /$p/ }' . Note no output from the second split in each run. Whereas in blead with my second patch (the "assign a null pointer" patch): $ ./perl -CS -le '$p = chr 0x100; for (".", "ab\x{100}cd") { print for split /$p/ }' . ab cd $ ./perl -CS -le '$p = chr 0x100; for (".", "ab\x{101}cd") { print for split /$p/ }' . abācd I don't have an analysis for what's causing that bug, nor why my proposed change fixes it. Which doesn't give me any confidence that my second patch (for the split bug) should be applied. Show quoted text
> (Am i incorrectly entangling these two patches?)
Probably, I think. The warnings quelled by the first patch happen when matching a utf8-needing pattern against a non-utf8 SV; the split warnings happen when splitting a utf8 SV on a utf8-needing pattern (but only after previously splitting a non-utf8 SV on the same pattern). So my two patches are independent, afaict. Is it worth splitting the match warning out to a separate RT ticket? -- Aaron Crane ** http://aaroncrane.co.uk/
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.2k
Resolved by commitsc72077c4fff72b66cdde1621c62fb4fd383ce093 and 7e0d5ad7c9cdb21b681e611b888acd41d34c4d05 I became convinced that Aaron's first patch was the right direction to go, and applied it, I was concerned about Yves' comment about potentially losing things, and so came used a somewhat different approach to the second patch. Instead of losing the data, the function that did so now instead just returns failure, and the callers check that return. It may be that Aaron's approach in the second patch is fine, but this way, we are sure that nothing is lost. What we do lose is the knowledge that this pattern can't be tried against a non-UTF8 string, so it may be that there are cases where we re-try that, only to fail again. I doubt that that leads to any more inefficiencies than what the existing situation is, where applying the pattern against both utf8 and non-utf8 targets changes the substrs each time. I also didn't see the test cases in the 2nd patch until I had generated my own from Aaron's data. If someone thinks those should be used as well, we can add them. I should note that there was yet another case in regexec.c that needed to check for the downgrade failure, and that is now done. -- Karl Williamson


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org