Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split output empty when PATTERN and EXPR have "wide" characters #12390

Closed
p5pRT opened this issue Sep 9, 2012 · 18 comments
Closed

split output empty when PATTERN and EXPR have "wide" characters #12390

p5pRT opened this issue Sep 9, 2012 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 9, 2012

Migrated from rt.perl.org#114808 (status was 'resolved')

Searchable as RT114808$

@p5pRT
Copy link
Author

p5pRT commented Sep 9, 2012

From mobrule@gmail.com

This is a bug report for perl from mobrule@​gmail.com,
generated with the help of perlbug 1.39 running under perl 5.12.4.

=head1 BUG REPORT

split output empty when PATTERN and EXPR have "wide" characters

=head1 DESCRIPTION

In a call to the builtin C<split PATTERN, EXPR> function​:

if the C<PATTERN> contains one or more "wide" characters
(that is, greater than C<chr(255)>), and if C<EXPR> contains
a different set of wide characters, then in some cases the
C<split> call will not produce any output.

In addition, when C<warnings> are enabled, Perl will produce
one and sometimes several spurious C<Use of uninitialized value
in split at ...> messages.

This bug has been observed on many different versions of Perl,
from v5.8 to v5.15, on Linux, Cygwin, and Windows.

=head2 ADDITIONAL INFORMATION

A C<split> statement will only exhibit this problem after
the first time it is executed.

=head2 DEMONSTRATION

The code in this script demonstrates the issue. The tests
check that C<split> produced output, and since the cases
are designed so that the C<PATTERN>s do not match the
C<EXPR>s, that the output are one-element lists that contain
the original C<EXPR>.

Tests pass when the characters in C<EXPR> are all below
C<chr(256)>, or after a C<split> statement is executed for
the first time. Tests fail when C<EXPR> contains a "wide"
character, and the output is not from the first time a
C<split> statement has been executed.

=cut

use strict;
use warnings;
use Test​::More;
use Encode;
binmode *STDOUT, "​:encoding(UTF-8)";
binmode *STDERR, "​:encoding(UTF-8)";
sub toUTF8 ($) { Encode​::encode("utf-8",$_[0]) }; # for output to Test​::Builder

my $text0 = "normal\x{ee}"; # 8-bit string
my $text1 = "\x{444}"; # single wide char
my $text2 = "ab\x{ccc}de\x{999}gh"; # string containing wide char

my $pattern1 = chr(0xabc); # single wide char
my $pattern2 = "\x{abc}\x{def}ghi"; # more than one wide char

for ( $text1, # the first call to each split function is ok
  $text0, # ok
  $text1, # tests fail, one warning
  $text2) { # tests fail, more than one warning

  print STDERR "--------------------\ntext is $_\n";

  print STDERR "pattern is /$pattern1/\n";
  my @​list1 = split /$pattern1/, $_;
  ok(@​list1 > 0, toUTF8 "split had results for text $_, pattern $pattern1");
  ok($list1[0] eq $_, toUTF8 "correct result for text $_, pattern $pattern1");

  print STDERR "pattern is /$pattern2/\n";
  my @​list2 = split /$pattern2/, $_;
  ok(@​list2 > 0, toUTF8 "split had results for text $_, pattern $pattern2");
  ok($list2[0] eq $_, toUTF8 "correct result for text $_, pattern $pattern2");
}

done_testing;

=head2 workaround

A workaround is to call C<split> indirectly in a way that insures
the C<PATTERN> is recompiled before each call. For the tests above,
we could define an indirect function

  sub SPLIT { my ($PATTERN, $EXPR) = @​_; return split /$PATTERN/,$EXPR }

and call

  my @​list1 = SPLIT $pattern1, $_;
  ...
  my @​list2 = SPLIT $pattern2, $_;

instead.

=head1 SUBMITTED BY

Marty O'Brien, E<lt>mob@​cpan.orgE<gt>

=cut


Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.12.4​:

Configured by Debian Project at Tue Sep 6 08​:07​:52 UTC 2011.

Summary of my perl5 (revision 5 version 12 subversion 4) configuration​:
 
  Platform​:
  osname=linux, osvers=2.6.24-28-server, archname=i686-linux-gnu-thread-multi-64int
  uname='linux roseapple 2.6.24-28-server #1 smp wed aug 18 21​:17​:51 utc 2010 i686 i686 i386 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i686-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.12 -Darchlib=/usr/lib/perl/5.12 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.12.4 -Dsitearch=/usr/local/lib/perl/5.12.4 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.12.4 -des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.6.1', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/i386-linux-gnu /lib/../lib /usr/lib/i386-linux-gnu /usr/lib/../lib /lib /usr/lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=, so=so, useshrplib=true, libperl=libperl.so.5.12.4
  gnulibc_version='2.13'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Locally applied patches​:
 


@​INC for perl 5.12.4​:
  /etc/perl
  /usr/local/lib/perl/5.12.4
  /usr/local/share/perl/5.12.4
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.12
  /usr/share/perl/5.12
  /usr/local/lib/site_perl
  .


Environment for perl 5.12.4​:
  HOME=/root
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/root/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/bin​:/usr/games
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 10, 2012

From mob@cpan.org

Possibly related​: http​://stackoverflow.com/questions/12358714

  Why does this unicode char cause perl regex match to emit
“uninitialized” warning?

  The following code​:

  #!/usr/bin/env perl
  use utf8;
  use strict;
  use warnings;
  use 5.012; # implicitly turn on feature unicode_strings
  my $test = "some string";
  $test =~ m/.+\x{2013}/x;

  Yields​:

  Use of uninitialized value $test in pattern match (m//) at test.pl
line 9.

@p5pRT
Copy link
Author

p5pRT commented Sep 10, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2012

From @khwilliamson

On Mon Sep 10 13​:21​:50 2012, mob wrote​:

Possibly related​: http​://stackoverflow.com/questions/12358714

Why does this unicode char cause perl regex match to emit

“uninitialized” warning?

The following code&#8203;:

\#\!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use 5\.012; \# implicitly turn on feature unicode\_strings
my $test = "some string";
$test =~ m/\.\+\\x\{2013\}/x;

Yields&#8203;:

Use of uninitialized value $test in pattern match \(m//\) at test\.pl

line 9.

I looked at this a little. It comes from these lines (in today's blead)
regexec.c S_regtry()​:

2454 if (!(utf8_target ? prog->float_utf8 : prog->float_substr))
2455 utf8_target ? to_utf8_substr(prog) :
to_byte_substr(prog);
2456 float_real = utf8_target ? prog->float_utf8 :
prog->float_substr;
2457
2458 little = SvPV_const(float_real, len);

The target is not utf8, but float_substr is set to NULL. The code tries
to guess the name of the uninitialized SV, and comes up erroneously with
$temp, instead of the internal Perl variable, float_substr.

This appears to be a regex optimizer bug, and I think others are better
qualified to look into that.
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2012

From @arc

Karl Williamson via RT <perlbug-followup@​perl.org> wrote​:

On Mon Sep 10 13​:21​:50 2012, mob wrote​:

\#\!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use 5\.012; \# implicitly turn on feature unicode\_strings
my $test = "some string";
$test =~ m/\.\+\\x\{2013\}/x;

Yields&#8203;:

Use of uninitialized value $test in pattern match \(m//\) at test\.pl line 9\.

That can be reduced further​:

$ perl -we '"aa" =~ /.+\x{100}/'
Use of uninitialized value in pattern match (m//) at -e line 1.

This appears to be a regex optimizer bug

I think I agree, more or less. This quells the warning, and
everything in t/re still passes with it (on my machine)​:

Inline Patch
diff --git a/regexec.c b/regexec.c
index 2dc2314..5ef493b 100644
--- a/regexec.c
+++ b/regexec.c
@@ -2453,6 +2453,9 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char
*stringarg, registe

  if (!(utf8_target ? prog->float_utf8 : prog->float_substr))
  utf8_target ? to_utf8_substr(prog) : to_byte_substr(prog);
+ if (!utf8_target && (!prog->float_substr ||
!SvOK(prog->float_substr)))
+ /* can't match​: floating substr needs utf8, but
target is not utf8 */
+ goto phooey;
  float_real = utf8_target ? prog->float_utf8 : prog->float_substr;

  little = SvPV_const(float_real, len);

I have to go out just now, but I'll put together a proper patch later
this evening, including a test.

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From mob@cpan.org

I think I agree, more or less. This quells the warning, and
everything in t/re still passes with it (on my machine)​:

diff --git a/regexec.c b/regexec.c
index 2dc2314..5ef493b 100644
--- a/regexec.c
+++ b/regexec.c
@​@​ -2453,6 +2453,9 @​@​ Perl_regexec_flags(pTHX_ REGEXP * const rx, char
*stringarg, registe

        if \(\!\(utf8\_target ? prog\->float\_utf8 : prog\-

float_substr))
utf8_target ? to_utf8_substr(prog) :
to_byte_substr(prog);
+ if (!utf8_target && (!prog->float_substr ||
!SvOK(prog->float_substr)))
+ /* can't match​: floating substr needs utf8, but
target is not utf8 */
+ goto phooey;
float_real = utf8_target ? prog->float_utf8 : prog-
float_substr;

         little = SvPV\_const\(float\_real\, len\);

I have to go out just now, but I'll put together a proper patch later
this evening, including a test.

The patch fixes the warnings from this one-liner

  $ perl -we '"aa" =~ /.+\x{100}/'

but not this one

  $ perl -we '$p=chr(0x100); for (".","ab\x{101}def") { @​q = split /$p/ }'
  Name "main​::q" used only once​: possible typo at -e line 1.
  Use of uninitialized value in split at -e line 1.
  Use of uninitialized value in split at -e line 1.
  Use of uninitialized value in split at -e line 1.
  Use of uninitialized value in split at -e line 1.
  Use of uninitialized value in split at -e line 1.
  Use of uninitialized value in split at -e line 1.

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @khwilliamson

On Tue Sep 11 20​:23​:24 2012, mob wrote​:

I think I agree, more or less. This quells the warning, and
everything in t/re still passes with it (on my machine)​:

diff --git a/regexec.c b/regexec.c
index 2dc2314..5ef493b 100644
--- a/regexec.c
+++ b/regexec.c
@​@​ -2453,6 +2453,9 @​@​ Perl_regexec_flags(pTHX_ REGEXP * const rx, char
*stringarg, registe

        if \(\!\(utf8\_target ? prog\->float\_utf8 : prog\-

float_substr))
utf8_target ? to_utf8_substr(prog) :
to_byte_substr(prog);
+ if (!utf8_target && (!prog->float_substr ||
!SvOK(prog->float_substr)))
+ /* can't match​: floating substr needs utf8, but
target is not utf8 */
+ goto phooey;
float_real = utf8_target ? prog->float_utf8 : prog-
float_substr;

         little = SvPV\_const\(float\_real\, len\);

I have to go out just now, but I'll put together a proper patch later
this evening, including a test.

The patch fixes the warnings from this one-liner

$ perl \-we '"aa" =~ /\.\+\\x\{100\}/'

but not this one

$ perl \-we '$p=chr\(0x100\); for \("\."\,"ab\\x\{101\}def"\) \{ @&#8203;q = split

/$p/ }'

Name "main&#8203;::q" used only once&#8203;: possible typo at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.

Any successful patch of this would require analyzing regcomp.c to find
out why float_substr is NULL. The fault may not lie in regexec.c at all.
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @arc

Karl Williamson via RT <perlbug-followup@​perl.org> wrote​:

On Tue Sep 11 20​:23​:24 2012, mob wrote​:

The patch fixes the warnings from this one-liner

$ perl \-we '"aa" =~ /\.\+\\x\{100\}/'

but not this one

$ perl \-we '$p=chr\(0x100\); for \("\."\,"ab\\x\{101\}def"\) \{ @&#8203;q = split

/$p/ }'

Name "main&#8203;::q" used only once&#8203;: possible typo at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.
Use of uninitialized value in split at \-e line 1\.

Any successful patch of this would require analyzing regcomp.c to find
out why float_substr is NULL. The fault may not lie in regexec.c at all.

For the /.+\x{100}/ case, I believe the logic is equivalent to this,
for the anchored-substring case, round line 2366​:

  if (must == &PL_sv_undef)
  /* could not downgrade utf8 check substring, so must fail */
  goto phooey;

That is​: the reason we don't have float_substr is that the floating
substring was utf8 in the pattern, but it can't be downgraded to the
non-utf8 representation. So if the target string is non-utf8, we know
there can't be a match, and we can skip the regex engine proper
entirely.

I haven't investigated the warnings from split, but I don't think
they're related, because you need both loop iterations (and in that
order, non-utf8 before utf8) to trigger them. Does that ring a bell
for anyone who's familiar with how split works?

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @arc

I wrote​:

Karl Williamson via RT <perlbug-followup@​perl.org> wrote​:

Any successful patch of this would require analyzing regcomp.c to find
out why float_substr is NULL. The fault may not lie in regexec.c at all.

For the /.+\x{100}/ case, I believe the logic is equivalent to this,
for the anchored-substring case, round line 2366​:

if \(must == &PL\_sv\_undef\)
    /\* could not downgrade utf8 check substring\, so must fail \*/
    goto phooey;

I'm still happy with my analysis of that case; see the attached patch.
In particular, the similarity to the anchored-substring case
increases my confidence that this is a reasonable fix. However…

I haven't investigated the warnings from split, but I don't think
they're related, because you need both loop iterations (and in that
order, non-utf8 before utf8) to trigger them. Does that ring a bell
for anyone who's familiar with how split works?

… after a little more investigation, I've come to the conclusion that
the use of split is a red herring here, because I can trigger what
seems to be the same bug using only regex matches​:

$ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r'
Use of uninitialized value $_ in pattern match (m//) at -e line 1.

Before I realised that, I worked up this patch​:

Inline Patch
diff --git a/regexec.c b/regexec.c
index 7f81fea..76c261f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -629,6 +629,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV
*sv, char *strpos,   if \(check == &PL\_sv\_undef\) \{   DEBUG\_EXECUTE\_r\(PerlIO\_printf\(Perl\_debug\_log\,   "Non\-utf8 string cannot match utf8 check string\\n"\)\); \+ /\* At this point\, to\_byte\_substr\(\) has set prog\->check\_substr to \+ \* &PL\_sv\_undef\. But this function \(and pregexec\(\)\) use the \+ \* non\-nullity of prog\->check\_substr \(rather than the definedness of \+ \* the sv it points to\) to decide whether to use it\. So should this \+ \* regex be reused in the future\, pregexec\(\) will \*try\* using \+ \* check\_substr\, but emit an "uninitialized value" warning every \+ \* time it consults it\. We work around this by setting the useless\, \+ \* undefined check\_substr to a null pointer\. \*/ \+ prog\->check\_substr = 0;   goto fail;   \}   if \(prog\->extflags & RXf\_ANCH\) \{ /\* Match at beg\-of\-str or after \\n \*/

But that clearly *isn't* the right fix, because while it does fix the
warnings in split, it doesn't help with my split-free test case.

Does anyone have any ideas?

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @arc

0001-Fix-spurious-uninitialized-value-warning-in-regex-ma.patch
From ab1b85c5c22c572d44d72acb4c74baca6e8d0fa3 Mon Sep 17 00:00:00 2001
From: Aaron Crane <arc@cpan.org>
Date: Wed, 12 Sep 2012 16:04:38 +0100
Subject: [PATCH] Fix spurious "uninitialized value" warning in regex match

The warning appeared if the pattern contains a floating substring for which
utf8 is needed, and the target string isn't in utf8.  In this situation,
downgrading the floating substring yields undef, which triggers the warning.

Matching can't succeed in this situation, because it's impossible for the
non-utf8 target string to contain any string which needs utf8 for its own
representation.  So the warning is quelled by aborting the match early.

Anchored substrings already have a check of this form; this commit makes the
corresponding change in the floating-substring case.
---
 regexec.c           | 11 +++++++++--
 t/re/pat_advanced.t |  6 ++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/regexec.c b/regexec.c
index 2dc2314..7f81fea 100644
--- a/regexec.c
+++ b/regexec.c
@@ -2451,8 +2451,15 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char *stre
 	    STRLEN len;
 	    const char *little;
 
-	    if (!(utf8_target ? prog->float_utf8 : prog->float_substr))
-		utf8_target ? to_utf8_substr(prog) : to_byte_substr(prog);
+	    if (utf8_target && !prog->float_utf8)
+		to_utf8_substr(prog);
+	    else if (!utf8_target && !prog->float_substr) {
+		to_byte_substr(prog);
+		if (prog->float_substr == &PL_sv_undef)
+		    /* downgrading failed, but target is not utf8, so
+		     * matching must fail */
+		    goto phooey;
+	    }
 	    float_real = utf8_target ? prog->float_utf8 : prog->float_substr;
 
             little = SvPV_const(float_real, len);
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 05cc191..9502928 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -789,6 +789,12 @@ sub run_tests {
     }
 
     {
+        # The second half of RT #114808
+        warning_is(sub {'aa' =~ /.+\x{100}/}, undef,
+                   'utf8-only floating substr, non-utf8 target, no warning');
+    }
+
+    {
         my $message = "qr /.../x";
         my $R = qr / A B C # D E/x;
         ok("ABCDE" =~    $R   && $& eq "ABC", $message);
-- 
1.7.11.3

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @demerphq

On 12 September 2012 20​:44, Aaron Crane <perl@​aaroncrane.co.uk> wrote​:

I wrote​:

Karl Williamson via RT <perlbug-followup@​perl.org> wrote​:

Any successful patch of this would require analyzing regcomp.c to find
out why float_substr is NULL. The fault may not lie in regexec.c at all.

For the /.+\x{100}/ case, I believe the logic is equivalent to this,
for the anchored-substring case, round line 2366​:

if \(must == &PL\_sv\_undef\)
    /\* could not downgrade utf8 check substring\, so must fail \*/
    goto phooey;

I'm still happy with my analysis of that case; see the attached patch.
In particular, the similarity to the anchored-substring case
increases my confidence that this is a reasonable fix. However…

I haven't investigated the warnings from split, but I don't think
they're related, because you need both loop iterations (and in that
order, non-utf8 before utf8) to trigger them. Does that ring a bell
for anyone who's familiar with how split works?

… after a little more investigation, I've come to the conclusion that
the use of split is a red herring here, because I can trigger what
seems to be the same bug using only regex matches​:

$ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r'
Use of uninitialized value $_ in pattern match (m//) at -e line 1.

Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;?

As written I would have said it is going to match /$c/ against $_
which is actually undefined and stick the result of the match (0) in
$r.

Before I realised that, I worked up this patch​:

diff --git a/regexec.c b/regexec.c
index 7f81fea..76c261f 100644
--- a/regexec.c
+++ b/regexec.c
@​@​ -629,6 +629,15 @​@​ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV
*sv, char *strpos,
if (check == &PL_sv_undef) {
DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
"Non-utf8 string cannot match utf8 check string\n"));
+ /* At this point, to_byte_substr() has set prog->check_substr to
+ * &PL_sv_undef. But this function (and pregexec()) use the
+ * non-nullity of prog->check_substr (rather than the definedness of
+ * the sv it points to) to decide whether to use it. So should this
+ * regex be reused in the future, pregexec() will *try* using
+ * check_substr, but emit an "uninitialized value" warning every
+ * time it consults it. We work around this by setting the useless,
+ * undefined check_substr to a null pointer. */
+ prog->check_substr = 0;
goto fail;
}
if (prog->extflags & RXf_ANCH) { /* Match at beg-of-str or after \n */

But that clearly *isn't* the right fix, because while it does fix the
warnings in split, it doesn't help with my split-free test case.

Does anyone have any ideas?

I need to review this more closely.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @arc

demerphq <demerphq@​gmail.com> wrote​:

On 12 September 2012 20​:44, Aaron Crane <perl@​aaroncrane.co.uk> wrote​:

$ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r'
Use of uninitialized value $_ in pattern match (m//) at -e line 1.

Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;?

D'oh. Thanks for pointing out my mistake.

With the missing qr inserted, that code is warning-free in blead, so
I'm back to thinking that my suggested fix for the original bug report
(the split case) is, at least, not overwhelmingly wrong. Patch
attached.

I need to review this more closely.

Thanks, I'd be most grateful for the review.

If you also have chance to consider my other patch (the one for the
/.+\x{100}/ report), that'd be great.

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @arc

0001-Fix-spurious-uninitialized-value-warning-in-split.patch
From a4888818c4611ffd116095857322cc1f3001dec3 Mon Sep 17 00:00:00 2001
From: Aaron Crane <arc@cpan.org>
Date: Wed, 12 Sep 2012 21:49:23 +0100
Subject: [PATCH] Fix spurious "uninitialized value" warning in split

This happened when:

* The split is invoked in a loop
* The pattern being split on is interpolated from a string
* The pattern being split on has a floating substring which needs utf8
* One iteration of the loop has a non-utf8 target string
* A subsequent iteration has a utf8 target string

In execution of the non-utf8-target iteration, the check_substr field of the
pattern was being set to &PL_sv_undef (where its default value is a null
pointer); that caused the subsequent utf8-target iteration to attempt to use
the undefined value as the floating substring.  Fix by setting it back to a
null pointer.
---
 regexec.c    |  9 +++++++++
 t/op/split.t | 18 +++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)
 mode change 100644 => 100755 t/op/split.t

diff --git a/regexec.c b/regexec.c
index 7f81fea..76c261f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -629,6 +629,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV *sv, char *strpos,
     if (check == &PL_sv_undef) {
 	DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
 		"Non-utf8 string cannot match utf8 check string\n"));
+        /* At this point, to_byte_substr() has set prog->check_substr to
+         * &PL_sv_undef.  But this function (and pregexec()) use the
+         * non-nullity of prog->check_substr (rather than the definedness of
+         * the sv it points to) to decide whether to use it.  So should this
+         * regex be reused in the future, pregexec() will *try* using
+         * check_substr, but emit an "uninitialized value" warning every
+         * time it consults it.  We work around this by setting the useless
+         * undef check_substr to a null pointer. */
+        prog->check_substr = 0;
 	goto fail;
     }
     if (prog->extflags & RXf_ANCH) {	/* Match at beg-of-str or after \n */
diff --git a/t/op/split.t b/t/op/split.t
old mode 100644
new mode 100755
index 6903503..feccf87
--- a/t/op/split.t
+++ b/t/op/split.t
@@ -6,7 +6,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 102;
+plan tests => 108;
 
 $FS = ':';
 
@@ -417,3 +417,19 @@ is($cnt, scalar(@ary));
            # 'my' doesn't trigger the bug
     is "@PATH", "Font GlyphNames", "hybrid scalar-and-array context";
 }
+
+{
+    # RT #114808
+    use warnings;
+    my @warn;
+    local $SIG{__WARN__} = sub { push @warn, join '', @_ };
+    my $char = "\x{100}";
+    for (".", "\x{101}") {
+        my $which = $_ eq '.' ? 'ascii' : 'utf8';
+        my @x = split /$char/;
+        is(scalar @x, 1,    "split $which on interpolated utf8: one field");
+        is($x[0], $_,       "split $which on interpolated utf8: field value");
+        is(scalar @warn, 0, "split $which on interpolated utf8: no warnings");
+        @warn = ();
+    }
+}
-- 
1.7.11.3

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2012

From @demerphq

On 12 September 2012 22​:56, Aaron Crane <perl@​aaroncrane.co.uk> wrote​:

demerphq <demerphq@​gmail.com> wrote​:

On 12 September 2012 20​:44, Aaron Crane <perl@​aaroncrane.co.uk> wrote​:

$ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r'
Use of uninitialized value $_ in pattern match (m//) at -e line 1.

Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;?

D'oh. Thanks for pointing out my mistake.

With the missing qr inserted, that code is warning-free in blead, so
I'm back to thinking that my suggested fix for the original bug report
(the split case) is, at least, not overwhelmingly wrong. Patch
attached.

I need to review this more closely.

Thanks, I'd be most grateful for the review.

If you also have chance to consider my other patch (the one for the
/.+\x{100}/ report), that'd be great.

You said at one point "fails to downgrade, so it returns undef", and
that patch adds a SvOK() check, and then this patch will set it to
null right?

But doesn't that mean that the substring optimization would be lost
permanently? So shouldnt we fix that logic so it can handle this case
without destroying the match substr?

IOW, doesnt this patch series (which I appreciate) solve this problem
a bit too far down the stack? (Am i incorrectly entangling these two
patches?)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Sep 13, 2012

From @arc

demerphq <demerphq@​gmail.com> wrote​:

On 12 September 2012 22​:56, Aaron Crane <perl@​aaroncrane.co.uk> wrote​:

Thanks, I'd be most grateful for the review.

If you also have chance to consider my other patch (the one for the
/.+\x{100}/ report), that'd be great.

You said at one point "fails to downgrade, so it returns undef", and
that patch adds a SvOK() check, and then this patch will set it to
null right?

The submitted version of my patch for /.+\x{100}/ just compared `==
&PL_sv_undef`, rather than using SvOK(). But that case doesn't seem
to be the same as this split case​: my fix for the split warnings works
regardless of whether the earlier patch has been applied.

For clarity, this is the earlier patch I'm talking about​:

https://rt-archive.perl.org/perl5/Ticket/Attachment/1153646/588206/0001-Fix-spurious-uninitialized-value-warning-in-regex-ma.patch

But doesn't that mean that the substring optimization would be lost
permanently? So shouldnt we fix that logic so it can handle this case
without destroying the match substr?

I don't *think* that's the case. This is just the version of the
floating-or-anchored substring that's copied into check_substr from
anchored_substr or float_substr as appropriate; the original is left
where it is. For added tricksiness, though, those apparent fields of
struct regexp are actually preprocessor macros; struct regexp contains
struct reg_substr_data *substrs, which points to a struct containing a
three-element array "data" where element 0 is anchored, 1 is floating,
and 2 is "check". (See line 573 in regcomp.h.) This gdb session
seems to support my position that we aren't destroying the match
substr​:

  $ gdb --args ./miniperl -we '$p = chr 0x100; for (".",
"ab\x{100}cd") { @​q = split /$p/ }'
  GNU gdb 6.3.50-20050815 (Apple version gdb-1515) (Sat Jan 15
08​:33​:48 UTC 2011)
  […]
  (gdb) b regexec.c​:640
  Breakpoint 1 at 0x10015168a​: file regexec.c, line 640.

That's the `prog->check_substr = 0` line my patch adds.

  (gdb) run
  Starting program​: ./miniperl -we \$p\ =\ chr\ 0x100\;\ for\
\(\".\",\ \"ab\\x\{100\}cd\"\)\ \{\ @​q\ =\ split\ /\$p/\ \}
  Reading symbols for shared libraries ++.. done
  Name "main​::q" used only once​: possible typo at -e line 1.

  Breakpoint 1, Perl_re_intuit_start (rx=0x10082e108,
sv=0x100826610, strpos=0x1003098f0 ".", strend=0x1003098f1 "",
flags=0, data=0x7fff5fbfe640) at regexec.c​:640
  640 prog->check_substr = 0;
  (gdb) p prog->substrs->data
  $1 = {{
  min_offset = 0,
  max_offset = 0,
  substr = 0x1001d4230,
  utf8_substr = 0x10082e120,
  end_shift = 0
  }, {
  min_offset = 0,
  max_offset = 0,
  substr = 0x0,
  utf8_substr = 0x0,
  end_shift = 0
  }, {
  min_offset = 0,
  max_offset = 0,
  substr = 0x1001d4230,
  utf8_substr = 0x10082e120,
  end_shift = 0
  }}

So that's the substring data we would leave if my change weren't
there. 0x1001d4230 is &PL_sv_undef in this build, btw.

  (gdb) s
  641 goto fail;
  (gdb) p prog->substrs->data
  $2 = {{
  min_offset = 0,
  max_offset = 0,
  substr = 0x1001d4230,
  utf8_substr = 0x10082e120,
  end_shift = 0
  }, {
  min_offset = 0,
  max_offset = 0,
  substr = 0x0,
  utf8_substr = 0x0,
  end_shift = 0
  }, {
  min_offset = 0,
  max_offset = 0,
  substr = 0x0,
  utf8_substr = 0x10082e120,
  end_shift = 0
  }}

Now we've set check_substr (aka prog->substrs->data[2]->substr) to a
null pointer, but check_utf8 is still there.

The next line is `goto fail`, which goes to the "Match rejected by
optimiser" exit. Let's continue this split invocation, but set a
breakpoint in split for the next one​:

  (gdb) b Perl_pp_split
  Breakpoint 2 at 0x10010e467​: file pp.c, line 5246.
  (gdb) cont
  Continuing.

  Breakpoint 2, Perl_pp_split () at pp.c​:5246
  5246 dVAR; dSP; dTARG;

At this point, if we're going to successfully use check_utf8, we'll
end up in fbm_instr(). So make sure that we do​:

  (gdb) b Perl_fbm_instr
  Breakpoint 3 at 0x10009df20​: file util.c, line 655.
  (gdb) cont
  Continuing.

  Breakpoint 3, Perl_fbm_instr (big=0x100309980 "abĀcd",
bigend=0x100309986 "", littlestr=0x10082e120, flags=0) at util.c​:655
  655 const unsigned char *little = (const unsigned char
*)SvPV_const(littlestr,l);
  (gdb) s
  656 STRLEN littlelen = l;
  (gdb) p little
  $3 = (const unsigned char *) 0x100306390 "Ā"

So, yes, the optimiser is using the \x{100} substring to
strength-reduce full regex matching to an FBM search, which means we
didn't destroy the match substr. Phew.

IOW, doesnt this patch series (which I appreciate) solve this problem
a bit too far down the stack?

That's a good question, and I really don't grok the regex engine well
enough to have an answer to it.

Also, I've now realised the split warnings are hiding an actual bug.
Under 5.16 (or under current blead)​:

$ perl -CS -le '$p = chr 0x100; for (".", "ab\x{100}cd") { print for
split /$p/ }'
.
$ perl -CS -le '$p = chr 0x100; for (".", "ab\x{101}cd") { print for
split /$p/ }'
.

Note no output from the second split in each run. Whereas in blead
with my second patch (the "assign a null pointer" patch)​:

$ ./perl -CS -le '$p = chr 0x100; for (".", "ab\x{100}cd") { print for
split /$p/ }'
.
ab
cd
$ ./perl -CS -le '$p = chr 0x100; for (".", "ab\x{101}cd") { print for
split /$p/ }'
.
abācd

I don't have an analysis for what's causing that bug, nor why my
proposed change fixes it. Which doesn't give me any confidence that
my second patch (for the split bug) should be applied.

(Am i incorrectly entangling these two patches?)

Probably, I think. The warnings quelled by the first patch happen
when matching a utf8-needing pattern against a non-utf8 SV; the split
warnings happen when splitting a utf8 SV on a utf8-needing pattern
(but only after previously splitting a non-utf8 SV on the same
pattern). So my two patches are independent, afaict. Is it worth
splitting the match warning out to a separate RT ticket?

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2012

From @khwilliamson

Resolved by commitsc72077c4fff72b66cdde1621c62fb4fd383ce093
and 7e0d5ad

I became convinced that Aaron's first patch was the right direction to
go, and applied it, I was concerned about Yves' comment about
potentially losing things, and so came used a somewhat different
approach to the second patch. Instead of losing the data, the function
that did so now instead just returns failure, and the callers check that
return. It may be that Aaron's approach in the second patch is fine,
but this way, we are sure that nothing is lost. What we do lose is the
knowledge that this pattern can't be tried against a non-UTF8 string, so
it may be that there are cases where we re-try that, only to fail again.
I doubt that that leads to any more inefficiencies than what the
existing situation is, where applying the pattern against both utf8 and
non-utf8 targets changes the substrs each time. I also didn't see the
test cases in the 2nd patch until I had generated my own from Aaron's
data. If someone thinks those should be used as well, we can add them.

I should note that there was yet another case in regexec.c that needed
to check for the downgrade failure, and that is now done.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2012

From [Unknown Contact. See original ticket]

Resolved by commitsc72077c4fff72b66cdde1621c62fb4fd383ce093
and 7e0d5ad

I became convinced that Aaron's first patch was the right direction to
go, and applied it, I was concerned about Yves' comment about
potentially losing things, and so came used a somewhat different
approach to the second patch. Instead of losing the data, the function
that did so now instead just returns failure, and the callers check that
return. It may be that Aaron's approach in the second patch is fine,
but this way, we are sure that nothing is lost. What we do lose is the
knowledge that this pattern can't be tried against a non-UTF8 string, so
it may be that there are cases where we re-try that, only to fail again.
I doubt that that leads to any more inefficiencies than what the
existing situation is, where applying the pattern against both utf8 and
non-utf8 targets changes the substrs each time. I also didn't see the
test cases in the 2nd patch until I had generated my own from Aaron's
data. If someone thinks those should be used as well, we can add them.

I should note that there was yet another case in regexec.c that needed
to check for the downgrade failure, and that is now done.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2012

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant