New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split output empty when PATTERN and EXPR have "wide" characters #12390
Comments
From mobrule@gmail.comThis is a bug report for perl from mobrule@gmail.com, =head1 BUG REPORT split output empty when PATTERN and EXPR have "wide" characters =head1 DESCRIPTION In a call to the builtin C<split PATTERN, EXPR> function: if the C<PATTERN> contains one or more "wide" characters In addition, when C<warnings> are enabled, Perl will produce This bug has been observed on many different versions of Perl, =head2 ADDITIONAL INFORMATION A C<split> statement will only exhibit this problem after =head2 DEMONSTRATION The code in this script demonstrates the issue. The tests Tests pass when the characters in C<EXPR> are all below =cut use strict; my $text0 = "normal\x{ee}"; # 8-bit string my $pattern1 = chr(0xabc); # single wide char for ( $text1, # the first call to each split function is ok print STDERR "--------------------\ntext is $_\n"; print STDERR "pattern is /$pattern1/\n"; print STDERR "pattern is /$pattern2/\n"; done_testing; =head2 workaround A workaround is to call C<split> indirectly in a way that insures sub SPLIT { my ($PATTERN, $EXPR) = @_; return split /$PATTERN/,$EXPR } and call my @list1 = SPLIT $pattern1, $_; instead. =head1 SUBMITTED BY Marty O'Brien, E<lt>mob@cpan.orgE<gt> =cut Flags: Site configuration information for perl 5.12.4: Configured by Debian Project at Tue Sep 6 08:07:52 UTC 2011. Summary of my perl5 (revision 5 version 12 subversion 4) configuration: Locally applied patches: @INC for perl 5.12.4: Environment for perl 5.12.4: |
From mob@cpan.orgPossibly related: http://stackoverflow.com/questions/12358714 Why does this unicode char cause perl regex match to emit The following code: #!/usr/bin/env perl Yields: Use of uninitialized value $test in pattern match (m//) at test.pl |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonOn Mon Sep 10 13:21:50 2012, mob wrote:
I looked at this a little. It comes from these lines (in today's blead) 2454 if (!(utf8_target ? prog->float_utf8 : prog->float_substr)) The target is not utf8, but float_substr is set to NULL. The code tries This appears to be a regex optimizer bug, and I think others are better |
From @arcKarl Williamson via RT <perlbug-followup@perl.org> wrote:
That can be reduced further: $ perl -we '"aa" =~ /.+\x{100}/'
I think I agree, more or less. This quells the warning, and Inline Patchdiff --git a/regexec.c b/regexec.c
index 2dc2314..5ef493b 100644
--- a/regexec.c
+++ b/regexec.c
@@ -2453,6 +2453,9 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char
if (!(utf8_target ? prog->float_utf8 : prog->float_substr)) little = SvPV_const(float_real, len); I have to go out just now, but I'll put together a proper patch later -- |
From mob@cpan.org
The patch fixes the warnings from this one-liner $ perl -we '"aa" =~ /.+\x{100}/' but not this one $ perl -we '$p=chr(0x100); for (".","ab\x{101}def") { @q = split /$p/ }' |
From @khwilliamsonOn Tue Sep 11 20:23:24 2012, mob wrote:
/$p/ }'
Any successful patch of this would require analyzing regcomp.c to find |
From @arcKarl Williamson via RT <perlbug-followup@perl.org> wrote:
For the /.+\x{100}/ case, I believe the logic is equivalent to this, if (must == &PL_sv_undef) That is: the reason we don't have float_substr is that the floating I haven't investigated the warnings from split, but I don't think -- |
From @arcI wrote:
I'm still happy with my analysis of that case; see the attached patch.
… after a little more investigation, I've come to the conclusion that $ perl -we '$c = "\x{100}"; $r = /$c/;' -e '"a" =~ $r;' -e '"\x{101}" =~ $r' Before I realised that, I worked up this patch: Inline Patchdiff --git a/regexec.c b/regexec.c
index 7f81fea..76c261f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -629,6 +629,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV
But that clearly *isn't* the right fix, because while it does fix the Does anyone have any ideas? -- |
From @arc0001-Fix-spurious-uninitialized-value-warning-in-regex-ma.patchFrom ab1b85c5c22c572d44d72acb4c74baca6e8d0fa3 Mon Sep 17 00:00:00 2001
From: Aaron Crane <arc@cpan.org>
Date: Wed, 12 Sep 2012 16:04:38 +0100
Subject: [PATCH] Fix spurious "uninitialized value" warning in regex match
The warning appeared if the pattern contains a floating substring for which
utf8 is needed, and the target string isn't in utf8. In this situation,
downgrading the floating substring yields undef, which triggers the warning.
Matching can't succeed in this situation, because it's impossible for the
non-utf8 target string to contain any string which needs utf8 for its own
representation. So the warning is quelled by aborting the match early.
Anchored substrings already have a check of this form; this commit makes the
corresponding change in the floating-substring case.
---
regexec.c | 11 +++++++++--
t/re/pat_advanced.t | 6 ++++++
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/regexec.c b/regexec.c
index 2dc2314..7f81fea 100644
--- a/regexec.c
+++ b/regexec.c
@@ -2451,8 +2451,15 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char *stre
STRLEN len;
const char *little;
- if (!(utf8_target ? prog->float_utf8 : prog->float_substr))
- utf8_target ? to_utf8_substr(prog) : to_byte_substr(prog);
+ if (utf8_target && !prog->float_utf8)
+ to_utf8_substr(prog);
+ else if (!utf8_target && !prog->float_substr) {
+ to_byte_substr(prog);
+ if (prog->float_substr == &PL_sv_undef)
+ /* downgrading failed, but target is not utf8, so
+ * matching must fail */
+ goto phooey;
+ }
float_real = utf8_target ? prog->float_utf8 : prog->float_substr;
little = SvPV_const(float_real, len);
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 05cc191..9502928 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -789,6 +789,12 @@ sub run_tests {
}
{
+ # The second half of RT #114808
+ warning_is(sub {'aa' =~ /.+\x{100}/}, undef,
+ 'utf8-only floating substr, non-utf8 target, no warning');
+ }
+
+ {
my $message = "qr /.../x";
my $R = qr / A B C # D E/x;
ok("ABCDE" =~ $R && $& eq "ABC", $message);
--
1.7.11.3
|
From @demerphqOn 12 September 2012 20:44, Aaron Crane <perl@aaroncrane.co.uk> wrote:
Are you sure that tests what you think it does? Shouldn't that be $r= qr/$c/;? As written I would have said it is going to match /$c/ against $_
I need to review this more closely. Yves -- |
From @arcdemerphq <demerphq@gmail.com> wrote:
D'oh. Thanks for pointing out my mistake. With the missing qr inserted, that code is warning-free in blead, so
Thanks, I'd be most grateful for the review. If you also have chance to consider my other patch (the one for the -- |
From @arc0001-Fix-spurious-uninitialized-value-warning-in-split.patchFrom a4888818c4611ffd116095857322cc1f3001dec3 Mon Sep 17 00:00:00 2001
From: Aaron Crane <arc@cpan.org>
Date: Wed, 12 Sep 2012 21:49:23 +0100
Subject: [PATCH] Fix spurious "uninitialized value" warning in split
This happened when:
* The split is invoked in a loop
* The pattern being split on is interpolated from a string
* The pattern being split on has a floating substring which needs utf8
* One iteration of the loop has a non-utf8 target string
* A subsequent iteration has a utf8 target string
In execution of the non-utf8-target iteration, the check_substr field of the
pattern was being set to &PL_sv_undef (where its default value is a null
pointer); that caused the subsequent utf8-target iteration to attempt to use
the undefined value as the floating substring. Fix by setting it back to a
null pointer.
---
regexec.c | 9 +++++++++
t/op/split.t | 18 +++++++++++++++++-
2 files changed, 26 insertions(+), 1 deletion(-)
mode change 100644 => 100755 t/op/split.t
diff --git a/regexec.c b/regexec.c
index 7f81fea..76c261f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -629,6 +629,15 @@ Perl_re_intuit_start(pTHX_ REGEXP * const rx, SV *sv, char *strpos,
if (check == &PL_sv_undef) {
DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
"Non-utf8 string cannot match utf8 check string\n"));
+ /* At this point, to_byte_substr() has set prog->check_substr to
+ * &PL_sv_undef. But this function (and pregexec()) use the
+ * non-nullity of prog->check_substr (rather than the definedness of
+ * the sv it points to) to decide whether to use it. So should this
+ * regex be reused in the future, pregexec() will *try* using
+ * check_substr, but emit an "uninitialized value" warning every
+ * time it consults it. We work around this by setting the useless
+ * undef check_substr to a null pointer. */
+ prog->check_substr = 0;
goto fail;
}
if (prog->extflags & RXf_ANCH) { /* Match at beg-of-str or after \n */
diff --git a/t/op/split.t b/t/op/split.t
old mode 100644
new mode 100755
index 6903503..feccf87
--- a/t/op/split.t
+++ b/t/op/split.t
@@ -6,7 +6,7 @@ BEGIN {
require './test.pl';
}
-plan tests => 102;
+plan tests => 108;
$FS = ':';
@@ -417,3 +417,19 @@ is($cnt, scalar(@ary));
# 'my' doesn't trigger the bug
is "@PATH", "Font GlyphNames", "hybrid scalar-and-array context";
}
+
+{
+ # RT #114808
+ use warnings;
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, join '', @_ };
+ my $char = "\x{100}";
+ for (".", "\x{101}") {
+ my $which = $_ eq '.' ? 'ascii' : 'utf8';
+ my @x = split /$char/;
+ is(scalar @x, 1, "split $which on interpolated utf8: one field");
+ is($x[0], $_, "split $which on interpolated utf8: field value");
+ is(scalar @warn, 0, "split $which on interpolated utf8: no warnings");
+ @warn = ();
+ }
+}
--
1.7.11.3
|
From @demerphqOn 12 September 2012 22:56, Aaron Crane <perl@aaroncrane.co.uk> wrote:
You said at one point "fails to downgrade, so it returns undef", and But doesn't that mean that the substring optimization would be lost IOW, doesnt this patch series (which I appreciate) solve this problem Yves -- |
From @arcdemerphq <demerphq@gmail.com> wrote:
The submitted version of my patch for /.+\x{100}/ just compared `== For clarity, this is the earlier patch I'm talking about:
I don't *think* that's the case. This is just the version of the $ gdb --args ./miniperl -we '$p = chr 0x100; for (".", That's the `prog->check_substr = 0` line my patch adds. (gdb) run Breakpoint 1, Perl_re_intuit_start (rx=0x10082e108, So that's the substring data we would leave if my change weren't (gdb) s Now we've set check_substr (aka prog->substrs->data[2]->substr) to a The next line is `goto fail`, which goes to the "Match rejected by (gdb) b Perl_pp_split Breakpoint 2, Perl_pp_split () at pp.c:5246 At this point, if we're going to successfully use check_utf8, we'll (gdb) b Perl_fbm_instr Breakpoint 3, Perl_fbm_instr (big=0x100309980 "abĀcd", So, yes, the optimiser is using the \x{100} substring to
That's a good question, and I really don't grok the regex engine well Also, I've now realised the split warnings are hiding an actual bug. $ perl -CS -le '$p = chr 0x100; for (".", "ab\x{100}cd") { print for Note no output from the second split in each run. Whereas in blead $ ./perl -CS -le '$p = chr 0x100; for (".", "ab\x{100}cd") { print for I don't have an analysis for what's causing that bug, nor why my
Probably, I think. The warnings quelled by the first patch happen -- |
From @khwilliamsonResolved by commitsc72077c4fff72b66cdde1621c62fb4fd383ce093 I became convinced that Aaron's first patch was the right direction to I should note that there was yet another case in regexec.c that needed -- |
From [Unknown Contact. See original ticket]Resolved by commitsc72077c4fff72b66cdde1621c62fb4fd383ce093 I became convinced that Aaron's first patch was the right direction to I should note that there was yet another case in regexec.c that needed -- |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#114808 (status was 'resolved')
Searchable as RT114808$
The text was updated successfully, but these errors were encountered: