Skip Menu |
Report information
Id: 123135
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: thoughtstream <damian [at] conway.org>
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: 5.22.0



Subject: UTF8-ness no longer propagating through assignments within regexes
Download (untitled) / with headers
text/plain 458b
From 5.20, matching and capturing against a utf8 string no longer reliably produces $^N captures that are also utf8 strings. Or, possibly, 5.20+ simply fails to propagate the original utf8-ness through assignments from $^N within the regex. The behaviour was first reported against Regexp::Grammars, but is not specific to that module. The attached tests pass (as expected) under 5.10, 5.12, 5.14, 5.16, and 5.18; they fail under 5.20 and 5.21. Damian
Subject: utf8_bug.t
Download utf8_bug.t
text/plain 599b
use utf8; use Test::More; our $result; my $parser = qr{ ( ( (?&TOP) ) (?{ $result = $^N }) (?(DEFINE) (?<TOP> .* ) ) ) }xms; 'za?—??_g??l?_ja??' =~ $parser; ok utf8::is_utf8( $2 ), 'UTF-8 (Polish) propagated through match to $2'; ok utf8::is_utf8( $result ), 'UTF-8 (Polish) propagated through match to $^N'; 'שלום העולם' =~ $parser; ok utf8::is_utf8( $2 ), 'UTF-8 (Hebrew) propagated through match to $2'; ok utf8::is_utf8( $result ), 'UTF-8 (Hebrew) propagated through match to $^N'; done_testing();
RT-Send-CC: perl5-porters [...] perl.org
Looks like test file broke during upload. Attaching correct one.
Subject: utf8_bug.t
Download utf8_bug.t
text/plain 589b
use utf8; use Test::More; our $result; my $parser = qr{ ( ( (?&TOP) ) (?{ $result = $^N }) (?(DEFINE) (?<TOP> .* ) ) ) }xms; 'zażółć gęślą jaźń' =~ $parser; ok utf8::is_utf8( $2 ), 'UTF-8 (Polish) propagated through match to $2'; ok utf8::is_utf8( $result ), 'UTF-8 (Polish) propagated through match to $^N'; 'שלום העולם' =~ $parser; ok utf8::is_utf8( $2 ), 'UTF-8 (Hebrew) propagated through match to $2'; ok utf8::is_utf8( $result ), 'UTF-8 (Hebrew) propagated through match to $^N'; done_testing();
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Wed Nov 05 12:50:31 2014, thoughtstream wrote: Show quoted text
> From 5.20, matching and capturing against a utf8 string no longer > reliably produces $^N captures that are also utf8 strings. Or, possibly, > 5.20+ simply fails to propagate the original utf8-ness through > assignments from $^N within the regex. > > The behaviour was first reported against Regexp::Grammars, but is not > specific to that module. > > The attached tests pass (as expected) under 5.10, 5.12, 5.14, > 5.16, and 5.18; they fail under 5.20 and 5.21.
Bisects to: bad - non-zero exit from ./perl -Ilib -e "\x{101c}" =~ /((?&TOP))(?{ $result = $^N })(?(DEFINE)(?<TOP>.))/; $result eq "\x{101c}" or die 0254aed965cd47adab9025a192546e4a5e63873a is the first bad commit commit 0254aed965cd47adab9025a192546e4a5e63873a Author: David Mitchell <davem@iabyn.com> Date: Mon May 20 17:04:44 2013 +0100 Rationalise RX_MATCH_UTF8_set() Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate whether the captures on a successful match are utf8, only set this flag at the end of a successful match, rather than at the start of the match. This should make no functional difference the way things stand at the moment, but makes things conceptually cleaner. :100644 100644 bcf2b80cce11b32936a84c2b317ac6bbbd294beb 4aa6de05ab79798138d010fc679e88644b066e00 M regexec.c bisect run success That took 1094 seconds
RT-Send-CC: perl5-porters [...] perl.org
On Wed Nov 05 18:32:18 2014, tonyc wrote: Show quoted text
> Bisects to:
Argh! You beat me to it. My test case is simpler: -e 'chr(256) =~ /(.)(?{ die unless $^N eq chr 256})/' Show quoted text
> bad - non-zero exit from ./perl -Ilib -e "\x{101c}" =~ /((?&TOP))(?{ > $result = $^N })(?(DEFINE)(?<TOP>.))/; $result eq "\x{101c}" or die > 0254aed965cd47adab9025a192546e4a5e63873a is the first bad commit > commit 0254aed965cd47adab9025a192546e4a5e63873a > Author: David Mitchell <davem@iabyn.com> > Date: Mon May 20 17:04:44 2013 +0100 > > Rationalise RX_MATCH_UTF8_set() > > Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate > whether the captures on a successful match are utf8, only set > this flag at the end of a successful match, rather than at the start > of > the match. > > This should make no functional difference the way things stand at the > moment, but makes things conceptually cleaner.
That logic is clearly wrong, though I would not have known any better without this failing test case. The commit should simple be reverted, and tests should be added. I plan to do both shortly. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.2k
On Wed Nov 05 20:06:13 2014, sprout wrote: Show quoted text
> On Wed Nov 05 18:32:18 2014, tonyc wrote:
> > Bisects to:
> > Argh! You beat me to it. My test case is simpler: > > -e 'chr(256) =~ /(.)(?{ die unless $^N eq chr 256})/' >
> > bad - non-zero exit from ./perl -Ilib -e "\x{101c}" =~ /((?&TOP))(?{ > > $result = $^N })(?(DEFINE)(?<TOP>.))/; $result eq "\x{101c}" or die > > 0254aed965cd47adab9025a192546e4a5e63873a is the first bad commit > > commit 0254aed965cd47adab9025a192546e4a5e63873a > > Author: David Mitchell <davem@iabyn.com> > > Date: Mon May 20 17:04:44 2013 +0100 > > > > Rationalise RX_MATCH_UTF8_set() > > > > Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate > > whether the captures on a successful match are utf8, only set > > this flag at the end of a successful match, rather than at the start > > of > > the match. > > > > This should make no functional difference the way things stand at the > > moment, but makes things conceptually cleaner.
> > That logic is clearly wrong, though I would not have known any better > without this failing test case. > > The commit should simple be reverted, and tests should be added. I > plan to do both shortly.
Oops, I had a test and manual revert done before I noticed this. Tony
Subject: 0001-perl-123135-ensure-utf8-ness-propagated-to-N.patch
From 7cc4874d20d36acc8db4f0ce73dff62c1d044394 Mon Sep 17 00:00:00 2001 From: Tony Cook <tony@develop-help.com> Date: Thu, 6 Nov 2014 15:17:08 +1100 Subject: [PATCH] [perl #123135] ensure utf8-ness propagated to $^N Reverts 0254aed965cd47adab9025a192546e4a5e63873a and adds a test --- regexec.c | 6 ++++-- t/re/pat.t | 9 ++++++++- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/regexec.c b/regexec.c index 37018ac..9f6291b 100644 --- a/regexec.c +++ b/regexec.c @@ -670,6 +670,8 @@ Perl_re_intuit_start(pTHX_ DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, "Intuit: trying to determine minimum start position...\n")); + RX_MATCH_UTF8_set(rx,utf8_target); + /* for now, assume that all substr offsets are positive. If at some point * in the future someone wants to do clever things with look-behind and * -ve offsets, they'll need to fix up any code in this function @@ -2630,6 +2632,8 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend, /* see how far we have to get to not match where we matched before */ reginfo->till = stringarg + minend; + RX_MATCH_UTF8_set(rx,utf8_target); + if (prog->extflags & RXf_EVAL_SEEN && SvPADTMP(sv)) { /* SAVEFREESV, not sv_mortalcopy, as this SV must last until after S_cleanup_regmatch_info_aux has executed (registered by @@ -3137,8 +3141,6 @@ got_it: if (RXp_PAREN_NAMES(prog)) (void)hv_iterinit(RXp_PAREN_NAMES(prog)); - RX_MATCH_UTF8_set(rx, utf8_target); - /* make sure $`, $&, $', and $digit will work later */ if ( !(flags & REXEC_NOT_FIRST) ) S_reg_set_capture_string(aTHX_ rx, diff --git a/t/re/pat.t b/t/re/pat.t index e532054..b3806f2 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -22,7 +22,7 @@ BEGIN { skip_all_without_unicode_tables(); } -plan tests => 755; # Update this when adding/deleting tests. +plan tests => 756; # Update this when adding/deleting tests. run_tests() unless caller; @@ -1620,6 +1620,13 @@ EOP use utf8; like("ÿ", qr/[ÿ-ÿ]/, "\"ÿ\" should match [ÿ-ÿ]"); } + + { + # [perl #123135] + my $result; + "\x{101c}" =~ /(.)(?{ $result = $^N })/; + is($result, "\x{101c}", 'check utf8 propagated to $^N'); + } } # End of sub run_tests 1; -- 1.7.10.4
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 204b
Fixed in ab4e48c10a. Tests added in 9c6c921b. (Oops. I just noticed a mistake in the commit message. I typed %^N intsead of $^N.) This should be a candidate for maint-5.20. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 247b
On Wed Nov 05 21:24:26 2014, sprout wrote: Show quoted text
> Fixed in ab4e48c10a. Tests added in 9c6c921b. (Oops. I just noticed > a mistake in the commit message. I typed %^N intsead of $^N.) > > This should be a candidate for maint-5.20.
+1 from me. Tony
From: Karl Williamson <public [...] khwilliamson.com>
To: perlbug-followup [...] perl.org
Subject: Re: [perl #123135] UTF8-ness no longer propagating through assignments within regexes
Date: Thu, 06 Nov 2014 16:09:36 -0700
CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 321b
On 11/06/2014 03:36 PM, Tony Cook via RT wrote: Show quoted text
> On Wed Nov 05 21:24:26 2014, sprout wrote:
>> Fixed in ab4e48c10a. Tests added in 9c6c921b. (Oops. I just noticed >> a mistake in the commit message. I typed %^N intsead of $^N.) >> >> This should be a candidate for maint-5.20.
> > +1 from me. > > Tony >
and me too
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 476b
On Thu Nov 06 15:09:56 2014, public@khwilliamson.com wrote: Show quoted text
> On 11/06/2014 03:36 PM, Tony Cook via RT wrote:
> > On Wed Nov 05 21:24:26 2014, sprout wrote:
> >> Fixed in ab4e48c10a. Tests added in 9c6c921b. (Oops. I just noticed > >> a mistake in the commit message. I typed %^N intsead of $^N.) > >> > >> This should be a candidate for maint-5.20.
> > > > +1 from me. > > > > Tony > >
> and me too >
I have backported it in commit 63100fadb. -- Father Chrysostomos
Subject: Your ticket against Perl 5 has been resolved
Download (untitled) / with headers
text/plain 263b
Thanks for submitting this ticket The issue should be resolved with the release today of Perl v5.22, available at http://www.perl.org/get.html If you find that the problem persists, feel free to reopen this ticket -- Karl Williamson for the Perl 5 porters team


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org