Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8-ness no longer propagating through assignments within regexes #14211

Closed
p5pRT opened this issue Nov 5, 2014 · 18 comments
Closed

UTF8-ness no longer propagating through assignments within regexes #14211

p5pRT opened this issue Nov 5, 2014 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 5, 2014

Migrated from rt.perl.org#123135 (status was 'resolved')

Searchable as RT123135$

@p5pRT
Copy link
Author

p5pRT commented Nov 5, 2014

From damian@conway.org

From 5.20, matching and capturing against a utf8 string no longer
reliably produces $^N captures that are also utf8 strings. Or, possibly,
5.20+ simply fails to propagate the original utf8-ness through
assignments from $^N within the regex.

The behaviour was first reported against Regexp​::Grammars, but is not
specific to that module.

The attached tests pass (as expected) under 5.10, 5.12, 5.14,
5.16, and 5.18; they fail under 5.20 and 5.21.

Damian

@p5pRT
Copy link
Author

p5pRT commented Nov 5, 2014

From damian@conway.org

use utf8;
use Test​::More;

our $result;

my $parser = qr{
  (
  ( (?&TOP) )
  (?{ $result = $^N })

  (?(DEFINE)
  (?<TOP> .* )
  )
  )
}xms;

'za?��??_g??l?_ja??' =~ $parser;
ok utf8​::is_utf8( $2 ), 'UTF-8 (Polish) propagated through match to $2';
ok utf8​::is_utf8( $result ), 'UTF-8 (Polish) propagated through match to $^N';

'���������������� ��������������������' =~ $parser;
ok utf8​::is_utf8( $2 ), 'UTF-8 (Hebrew) propagated through match to $2';
ok utf8​::is_utf8( $result ), 'UTF-8 (Hebrew) propagated through match to $^N';

done_testing();

@p5pRT
Copy link
Author

p5pRT commented Nov 5, 2014

From bbkr@post.pl

Looks like test file broke during upload.
Attaching correct one.

@p5pRT
Copy link
Author

p5pRT commented Nov 5, 2014

From bbkr@post.pl

use utf8;
use Test​::More;

our $result;

my $parser = qr{
  (
  ( (?&TOP) )
  (?{ $result = $^N })

  (?(DEFINE)
  (?<TOP> .* )
  )
  )
}xms;

'za�������� g����l�� ja����' =~ $parser;
ok utf8​::is_utf8( $2 ), 'UTF-8 (Polish) propagated through match to $2';
ok utf8​::is_utf8( $result ), 'UTF-8 (Polish) propagated through match to $^N';

'�������� ����������' =~ $parser;
ok utf8​::is_utf8( $2 ), 'UTF-8 (Hebrew) propagated through match to $2';
ok utf8​::is_utf8( $result ), 'UTF-8 (Hebrew) propagated through match to $^N';

done_testing();

@p5pRT
Copy link
Author

p5pRT commented Nov 5, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @tonycoz

On Wed Nov 05 12​:50​:31 2014, thoughtstream wrote​:

From 5.20, matching and capturing against a utf8 string no longer
reliably produces $^N captures that are also utf8 strings. Or, possibly,
5.20+ simply fails to propagate the original utf8-ness through
assignments from $^N within the regex.

The behaviour was first reported against Regexp​::Grammars, but is not
specific to that module.

The attached tests pass (as expected) under 5.10, 5.12, 5.14,
5.16, and 5.18; they fail under 5.20 and 5.21.

Bisects to​:

bad - non-zero exit from ./perl -Ilib -e "\x{101c}" =~ /((?&TOP))(?{ $result = $^N })(?(DEFINE)(?<TOP>.))/; $result eq "\x{101c}" or die
0254aed is the first bad commit
commit 0254aed
Author​: David Mitchell <davem@​iabyn.com>
Date​: Mon May 20 17​:04​:44 2013 +0100

  Rationalise RX_MATCH_UTF8_set()
 
  Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate
  whether the captures on a successful match are utf8, only set
  this flag at the end of a successful match, rather than at the start of
  the match.
 
  This should make no functional difference the way things stand at the
  moment, but makes things conceptually cleaner.

:100644 100644 bcf2b80cce11b32936a84c2b317ac6bbbd294beb 4aa6de05ab79798138d010fc679e88644b066e00 M regexec.c
bisect run success
That took 1094 seconds

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @cpansprout

On Wed Nov 05 18​:32​:18 2014, tonyc wrote​:

Bisects to​:

Argh! You beat me to it. My test case is simpler​:

-e 'chr(256) =~ /(.)(?{ die unless $^N eq chr 256})/'

bad - non-zero exit from ./perl -Ilib -e "\x{101c}" =~ /((?&TOP))(?{
$result = $^N })(?(DEFINE)(?<TOP>.))/; $result eq "\x{101c}" or die
0254aed is the first bad commit
commit 0254aed
Author​: David Mitchell <davem@​iabyn.com>
Date​: Mon May 20 17​:04​:44 2013 +0100

Rationalise RX_MATCH_UTF8_set()

Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate
whether the captures on a successful match are utf8, only set
this flag at the end of a successful match, rather than at the start
of
the match.

This should make no functional difference the way things stand at the
moment, but makes things conceptually cleaner.

That logic is clearly wrong, though I would not have known any better without this failing test case.

The commit should simple be reverted, and tests should be added. I plan to do both shortly.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @tonycoz

On Wed Nov 05 20​:06​:13 2014, sprout wrote​:

On Wed Nov 05 18​:32​:18 2014, tonyc wrote​:

Bisects to​:

Argh! You beat me to it. My test case is simpler​:

-e 'chr(256) =~ /(.)(?{ die unless $^N eq chr 256})/'

bad - non-zero exit from ./perl -Ilib -e "\x{101c}" =~ /((?&TOP))(?{
$result = $^N })(?(DEFINE)(?<TOP>.))/; $result eq "\x{101c}" or die
0254aed is the first bad commit
commit 0254aed
Author​: David Mitchell <davem@​iabyn.com>
Date​: Mon May 20 17​:04​:44 2013 +0100

Rationalise RX_MATCH_UTF8_set()

Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate
whether the captures on a successful match are utf8, only set
this flag at the end of a successful match, rather than at the start
of
the match.

This should make no functional difference the way things stand at the
moment, but makes things conceptually cleaner.

That logic is clearly wrong, though I would not have known any better
without this failing test case.

The commit should simple be reverted, and tests should be added. I
plan to do both shortly.

Oops, I had a test and manual revert done before I noticed this.

Tony

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @tonycoz

0001-perl-123135-ensure-utf8-ness-propagated-to-N.patch
From 7cc4874d20d36acc8db4f0ce73dff62c1d044394 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Thu, 6 Nov 2014 15:17:08 +1100
Subject: [PATCH] [perl #123135] ensure utf8-ness propagated to $^N

Reverts 0254aed965cd47adab9025a192546e4a5e63873a and adds a test
---
 regexec.c  |    6 ++++--
 t/re/pat.t |    9 ++++++++-
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/regexec.c b/regexec.c
index 37018ac..9f6291b 100644
--- a/regexec.c
+++ b/regexec.c
@@ -670,6 +670,8 @@ Perl_re_intuit_start(pTHX_
     DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
                 "Intuit: trying to determine minimum start position...\n"));
 
+    RX_MATCH_UTF8_set(rx,utf8_target);
+
     /* for now, assume that all substr offsets are positive. If at some point
      * in the future someone wants to do clever things with look-behind and
      * -ve offsets, they'll need to fix up any code in this function
@@ -2630,6 +2632,8 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend,
     /* see how far we have to get to not match where we matched before */
     reginfo->till = stringarg + minend;
 
+    RX_MATCH_UTF8_set(rx,utf8_target);
+
     if (prog->extflags & RXf_EVAL_SEEN && SvPADTMP(sv)) {
         /* SAVEFREESV, not sv_mortalcopy, as this SV must last until after
            S_cleanup_regmatch_info_aux has executed (registered by
@@ -3137,8 +3141,6 @@ got_it:
     if (RXp_PAREN_NAMES(prog)) 
         (void)hv_iterinit(RXp_PAREN_NAMES(prog));
 
-    RX_MATCH_UTF8_set(rx, utf8_target);
-
     /* make sure $`, $&, $', and $digit will work later */
     if ( !(flags & REXEC_NOT_FIRST) )
         S_reg_set_capture_string(aTHX_ rx,
diff --git a/t/re/pat.t b/t/re/pat.t
index e532054..b3806f2 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -22,7 +22,7 @@ BEGIN {
     skip_all_without_unicode_tables();
 }
 
-plan tests => 755;  # Update this when adding/deleting tests.
+plan tests => 756;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1620,6 +1620,13 @@ EOP
         use utf8;
         like("ÿ", qr/[ÿ-ÿ]/, "\"ÿ\" should match [ÿ-ÿ]");
     }
+
+    {
+        # [perl #123135]
+        my $result;
+	"\x{101c}" =~ /(.)(?{ $result = $^N })/;
+	is($result, "\x{101c}", 'check utf8 propagated to $^N');
+    }
 } # End of sub run_tests
 
 1;
-- 
1.7.10.4

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @cpansprout

Fixed in ab4e48c. Tests added in 9c6c921. (Oops. I just noticed a mistake in the commit message. I typed %^N intsead of $^N.)

This should be a candidate for maint-5.20.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

@cpansprout - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

@cpansprout - Status changed from 'pending release' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @tonycoz

On Wed Nov 05 21​:24​:26 2014, sprout wrote​:

Fixed in ab4e48c. Tests added in 9c6c921. (Oops. I just noticed
a mistake in the commit message. I typed %^N intsead of $^N.)

This should be a candidate for maint-5.20.

+1 from me.

Tony

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2014

From @khwilliamson

On 11/06/2014 03​:36 PM, Tony Cook via RT wrote​:

On Wed Nov 05 21​:24​:26 2014, sprout wrote​:

Fixed in ab4e48c. Tests added in 9c6c921. (Oops. I just noticed
a mistake in the commit message. I typed %^N intsead of $^N.)

This should be a candidate for maint-5.20.

+1 from me.

Tony

and me too

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2014

From @cpansprout

On Thu Nov 06 15​:09​:56 2014, public@​khwilliamson.com wrote​:

On 11/06/2014 03​:36 PM, Tony Cook via RT wrote​:

On Wed Nov 05 21​:24​:26 2014, sprout wrote​:

Fixed in ab4e48c. Tests added in 9c6c921. (Oops. I just noticed
a mistake in the commit message. I typed %^N intsead of $^N.)

This should be a candidate for maint-5.20.

+1 from me.

Tony

and me too

I have backported it in commit 63100fa.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2014

@cpansprout - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Jun 2, 2015

From @khwilliamson

Thanks for submitting this ticket

The issue should be resolved with the release today of Perl v5.22, available at http​://www.perl.org/get.html
If you find that the problem persists, feel free to reopen this ticket

--
Karl Williamson for the Perl 5 porters team

@p5pRT
Copy link
Author

p5pRT commented Jun 2, 2015

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant