Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(?{...}) not executed in many regexes that do subpattern recursion #13550

Open
p5pRT opened this issue Jan 23, 2014 · 14 comments
Open

(?{...}) not executed in many regexes that do subpattern recursion #13550

p5pRT opened this issue Jan 23, 2014 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 23, 2014

Migrated from rt.perl.org#121070 (status was 'open')

Searchable as RT121070$

@p5pRT
Copy link
Author

p5pRT commented Jan 23, 2014

From smls75@gmail.com

I've been experimenting with using embedded (?{...}) blocks in regexes that also make use of the (?&NAME) "Recurse to a named subpattern" feature, and found that for many such regexes, the embedded code blocks fail to take effect when they should.

Here's a simple test case that demonstrates the bug​:

  use Data​::Dumper;
 
  "abcd" =~ /(?&Char) (?{ 42 })
  (?(DEFINE)
  (?<Char> . )
  )/x;
 
  print Dumper $&amp;, $^R;

Output​:
  $VAR1 = 'a';
  $VAR2 = undef;

Expected output​: 42 instead of undef

Curiously, the above test case suddenly *does* work when a superfluous {1,1} is added to the regex like so​:

  "abcd" =~ /(?&Char){1,1} (?{ 42 })
  (?(DEFINE)
  (?<Char> . )
  )/x;

The output is now​:
  $VAR1 = 'a';
  $VAR2 = 42;

This is just one example of how fragile & unpredictable the combination of these two regexp features seems to be; Many cases don't work, although some do.


PS​: I'm using Perl 5.18.2 on Arch Linux, installed using the official distro-provided packages. Output of "perlbug -d" is attached.

@p5pRT
Copy link
Author

p5pRT commented Jan 23, 2014

From smls75@gmail.com


Flags​:
  category=core
  severity=low


Site configuration information for perl 5.18.2​:

Configured by nobody at Sun Jan 12 12​:54​:37 CET 2014.

Summary of my perl5 (revision 5 version 18 subversion 2) configuration​:
 
  Platform​:
  osname=linux, osvers=3.12.6-1-arch, archname=x86_64-linux-thread-multi
  uname='linux mnt-chroots-arch-extra-x86_64-flo-64 3.12.6-1-arch #1 smp preempt fri dec 20 19​:39​:00 cet 2013 x86_64 gnulinux '
  config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.8.2 20131219 (prerelease)', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib /lib64 /usr/lib64
  libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=/lib/libc-2.18.so, so=so, useshrplib=true, libperl=libperl.so
  gnulibc_version='2.18'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/core_perl/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro -L/usr/local/lib -fstack-protector'

Locally applied patches​:
 


@​INC for perl 5.18.2​:
  /usr/lib/perl5/site_perl
  /usr/share/perl5/site_perl
  /usr/lib/perl5/vendor_perl
  /usr/share/perl5/vendor_perl
  /usr/lib/perl5/core_perl
  /usr/share/perl5/core_perl
  .


Environment for perl 5.18.2​:
  HOME=/home/sam
  LANG=en_US.UTF-8
  LANGUAGE=
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/sam/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/bin​:/usr/bin/vendor_perl​:/usr/bin/core_perl
  PERL_BADLANG (unset)
  SHELL=/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2014

From @Abigail

On Thu, Jan 23, 2014 at 01​:28​:00PM -0800, Sam S. wrote​:

# New Ticket Created by Sam S.
# Please include the string​: [perl #121070]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121070 >

I've been experimenting with using embedded (?{...}) blocks in regexes that also make use of the (?&NAME) "Recurse to a named subpattern" feature, and found that for many such regexes, the embedded code blocks fail to take effect when they should.

Here's a simple test case that demonstrates the bug​:

use Data​::Dumper;

"abcd" =~ /(?&Char) (?{ 42 })
(?(DEFINE)
(?<Char> . )
)/x;

print Dumper $&amp;, $^R;

Output​:
$VAR1 = 'a';
$VAR2 = undef;

Expected output​: 42 instead of undef

Curiously, the above test case suddenly *does* work when a superfluous {1,1} is added to the regex like so​:

"abcd" =~ /(?&Char){1,1} (?{ 42 })
(?(DEFINE)
(?<Char> . )
)/x;

The output is now​:
$VAR1 = 'a';
$VAR2 = 42;

This is just one example of how fragile & unpredictable the combination of these two regexp features seems to be; Many cases don't work, although some do.

The bug doesn't seem that (?{ }) isn't executed, but that the result isn't
assigned to $^R. Changing your first example to​:

  use Data​::Dumper;
 
  "abcd" =~ /(?&Char) (?{ say 42 })
  (?(DEFINE)
  (?<Char> . )
  )/x;
 
  print Dumper $&amp;, $^R;

gives as output​:

  42
  $VAR1 = 'a';
  $VAR2 = undef;

showing that (?{ }) *is* executed, but it's the assignment to $^R that fails.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2014

From @demerphq

On 24 January 2014 10​:50, Abigail <abigail@​abigail.be> wrote​:

On Thu, Jan 23, 2014 at 01​:28​:00PM -0800, Sam S. wrote​:

# New Ticket Created by Sam S.
# Please include the string​: [perl #121070]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121070 >

I've been experimenting with using embedded (?{...}) blocks in regexes that also make use of the (?&NAME) "Recurse to a named subpattern" feature, and found that for many such regexes, the embedded code blocks fail to take effect when they should.

Here's a simple test case that demonstrates the bug​:

use Data​::Dumper;

"abcd" =~ /(?&Char) (?{ 42 })
(?(DEFINE)
(?<Char> . )
)/x;

print Dumper $&amp;, $^R;

Output​:
$VAR1 = 'a';
$VAR2 = undef;

Expected output​: 42 instead of undef

Curiously, the above test case suddenly *does* work when a superfluous {1,1} is added to the regex like so​:

"abcd" =~ /(?&Char){1,1} (?{ 42 })
(?(DEFINE)
(?<Char> . )
)/x;

The output is now​:
$VAR1 = 'a';
$VAR2 = 42;

This is just one example of how fragile & unpredictable the combination of these two regexp features seems to be; Many cases don't work, although some do.

The bug doesn't seem that (?{ }) isn't executed, but that the result isn't
assigned to $^R. Changing your first example to​:

use Data&#8203;::Dumper;

"abcd" =~ /\(?&Char\) \(?\{ say 42 \}\)
           \(?\(DEFINE\)
               \(?\<Char> \. \)
           \)/x;

print Dumper $&\, $^R;

gives as output​:

42
$VAR1 = 'a';
$VAR2 = undef;

showing that (?{ }) *is* executed, but it's the assignment to $^R that fails.

Thanks for the analysis. I will try to find time to poke into this.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 24, 2014

From @Leont

On Fri, Jan 24, 2014 at 10​:50 AM, Abigail <abigail@​abigail.be> wrote​:

The bug doesn't seem that (?{ }) isn't executed, but that the result isn't
assigned to $^R. Changing your first example to​:

use Data&#8203;::Dumper;

"abcd" =~ /\(?&Char\) \(?\{ say 42 \}\)
           \(?\(DEFINE\)
               \(?\<Char> \. \)
           \)/x;

print Dumper $&\, $^R;

gives as output​:

42
$VAR1 = 'a';
$VAR2 = undef;

showing that (?{ }) *is* executed, but it's the assignment to $^R that
fails.

Not quite​:

"abcd" =~ /(?&Char) (?{ 42 }) (?{ say $^R }) (?(DEFINE) (?<Char> . ))/x;
print Dumper $&amp;, $^R;

Does print 42. It is assigned to, but it's apparently not available outside
of the regexp.

Leon

@p5pRT
Copy link
Author

p5pRT commented Jan 25, 2014

From @demerphq

On 23 January 2014 22​:28, Sam S. <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Sam S.
# Please include the string​: [perl #121070]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121070 >

I've been experimenting with using embedded (?{...}) blocks in regexes that also make use of the (?&NAME) "Recurse to a named subpattern" feature, and found that for many such regexes, the embedded code blocks fail to take effect when they should.

Here's a simple test case that demonstrates the bug​:

use Data​::Dumper;

"abcd" =~ /(?&Char) (?{ 42 })
(?(DEFINE)
(?<Char> . )
)/x;

print Dumper $&amp;, $^R;

Output​:
$VAR1 = 'a';
$VAR2 = undef;

Expected output​: 42 instead of undef

Curiously, the above test case suddenly *does* work when a superfluous {1,1} is added to the regex like so​:

"abcd" =~ /(?&Char){1,1} (?{ 42 })
(?(DEFINE)
(?<Char> . )
)/x;

The output is now​:
$VAR1 = 'a';
$VAR2 = 42;

This is just one example of how fragile & unpredictable the combination of these two regexp features seems to be; Many cases don't work, although some do.

This is fixed in 4b22688

If you wanted to create tests for the various bugs you are finding it
would be useful.

Yves

@p5pRT
Copy link
Author

p5pRT commented Jan 28, 2014

From @cpansprout

On Sat Jan 25 08​:15​:29 2014, demerphq wrote​:

This is fixed in 4b22688

Unfortunately, that causes this to crash​:

$ ./miniperl -Ilib -e '"" =~ /(??{undef *^R;""})(?{42})/; '

What happens is that the undef frees the scalar pointed to by oreplsv in
regtry, Then that scalar is reused for the regexp object created from
the return value of the ?? block. sv_setsv(oreplsv,...) ends up trying to
set the IV slot of a regexp, overwriting the engine field (sv_setsv prob-
ably needs an assertion). Then later accesses crash because r->engine
is now a bad pointer.

I do not understand why it only started crashing at 4b22688.

In any case this diff seems to fix it (part of a local commit that also
includes tests), but I do not fully understand what order everything
happens inside S_regtry, so I do not feel comfortable applying it
without feedback from a regexp expert​:

Inline Patch
diff --git a/regexec.c b/regexec.c
index 0109370..8e0bdd2 100644
--- a/regexec.c
+++ b/regexec.c
@@ -3666,6 +3666,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, r
     U32 lastopen = 0;       /* last open we saw */
     bool has_cutgroup = RX_HAS_CUTGROUP(rex) ? 1 : 0;   
     SV* const oreplsv = GvSVn(PL_replgv);
+    SAVEFREESV(SvREFCNT_inc_simple_NN(oreplsv));
     /* these three flags are set by various ops to signal information to
      * the very next op. They have a useful lifetime of exactly one loop
      * iteration, and are not preserved or restored by state pushes/pops


-- 

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jan 28, 2014

From @demerphq

On 28 January 2014 13​:43, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Sat Jan 25 08​:15​:29 2014, demerphq wrote​:

This is fixed in 4b22688

Unfortunately, that causes this to crash​:

$ ./miniperl -Ilib -e '"" =~ /(??{undef *^R;""})(?{42})/; '

Umm, what cauldron of evil did you find that in? :-)

What happens is that the undef frees the scalar pointed to by oreplsv in
regtry, Then that scalar is reused for the regexp object created from
the return value of the ?? block. sv_setsv(oreplsv,...) ends up trying to
set the IV slot of a regexp, overwriting the engine field (sv_setsv prob-
ably needs an assertion). Then later accesses crash because r->engine
is now a bad pointer.

Yeah, ugly.

I do not understand why it only started crashing at 4b22688.

Hrm. Well I think somewhere along the line something changed in the
regexp engine[1] that meant that $^R updating was pretty broken, and I
suspect that it just never got assigned to at all. That was actually
the bug I fixed.

In any case this diff seems to fix it (part of a local commit that also
includes tests), but I do not fully understand what order everything
happens inside S_regtry, so I do not feel comfortable applying it
without feedback from a regexp expert​:

It looks fine to me.

The only thought I have is maybe this is wrong solution overall.

I fixed bugs related to this in the past, if you look at the comments
in 4b22688 there is a block that I expected to fire to handle
this problem but that didn't, and im not even sure if that block is
used anymore.

I think that we dont handle (?{ }) properly in general. We use the
internal equivalent to local to try to handle backtracking, but my
spidey sense says that we should actually handle it in reginfo. I also
would not be surprised that the return from (?{ }) leaks, but I havent
checked.

Consider this pattern​:

"abc" =~ /(?{41}) (?​:(??{print "$^R\n"; ""})a (?{ 42 })(??{print
"$^R\n"; ""}) x | (??{print "$^R\n"; ""}) ab (?{43}) (??{print
"$^R\n"; ""}) )/x

We expect that to print

41
42
41
43

Notice how we restore the value of '41'. This behaviour on backtrack
is complicated by the fact that we may have "localized" $^R many times
during matching, but we want the last to "stick" in the case of a
successful match. Which is what the patch I did tries to solve.

The fact that you might undef *^R at any time during matching
complicates this quite a bit.

So I think there is a reasonable argument to say that we should handle
this quite differently.

Instead of using local like this, we could store the return of (?{ })
in the regmatch struct (the one storing the current state of
matching), this would automatically make backtracking work. Prior to
executing the pattern we would store the current value of $^R
somewhere, and then during execution and prior to executing any (?{ })
or (??{ }) blocks we would copy the latest version of the value into
$^R (without localizing it). At the end if we were successful we would
copy the "latest" version of $^R from the regmatch struct into $^R. As
part of this we would have to make it so that when the regex engine
unwinds the regmatch stack it would free any stored $^R's as it goes.

This would eliminate a lot of local calls, and IMO make the behavior Just Work.

But until someone has the time to do this, assuming it can be done, I
think your patch will do just fine to prevent people playing silly
buggers. :-)

diff --git a/regexec.c b/regexec.c
index 0109370..8e0bdd2 100644
--- a/regexec.c
+++ b/regexec.c
@​@​ -3666,6 +3666,7 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, r
U32 lastopen = 0; /* last open we saw */
bool has_cutgroup = RX_HAS_CUTGROUP(rex) ? 1 : 0;
SV* const oreplsv = GvSVn(PL_replgv);
+ SAVEFREESV(SvREFCNT_inc_simple_NN(oreplsv));
/* these three flags are set by various ops to signal information to
* the very next op. They have a useful lifetime of exactly one loop
* iteration, and are not preserved or restored by state pushes/pops

--

Father Chrysostomos

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121070

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 28, 2014

From smls75@gmail.com

On Sat Jan 25 08​:15​:29 2014, demerphq wrote​:

This is fixed in 4b22688

If you wanted to create tests for the various bugs you are finding it
would be useful.

Was that directed at me?
If so, where can I find instructions for creating these tests?

@p5pRT
Copy link
Author

p5pRT commented Jan 28, 2014

From @demerphq

On 29 January 2014 00​:34, Sam S. via RT <perlbug-followup@​perl.org> wrote​:

On Sat Jan 25 08​:15​:29 2014, demerphq wrote​:

This is fixed in 4b22688

If you wanted to create tests for the various bugs you are finding it
would be useful.

Was that directed at me?

Yes. :-)

If so, where can I find instructions for creating these tests?

Er good question. A patch to the relevent test file (in this case
t/re/pat_re_eval.t) would be just fine.

You can see the patch I did to add a test for this here​:

http​://perl5.git.perl.org/perl.git/commitdiff/4b22688e5c51dea3e913d630e965389c5c029765

Cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 29, 2014

From perl5-porters@perl.org

Yves Orton wrote​:

$ ./miniperl -Ilib -e '"" =~ /(??{undef *^R;""})(?{42})/; '

Umm, what cauldron of evil did you find that in? :-)

Recently I fixed various bugs relating to the use of GvSV with no null
check. So when I saw your commit (which also uses GvSV like that),
I tried to make it crash. But the crash I got happened for a differ-
ent reason.

(I still find the sv_setsv(GvSV(PL_replgv)...) suspicious, but I have
not figured out how to get a null GvSV to reach that code path. I do
not know the regexp engine well enough. Or maybe that code is fine.)

I fixed bugs related to this in the past, if you look at the comments
in 4b22688 there is a block that I expected to fire to handle
this problem but that didn't, and im not even sure if that block is
used anymore.

I tried disabling it and a test failed in t/re/pat.t, which comes down
to this printing undef instead of 42​:

"" =~ /(?{42})/;
print $^R//"undef", "\n";

So I think there is a reasonable argument to say that we should handle
this quite differently.
...
But until someone has the time to do this, assuming it can be done, I
think your patch will do just fine

Thank you. I have applied it as 7e68f15.

@p5pRT
Copy link
Author

p5pRT commented Jan 29, 2014

From @demerphq

On 29 January 2014 09​:59, Father Chrysostomos <sprout@​cpan.org> wrote​:

Yves Orton wrote​:

$ ./miniperl -Ilib -e '"" =~ /(??{undef *^R;""})(?{42})/; '

Umm, what cauldron of evil did you find that in? :-)

Recently I fixed various bugs relating to the use of GvSV with no null
check. So when I saw your commit (which also uses GvSV like that),
I tried to make it crash. But the crash I got happened for a differ-
ent reason.

Ah. Makes sense.

(I still find the sv_setsv(GvSV(PL_replgv)...) suspicious, but I have
not figured out how to get a null GvSV to reach that code path. I do
not know the regexp engine well enough. Or maybe that code is fine.)

Understood. Do you think its worth me digging a bit? (Note this stuff
is not my area of expertise)

I fixed bugs related to this in the past, if you look at the comments
in 4b22688 there is a block that I expected to fire to handle
this problem but that didn't, and im not even sure if that block is
used anymore.

I tried disabling it and a test failed in t/re/pat.t, which comes down
to this printing undef instead of 42​:

"" =~ /(?{42})/;
print $^R//"undef", "\n";

Thanks, I had been meaning to do that myself when I got a bit more time.

So I think there is a reasonable argument to say that we should handle
this quite differently.
...
But until someone has the time to do this, assuming it can be done, I
think your patch will do just fine

Thank you. I have applied it as 7e68f15.

And thank you for catching the bug. :-)

Cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 29, 2014

From perl5-porters@perl.org

Yves Orton wrote​:

On 29 January 2014 09​:59, Father Chrysostomos <sprout@​cpan.org> wrote​:

(I still find the sv_setsv(GvSV(PL_replgv)...) suspicious, but I have
not figured out how to get a null GvSV to reach that code path. I do
not know the regexp engine well enough. Or maybe that code is fine.)

Understood. Do you think its worth me digging a bit? (Note this stuff
is not my area of expertise)

It would be nice to know that the code is safe. I think I have fig-
ured it out partially. Localisation causes scalars to pop into exist-
ence (something I did not know till just now). save_scalar (in
scope.c) vivifies the scalar slot *before* localisation (using GvSVn).
So after localisation unwinds, where will be a scalar there where
there was not one before.

Is there any way that 'case EVAL_AB​:' can be reached without $^R
having been localised (i.e., so that regcpblow will not unwind a
save_scalar)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants