Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp behavior in matching backreferences to optional captures #2271

Closed
p5pRT opened this issue Jul 29, 2000 · 18 comments
Closed

Regexp behavior in matching backreferences to optional captures #2271

p5pRT opened this issue Jul 29, 2000 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 29, 2000

Migrated from rt.perl.org#3589 (status was 'open')

Searchable as RT3589$

@p5pRT
Copy link
Author

p5pRT commented Jul 29, 2000

From @vanstyn

Created by @vanstyn

crypt% perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad babadab bababdab bababdad /'
bababdad​: -dad-d-
crypt%

That's with 5.6.0; with bleadperl, none of the strings match.
I believe it should match 'babadad'.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.6.0:

Configured by hv at Tue Jul 18 23:29:44 BST 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.2.5-16, archname=i686-linux
    uname='linux crypt.compulink.co.uk 2.2.5-16 #1 sun may 30 23:00:18 bst 1999 i686 unknown '
    config_args='-des -Doptimize=-g -O6 -Dprefix=/opt/perl-blead'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=define 
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
  Compiler:
    cc='cc', optimize='-g -O6', gccversion=egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
    cppflags='-DDEBUGGING -fno-strict-aliasing'
    ccflags ='-DDEBUGGING -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=/lib/libc-2.1.1.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    bleadperl


@INC for perl v5.6.0:
    lib
    /opt/perl-blead/lib/5.6.0/i686-linux
    /opt/perl-blead/lib/5.6.0
    /opt/perl-blead/lib/site_perl/5.6.0/i686-linux
    /opt/perl-blead/lib/site_perl/5.6.0
    /opt/perl-blead/lib/site_perl
    .


Environment for perl v5.6.0:
    HOME=/home/hv
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/hv/bin:/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented May 21, 2001

From @vanstyn

More​:
crypt% ./perl -Dr -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /' 2>&1 | grep restoring
  restoring \1 to 0(0)..3
  restoring \2 to 0(1882390424)..1
crypt%

That odd number turns out to occur because PL_reg_start_tmp[paren]
is \x78455220 = " REx".

Hugo

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2004

From @smpeters

crypt% perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad babadab bababdab bababdad /'
bababdad​: -dad-d-
crypt%

That's with 5.6.0; with bleadperl, none of the strings match.
I believe it should match 'babadad'.

It appears that there is a problem with your regexp.

/^((.)?a\2)+$/

The "^" is saying start at the beginning of the string, and none of the
strings start with "dad". Also, the enclosed match must be three
characters long. Finally, the "+$" means can appear one or more times
before the end. So, this regexp is looking for a three character string
that is of the form XaX (X is any character) that appears one or more
times to the complete end of the string. That said, the following
one-liner demonstrates your regexp...

steve@​kirk sandbox $ perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-"
for qw/ dad dadad daddad babadad babadab bababdab bababdad /'
dad​: -dad-d-
daddad​: -dad-d-

"dad" matches exactly as does "daddad". "dadad" does not match since
the "dad" is not repeated completely throughout the string from
beginning to end.

This problem does not appear to be a bug with Perl, but with the regular
expression.

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2004

@smpeters - Status changed from 'open' to 'rejected'

@p5pRT p5pRT closed this as completed Sep 17, 2004
@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2004

From @hvds

"Steve Peters via RT" <perlbug-followup@​perl.org> wrote​:
:> crypt% perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/
:> babadad babadab bababdab bababdad /'
:> bababdad​: -dad-d-
:> crypt%
:>
:> That's with 5.6.0; with bleadperl, none of the strings match.
:> I believe it should match 'babadad'.
:>
:
:It appears that there is a problem with your regexp.
:
:/^((.)?a\2)+$/
:
:The "^" is saying start at the beginning of the string, and none of the
:strings start with "dad". Also, the enclosed match must be three
:characters long.

That depends on your definitions, and I think the problem here is that
the regexp explores behaviour that has never been explicitly defined.

If the '?' is moved inside the parens​:
  /^((.?)a\2)+$/
then expected behaviour is clear - match any string comprised entirely
of one or more consecutive occurrences of 'a' and/or 'XaX'.

As written though, it isn't clear what the following '\2' should match
when the (.)? does not capture - it could match an empty string (as it
does when the '?' is inside), or it could match the value it captured
the last time round (which is what it did in 5.6.0, hence the match
on "bababdad"), or it could refuse to match anything at all.

Behaviour from 5.6.1 onwards appears to take the last option​: when
trying a path in which the (.)? does not match, a subsequent attempt
to match \2 fails. I'm not sure whether this is explicitly documented
anywhere, nor whether it is tested. It feels like reasonable behaviour
though, since the other sensible behaviour is catered for by (.?), and
I'd class the original behaviour of 5.6.0 (retain the previous value
of $2) as a clear bug.

It is however slightly unfortunate, in that the alternative definition
makes it possible to match against captures that would only be defined
later​:
  perl -wle 'print "triangular" if ("x" x shift(@​ARGV)) =~ /^(\1x)*$/' 10

Such forms can always be modified to give the desire behaviour though,
and usually with only a small change; in the above case, for example,
you can use /^((^|\1)x)*$/.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2004

@smpeters - Status changed from 'rejected' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2004

From @ysth

On Sat, Sep 18, 2004 at 12​:16​:05AM +0100, hv@​crypt.org wrote​:

As written though, it isn't clear what the following '\2' should match
when the (.)? does not capture - it could match an empty string (as it
does when the '?' is inside), or it could match the value it captured
the last time round (which is what it did in 5.6.0, hence the match
on "bababdad"), or it could refuse to match anything at all.

Behaviour from 5.6.1 onwards appears to take the last option​: when
trying a path in which the (.)? does not match, a subsequent attempt
to match \2 fails. I'm not sure whether this is explicitly documented
anywhere, nor whether it is tested. It feels like reasonable behaviour
though, since the other sensible behaviour is catered for by (.?), and
I'd class the original behaviour of 5.6.0 (retain the previous value
of $2) as a clear bug.

I second the reasonableness of this behaviour.

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2004

From @smpeters

On Sunday 19 September 2004 02​:18 pm, Yitzchak Scott-Thoennes wrote​:

On Sat, Sep 18, 2004 at 12​:16​:05AM +0100, hv@​crypt.org wrote​:

As written though, it isn't clear what the following '\2' should match
when the (.)? does not capture - it could match an empty string (as it
does when the '?' is inside), or it could match the value it captured
the last time round (which is what it did in 5.6.0, hence the match
on "bababdad"), or it could refuse to match anything at all.

Behaviour from 5.6.1 onwards appears to take the last option​: when
trying a path in which the (.)? does not match, a subsequent attempt
to match \2 fails. I'm not sure whether this is explicitly documented
anywhere, nor whether it is tested. It feels like reasonable behaviour
though, since the other sensible behaviour is catered for by (.?), and
I'd class the original behaviour of 5.6.0 (retain the previous value
of $2) as a clear bug.

I second the reasonableness of this behaviour.

With all this being said, is it agreed that the current behavior is correct,
but that we need test conditions and documentation to insure the continuation
of this behavior going forward?

Steve Peters
steve@​fisharerojo.org

@p5pRT
Copy link
Author

p5pRT commented Sep 22, 2004

From @hvds

Steve Peters <steve@​fisharerojo.org> wrote​:
:With all this being said, is it agreed that the current behavior is correct,
:but that we need test conditions and documentation to insure the continuation
:of this behavior going forward?

I think so, but I haven't actually checked the documentation and existing
tests to verify that there is currently a gap.

(I've also only just noticed that the original bug report came from me,
but luckily my previous response is no different than it would have been
with that knowledge. :)

I note that my subsequent comment is still true with latest bleadperl​:

zen% ./perl -Ilib -Dr -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /' 2>&1 | grep restoring
  restoring \1 to 0(0)..3
  restoring \2 to 0(1882132400)..1
zen%

.. and I assume my analysis from then is also still true​:

:That odd number turns out to occur because PL_reg_start_tmp[paren]
:is \x78455220 = " REx".

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2007

From @rurban

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

t/op/re_tests​: test new behaviour for bug #3589 to close it.
Todo​: documentation. See
http​://rt.perl.org/rt3/Public/Bug/Display.html?id=3589
--
Reini Urban

@p5pRT
Copy link
Author

p5pRT commented Jul 2, 2007

From @rurban

pl-bug3589.patch
difforig perl-current/t/op/re_tests

2007-07-02  Reini Urban <rurban@x-ray.at>
	* t/op/re_tests: test correct new behaviour for bug #3589 to close it. 
	Todo: documentation. See http://rt.perl.org/rt3/Public/Bug/Display.html?id=3589

diff -ub perl-current/t/op/re_tests.orig perl-current/t/op/re_tests
--- perl-current/t/op/re_tests.orig	2007-06-18 15:14:31.000000000 +0000
+++ perl-current/t/op/re_tests	2007-07-02 21:50:56.430125000 +0000
@@ -261,6 +261,8 @@
 ((\3|b)\2(a)x)+	aaxabxbaxbbx	n	-	-
 ((\3|b)\2(a)x)+	aaaxabaxbaaxbbax	y	$&-$1-$2-$3	bbax-bbax-b-a
 ((\3|b)\2(a)){2,}	bbaababbabaaaaabbaaaabba	y	$&-$1-$2-$3	bbaaaabba-bba-b-a
+#Bug #3589 - up to perl-5.6.0 matches incorrectly, from 5.6.1 not anymore
+^((.)?a\2)+$	babadad	n	-	-
 (a)|(b)	b	y	$-[0]	0
 (a)|(b)	b	y	$+[0]	1
 (a)|(b)	b	y	x$-[1]	x

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2007

From @rgs

On 02/07/07, Reini Urban via RT <perlbug-followup@​perl.org> wrote​:

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

Thanks, applied as #31530.

@p5pRT
Copy link
Author

p5pRT commented May 27, 2008

From module@renee-baecker.de

On Mi. 04. Jul. 2007, 06​:39​:37, rafael wrote​:

On 02/07/07, Reini Urban via RT <perlbug-followup@​perl.org> wrote​:

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

Thanks, applied as #31530.

Attached is a patch for perlre.pod. If this patch is ok, this ticket can
be closed...

@p5pRT
Copy link
Author

p5pRT commented May 27, 2008

From module@renee-baecker.de

perlre_pod.patch
--- old/perl-5.10.0/pod/perlre.pod	2007-12-18 11:47:08.000000000 +0100
+++ perl-5.10.0/pod/perlre.pod	2008-05-27 10:33:47.000000000 +0200
@@ -599,6 +599,11 @@
 which makes it easier to write code that tests for a series of more
 specific cases and remembers the best match.
 
+B<NOTE>: If an optional match (e.g. C<(.)?>) does not match, the
+subsequent attempt to match the correspondent capture buffer fails.
+
+  'bababdad' =~ /^((.)?a\2)+$/
+
 B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
 C<$'> anywhere in the program, it has to provide them for every
 pattern match.  This may substantially slow your program.  Perl

@p5pRT
Copy link
Author

p5pRT commented May 27, 2008

From p5p@perl.wizbit.be

Quoting reneeb via RT <perlbug-followup@​perl.org>​:

On Mi. 04. Jul. 2007, 06​:39​:37, rafael wrote​:

On 02/07/07, Reini Urban via RT <perlbug-followup@​perl.org> wrote​:

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

Thanks, applied as #31530.

Attached is a patch for perlre.pod. If this patch is ok, this ticket can
be closed...

I feel that the example should be clearer or it needs more explenation...

From the example I can't immediatly tell what is happening and/or
what it special about it that it deserves a <NOTE>. (I'm sure that
reading the bug report will clarify this but that shouldn't be a
requirement)

Kind regards,

Bram

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2010

From @gannett-ggreer

The remaining part of this ticket is clarifying in the documentation the
behavior of back-references to optional captures. For example​:

./perl -Ilib -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad /'

does not match, because the "\2" always fails (rather than allowing an
empty match) if the "." didn't match. However​:

./perl -Ilib -wle '/^((.?)a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad /'

does match, because an empty result is valid in the capture.

--
George Greer

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2016

From @jkeenan

On Sun, 25 Jul 2010 19​:21​:33 GMT, greerga wrote​:

The remaining part of this ticket is clarifying in the documentation the
behavior of back-references to optional captures. For example​:

./perl -Ilib -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad /'

does not match, because the "\2" always fails (rather than allowing an
empty match) if the "." didn't match. However​:

./perl -Ilib -wle '/^((.?)a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad /'

does match, because an empty result is valid in the capture.

I believe this ticket is closable because the additional documentation needed was, in my reading, added by Karl Williamson in commit d8b950d in June 2010. 'pod/perlre.pod' has a section on Capture Groups which contains this​:

#####
The bracketing construct "( ... )" creates capture groups
(also referred to as capture buffers). To refer to the
current contents of a group later on, within the same
pattern, use "\g1" (or "\g{1}") for the first, "\g2" (or
"\g{2}") for the second, and so on. This is called a
*backreference*. There is no limit to the number of captured
substrings that you may use. Groups are numbered with the
leftmost open parenthesis being number 1, etc. If a group
did not match, the associated backreference won't match
either. (This can happen if the group is optional, or in a
different branch of an alternation.)
#####

I'll put this ticket out of its misery within 7 days unless someone wants to carry on the discussion.

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Dec 28, 2016

From @demerphq

On 28 December 2016 at 03​:15, James E Keenan via RT
<perlbug-followup@​perl.org> wrote​:

On Sun, 25 Jul 2010 19​:21​:33 GMT, greerga wrote​:

The remaining part of this ticket is clarifying in the documentation the
behavior of back-references to optional captures. For example​:

./perl -Ilib -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad /'

does not match, because the "\2" always fails (rather than allowing an
empty match) if the "." didn't match. However​:

./perl -Ilib -wle '/^((.?)a\2)+$/ and print "$_​: -$1-$2-" for qw/
babadad /'

does match, because an empty result is valid in the capture.

I believe this ticket is closable because the additional documentation needed was, in my reading, added by Karl Williamson in commit d8b950d in June 2010. 'pod/perlre.pod' has a section on Capture Groups which contains this​:

#####
The bracketing construct "( ... )" creates capture groups
(also referred to as capture buffers). To refer to the
current contents of a group later on, within the same
pattern, use "\g1" (or "\g{1}") for the first, "\g2" (or
"\g{2}") for the second, and so on. This is called a
*backreference*. There is no limit to the number of captured
substrings that you may use. Groups are numbered with the
leftmost open parenthesis being number 1, etc. If a group
did not match, the associated backreference won't match
either. (This can happen if the group is optional, or in a
different branch of an alternation.)
#####

I think that should be beefed up a bit to say something like

Note that a quantified group which matches zero things counts as a
"non match", so none of the following will match​:

"aba"=/a(x)*b\1a/
"aba"=
/a(x)?b\1a/

However when the quantifier is on the inside of the group it will​:

"aba"=/a(x*)b\1a/
"aba"=
/a(x?)b\1a/

Put another way, perl's regex engine distinguishes between "captured
the empty string" and "an optional capture did not match".

You can see this in the following code​:

$ perl -le'if ("aba"=~/a(x*)b/) { print "matched​: ", defined($1) ?
"<$1>" : "undef" } else { print "rejected" }'
matched​: <>

$ perl -le'if ("aba"=~/a(x)*b/) { print "matched​: ", defined($1) ?
"<$1>" : "undef" } else { print "rejected" }'
matched​: undef

Both patterns match, but leave the contents of the $1 capture group in
a different state. Back references only match when the capture group
they reference contained a defined value.

Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant