Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex weirdness - capture group not reset #15350

Closed
p5pRT opened this issue May 22, 2016 · 6 comments
Closed

Regex weirdness - capture group not reset #15350

p5pRT opened this issue May 22, 2016 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented May 22, 2016

Migrated from rt.perl.org#128215 (status was 'rejected')

Searchable as RT128215$

@p5pRT
Copy link
Author

p5pRT commented May 22, 2016

From choroba@matfyz.cz

Created by choroba@matfyz.cz

The following regex matches the string abA​:

  perl -lwe 'print for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
  b
  b
  a
  A

The first "a" was matched by a previous iteration of the * (probably
not the first one), "b" is than matched by \2, but \3 stays being "a"
from a previous iteration. I'd think it should be reset.

See also http​://stackoverflow.com/a/37379672/1030675

Thanks Lukas Mai for helping me debug it in #perl at freenode.

E. Choroba

Perl Info

Flags:
     category=core
     severity=medium

This perlbug was built using Perl 5.20.1 - Fri Mar 11 09:59:22 UTC 2016
It is being executed now by  Perl 5.20.1 - Fri Mar 11 09:56:51 UTC 2016.

Site configuration information for perl 5.20.1:

Configured by abuild at Fri Mar 11 09:56:51 UTC 2016.

Summary of my perl5 (revision 5 version 20 subversion 1) configuration:

   Platform:
     osname=linux, osvers=3.16.7-35-default, archname=x86_64-linux-thread-multi
     uname='linux cloud121 3.16.7-35-default #1 smp sun feb 7 17:32:21 utc 2016 (832c776) x86_64 x86_64 x86_64 gnulinux '
     config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Dd_dbm_open -Duseshrplib=true -Doptimize=-fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -Wall -pipe -Accflags=-DPERL_USE_SAFE_PUTENV -Dotherlibdirs=/usr/lib/perl5/site_perl -Dinc_version_list=5.20.0/x86_64-linux-thread-multi 5.20.0'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=define, usemultiplicity=define
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -fwrapv -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -Wall -pipe',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -fwrapv -fno-strict-aliasing -pipe -fstack-protector'
     ccversion='', gccversion='4.8.3 20140627 [gcc-4_8-branch revision 212064]', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -L/usr/local/lib64 -fstack-protector'
     libpth=/usr/local/lib /usr/lib64/gcc/x86_64-suse-linux/4.8/include-fixed /usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64
     libs=-lm -ldl -lcrypt -lpthread
     perllibs=-lm -ldl -lcrypt -lpthread
     libc=/lib64/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so
     gnulibc_version='2.19'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.20.1/x86_64-linux-thread-multi/CORE'
     cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64 -fstack-protector'



@INC for perl 5.20.1:
     /home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi
     /home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi
     /home/choroba/perl5/lib/perl5
     /usr/lib/perl5/site_perl/5.20.1/x86_64-linux-thread-multi
     /usr/lib/perl5/site_perl/5.20.1
     /usr/lib/perl5/vendor_perl/5.20.1/x86_64-linux-thread-multi
     /usr/lib/perl5/vendor_perl/5.20.1
     /usr/lib/perl5/5.20.1/x86_64-linux-thread-multi
     /usr/lib/perl5/5.20.1
     /usr/lib/perl5/site_perl
     .


Environment for perl 5.20.1:
     HOME=/home/choroba
     LANG=en_US.UTF-8
     LANGUAGE=en_US.utf8
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/home/choroba/perl5/bin:/home/choroba/bin:/bin:/sbin:/usr/bin:/usr/local/bin:/usr/games:/usr/X11R6/bin:/opt/gnome/bin:.
     PERL5LIB=/home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi:/home/choroba/perl5/lib/perl5
     PERL_BADLANG (unset)
     PERL_LOCAL_LIB_ROOT=/home/choroba/perl5
     PERL_MB_OPT=--install_base /home/choroba/perl5
     PERL_MM_OPT=INSTALL_BASE=/home/choroba/perl5
     SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 23, 2016

From ambrus@math.bme.hu

On Sun May 22 16​:15​:17 2016, choroba@​matfyz.cz wrote​:

The following regex matches the string abA​:

perl -lwe 'print for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A

That looks correct to me. The regex can match only one way​: the first character is matched by the right hand side alternative, the second character is matched by the left hand side alternative, and the third character is matched by the backreference. As a result, $2 is set from the first character because the left alternative doesn't match later; $1 and $3 are both match from the second character because that's the last time those groups matched; and $4 is set from the last character.

@p5pRT
Copy link
Author

p5pRT commented May 23, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 23, 2016

From choroba@matfyz.cz

The problem (as shown in the link) is that when replacing \3 with \2, the regex no longer matches the string, but it still does in egrep.

To get more details, you can use

  use feature qw{ say };
  no warnings 'uninitialized';
  say for 'abA' =~ /^( (?{warn "* [$1,$2,$3]\n"})
  ([ab]) (?{warn "\tL [$1,$2,$3]\n"})
  |
  ([ab]) (?{warn "\t\tR [$1,$2,$3]\n"})
  )*
  ( (?{ warn "\t\t\t\\3 [$1,$2,$3]\n"}) \2 )
  $/xi;

On Mon May 23 04​:35​:57 2016, b_jonas wrote​:

On Sun May 22 16​:15​:17 2016, choroba@​matfyz.cz wrote​:

The following regex matches the string abA​:

perl -lwe 'print for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A

That looks correct to me. The regex can match only one way​: the first
character is matched by the right hand side alternative, the second
character is matched by the left hand side alternative, and the third
character is matched by the backreference. As a result, $2 is set
from the first character because the left alternative doesn't match
later; $1 and $3 are both match from the second character because
that's the last time those groups matched; and $4 is set from the last
character.

@p5pRT
Copy link
Author

p5pRT commented Jun 21, 2016

From @iabyn

On Mon, May 23, 2016 at 04​:35​:58AM -0700, Zsban Ambrus via RT wrote​:

On Sun May 22 16​:15​:17 2016, choroba@​matfyz.cz wrote​:

The following regex matches the string abA​:

perl -lwe 'print for shift =~ /^(([ab])|([ab]))*(\3)$/i' abA
b
b
a
A

That looks correct to me. The regex can match only one way​: the first
character is matched by the right hand side alternative, the second
character is matched by the left hand side alternative, and the third
character is matched by the backreference. As a result, $2 is set from
the first character because the left alternative doesn't match later; $1
and $3 are both match from the second character because that's the last
time those groups matched; and $4 is set from the last character.

Yes, successful captures are intentionally kept across subsequent
iterations of quantifiers, until overwritten.

e.g. this​:

"abcde" =~ /(?​: (?​: ([acd]) | ([bde]) ) (?{ print "1=[$1] 2=[$2]\n" }) )+/x;

outputs​:

  1=[a] 2=[]
  1=[a] 2=[b]
  1=[c] 2=[b]
  1=[d] 2=[b]
  1=[d] 2=[e]

Closing.

--
Wesley Crusher gets beaten up by his classmates for being a smarmy git,
and consequently has a go at making some friends of his own age for a
change.
  -- Things That Never Happen in "Star Trek" #18

@p5pRT
Copy link
Author

p5pRT commented Jun 21, 2016

@iabyn - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant