Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional match of an anchor gets ignored #7231

Closed
p5pRT opened this issue Apr 13, 2004 · 5 comments
Closed

optional match of an anchor gets ignored #7231

p5pRT opened this issue Apr 13, 2004 · 5 comments

Comments

@p5pRT
Copy link

p5pRT commented Apr 13, 2004

Migrated from rt.perl.org#28532 (status was 'resolved')

Searchable as RT28532$

@p5pRT
Copy link
Author

p5pRT commented Apr 13, 2004

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -we '"abc" =~ /\Aab(c\z)/; print defined($1) ? "yes\n" : "no\n"'
yes
$ perl -we '"abc" =~ /\Aab(c\z)?/; print defined($1) ? "yes\n" : "no\n"'
yes
$ perl -we '"abc" =~ /\Aabc(\z)/; print defined($1) ? "yes\n" : "no\n"'
yes
$ perl -we '"abc" =~ /\Aabc(\z)?/; print defined($1) ? "yes\n" : "no\n"'
no

I believe the behaviour of the last one, with /(\z)?/, is erroneous.
The /\z/ should match, the parens should capture an empty string, and
$1 should be defined.

In the code in which I discovered this behaviour I was trying to use a
submatch to test whether the /\z/ matched.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.3:

Configured by Debian Project at Sat Mar 27 17:07:14 EST 2004.

Summary of my perl5 (revision 5.0 version 8 subversion 3) configuration:
  Platform:
    osname=linux, osvers=2.4.25-ti1211, archname=i386-linux-thread-multi
    uname='linux kosh 2.4.25-ti1211 #1 thu feb 19 18:20:12 est 2004 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.3 -Dsitearch=/usr/local/lib/perl/5.8.3 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.3 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='3.3.3 (Debian 20040314)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so.5.8.3
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.3:
    /etc/perl
    /usr/local/lib/perl/5.8.3
    /usr/local/share/perl/5.8.3
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    .


Environment for perl v5.8.3:
    HOME=/home/zefram
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games:/opt/libunsn/bin
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Apr 14, 2004

From @hvds

Zefram (via RT) <perlbug-followup@​perl.org> wrote​:
:$ perl -we '"abc" =~ /\Aabc(\z)?/; print defined($1) ? "yes\n" : "no\n"'
:no
:
:I believe the behaviour of the last one, with /(\z)?/, is erroneous.
:The /\z/ should match, the parens should capture an empty string, and
:$1 should be defined.

I can confirm this still acts the same in the latest development version
of perl, and I can't see any obvious reason to see it as a desirable
exception.

Tracing the behaviour, we end up calling regexec.c​:S_regrepeat_hard(),
which keeps trying for more while (among other things) it hasn't
reached the end of the string​:
  while (PL_reginput < loceol && ...)
.. and this means that _any_ optional zero-width capture at the end of
the string will fail in the same way.

Changing this to the obvious 'PL_reginput <= loceol' allows the above
test case to succeed, but causes a strange selection of 13 tests to fail​:
Failed Test Stat Wstat Total Fail Failed List of Failed


op/regexp.t 945 13 1.38% 402 404-406 412 414-415 426-
  427 933-934 941-942
op/regexp_noamp.t 945 13 1.38% 402 404-406 412 414-415 426-
  427 933-934 941-942

I'm not sure exactly why, but allowing the check at PL_reginput == loceol
to occur only once does allow all tests to pass, including the new ones.
However, this isn't entirely desirable​: regrepeat_hard() is a hot function,
and adding additional tests is undesirable if it can be avoided.

The alternative, as implemented in the patch below, is to avoid using
CURLYM altogether for zero-width matches​: since repeated zero-width
matches are rare, I believe this is a better trade.

A happy side-effect of this approach is that the existing convolutions
in the CURLYM handling to special-case zero-width matches could be
removed altogether, so this patch should also offer a small performance
improvement for non-zero-width repeats.

Hugo

Inline Patch
--- regcomp.c.old	Fri Oct 31 21:13:19 2003
+++ regcomp.c	Wed Apr 14 19:05:41 2004
@@ -1188,7 +1188,9 @@
 		if (  OP(oscan) == CURLYX && data
 		      && !(data->flags & SF_HAS_PAR)
 		      && !(data->flags & SF_HAS_EVAL)
-		      && !deltanext  ) {
+		      && !deltanext	/* atom is fixed width */
+		      && minnext != 0	/* CURLYM can't handle zero width */
+		) {
 		    /* XXXX How to optimize if data == 0? */
 		    /* Optimize to a simpler form.  */
 		    regnode *nxt = NEXTOPER(oscan) + EXTRA_STEP_2ARGS; /* OPEN */
--- regexec.c.old	Tue Mar 16 18:42:24 2004
+++ regexec.c	Wed Apr 14 19:19:29 2004
@@ -3367,7 +3367,7 @@
 	    CHECKPOINT lastcp;
 	
 	    /* We suppose that the next guy does not need
-	       backtracking: in particular, it is of constant length,
+	       backtracking: in particular, it is of constant non-zero length,
 	       and has no parenths to influence future backrefs. */
 	    ln = ARG1(scan);  /* min to match */
 	    n  = ARG2(scan);  /* max to match */
@@ -3386,15 +3386,6 @@
 		minmod = 0;
 		if (ln && regrepeat_hard(scan, ln, &l) < ln)
 		    sayNO;
-		/* if we matched something zero-length we don't need to
-		   backtrack - capturing parens are already defined, so
-		   the caveat in the maximal case doesn't apply
-
-		   XXXX if ln == 0, we can redo this check first time
-		   through the following loop
-		*/
-		if (ln && l == 0)
-		    n = ln;	/* don't backtrack */
 		locinput = PL_reginput;
 		if (HAS_TEXT(next) || JUMPABLE(next)) {
 		    regnode *text_node = next;
@@ -3420,8 +3411,7 @@
 		    c1 = c2 = -1000;
 	    assume_ok_MM:
 		REGCP_SET(lastcp);
-		/* This may be improved if l == 0.  */
-		while (n >= ln || (n == REG_INFTY && ln > 0 && l)) { /* ln overflow ? */
+		while (n >= ln || (n == REG_INFTY && ln > 0)) { /* ln overflow ? */
 		    /* If it could work, try it. */
 		    if (c1 == -1000 ||
 			UCHARAT(PL_reginput) == c1 ||
@@ -3452,13 +3442,6 @@
 	    }
 	    else {
 		n = regrepeat_hard(scan, n, &l);
-		/* if we matched something zero-length we don't need to
-		   backtrack, unless the minimum count is zero and we
-		   are capturing the result - in that case the capture
-		   being defined or not may affect later execution
-		*/
-		if (n != 0 && l == 0 && !(paren && ln == 0))
-		    ln = n;	/* don't backtrack */
 		locinput = PL_reginput;
 		DEBUG_r(
 		    PerlIO_printf(Perl_debug_log,
@@ -4263,7 +4246,7 @@
 /*
  - regrepeat_hard - repeatedly match something, report total lenth and length
  *
- * The repeater is supposed to have constant length.
+ * The repeater is supposed to have constant non-zero length.
  */
 
 STATIC I32
--- t/op/pat.t.old	Fri Apr  2 17:01:58 2004
+++ t/op/pat.t	Wed Apr 14 18:51:17 2004
@@ -6,7 +6,7 @@
 
 $| = 1;
 
-print "1..1063\n";
+print "1..1065\n";
 
 BEGIN {
     chdir 't' if -d 't';
@@ -3278,4 +3278,10 @@
 ok("a\cCbc" =~ /[^\cA-\cB]/, '\cC in negated character class range');
 ok("a\cAb" =~ /(??{"\cA"})/, '\cA in ??{} pattern');
 
-# last test 1063
+# perl #28532: optional zero-width match at end of string is ignored
+ok(("abc" =~ /^abc(\z)?/) && defined($1),
+    'optional zero-width match at end of string');
+ok(("abc" =~ /^abc(\z)??/) && !defined($1),
+    'optional zero-width match at end of string');
+
+# last test 1065

@p5pRT
Copy link
Author

p5pRT commented Apr 14, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 19, 2004

From @rgs

hv@​crypt.org wrote​:
...

The alternative, as implemented in the patch below, is to avoid using
CURLYM altogether for zero-width matches​: since repeated zero-width
matches are rare, I believe this is a better trade.

A happy side-effect of this approach is that the existing convolutions
in the CURLYM handling to special-case zero-width matches could be
removed altogether, so this patch should also offer a small performance
improvement for non-zero-width repeats.

Thanks, applied as #22712 to bleadperl.

--- regcomp.c.old Fri Oct 31 21​:13​:19 2003
+++ regcomp.c Wed Apr 14 19​:05​:41 2004

@p5pRT
Copy link
Author

p5pRT commented Apr 19, 2004

@rgs - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant