Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug of regex lookaround assertions? #16894

Closed
p5pRT opened this issue Mar 18, 2019 · 8 comments
Closed

bug of regex lookaround assertions? #16894

p5pRT opened this issue Mar 18, 2019 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 18, 2019

Migrated from rt.perl.org#133940 (status was 'open')

Searchable as RT133940$

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2019

From malincns@163.com

Created by malincns@163.com

This report is about capture group.
I'm not sure whether it's a bug of Perl, or an intended policy.

string : "abab"
pattern​: /(?​:[^b]*(?=(b)|(a))ab)*/

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef", if
"abab" =~ /(?​:[^b]*(?=(b)|(a))ab)*/;

Output​:
"b", "a"

Excepted output​:
undef, "a"

Some engines output​:

Perl 5.28.1      "b", "a"
PHP 7.3.2      NULL, "a"
Java 11.0.2      "b", "a" [1]
Ruby 2.6.1      nil, "a"
Go 1.12       [2]
Rust 1.32.0      [2]
Node.js 10.15.1   undef, "a"
Python 3.7.2      "b", "a" [3]

[1] seems this bug​:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888
[2] doesn't support lookaround.
[3] it's a bug​: https://bugs.python.org/issue35859

Another doubtful case​:

string : "ab"
pattern​: /.*?(?=(a)|(b))b$/

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef", if
"ab" =~ /.*?(?=(a)|(b))b$/;

Output​:
"a", "b"

Except output​:
undef, "b"

Perl Info

Flags:
     category=core
     severity=low

Site configuration information for perl 5.28.1:

Configured by strawberry-perl at Sun Dec  2 14:25:09 2018.

Summary of my perl5 (revision 5 version 28 subversion 1) configuration:

   Platform:
     osname=MSWin32
     osvers=10.0.17134.407
     archname=MSWin32-x64-multi-thread
     uname='Win32 strawberry-perl 5.28.1.1 #1 Sun Dec  2 14:24:00 2018 x64'
     config_args='undef'
     hint=recommended
     useposix=true
     d_sigaction=undef
     useithreads=define
     usemultiplicity=define
     use64bitint=define
     use64bitall=undef
     uselongdouble=undef
     usemymalloc=n
     default_inc_excludes_dot=define
     bincompat5005=undef
   Compiler:
     cc='gcc'
     ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE 
-D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT 
-DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing 
-mms-bitfields'
     optimize='-s -O2'
     cppflags='-DWIN32'
     ccversion=''
     gccversion='7.1.0'
     gccosandvers=''
     intsize=4
     longsize=4
     ptrsize=8
     doublesize=8
     byteorder=12345678
     doublekind=3
     d_longlong=define
     longlongsize=8
     d_longdbl=define
     longdblsize=16
     longdblkind=3
     ivtype='long long'
     ivsize=8
     nvtype='double'
     nvsize=8
     Off_t='long long'
     lseeksize=8
     alignbytes=8
     prototype=define
   Linker and Libraries:
     ld='g++.exe'
     ldflags ='-s -L"D:\perl\perl\lib\CORE" -L"D:\perl\c\lib"'
     libpth=D:\perl\c\lib D:\perl\c\x86_64-w64-mingw32\lib 
D:\perl\c\lib\gcc\x86_64-w64-mingw32\7.1.0
     libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr 
-lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
     perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool 
-lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid 
-lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
     libc=
     so=dll
     useshrplib=true
     libperl=libperl528.a
     gnulibc_version=''
   Dynamic Linking:
     dlsrc=dl_win32.xs
     dlext=xs.dll
     d_dlsymun=undef
     ccdlflags=' '
     cccdlflags=' '
     lddlflags='-mdll -s -L"D:\perl\perl\lib\CORE" -L"D:\perl\c\lib"'



@INC for perl 5.28.1:
     D:/perl/perl/site/lib
     D:/perl/perl/vendor/lib
     D:/perl/perl/lib


Environment for perl 5.28.1:
     HOME (unset)
     LANG (unset)
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
PATH=E:\Python37\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;E:\Program 
Files\git\bin;D:\Qt\5.9.7\mingw53_32\bin;E:\Program 
Files\Java\jdk-11.0.2"\bin;E:\Python37\Scripts\;E:\Program 
Files\TortoiseGit\bin;C:\Users\anima\.cargo\bin;C:\Users\anima\AppData\Local\Microsoft\WindowsApps;;E:\Program 
Files\Microsoft VS Code\bin
     PERL_BADLANG (unset)
     SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Mar 20, 2019

From @jkeenan

On Mon, 18 Mar 2019 01​:19​:11 GMT, malincns@​163.com wrote​:

This is a bug report for perl from malincns@​163.com,
generated with the help of perlbug 1.41 running under perl 5.28.1.

-----------------------------------------------------------------
[Please describe your issue here]

This report is about capture group.
I'm not sure whether it's a bug of Perl, or an intended policy.

string : "abab"
pattern​: /(?​:[^b]*(?=(b)|(a))ab)*/

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if
"abab" =~ /(?​:[^b]*(?=(b)|(a))ab)*/;

Output​:
"b", "a"

Excepted output​:
undef, "a"

Some engines output​:

Perl 5.28.1      "b", "a"
PHP 7.3.2      NULL, "a"
Java 11.0.2      "b", "a" [1]
Ruby 2.6.1      nil, "a"
Go 1.12       [2]
Rust 1.32.0      [2]
Node.js 10.15.1   undef, "a"
Python 3.7.2      "b", "a" [3]

[1] seems this bug​:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888
[2] doesn't support lookaround.
[3] it's a bug​: https://bugs.python.org/issue35859

Another doubtful case​:

string : "ab"
pattern​: /.*?(?=(a)|(b))b$/

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if
"ab" =~ /.*?(?=(a)|(b))b$/;

Output​:
"a", "b"

Except output​:
undef, "b"

In each of these two cases, can you explain -- in your own words, and without reference to any other language's regex engines -- why you have come to the expectations you have?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Mar 20, 2019

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 20, 2019

From malincns@163.com

在 19-3-20 21​:14, James E Keenan via RT 写道​:

On Mon, 18 Mar 2019 01​:19​:11 GMT, malincns@​163.com wrote​:

This is a bug report for perl from malincns@​163.com,
generated with the help of perlbug 1.41 running under perl 5.28.1.

-----------------------------------------------------------------
[Please describe your issue here]

This report is about capture group.
I'm not sure whether it's a bug of Perl, or an intended policy.

string : "abab"
pattern​: /(?​:[^b]*(?=(b)|(a))ab)*/

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if
"abab" =~ /(?​:[^b]*(?=(b)|(a))ab)*/;

Output​:
"b", "a"

Excepted output​:
undef, "a"

Some engines output​:

Perl 5.28.1      "b", "a"
PHP 7.3.2      NULL, "a"
Java 11.0.2      "b", "a" [1]
Ruby 2.6.1      nil, "a"
Go 1.12       [2]
Rust 1.32.0      [2]
Node.js 10.15.1   undef, "a"
Python 3.7.2      "b", "a" [3]

[1] seems this bug​:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888
[2] doesn't support lookaround.
[3] it's a bug​: https://bugs.python.org/issue35859

Another doubtful case​:

string : "ab"
pattern​: /.*?(?=(a)|(b))b$/

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if
"ab" =~ /.*?(?=(a)|(b))b$/;

Output​:
"a", "b"

Except output​:
undef, "b"

In each of these two cases, can you explain -- in your own words, and without reference to any other language's regex engines -- why you have come to the expectations you have?

Thank you very much.

Please see these two cases​:

1,

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if "abc" =~ /.*?(?​:(a)|(b))c/;

Output​: undef, "b"

/(a)c/ this pattern can't match, so $1 is undef.

2,

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if "ab" =~ /.*?(?=(a)|(b))b/;

Output​: "a", "b"

/(?=(a))b/ this pattern can't match too, but $1 is not undef,

It seems a bit inconsistent.

@p5pRT
Copy link
Author

p5pRT commented Mar 20, 2019

From @jkeenan

On Wed, 20 Mar 2019 14​:08​:07 GMT, malincns@​163.com wrote​:

在 19-3-20 21​:14, James E Keenan via RT 写道​:

On Mon, 18 Mar 2019 01​:19​:11 GMT, malincns@​163.com wrote​:

This is a bug report for perl from malincns@​163.com,
generated with the help of perlbug 1.41 running under perl 5.28.1.

-----------------------------------------------------------------
[Please describe your issue here]

This report is about capture group.
I'm not sure whether it's a bug of Perl, or an intended policy.

string : "abab"
pattern​: /(?​:[^b]*(?=(b)|(a))ab)*/

print defined($1)?"\"$1\""​:"undef",",
",defined($2)?"\"$2\""​:"undef",
if
"abab" =~ /(?​:[^b]*(?=(b)|(a))ab)*/;

Output​:
"b", "a"

Excepted output​:
undef, "a"

Some engines output​:

Perl 5.28.1      "b", "a"
PHP 7.3.2      NULL, "a"
Java 11.0.2      "b", "a" [1]
Ruby 2.6.1      nil, "a"
Go 1.12       [2]
Rust 1.32.0      [2]
Node.js 10.15.1   undef, "a"
Python 3.7.2      "b", "a" [3]

[1] seems this bug​:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888
[2] doesn't support lookaround.
[3] it's a bug​: https://bugs.python.org/issue35859

Another doubtful case​:

string : "ab"
pattern​: /.*?(?=(a)|(b))b$/

print defined($1)?"\"$1\""​:"undef",",
",defined($2)?"\"$2\""​:"undef",
if
"ab" =~ /.*?(?=(a)|(b))b$/;

Output​:
"a", "b"

Except output​:
undef, "b"

In each of these two cases, can you explain -- in your own words, and
without reference to any other language's regex engines -- why you
have come to the expectations you have?

Thank you very much.

Please see these two cases​:

1,

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if "abc" =~ /.*?(?​:(a)|(b))c/;

Output​: undef, "b"

/(a)c/ this pattern can't match, so $1 is undef.

2,

print defined($1)?"\"$1\""​:"undef",", ",defined($2)?"\"$2\""​:"undef",
if "ab" =~ /.*?(?=(a)|(b))b/;

Output​: "a", "b"

/(?=(a))b/ this pattern can't match too, but $1 is not undef,

It seems a bit inconsistent.

I doubt there are bugs here, but understanding what is going on with respect to the interaction of positive-lookahead assertions and capture groups is very tricky. There are things I don't understand myself.

To be able to consider different cases, I've written a subroutine to handle each combination of string and pattern which you have mentioned in your two posts so far. Please see attached program 133940-regex.pl. When I run it, I get this output​:

#####
Case​: 1
string​: abab
pattern​: (?^u​:(?​:[^b]*(?=(b)|(a))ab)*)
pre-match​: |match​: abab|post-match​:
$1​: b
$2​: a

Case​: 2
string​: ab
pattern​: (?^u​:.*?(?=(a)|(b))b$)
pre-match​: |match​: ab|post-match​:
$1​: a
$2​: b

Case​: 3
string​: abc
pattern​: (?^u​:.*?(?​:(a)|(b))c)
pre-match​: |match​: abc|post-match​:
$1​: undef
$2​: b

Case​: 4
string​: abc
pattern​: (?^u​:(a)c)
string 'abc' did not match pattern '(?^u​:(a)c)'

Case​: 5
string​: ab
pattern​: (?^u​:.*?(?=(a)|(b))b)
pre-match​: |match​: ab|post-match​:
$1​: a
$2​: b

Case​: 6
string​: ab
pattern​: (?^u​:(?=(a))b)
string 'ab' did not match pattern '(?^u​:(?=(a))b)'

#####

The things I myself don't understand are​:

* In Case 3, if the string matches the pattern, and if that pattern includes captures, how can any element in this list of captures be undefined?

* In Case 6, why doesn't the string match the pattern?

Can someone on list clarify?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Mar 20, 2019

From @jkeenan

133940-regex.pl

@p5pRT
Copy link
Author

p5pRT commented Mar 22, 2019

From @iabyn

On Wed, Mar 20, 2019 at 08​:49​:21AM -0700, James E Keenan via RT wrote​:

This report is about capture group.
I'm not sure whether it's a bug of Perl, or an intended policy.

string : "abab"
pattern​: /(?​:[^b]*(?=(b)|(a))ab)*/

print defined($1)?"\"$1\""​:"undef",",
",defined($2)?"\"$2\""​:"undef",
if
"abab" =~ /(?​:[^b]*(?=(b)|(a))ab)*/;

Output​:
"b", "a"

Excepted output​:
undef, "a"

This looks to be a variant of

  [perl #133352] Ancient Regex Regression

--
It's not that I'm afraid to die, I just don't want to be there when it
happens.
  -- Woody Allen

@toddr toddr removed the khw label Oct 25, 2019
@khwilliamson khwilliamson changed the title bug of regex? bug of regex lookaround assertions? Jan 2, 2020
@xenu xenu removed the affects-5.28 label Nov 19, 2021
@xenu xenu removed the Severity Low label Dec 29, 2021
@khwilliamson
Copy link
Contributor

This has been fixed by

commit acababb
Author: Yves Orton demerphq@gmail.com
Date: Mon Jan 9 22:34:13 2023 +0100

regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers

In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants