Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.10: different behaviour of duplicated named capturing parens with/without (?|...) #9130

Open
p5pRT opened this issue Nov 24, 2007 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 24, 2007

Migrated from rt.perl.org#47762 (status was 'open')

Searchable as RT47762$

@p5pRT
Copy link
Author

p5pRT commented Nov 24, 2007

From andy@shitov.ru

This is a bug report for perl from andy@​shitov.ru,
generated with the help of perlbug 1.36 running under perl 5.8.8.


The resetting parens (?| ... ) in regexps wokr wrong when it is
used together with named captures where two of them have the same name
but are in different branches of "|".

Here is an example​:

use feature 'say';

my $re = qr/
  (?|
  (?<year>\d{4}) (\d\d) (\d\d)
  |
  (\w+), \s* (?<year>\d{4})
  )
  /x;

'20071122' =~ $re;
#say "$1 / $2 / $3";
say $+{year}; # prints '2007'

'November, 2007' =~ $re;
#say "$1 / $2";
say $+{year}; # prints 'November'

It looks like $+{year} in the second match prints the $1 when first
occurance of ?<year> is in the first capturing parens (), $2 when
?<year> moved into second and $3 (empty respective) when testing

(\d{4}) (\d\d) (?<year>\d\d)

When (?| ... ) is removed, the behaviour changes and the programme
prints '2007' twice as expected.



Flags​:
  category=core
  severity=low


This perlbug was built using Perl 5.10.0 - Thu Nov 22 14​:37​:24 2007
It is being executed now by Perl 5.8.8 - Tue Aug 29 12​:39​:43 2006.

Site configuration information for perl 5.8.8​:

Configured by SYSTEM at Tue Aug 29 12​:39​:43 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration​:
  Platform​:
  osname=MSWin32, osvers=5.0, archname=MSWin32-x86-multi-thread
  uname=''
  config_args='undef'
  hint=recommended, useposix=true, d_sigaction=undef
  usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='gcc', ccflags ='-DNDEBUG -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DNO_HASH_SEED -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX -DHASATTRIBUTE -fno-strict-aliasing',
  optimize='-O2',
  cppflags='-DWIN32'
  ccversion='12.00.8804', gccversion='3.4.5 (mingw special)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='gcc', ldflags ='-nologo -nodefaultlib -debug -opt​:ref,icf -libpath​:"C​:\Perl\lib\CORE" -machine​:x86'
  libpth=\lib
  libs=-lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lmsvcrt
  perllibs=-lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lmsvcrt
  libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl58.lib
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags='-mdll -L"C​:\Perl\lib\CORE"'

Locally applied patches​:
  ACTIVEPERL_LOCAL_PATCHES_ENTRY
  perl-current@​32448


@​INC for perl 5.8.8​:
  C​:/Perl/site/lib
  C​:/Perl/lib
  .


Environment for perl 5.8.8​:
  HOME (unset)
  LANG (unset)
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=C​:\Perl510\site\bin;C​:\Perl510\bin;c​:\program files\imagemagick-6.3.4-q16;C​:\Program Files\ActiveState Komodo Edit 4.0\;C​:\Python\.;C​:\Perl\bin\;C​:\WINDOWS\SYSTEM32;C​:\WINDOWS;C​:\WINDOWS\SYSTEM32\WBEM;C​:\PROGRAM FILES\ATI TECHNOLOGIES\ATI.ACE\;c​:\ghc\bin;c​:\ghc\ghc-6.6;C​:\Program Files\Intel\IDB\9.1\IA32\Script;C​:\Program Files\Intel\Compiler\C++\9.1\EM64T\Bin;c​:\mingw\bin;C​:\ghc\ghc-6.4.2\gcc-lib;c​:\pugs;C​:\Program Files\Common Files\Adobe\AGL;c​:\Program Files\Microsoft SQL Server\90\Tools\binn\;C​:\Program Files\Java\jdk1.5.0_07\bin;C​:\Program Files\MySQL\MySQL Server 5.0\bin;c​:\usr\bin;d​:\usr\local\bin;c​:\parser\;C​:\Program Files\QuickTime\QTSystem\;C​:\Program Files\Haskell\bin;C​:\ghc\bin;
  PERL_BADLANG (unset)
  SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Nov 24, 2007

From @rgs

On 24/11/2007, via RT Andrew wrote​:

The resetting parens (?| ... ) in regexps wokr wrong when it is
used together with named captures where two of them have the same name
but are in different branches of "|".

This doesn't look very difficult to fix, but for 5.10.0, since we're
in RC mode, it's probably better to document this as a known limitation.

@p5pRT
Copy link
Author

p5pRT commented Nov 24, 2007

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 24, 2007

From @demerphq

On Nov 24, 2007 11​:01 AM, Rafael Garcia-Suarez <rgarciasuarez@​gmail.com> wrote​:

On 24/11/2007, via RT Andrew wrote​:

The resetting parens (?| ... ) in regexps wokr wrong when it is
used together with named captures where two of them have the same name
but are in different branches of "|".

This doesn't look very difficult to fix, but for 5.10.0, since we're
in RC mode, it's probably better to document this as a known limitation.

I concur, and actually off the top of my head I can't see an easy solution.

Named captures are implemented as maps to one or more numbered capture
buffers. And capture buffers in general work by hardmapping an
open/close regop to a given buffer number, normally a 1​:1 mapping,
except that branch reset bends the rule so that multiple regops point
at the same buffer. Thus the two are to a certain extent mutually
exclusive.

  OPEN1 ---> buffer index
  NAME1 ---> buffer index

So what happens is that with the two combined you get something like​:

  open1
...
  open2
...
  open1
...
  open2

year->1,2

Thus in the "incorrect case" if the OP dumps $-{year} he should get
['november','2007'].

Without the branch reset the program ends up looking like​:

  open1
...
  open2
....
  open3
....
  open4

year->1,3

And thus $+{year} will have the correct value.

So id say we document this mixture as "probably best avoided, but if
you cant, consider using $-{year} as a workaround."

IMO its most definately not something that can be changed in time for release.

yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 27, 2012

From @bulk88

On Sat Nov 24 04​:49​:14 2007, demerphq wrote​:

On Nov 24, 2007 11​:01 AM, Rafael Garcia-Suarez
<rgarciasuarez@​gmail.com> wrote​:

On 24/11/2007, via RT Andrew wrote​:

The resetting parens (?| ... ) in regexps wokr wrong when it is
used together with named captures where two of them have the same
name
but are in different branches of "|".

This doesn't look very difficult to fix, but for 5.10.0, since we're
in RC mode, it's probably better to document this as a known
limitation.

I concur, and actually off the top of my head I can't see an easy
solution.

Named captures are implemented as maps to one or more numbered capture
buffers. And capture buffers in general work by hardmapping an
open/close regop to a given buffer number, normally a 1​:1 mapping,
except that branch reset bends the rule so that multiple regops point
at the same buffer. Thus the two are to a certain extent mutually
exclusive.

OPEN1 ---> buffer index
NAME1 ---> buffer index

So what happens is that with the two combined you get something like​:

open1
...
open2
...
open1
...
open2

year->1,2

Thus in the "incorrect case" if the OP dumps $-{year} he should get
['november','2007'].

Without the branch reset the program ends up looking like​:

open1
...
open2
....
open3
....
open4

year->1,3

And thus $+{year} will have the correct value.

So id say we document this mixture as "probably best avoided, but if
you cant, consider using $-{year} as a workaround."

IMO its most definately not something that can be changed in time for
release.

yves

5.10 is out of support now. Does anything in this bug still apply today
or can it be closed?

This bug also lists Win32 as the OS due to the perlbug data, but its not
Win32 specific (reason why I found it).

--
bulk88 ~ bulk88 at hotmail.com

@p5pRT
Copy link
Author

p5pRT commented Dec 29, 2012

From @demerphq

On 27 December 2012 03​:08, bulk88 via RT <perlbug-followup@​perl.org> wrote​:

On Sat Nov 24 04​:49​:14 2007, demerphq wrote​:

On Nov 24, 2007 11​:01 AM, Rafael Garcia-Suarez
<rgarciasuarez@​gmail.com> wrote​:

On 24/11/2007, via RT Andrew wrote​:

The resetting parens (?| ... ) in regexps wokr wrong when it is
used together with named captures where two of them have the same
name
but are in different branches of "|".

This doesn't look very difficult to fix, but for 5.10.0, since we're
in RC mode, it's probably better to document this as a known
limitation.

I concur, and actually off the top of my head I can't see an easy
solution.

Named captures are implemented as maps to one or more numbered capture
buffers. And capture buffers in general work by hardmapping an
open/close regop to a given buffer number, normally a 1​:1 mapping,
except that branch reset bends the rule so that multiple regops point
at the same buffer. Thus the two are to a certain extent mutually
exclusive.

OPEN1 ---> buffer index
NAME1 ---> buffer index

So what happens is that with the two combined you get something like​:

open1
...
open2
...
open1
...
open2

year->1,2

Thus in the "incorrect case" if the OP dumps $-{year} he should get
['november','2007'].

Without the branch reset the program ends up looking like​:

open1
...
open2
....
open3
....
open4

year->1,3

And thus $+{year} will have the correct value.

So id say we document this mixture as "probably best avoided, but if
you cant, consider using $-{year} as a workaround."

IMO its most definately not something that can be changed in time for
release.

yves

5.10 is out of support now. Does anything in this bug still apply today
or can it be closed?

This bug also lists Win32 as the OS due to the perlbug data, but its not
Win32 specific (reason why I found it).

This bug is still around in blead. Its close to but not quite a
"wont-fix" because fixing it requires rewriting a lot of code.

It has nothing to do with win32

please feel free to assign the bug to me.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@demerphq
Copy link
Collaborator

This should be fixed by #20653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants