Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/n regexp modifier and backreferences to previous groups #15199

Open
p5pRT opened this issue Feb 26, 2016 · 18 comments
Open

/n regexp modifier and backreferences to previous groups #15199

p5pRT opened this issue Feb 26, 2016 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 26, 2016

Migrated from rt.perl.org#127617 (status was 'open')

Searchable as RT127617$

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2016

From @epa

Created by @epa

The /n regexp modifier, according to perlvar, 'will stop $1, $2,
etc... from being filled in'. However it has another behaviour which
is not documented, and in my opinion, is not helpful. It also stops
the group from being referenced by (?-1) and similar within the same
regexp.

So for example, with the current behaviour​:

% perl -E '$_ = "aa"; /(a)(?-1)/ or die; say $1 // "undef"'
a
% perl -E '$_ = "aa"; /(a)(?-1)/n or die; say $1 // "undef"'
Reference to nonexistent group in regex...

This applies too if the modifier is set within a part of the regexp​:

% perl -E '$_ = "aa"; /(?n​:(a)(?-1))/ or die; say $1 // "undef"'
Reference to nonexistent group in regex...

I would prefer it to still allow referring to the group within the
regexp itself, even if the external effect of setting $1, etc does not
happen. So my preferred behaviour would be

% perl -E '$_ = "aa"; /(?n​:(a)(?-1))/ or die; say $1 // "undef"'
undef

Although this would be a change to the current semantics, it is more
closely in line with what perlvar currently documents, so might be
considered more of a bug fix than an incompatible change.

Now I will give a bit of background about why I this would be useful.
Suppose I have a regular expression matching a simple regular
language. Strings in the language are sequences of one or more 'a'.

  $lang_re = qr/a+/;

I may define this regexp in a library and then use it in client code
which matches a string in the language followed by a digit​:

  /\A ($lang_re) ([0-9]) \z/x or die;
  my ($lang_str, $digit) = ($1, $2);

Now suppose I change the definition of the language so that valid
strings are now either a sequence of 'a' as before, or <X> where
X is a valid string.

  $lang_re = qr/ ( a+ | < (?-1) > ) /x;

(For this trivially simple language there may be other ways to do it
but in general a recursively defined language requires recursive
subpatterns in the regexp.)
 
The modified $lang_re works but now it has a side effect of setting a
capturing group. The existing client code that expected to include
$lang_re in a larger regexp and then get ($1, $2) will be broken by
this change.

To avoid adding a new externally visible capturing group I would like
to use the /n modifier​:

  $lang_re = qr/ (?n​: ( a+ | < (?-1) > ) ) /x;

The intention is that while $lang_re may use a recursive subpattern
internally, it does not expose a new capturing group to the outside
world. So it can be used as a building block in a larger pattern
without bumping around the $1,$2,$3 results whenever the
implementation of $lang_re changes.

Although using named captures everywhere mitigates the problem it does
not solve it, since of course there is no guarantee that the names of
capturing groups will be globally unique. And of course if $lang_re
is provided by a regexp library, the library author cannot know that
all client code is always using named captures rather than $1,$2,$3.

I think that changing the semantics of /n, so that it stops
*capturing*, but still allows the group to be referenced with
recursive subpatterns, would make it much more useful and would more
closely match the documentation.

(There may also be room for a regexp modifier X which hides groups
from recursive subpattern matches *outside* the (?​:X ... ) but allows
them to be visible *inside*. This would be a further improvement to
building reusable, composable regexps. The letter X is just an
example of course. Possibly this could even be the behaviour
of (?​:n ... ). But I would not want this to distract from the more
important issue of making /n's current behaviour match the docs.)

(FWIW, the real code which prompted this is a regexp library to match a
'simple arithmetic expression', being numbers with operators like +
and - and parentheses. Such a 'simple expression' is in some sense
safe to evaluate using eval(STRING) to get a number.)

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.22.1:

Configured by Red Hat, Inc. at Mon Dec 14 11:14:02 UTC 2015.

Summary of my perl5 (revision 5 version 22 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=4.3.0-1.fc24.x86_64, archname=x86_64-linux-thread-multi
    uname='linux buildvm-04-nfs.phx2.fedoraproject.org 4.3.0-1.fc24.x86_64 #1 smp mon nov 2 16:27:20 utc 2015 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Doptimize=none -Dccflags=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Dldflags=-Wl,-z,relro  -Dccdlflags=-Wl,--enable-new-dtags -Wl,-z,relro  -Dlddlflags=-shared -Wl,-z,relro  -Dshrpdir=/usr/lib64 -DDEBUGGING=-g -Dversion=5.22.1 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='  -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='5.3.1 20151207 (Red Hat 5.3.1-2)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-Wl,-z,relro  -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib64 /lib64 /usr/lib64 /usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib
    libs=-lpthread -lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lresolv -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.22'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-z,relro '
    cccdlflags='-fPIC', lddlflags='-shared -Wl,-z,relro  -L/usr/local/lib -fstack-protector-strong'

Locally applied patches:
    Fedora Patch1: Removes date check, Fedora/RHEL specific
    Fedora Patch3: support for libdir64
    Fedora Patch4: use libresolv instead of libbind
    Fedora Patch5: USE_MM_LD_RUN_PATH
    Fedora Patch6: Skip hostname tests, due to builders not being network capable
    Fedora Patch7: Dont run one io test due to random builder failures
    Fedora Patch15: Define SONAME for libperl.so
    Fedora Patch16: Install libperl.so to -Dshrpdir value
    Fedora Patch22: Document Math::BigInt::CalcEmu requires Math::BigInt (CPAN RT#85015)
    Fedora Patch26: Make *DBM_File desctructors thread-safe (RT#61912)
    Fedora Patch27: Make PadlistNAMES() lvalue again (CPAN RT#101063)
    Fedora Patch28: Make magic vtable writable as a work-around for Coro (CPAN RT#101063)
    Fedora Patch200: Link XS modules to libperl.so with EU::CBuilder on Linux
    Fedora Patch201: Link XS modules to libperl.so with EU::MM on Linux


@INC for perl 5.22.1:
    /usr/local/lib64/perl5
    /usr/local/share/perl5
    /usr/lib64/perl5/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib64/perl5
    /usr/share/perl5
    .


Environment for perl 5.22.1:
    HOME=/home/eda
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=C
    LC_CTYPE=en_GB.UTF-8
    LC_MESSAGES=en_GB.UTF-8
    LC_MONETARY=en_GB.UTF-8
    LC_NUMERIC=en_GB.UTF-8
    LC_TIME=en_GB.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/eda/bin:/home/eda/bin:/usr/local/bin:/usr/bin:/sbin:/usr/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

This email is intended only for the person to whom it is addressed and may contain confidential information. Any retransmission, copying, disclosure or other use of, this information by persons other than the intended recipient is prohibited. If you received this email in error, please contact the sender and delete the material. This email is for information only and is not intended as an offer or solicitation for the purchase or sale of any financial instrument. Wadhwani Asset Management LLP is a Limited Liability Partnership registered in England (OC303168) with registered office at 40 Berkeley Square, 3rd Floor, London, W1J 5AL. It is authorised and regulated by the Financial Conduct Authority.

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2016

From @hvds

"Ed Avis" (via RT) <perlbug-followup@​perl.org> wrote​:
:The /n regexp modifier, according to perlvar, 'will stop $1, $2,
:etc... from being filled in'. However it has another behaviour which
:is not documented, and in my opinion, is not helpful. It also stops
:the group from being referenced by (?-1) and similar within the same
:regexp.

In the original discussion, AFAIR, the intent was solely to provide
a shorthand to avoid needing to write /(?​:xxx)/ all over the place
to avoid capturing, such that in the presence of /n we would treat
/(xxx)/ exactly as if it had been written that way.

I suspect the reinterpretation you propose would be problematic, in
that you'd end up having confusion in /(?n​:(x))(y)(?-1)/ as to how to
count - I don't think you'd want $1=y to be referenced as group 2,
and there'd be risk of breakage to existing code if we changed that
(though I expect the risk is minimal).

It may be that we need a distinct new feature that delineates some
kind of scope for patterns, so that both numbered and named captures
could be appropriately local for this sort of use case.

(I can imagine value, at least for named captures, in modelling this
quite closely on lexical scopes for variables, such that you can
continue to refer to captures in outer scopes until you mask them by
reusing the name; but maybe a cleaner model would be to have them
quite distinct, to avoid even more spooky action at a distance in the
embedding case.)

Hugo

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 26, 2016

From @tamias

On Fri, Feb 26, 2016 at 04​:13​:42AM -0800, Ed Avis wrote​:

The /n regexp modifier, according to perlvar, 'will stop $1, $2,
etc... from being filled in'. However it has another behaviour which
is not documented, and in my opinion, is not helpful. It also stops
the group from being referenced by (?-1) and similar within the same
regexp.

  n

  Prevent the grouping metacharacters () from capturing. This modifier, new
  in 5.22, will stop $1 , $2 , etc... from being filled in.

"Prevent the grouping metacharacters () from capturing."

It seems to me this is working as intended.

Ronald

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From @epa

Ronald J Kimball <rjk <at> tamias.net> writes​:

The /n regexp modifier,

"Prevent the grouping metacharacters () from capturing."

It seems to me this is working as intended.

This bug report is about pattern references, not capturing. The syntax
(?-1) means to include the regexp fragment from that group (which can be
a recursive call). It is not related to anything that may have been captured.
This is the difference between (?1) and \1. So for example,

  $_ = 'ab';
  /(.)((?1))/ or die;

This succeeds, with the (?1) call matching the string 'b', even though the
first group matched 'a'. There is no capturing required in order to make a
call to the earlier bit of regexp.

I understand that /n turns off capturing. That is what is documented.
What is not documented, and less useful, is that it also stops you being
able to use recursive subpatterns.

Actually, I see that the documentation of recursive subpatterns in perlre
does talk about 'a given capture buffer'. This could imply that if a group
does not capture then it cannot be referred to recursively either, which is
the current behaviour. But there is no real reason why these have to be
linked. It would be more useful for /n to turn off capturing but still allow
recursive subpatterns.

FWIW, there is also an interesting effect where adding /n to an existing
regexp fails to match but does not say why - whereas including it at
regexp compile time gives an error​:

  $_ = 'aa';
  /(.)(?1)/ or die; # matches
  my $re = /(.)(?1)/;
  /$re/n or die; # no match, but no error either

  /(.)(?1)/n; # fails at compile time with a useful error

Again, this breaks composability of regexps. Suppose the implementation of
$RE{whatever} in a common regexp library changed so that internally it used
capturing groups. Client code that used that regexp with the /n flag would
mysteriously find it started failing to match, with no indication why. It
would make more sense for the /n flag in the client code to do what it says​:
turn off capturing so that $1, $2 etc are not populated, but without
changing what strings the regexp matches.

--
Ed Avis <eda@​waniasset.com>

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From @epa

<hv <at> crypt.org> writes​:

​:The /n regexp modifier,

In the original discussion, AFAIR, the intent was solely to provide
a shorthand to avoid needing to write /(?​:xxx)/ all over the place

Yes, I can see that is the current behaviour of /n. If you want to keep
that then it would be better to document it that way​: 'treat (...) as
(?​:...) everywhere in the regexp'.

As I mentioned in another message, this behaviour of /n is a bit dangerous.
You could have code like

  use Some​::Regexp​::Library;
  /($RE{foo}|$RE{bar})+/n;

where the programmer has used /n as a shortcut to avoid writing (?​:...)
everywhere. But now Some​::Regexp​::Library is stuck. If the implementation
of $RE{foo} changes so that it now uses a recursive subpattern, the code
above will break silently, failing to match and giving no error message.

If you do want /n as a syntactic convenience to save on writing ?​: then
it would make more sense for it to apply only to the literal text where
it is used, not affecting other regexp fragments which are included with
variables. But the way regexps are interpolated as strings and compiled
may make this difficult.

(By the way, this all applies to backreferences like \1 as well as
recursive subpatterns. So regexp library code cannot use those either.)

I think it would be more useful to guarantee that /n does not change the
strings matched by the regexp, but it does stop the capture buffers like $1
being filled in. This would be a backwards-incompatible change.

Alternatively, create a new regexp syntax (?;...) which stops external
capturing, so $1 etc are not set, but otherwise behaves like (...).
The group is still visible to \1 and (?1). Changing between (?;...) and
(...) would never affect the set of strings matched by the regexp.
As a syntactic convenience, the /N flag would treat (...) as (?;...).

It may be that we need a distinct new feature that delineates some
kind of scope for patterns, so that both numbered and named captures
could be appropriately local for this sort of use case.

I agree but I'm trying to avoid getting into that, and trying to keep
proposals to what I imagine can be straightforwardly implemented.

--
Ed Avis <eda@​waniasset.com>

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From zefram@fysh.org

Ed Avis wrote​:

        But now Some&#8203;::Regexp&#8203;::Library is stuck\.  If the implementation

of $RE{foo} changes so that it now uses a recursive subpattern, the code
above will break silently, failing to match and giving no error message.

Only if $RE{foo} is a string of regexp source and doesn't set the flags
it needs. In that case, a similar problem arises if the user adds /i
or any other flag that the library didn't anticipate. But there's no
problem if the %RE values are compiled regexp objects​: those preserve
their flag state when interpolated.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From @epa

Zefram <zefram <at> fysh.org> writes​:

        But now Some&#8203;::Regexp&#8203;::Library is stuck\.  If the implementation

of $RE{foo} changes so that it now uses a recursive subpattern, the code
above will break silently, failing to match and giving no error message.

Only if $RE{foo} is a string of regexp source and doesn't set the flags
it needs. In that case, a similar problem arises if the user adds /i
or any other flag that the library didn't anticipate.

There is some distinction to be made between /i and more wacky modifiers like
/n or /x. It makes some sense to take an unknown regexp and apply /i to it
globally to make it case-insensitive. But your general point is right​:

But there's no
problem if the %RE values are compiled regexp objects​: those preserve
their flag state when interpolated.

Apologies, this is true. I had somehow got the idea that even qr// would be
subject to the /n flag when interpolated. So a lot of the nonsense I wrote
can be discarded.

I think there is still an unhappy semantics for /n currently and it is this.
In my regexp library I would like to add a new recursive subpattern. To do
so I currently have to add a new capturing group. That will break any larger
regexp of which mine is a part.

  my $library_re = qr/a+/;

  # Client code
  /$library_re(\d+)/ or die;
  my $num = $1;

Now in the new version

  my $library_re = qr/(a+|<(?-1)>)/;

But this breaks the client code since $1 is no longer the same. I would
like to turn off capturing, which I thought could be done with (?n​:...), but
that currently disables recursive subpatterns too.

Suppose a new syntax (?;...) disables capturing, but still allows relative
subpatterns (?-1), (?-2) and so on. Then library code could use that to
add new groups without breaking callers that use $1, $2, etc. It doesn't
completely solve the problem because it would still break callers that use
(?-1).

--
Ed Avis <eda@​waniasset.com>

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From zefram@fysh.org

Ed Avis wrote​:

Suppose a new syntax (?;...) disables capturing, but still allows relative
subpatterns (?-1), (?-2) and so on.

No, that's a half-assed approach, with composability problems as
you note. We've got enough half-assed group referencing mechanisms
as it is. If we're going for composability, let's do it properly,
using a mechanism that's tried and well understood in another context​:
lexical scoping based on explicit declaration. For now I'll not worry
too much about exactly which characters to assign to the new syntax,
and just start all the new stuff with "(?;". Suppose we have

  (?;~NAME,NAME,...;PATTERN)
  Match PATTERN, with each NAME (having identifier syntax) scoped
  to this pattern. Every introduced NAME must be defined to refer
  to some subgroup within this group.

  (?;=NAME;PATTERN)
  Match PATTERN, defining group identifier NAME to refer to
  this group. The name must have been declared with some scope
  that encompasses this group. It is defined for the purposes of
  the innermost such surrounding scope, not for any outer scope
  declaring the same name.

  (?;&NAME)
  Recurse to the group named NAME. The name must have been declared
  with some scope that encompasses this pattern. The group
  referenced is that defined for the innermost such surrounding
  scope, not for any outer scope declaring the same name.

  (?;*NAME)
  Match the exact text that was matched by the group named NAME.
  The name must have been declared with some scope that encompasses
  this pattern. The group referenced is that defined for the
  innermost such surrounding scope, not for any outer scope
  declaring the same name. The match of that group that determines
  the exact text to be matched is the most recent occasion that
  it matched within the current match attempt of the group to
  which the name is scoped. If there is no such match then this
  backreference cannot match.

The group names used here do not interact with the names used by
(?<NAME>PATTERN), \g{NAME}, \k<NAME>, and (?&NAME). None of the above
group types count as a capturing group for the purposes of $1, \1, \g{1},
\g{-1}, (?1), (?-1), and (?+1).

-zefram

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From @epa

Now that {} are special characters in regexps, can they be used to introduce
scoping of names? Perhaps only under /x?

I agree with your proposal, just wonder if a more readable syntax is possible.

The other thing I might suggest is that perhaps 'goto' is a more flexible
primitive than calling a group. If you could set a label part way through
the regexp (lexically scoped) and then jump to it later...

--
Ed Avis <eda@​waniasset.com>

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2016

From zefram@fysh.org

Ed Avis wrote​:

Now that {} are special characters in regexps, can they be used to introduce
scoping of names?

I'd rather not overload them further. But in any case it wouldn't get
the brevity that you imply, because it's really necessary to declare
the specific names being scoped. You wouldn't be reducing (?;~) to {},
you'd only be reducing (?;~foo,bar;) to {foo,bar;}.

              Perhaps only under /x?

That would be a really bad idea. /x has a well-defined behaviour that
doesn't involve changing the meaning of any metacharacters that actually
contribute to the regexp. Don't make it any more complicated.

I agree with your proposal, just wonder if a more readable syntax is possible.

The exact character sequence introducing each item is up for tweaking,
but it can't get a whole lot more readable in the (?...) system.
We have /x to aid readability, so that everything doesn't have to be
smooshed together in one continuous string, and I think that should be
good enough if we're generally happy with (?...). But if your concern is
about making the meaning of obscure constructs more transparent, the way
to go is to replace the punctuation strings with names for the operators.
We've already done that for backtracking control operators, inventing the
(*FOO) space, and we could use that sort of syntax for the scoped names​:

  (*groupscope​:NAME,NAME,...​:PATTERN)
  (*groupname​:NAME​:PATTERN)
  (*recurse​:NAME)
  (*backref​:NAME)

The other thing I might suggest is that perhaps 'goto' is a more flexible
primitive than calling a group.

Less flexible in general, I'd say, because you can't build recursion
out of it. Still, I could imagine a goto feature finding some use.
But it's out of scope for the present discussion. It would be a totally
new regexp feature, unlike the recursion and backreferencing for which
we are discussing new ways of resolving references.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Apr 7, 2016

From @demerphq

On 27 February 2016 at 22​:30, Zefram <zefram@​fysh.org> wrote​:

Ed Avis wrote​:

Now that {} are special characters in regexps, can they be used to introduce
scoping of names?

I'd rather not overload them further. But in any case it wouldn't get
the brevity that you imply, because it's really necessary to declare
the specific names being scoped. You wouldn't be reducing (?;~) to {},
you'd only be reducing (?;~foo,bar;) to {foo,bar;}.

              Perhaps only under /x?

That would be a really bad idea. /x has a well-defined behaviour that
doesn't involve changing the meaning of any metacharacters that actually
contribute to the regexp. Don't make it any more complicated.

I agree with your proposal, just wonder if a more readable syntax is possible.

The exact character sequence introducing each item is up for tweaking,
but it can't get a whole lot more readable in the (?...) system.
We have /x to aid readability, so that everything doesn't have to be
smooshed together in one continuous string, and I think that should be
good enough if we're generally happy with (?...). But if your concern is
about making the meaning of obscure constructs more transparent, the way
to go is to replace the punctuation strings with names for the operators.
We've already done that for backtracking control operators, inventing the
(*FOO) space, and we could use that sort of syntax for the scoped names​:

\(\*groupscope&#8203;:NAME\,NAME\,\.\.\.&#8203;:PATTERN\)
\(\*groupname&#8203;:NAME&#8203;:PATTERN\)
\(\*recurse&#8203;:NAME\)
\(\*backref&#8203;:NAME\)

The other thing I might suggest is that perhaps 'goto' is a more flexible
primitive than calling a group.

Less flexible in general, I'd say, because you can't build recursion
out of it. Still, I could imagine a goto feature finding some use.
But it's out of scope for the present discussion. It would be a totally
new regexp feature, unlike the recursion and backreferencing for which
we are discussing new ways of resolving references.

I just want to point out that there two problems here. First is the
simpler part, choosing a syntax that works. Your proposal seems to me
to work around that. The second part however is that implementation
wise there is very tight coupling between capture buffers and
recursion. Off the top of my head I think it might be challenging to
decouple them suffficiently.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Apr 7, 2016

From @wolfsage

On Sat, Feb 27, 2016 at 2​:31 AM, Ed Avis <eda@​waniasset.com> wrote​:

<hv <at> crypt.org> writes​:

​:The /n regexp modifier,

In the original discussion, AFAIR, the intent was solely to provide
a shorthand to avoid needing to write /(?​:xxx)/ all over the place

Yes, I can see that is the current behaviour of /n. If you want to keep
that then it would be better to document it that way​: 'treat (...) as
(?​:...) everywhere in the regexp'.

From perldoc perlre​:

  n Prevent the grouping metacharacters "()" from capturing. This
  modifier, new in 5.22, will stop $1, $2, etc... from being filled in.

  "hello" =~ /(hi|hello)/; # $1 is "hello"
  "hello" =~ /(hi|hello)/n; # $1 is undef

  This is equivalent to putting "?​:" at the beginning of every capturing
  group​:

  "hello" =~ /(?​:hi|hello)/; # $1 is undef

Is there other places that this note needs to be added?

-- Matthew Horsfall (alh)

@p5pRT
Copy link
Author

p5pRT commented Jun 6, 2016

From @epa

Note that .NET regular expressions have a possible solution to the
composability problem, as described in <http​://research.swtch.com/irregexp>​:

.NET introduces a variant on the named capture, (?<-x>...), which, during
the match, deletes the last captured substring for x, exposing the one that
was there before.

Is this something that would be easy to implement in Perl? The compatibility
for regexps ported from .NET languages such as C# might be enough reason
in itself.

--
Ed Avis <eda@​waniasset.com>

@p5pRT
Copy link
Author

p5pRT commented Jun 7, 2016

From @demerphq

On 6 June 2016 at 14​:15, Ed Avis <eda@​waniasset.com> wrote​:

Note that .NET regular expressions have a possible solution to the
composability problem, as described in <http​://research.swtch.com/irregexp>​:

.NET introduces a variant on the named capture, (?<-x>...), which, during
the match, deletes the last captured substring for x, exposing the one that
was there before.

Is this something that would be easy to implement in Perl? The compatibility
for regexps ported from .NET languages such as C# might be enough reason
in itself.

We dont keep a stack of matches for things like /(a)+/.

If we to start doing so IMO it would have to be via a new modifier
which allows it, or everybody would have to pay the penalty for it
even though they dont use it.

Yves

@p5pRT
Copy link
Author

p5pRT commented Jun 7, 2016

From @iabyn

On Tue, Jun 07, 2016 at 12​:38​:40PM +0200, demerphq wrote​:

On 6 June 2016 at 14​:15, Ed Avis <eda@​waniasset.com> wrote​:

Note that .NET regular expressions have a possible solution to the
composability problem, as described in <http​://research.swtch.com/irregexp>​:

.NET introduces a variant on the named capture, (?<-x>...), which, during
the match, deletes the last captured substring for x, exposing the one that
was there before.

Is this something that would be easy to implement in Perl? The compatibility
for regexps ported from .NET languages such as C# might be enough reason
in itself.

We dont keep a stack of matches for things like /(a)+/.

Although sometimes we do!

CURLYX quantifiers push a WHILEM entry for every match, so at the end of
"abcd" =~ /(\w+?)+/, the backtrack stack contains

  CURLYX,WHILEM,WHILEM,WHILEM,WHILEM

Each whilem struct could be made to contain the last start and end
positions (in fact they may already do so, possibly indirectly).

IIRC, all quantifiers start off as CURLYX, and then are optimised down to
CURLYM etc where possible. If that's the case, then we simply skip the
optimisations where both​:
* the thing being quantified is a named (?<foo>....),
* a (?<-foo>) has been seen.

Of course it will make that part of the regex execution slower, but only
where they're requested it.

--
The Enterprise successfully ferries an alien VIP from one place to another
without serious incident.
  -- Things That Never Happen in "Star Trek" #7

@p5pRT
Copy link
Author

p5pRT commented Jun 7, 2016

From @demerphq

On 7 June 2016 at 15​:08, Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Jun 07, 2016 at 12​:38​:40PM +0200, demerphq wrote​:

On 6 June 2016 at 14​:15, Ed Avis <eda@​waniasset.com> wrote​:

Note that .NET regular expressions have a possible solution to the
composability problem, as described in <http​://research.swtch.com/irregexp>​:

.NET introduces a variant on the named capture, (?<-x>...), which, during
the match, deletes the last captured substring for x, exposing the one that
was there before.

Is this something that would be easy to implement in Perl? The compatibility
for regexps ported from .NET languages such as C# might be enough reason
in itself.

We dont keep a stack of matches for things like /(a)+/.

Although sometimes we do!

Yes, true.

CURLYX quantifiers push a WHILEM entry for every match, so at the end of
"abcd" =~ /(\w+?)+/, the backtrack stack contains

CURLYX\,WHILEM\,WHILEM\,WHILEM\,WHILEM

I guess this a simplification?

perl -Mre=Debug,ALL -e'"abcd"=~/(\w+?)+/'

produces this backtracking stack​:

  #9 WHILEM_B_max yes
  #8 CURLY_B_min
  #7 WHILEM_A_max
  #6 CURLY_B_min
  #5 WHILEM_A_max
  #4 CURLY_B_min
  #3 WHILEM_A_max
  #2 CURLY_B_min
  #1 WHILEM_A_pre
  #0 CURLYX_end yes

Each whilem struct could be made to contain the last start and end
positions (in fact they may already do so, possibly indirectly).

IIRC, all quantifiers start off as CURLYX,

+ and * on simple objects does not go through CURLYX.

$ perl -Mre=Debug,ALL -e'"abcd"=~/a+/'
Assembling pattern from 1 elements
Compiling REx "a+"
Starting first pass (sizing)

a+< | 1| reg
  | | brnc
  | | piec
  | | atom
< | 3| inst - PLUS
Required size 4 nodes
Starting second pass (creation)
a+< | 1| reg
  | | brnc
  | | piec
  | | atom
< | 3| inst - PLUS
  | 5| lsbr~ tying lastbr PLUS (1) to ender END (4) offset 3
  | | tail~ PLUS (1) -> END

and then are optimised down to
CURLYM etc where possible. If that's the case, then we simply skip the
optimisations where both​:
* the thing being quantified is a named (?<foo>....),
* a (?<-foo>) has been seen.

Of course it will make that part of the regex execution slower, but only
where they're requested it.

I fear this sounds easier than it is.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jun 7, 2016

From @iabyn

On Tue, Jun 07, 2016 at 03​:40​:13PM +0200, demerphq wrote​:

We dont keep a stack of matches for things like /(a)+/.

Although sometimes we do!

Yes, true.

CURLYX quantifiers push a WHILEM entry for every match, so at the end of
"abcd" =~ /(\w+?)+/, the backtrack stack contains

CURLYX\,WHILEM\,WHILEM\,WHILEM\,WHILEM

I guess this a simplification?

Yes sorry I was oversimplifying - I was only listing the contribution
to the backtracking stack made by the outer '+' quantifier.

+ and * on simple objects does not go through CURLYX.

But we're only referring to quantifiers for (?<foo>....), and in
these cases - where there are captures - I suspect it *might* always go
through CURLYX​:

  $ p -Mre=Debug,ALL -e'qr/(?<foo>a)+/'
  ....
  <> | 8| inst - CURLYX
  ....
  Final program​:
  1​: CURLYN[1]{1,INFTY} (11)

I fear this sounds easier than it is.

I'm sure you're right, and I'm certainly not volunteering!

--
O Unicef Clearasil!
Gibberish and Drivel!
  -- "Bored of the Rings"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants