Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time check for $1, $2 on RHS of s/// operator #15604

Open
p5pRT opened this issue Sep 16, 2016 · 14 comments
Open

Compile-time check for $1, $2 on RHS of s/// operator #15604

p5pRT opened this issue Sep 16, 2016 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 16, 2016

Migrated from rt.perl.org#129283 (status was 'open')

Searchable as RT129283$

@p5pRT
Copy link
Author

p5pRT commented Sep 16, 2016

From @epa

Created by @epa

Currently this gives a run-time warning, but no compile-time error​:

  use 5.022;
  $_ = 'abc';
  s/(a)/x$1$2/;
  say;

If the regexp on the LHS of s/// is known at compile time, then the
number of capturing groups it contains is known. Under 'use strict',
a mention in the RHS of a capture variable which doesn't have a
corresponding capturing group (like $2 in the above example)
should give a compile-time error.

Perl Info

Flags:
    category=core
    severity=wishlist

Site configuration information for perl 5.22.2:

Configured by Red Hat, Inc. at Wed Aug  3 14:06:20 UTC 2016.

Summary of my perl5 (revision 5 version 22 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=4.6.3-300.fc24.x86_64, archname=x86_64-linux-thread-multi
    uname='linux buildvm-08.phx2.fedoraproject.org 4.6.3-300.fc24.x86_64 #1 smp fri jun 24 20:52:41 utc 2016 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Doptimize=none -Dccflags=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Dldflags=-Wl,-z,relro  -Dccdlflags=-Wl,--enable-new-dtags -Wl,-z,relro  -Dlddlflags=-shared -Wl,-z,relro  -Dshrpdir=/usr/lib64 -DDEBUGGING=-g -Dversion=5.22.2 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='  -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='5.3.1 20160406 (Red Hat 5.3.1-6)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-Wl,-z,relro  -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib64 /lib64 /usr/lib64 /usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib
    libs=-lpthread -lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lresolv -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.22'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-z,relro '
    cccdlflags='-fPIC', lddlflags='-shared -Wl,-z,relro  -L/usr/local/lib -fstack-protector-strong'

Locally applied patches:
    Fedora Patch1: Removes date check, Fedora/RHEL specific
    Fedora Patch3: support for libdir64
    Fedora Patch4: use libresolv instead of libbind
    Fedora Patch5: USE_MM_LD_RUN_PATH
    Fedora Patch6: Skip hostname tests, due to builders not being network capable
    Fedora Patch7: Dont run one io test due to random builder failures
    Fedora Patch15: Define SONAME for libperl.so
    Fedora Patch16: Install libperl.so to -Dshrpdir value
    Fedora Patch22: Document Math::BigInt::CalcEmu requires Math::BigInt (CPAN RT#85015)
    Fedora Patch26: Make *DBM_File desctructors thread-safe (RT#61912)
    Fedora Patch27: Make PadlistNAMES() lvalue again (CPAN RT#101063)
    Fedora Patch28: Make magic vtable writable as a work-around for Coro (CPAN RT#101063)
    Fedora Patch29: Fix duplicating PerlIO::encoding when spawning threads (RT#31923)
    Fedora Patch30: Do not let XSLoader load relative paths (CVE-2016-6185)
    Fedora Patch31: Avoid loading optional modules from default . (CVE-2016-1238)
    Fedora Patch200: Link XS modules to libperl.so with EU::CBuilder on Linux
    Fedora Patch201: Link XS modules to libperl.so with EU::MM on Linux


@INC for perl 5.22.2:
    /home/eda/lib64/perl5/
    /usr/local/lib64/perl5
    /usr/local/share/perl5
    /usr/lib64/perl5/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib64/perl5
    /usr/share/perl5


Environment for perl 5.22.2:
    HOME=/home/eda
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=C
    LC_CTYPE=en_GB.UTF-8
    LC_MESSAGES=en_GB.UTF-8
    LC_MONETARY=en_GB.UTF-8
    LC_NUMERIC=en_GB.UTF-8
    LC_TIME=en_GB.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/eda/bin:/home/eda/bin:/usr/local/bin:/usr/bin:/sbin:/usr/sbin:/sbin:/usr/sbin
    PERL5LIB=/home/eda/lib64/perl5/
    PERL_BADLANG (unset)
    SHELL=/bin/bash
Please ignore autogenerated disclaimer below this point.

@p5pRT
Copy link
Author

p5pRT commented Sep 16, 2016

From @Abigail

On Fri, Sep 16, 2016 at 04​:15​:13AM -0700, Ed Avis wrote​:

# New Ticket Created by "Ed Avis"
# Please include the string​: [perl #129283]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=129283 >

This is a bug report for perl from eda@​waniasset.com,
generated with the help of perlbug 1.40 running under perl 5.22.2.

-----------------------------------------------------------------
[Please describe your issue here]

Currently this gives a run-time warning, but no compile-time error​:

use 5\.022;
$\_ = 'abc';
s/\(a\)/x$1$2/;
say;

If the regexp on the LHS of s/// is known at compile time, then the
number of capturing groups it contains is known. Under 'use strict',
a mention in the RHS of a capture variable which doesn't have a
corresponding capturing group (like $2 in the above example)
should give a compile-time error.

"use strict" never promotes a 'Use of uninitialized value' to an
error. I don't think perl being consistent when it comes to s///
is a bug.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Sep 16, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @mauke

Am 16.09.2016 um 23​:36 schrieb Abigail​:

On Fri, Sep 16, 2016 at 04​:15​:13AM -0700, Ed Avis wrote​:

# New Ticket Created by "Ed Avis"
# Please include the string​: [perl #129283]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=129283 >

Currently this gives a run-time warning, but no compile-time error​:

use 5\.022;
$\_ = 'abc';
s/\(a\)/x$1$2/;
say;

If the regexp on the LHS of s/// is known at compile time, then the
number of capturing groups it contains is known. Under 'use strict',
a mention in the RHS of a capture variable which doesn't have a
corresponding capturing group (like $2 in the above example)
should give a compile-time error.

"use strict" never promotes a 'Use of uninitialized value' to an
error. I don't think perl being consistent when it comes to s///
is a bug.

I agree.

It's not that $2 doesn't exist there; it just contains undef. 'strict'
seems the wrong place for this check.

I would like to see it as a warning, but I don't know how hard it would
be to implement (e.g. you'd have to check for embedded regex matches
that could reset capture variables).

--
Lukas Mai <plokinom@​gmail.com>

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @cpansprout

On Fri Sep 16 20​:20​:57 2016, plokinom@​gmail.com wrote​:

I agree.

It's not that $2 doesn't exist there; it just contains undef. 'strict'
seems the wrong place for this check.

I would like to see it as a warning, but I don't know how hard it would
be to implement (e.g. you'd have to check for embedded regex matches
that could reset capture variables).

You would also have to take into account that a regexp match might occur within the replacement code; e.g., s/()/ m|()()|; $2/e. Not likely, but something similar could happen in complex code. (I’ve written s/// thingies that go on for fifty lines!)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @epa

To be clear, I was referring only to literal regexp matches where both the LHS and RHS are known at compile time, that is, without /e. There may be complex cases where even a fixed regexp doesn't leave it entirely clear how many capturing buffers will exist, and it makes sense to ignore those cases (they can continue to warn at runtime).

--
Ed Avis <eda@​waniasset.com<mailto​:eda@​waniasset.com>>

On 17 Sep 2016, at 06​:12, Father Chrysostomos via RT <perlbug-followup@​perl.org<mailto​:perlbug-followup@​perl.org>> wrote​:

On Fri Sep 16 20​:20​:57 2016, plokinom@​gmail.com<mailto​:plokinom@​gmail.com> wrote​:
I agree.

It's not that $2 doesn't exist there; it just contains undef. 'strict'
seems the wrong place for this check.

I would like to see it as a warning, but I don't know how hard it would
be to implement (e.g. you'd have to check for embedded regex matches
that could reset capture variables).

You would also have to take into account that a regexp match might occur within the replacement code; e.g., s/()/ m|()()|; $2/e. Not likely, but something similar could happen in complex code. (I've written s/// thingies that go on for fifty lines!)

--

Father Chrysostomos

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http​://www.symanteccloud.com
______________________________________________________________________

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @epa

Lukas M, Thanks for your reply. It is true that the $2 variable can always be read, even if no regexp match has happened so far. I guess the point is that the capture buffer number 2 does not exist. Note that sed, for example, does give a hard error if you try to refer to a capturing group that doesn't exist​:

% sed -E 's/(a)/\1\2/'
sed​: -e expression #1, char 11​: invalid reference \2 on `s' command's RHS

(at least for GNU sed). But I don't want to get bogged down in the theological question of whether $2 in the RHS can be said to 'exist', whether such a check would be an extension of 'use strict', or indeed whether it should be a warning or an error.

Simple substitutions such as s/a(b)c/$1/ are a commonly used part of Perl, and it would be great to give the programmer more help in avoiding mistakes in them, as sed and other tools do.

--
Ed Avis <eda@​waniasset.com>

This email is intended only for the person to whom it is addressed and may contain confidential information. Any retransmission, copying, disclosure or other use of, this information by persons other than the intended recipient is prohibited. If you received this email in error, please contact the sender and delete the material. This email is for information only and is not intended as an offer or solicitation for the purchase or sale of any financial instrument. Wadhwani Asset Management LLP is a Limited Liability Partnership registered in England (OC303168) with registered office at 9th Floor Orion House, 5 Upper St Martin’s Lane, London, WC2H 9EA. It is authorised and regulated by the Financial Conduct Authority.

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @demerphq

On 17 Sep 2016 3​:54 a.m., "Ed Avis" <eda@​waniasset.com> wrote​:

Lukas M, Thanks for your reply. It is true that the $2 variable can
always be read, even if no regexp match has happened so far.

I am less convinced by this line of reasoning than others.

I guess the point is that the capture buffer number 2 does not exist.
Note that sed, for example, does give a hard error if you try to refer to a
capturing group that doesn't exist​:

% sed -E 's/(a)/\1\2/'
sed​: -e expression #1, char 11​: invalid reference \2 on `s' command's RHS

(at least for GNU sed). But I don't want to get bogged down in the
theological question of whether $2 in the RHS can be said to 'exist',
whether such a check would be an extension of 'use strict', or indeed
whether it should be a warning or an error.

But I think that is precisely the discussion we should have.

Currently we cannot distinguish $2 is undef because it did not match and $2
is undef because there is no capture buffer declared for it in the pattern.

In theory we can distinguish these cases and throw an additional warning.

But we can't do this at compile time, as we cannot tell in advance of regex
execution whether the user will actually use a missing var. Consider

s/$qr/$1 ? $3 : $2 /e

We have no way to know if we will inappropriately access a missing $3 until
we actually do so.

Simple substitutions such as s/a(b)c/$1/ are a commonly used part of
Perl, and it would be great to give the programmer more help in avoiding
mistakes in them, as sed and other tools do.

Agreed. It would actually be trivial to make us generate an additional
warning when this happens.

Yves

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @mauke

Am 17.09.2016 um 09​:00 schrieb Ed Avis​:

To be clear, I was referring only to literal regexp matches where both
the LHS and RHS are known at compile time, that is, without /e. There
may be complex cases where even a fixed regexp doesn't leave it entirely
clear how many capturing buffers will exist, and it makes sense to
ignore those cases (they can continue to warn at runtime).

Even with /e the RHS is known at compile time. But /e is a redundant
feature; you can always rewrite the code without /e​:

  s/.../${ m|()()|; \"it's $2" }/

That's why I think detecting the error case reliably may be harder than
it appears.

--
Ed Avis <eda@​waniasset.com<mailto​:eda@​waniasset.com>>

On 17 Sep 2016, at 06​:12, Father Chrysostomos via RT <perlbug-followup@​perl.org<mailto​:perlbug-followup@​perl.org>> wrote​:

On Fri Sep 16 20​:20​:57 2016, plokinom@​gmail.com<mailto​:plokinom@​gmail.com> wrote​:
I agree.

It's not that $2 doesn't exist there; it just contains undef. 'strict'
seems the wrong place for this check.

I would like to see it as a warning, but I don't know how hard it would
be to implement (e.g. you'd have to check for embedded regex matches
that could reset capture variables).

You would also have to take into account that a regexp match might occur within the replacement code; e.g., s/()/ m|()()|; $2/e. Not likely, but something similar could happen in complex code. (I've written s/// thingies that go on for fifty lines!)

--

Father Chrysostomos

--
Lukas Mai <plokinom@​gmail.com>

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @demerphq

On 17 September 2016 at 18​:04, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 17.09.2016 um 09​:00 schrieb Ed Avis​:

To be clear, I was referring only to literal regexp matches where both

the LHS and RHS are known at compile time, that is, without /e. There
may be complex cases where even a fixed regexp doesn't leave it entirely
clear how many capturing buffers will exist, and it makes sense to
ignore those cases (they can continue to warn at runtime).

Even with /e the RHS is known at compile time. But /e is a redundant
feature; you can always rewrite the code without /e​:

s/\.\.\./$\{ m|\(\)\(\)|; \\"it's $2" \}/

That's why I think detecting the error case reliably may be harder than it
appears.

I agree. IMO detecting in advance cannot be done.

s/$qr/$1 ? $2 : $3/e

there is no way for us to know at Perl compile time what $qr will
contain, and there is no way for us to know if we will ever try to
access $3 even if we knew that $qr only contained 2 capture buffers.

However we do know this at fetch time, and could easily throw a
specific warning when reading from a capture buffer that was not in
effect at the time of the read (which would be distinct from it
returning undef).

So for instance for code like

defined $5

we could throw a warning like

"use of capture buffer not in scope"

and for code like

print "$5";

we might throw both that warning and the fact that the resulting value
is undefined.

Yves

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @cpansprout

On Sat Sep 17 10​:00​:12 2016, demerphq wrote​:

On 17 September 2016 at 18​:04, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 17.09.2016 um 09​:00 schrieb Ed Avis​:

To be clear, I was referring only to literal regexp matches where both

the LHS and RHS are known at compile time, that is, without /e. There
may be complex cases where even a fixed regexp doesn't leave it entirely
clear how many capturing buffers will exist, and it makes sense to
ignore those cases (they can continue to warn at runtime).

Even with /e the RHS is known at compile time. But /e is a redundant
feature; you can always rewrite the code without /e​:

s/\.\.\./$\{ m|\(\)\(\)|; \\"it's $2" \}/

That's why I think detecting the error case reliably may be harder than it
appears.

I agree. IMO detecting in advance cannot be done.

s/$qr/$1 ? $2 : $3/e

there is no way for us to know at Perl compile time what $qr will
contain, and there is no way for us to know if we will ever try to
access $3 even if we knew that $qr only contained 2 capture buffers.

However we do know this at fetch time, and could easily throw a
specific warning when reading from a capture buffer that was not in
effect at the time of the read (which would be distinct from it
returning undef).

So for instance for code like

defined $5

we could throw a warning like

"use of capture buffer not in scope"

and for code like

print "$5";

we might throw both that warning and the fact that the resulting value
is undefined.

That would, unfortunately, give false positives. If the pattern is supplied by the user, or the calling code, then ‘defined $1‘ is a perfectly reasonable way to ask whether the pattern had a () and whether it participated in the match.

While I agree that a warning would be helpful, I generally find false positives to be so annoying that I slap a ‘no warnings’ on my code in the affected area, just because I’m too frustrated to look up the category that needs to be disabled. I can’t imagine I’m the only one who feels that way. (Or am I unique? :-) The result is that other problems with the code will not be flagged by perl.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @demerphq

On 17 September 2016 at 19​:11, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Sat Sep 17 10​:00​:12 2016, demerphq wrote​:

On 17 September 2016 at 18​:04, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 17.09.2016 um 09​:00 schrieb Ed Avis​:

To be clear, I was referring only to literal regexp matches where both

the LHS and RHS are known at compile time, that is, without /e. There
may be complex cases where even a fixed regexp doesn't leave it entirely
clear how many capturing buffers will exist, and it makes sense to
ignore those cases (they can continue to warn at runtime).

Even with /e the RHS is known at compile time. But /e is a redundant
feature; you can always rewrite the code without /e​:

s/\.\.\./$\{ m|\(\)\(\)|; \\"it's $2" \}/

That's why I think detecting the error case reliably may be harder than it
appears.

I agree. IMO detecting in advance cannot be done.

s/$qr/$1 ? $2 : $3/e

there is no way for us to know at Perl compile time what $qr will
contain, and there is no way for us to know if we will ever try to
access $3 even if we knew that $qr only contained 2 capture buffers.

However we do know this at fetch time, and could easily throw a
specific warning when reading from a capture buffer that was not in
effect at the time of the read (which would be distinct from it
returning undef).

So for instance for code like

defined $5

we could throw a warning like

"use of capture buffer not in scope"

and for code like

print "$5";

we might throw both that warning and the fact that the resulting value
is undefined.

That would, unfortunately, give false positives. If the pattern is supplied by the user, or the calling code, then ‘defined $1‘ is a perfectly reasonable way to ask whether the pattern had a () and whether it participated in the match.

Yes, I agree. I said could/might more or less for these reasons. It is
technically feasible to warn in these cases, but in practice it
probably isnt a good idea.

OTOH I think it would be cool if we returned an "undef" that somehow
signalled to the warning system "this undef came from a non-existent
capture buffer", and/or more generally "this undef came from a
nonexistent key in a hash" when we did throw an uninit warning from
use of that undef. If we could do so cheaply it would be a nice
enhancement to our warnings.

While I agree that a warning would be helpful, I generally find false positives to be so annoying that I slap a ‘no warnings’ on my code in the affected area, just because I’m too frustrated to look up the category that needs to be disabled. I can’t imagine I’m the only one who feels that way. (Or am I unique? :-) The result is that other problems with the code will not be flagged by perl.

Yeah i think this is all too common. Warning/alarm fatigue.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Sep 17, 2016

From @Abigail

On Sat, Sep 17, 2016 at 10​:11​:30AM -0700, Father Chrysostomos via RT wrote​:

That would, unfortunately, give false positives. If the pattern is
supplied by the user, or the calling code, then ‘defined $1‘ is a
perfectly reasonable way to ask whether the pattern had a () and whether
it participated in the match.

While I agree that a warning would be helpful, I generally find false
positives to be so annoying that I slap a ‘no warnings’ on my code
in the affected area, just because I’m too frustrated to look up the
category that needs to be disabled. I can’t imagine I’m the only
one who feels that way. (Or am I unique? :-) The result is that other
problems with the code will not be flagged by perl.

No, you are certainly not unique in that aspect.

False positives are bad. It makes people turn off warnings.
Changing existing constructs such that they now start warning on
false positives will make people throw away the language in disgust.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2016

From @epa

Lukas M. pointed out that even without the /e flag, a regexp match can evaulate code on the RHS, using the ${ ... } construct.
It is quite true that you cannot definitively flag all cases where a match variable like $2 is used with a regexp that has no corresponding capture buffer.
(In a dynamic language like Perl you can never catch 100% of any class of error at compile time, since eval always provides an escape hatch.)

So could I refine or clarify the proposal a bit. In a regular expression match of the form s/lhs/rhs/, where

  - lhs is a fixed string known at compile time (so it does not contain interpolated variables),
  - lhs is a regexp sufficiently simple that you can put some upper limit on the number of capture buffers it creates (this might be all regexps, I'm not sure),
  - the /e flag is not specified,
  - rhs or at least some part of rhs is known at compile time,

the parts of rhs which are known can be checked to see if they are the capture variables $1, $2 and so on, and a use of one of these variables which is greater than the upper limit of the numbered capture buffers can raise an error at compile time (or a warning - it doesn't much matter which).

For example, if you have

  s/hello (\w+)/hello, Mr $1, my name is $me/g;

the right hand side is a doublequoted string equivalent to the concatenation of these four expressions

  'hello, Mr '
  $1
  ', my name is '
  $me

The left hand side, when parsed, has at most one numbered capture. So at the end of the regexp you can imagine that adding (?1) would be legal but (?2) would raise a 'Reference to nonexistent group' error. Now all the four components of the RHS can be checked. Of them, only $1 is a numbered capture variable, and it's allowed. By contrast using $2 in the RHS would be an error or warning, just as (?2) in the LHS would be an error.

Explicitly not checked are cases such as s/$my_regexp/foo/.

I suggest that the same rule the regexp parser currently uses to number the capture buffers, and so raise the 'Reference to nonexistent group' error, could be applied to decide whether a capture variable in the replacement text is legal. I hope that this would be conservative, and not give false positives. An unwanted diagnostic would arise only in cases where existing code used a nonexistent capture group and relied on the old behaviour of getting an empty string - but this currently warns at runtime.

--
Ed Avis <eda@​waniasset.com>

This email is intended only for the person to whom it is addressed and may contain confidential information. Any retransmission, copying, disclosure or other use of, this information by persons other than the intended recipient is prohibited. If you received this email in error, please contact the sender and delete the material. This email is for information only and is not intended as an offer or solicitation for the purchase or sale of any financial instrument. Wadhwani Asset Management LLP is a Limited Liability Partnership registered in England (OC303168) with registered office at 9th Floor Orion House, 5 Upper St Martin’s Lane, London, WC2H 9EA. It is authorised and regulated by the Financial Conduct Authority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants