Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp syntax check when variables contained with \Q...\E #15644

Open
p5pRT opened this issue Oct 4, 2016 · 4 comments
Open

Regexp syntax check when variables contained with \Q...\E #15644

p5pRT opened this issue Oct 4, 2016 · 4 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 4, 2016

Migrated from rt.perl.org#129803 (status was 'open')

Searchable as RT129803$

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 2016

From @epa

Created by @epa

When possible regular expressions are compiled and checked at program compile time.
This program gives an error even though the sub f() is never called​:

  sub f { /(/ }

But when the regexp includes variables, this early checking cannot be done.
The following code will only give an error if and when f() is called​:

  $x = 'a';
  sub f { /($x/ }

In general, this has to be so. Perl can't know what the possible
values of $x will be at run time. If $x contains ')' then the regexp
is well-formed.

However, when building a regexp you may choose to use the \Q...\E
mechanism as a safer alternative to raw string interpolation.
Whatever appears inside \Q...\E is effectively matched as a literal
string, with regexp metacharacters like ) not having their usual
effect. (As an implementation detail, this might work by \Q...\E
carefully escaping all such characters with backslashes before parsing
the regexp, but from the user's point of view you can consider it a
way to match literal text contained in a scalar.)

If the variable appears in the regexp protected by \Q...\E then the
syntax checks can still happen as normal.

  sub f { /(\Q$x\E/ }

No matter the value of $x at run time, this will never be a valid
regexp; it will always have an unbalanced ( at the start. Perl could
warn for this regexp at compile time just as it warns for /(/.

One possible way to do this would be to try making a munged version of
the regexp, replacing all \Q...\E fragments of the regexp with the
dummy construct (?​:). If the resulting munged regexp does not contain
any variables, then it can be used as a compile-time check; the munged
regexp will be syntactically valid if and only if the original regexp
is syntactically valid for all possible uses.

There might be a better way to do this checking depending on
implementation details of the regexp engine.

* Further discussion

Not strictly part of the bug report, but I would like to mention a
possible avenue to improve compile-time checking of composed regexps
when \Q...\E is not used.

The other major way to build up a regexp from variables is to include
a variable which itself holds a compiled regular expression (qr/.../).
Here, too, you can detect some regexp syntax errors for certain, no
matter what other regexp the variable may hold.

  my $re = qr/whatever/;
  /($re/;

This is always going to be an invalid regexp as long as $re holds a
compiled regular expression object. The compiled regexp included can
never manage to close the open ( since it must itself have balanced
parentheses. But Perl doesn't have a way to know that $re will always
hold a compiled regexp at run time.

Suppose there were a way for the programmer to specify his or her
intention that the variable will always hold a compiled regexp. Let's
say that \I...\J is used for this (this syntax is arbitrary and only
for the sake of discussion). So you would write

  my $re_1 = qr/hel+o/;
  my $re_2 = qr/there/;
  my $re = qr/ \I$re_1\J \s+ \I$re_2\J /x;

At the point when $re is compiled, and the values of $re_1 and $re_2
are included in the larger regexp string, Perl would check at run time
that they really were compiled regexp objects and not arbitrary
scalars.

In itself, this might catch a few bugs but is perhaps not worth the
extra clunkiness. However, it would also enable compile-time checking
of regexp syntax in the same way as \Q...\J would.

  my $re = qr/whatever/;
  /(\I$re\J/;

Here Perl can know that since $re is always going to be a compiled
regexp, the resulting regexp of /($re/ will always have unbalanced
parens. The checking may have more subtleties than for \Q...\E since
an included compiled regexp can change the number of capturing groups,
for example, affecting the validity of later backreferences. Perhaps
only the most basic syntax checks (like making sure parens are
balanced) could be done at program compile time, with more 'semantic'
ones (like a backreference to a nonexistent group) done when the
string interpolation has been done and the final regexp is compiled.

Finally, only for those who enjoy the bondage and discipline, an
optional 'strict mode' would forbid arbitrary string interpolation in
regexps, requiring all variable uses to be explicitly labelled as
either \Q...\E (for matching a literal string) or \I...\J (for
including a precompiled regexp fragment). Of course, there are cases
where you want to build up a regexp from fragments which are not
themselves valid regexps, so strict mode would not be appropriate all
the time and certainly not on by default. But it would be a useful
way to eliminate regexp-injection bugs. (Taint mode helps too, of
course, but is a run time check only.)

Perl Info

Flags:
    category=core
    severity=wishlist

Site configuration information for perl 5.22.2:

Configured by Red Hat, Inc. at Wed Aug  3 14:06:20 UTC 2016.

Summary of my perl5 (revision 5 version 22 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=4.6.3-300.fc24.x86_64, archname=x86_64-linux-thread-multi
    uname='linux buildvm-08.phx2.fedoraproject.org 4.6.3-300.fc24.x86_64 #1 smp fri jun 24 20:52:41 utc 2016 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Doptimize=none -Dccflags=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -Dldflags=-Wl,-z,relro  -Dccdlflags=-Wl,--enable-new-dtags -Wl,-z,relro  -Dlddlflags=-shared -Wl,-z,relro  -Dshrpdir=/usr/lib64 -DDEBUGGING=-g -Dversion=5.22.2 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='  -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='5.3.1 20160406 (Red Hat 5.3.1-6)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-Wl,-z,relro  -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib64 /lib64 /usr/lib64 /usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib
    libs=-lpthread -lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lresolv -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.22'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-z,relro '
    cccdlflags='-fPIC', lddlflags='-shared -Wl,-z,relro  -L/usr/local/lib -fstack-protector-strong'

Locally applied patches:
    Fedora Patch1: Removes date check, Fedora/RHEL specific
    Fedora Patch3: support for libdir64
    Fedora Patch4: use libresolv instead of libbind
    Fedora Patch5: USE_MM_LD_RUN_PATH
    Fedora Patch6: Skip hostname tests, due to builders not being network capable
    Fedora Patch7: Dont run one io test due to random builder failures
    Fedora Patch15: Define SONAME for libperl.so
    Fedora Patch16: Install libperl.so to -Dshrpdir value
    Fedora Patch22: Document Math::BigInt::CalcEmu requires Math::BigInt (CPAN RT#85015)
    Fedora Patch26: Make *DBM_File desctructors thread-safe (RT#61912)
    Fedora Patch27: Make PadlistNAMES() lvalue again (CPAN RT#101063)
    Fedora Patch28: Make magic vtable writable as a work-around for Coro (CPAN RT#101063)
    Fedora Patch29: Fix duplicating PerlIO::encoding when spawning threads (RT#31923)
    Fedora Patch30: Do not let XSLoader load relative paths (CVE-2016-6185)
    Fedora Patch31: Avoid loading optional modules from default . (CVE-2016-1238)
    Fedora Patch200: Link XS modules to libperl.so with EU::CBuilder on Linux
    Fedora Patch201: Link XS modules to libperl.so with EU::MM on Linux


@INC for perl 5.22.2:
    /home/eda/lib64/perl5/
    /usr/local/lib64/perl5
    /usr/local/share/perl5
    /usr/lib64/perl5/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib64/perl5
    /usr/share/perl5


Environment for perl 5.22.2:
    HOME=/home/eda
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=C
    LC_CTYPE=en_GB.UTF-8
    LC_MESSAGES=en_GB.UTF-8
    LC_MONETARY=en_GB.UTF-8
    LC_NUMERIC=en_GB.UTF-8
    LC_TIME=en_GB.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/eda/bin:/home/eda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/sbin:/home/eda/.local/bin:/home/eda/bin:/sbin:/usr/sbin
    PERL5LIB=/home/eda/lib64/perl5/
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

From @Abigail

On Tue, Oct 04, 2016 at 07​:29​:05AM -0700, Ed Avis wrote​:

# New Ticket Created by "Ed Avis"
# Please include the string​: [perl #129803]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=129803 >

This is a bug report for perl from eda@​waniasset.com,
generated with the help of perlbug 1.40 running under perl 5.22.2.

-----------------------------------------------------------------
[Please describe your issue here]

When possible regular expressions are compiled and checked at program compile time.
This program gives an error even though the sub f() is never called​:

sub f \{ /\(/ \}

But when the regexp includes variables, this early checking cannot be done.
The following code will only give an error if and when f() is called​:

$x = 'a';
sub f \{ /\($x/ \}

In general, this has to be so. Perl can't know what the possible
values of $x will be at run time. If $x contains ')' then the regexp
is well-formed.

However, when building a regexp you may choose to use the \Q...\E
mechanism as a safer alternative to raw string interpolation.
Whatever appears inside \Q...\E is effectively matched as a literal
string, with regexp metacharacters like ) not having their usual
effect. (As an implementation detail, this might work by \Q...\E
carefully escaping all such characters with backslashes before parsing
the regexp, but from the user's point of view you can consider it a
way to match literal text contained in a scalar.)

If the variable appears in the regexp protected by \Q...\E then the
syntax checks can still happen as normal.

sub f \{ /\(\\Q$x\\E/ \}

No matter the value of $x at run time, this will never be a valid
regexp; it will always have an unbalanced ( at the start. Perl could
warn for this regexp at compile time just as it warns for /(/.

One possible way to do this would be to try making a munged version of
the regexp, replacing all \Q...\E fragments of the regexp with the
dummy construct (?​:). If the resulting munged regexp does not contain
any variables, then it can be used as a compile-time check; the munged
regexp will be syntactically valid if and only if the original regexp
is syntactically valid for all possible uses.

There might be a better way to do this checking depending on
implementation details of the regexp engine.

I don't see much of a benefit of this. You'd be adding an additional
compilation of a regexp at compile time, but then you have to throw
away the compiled result, as it's a different pattern than what would
really be there. The only win is that if you write an regexp with a
syntax error, you get an error earlier. But once you have fixed the
the error, you keep paying the speed penalty each time you run the
program. And if you don't make a mistake in the first place, you pay
the speed penalty.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

From @epa

Yes, the advantage is that of compile-time checking of regular expressions. Speaking personally, I find this one of the advantages of Perl over other scripting languages, where a regexp syntax error is caught only at run time. There is a small overhead associated with compiling the regular expression even though it may not in the end be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants