Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document formal grammar for regex sets #16779

Closed
p5pRT opened this issue Dec 2, 2018 · 17 comments · Fixed by #19503
Closed

Document formal grammar for regex sets #16779

p5pRT opened this issue Dec 2, 2018 · 17 comments · Fixed by #19503

Comments

@p5pRT
Copy link

p5pRT commented Dec 2, 2018

Migrated from rt.perl.org#133707 (status was 'new')

Searchable as RT133707$

@p5pRT
Copy link
Author

p5pRT commented Dec 2, 2018

From @khwilliamson

This is a bug report for perl from khw@​khw-xps-8930.(none),
generated with the help of perlbug 1.41 running under perl 5.29.6.


A message that was part of the security ticket [perl #133649] asked that
a formal grammar be defined for the regex sets feature, and suggested
enhancements. The security ticket has now been close, so this ticket is
being opened so that portion of it does not get lost. The complete text
of the portion starts at

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131649#txn-1471088



Flags​:
  category=core
  severity=low


Site configuration information for perl 5.29.6​:

Configured by khw at Fri Nov 30 08​:02​:34 MST 2018.

Summary of my perl5 (revision 5 version 29 subversion 6) configuration​:
  Commit id​: 23665de
  Platform​:
  osname=linux
  osvers=4.15.0-39-generic
  archname=x86_64-linux-thread-multi-ld
  uname='linux khw-xps-8930 4.15.0-39-generic #42-ubuntu smp tue oct
23 15​:48​:01 utc 2018 x86_64 x86_64 x86_64 gnulinux '
  config_args='-des -Uversiononly -Dprefix=/home/khw/blead -Dusedevel
-A'optimize=-ggdb3' -A'optimize=-O0' -Accflags='-Wno-deprecated'
-Dman1dir=none -Dman3dir=none -Dcc=g++ -DDEBUGGING -Dusemorebits
-Dusecbacktrace -Dusethreads'
  hint=recommended
  useposix=true
  d_sigaction=define
  useithreads=define
  usemultiplicity=define
  use64bitint=define
  use64bitall=define
  uselongdouble=define
  usemymalloc=n
  default_inc_excludes_dot=define
  bincompat5005=undef
  Compiler​:
  cc='g++'
  ccflags ='-D_REENTRANT -D_GNU_SOURCE -Wno-deprecated -fwrapv
-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong
-I/usr/local/include -DUSE_C_BACKTRACE -g -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
  optimize='-O2 -ggdb3 -O0'
  cppflags='-D_REENTRANT -D_GNU_SOURCE -Wno-deprecated -fwrapv
-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong
-I/usr/local/include'
  ccversion=''
  gccversion='7.3.0'
  gccosandvers=''
  intsize=4
  longsize=8
  ptrsize=8
  doublesize=8
  byteorder=12345678
  doublekind=3
  d_longlong=define
  longlongsize=8
  d_longdbl=define
  longdblsize=16
  longdblkind=3
  ivtype='long'
  ivsize=8
  nvtype='long double'
  nvsize=16
  Off_t='off_t'
  lseeksize=8
  alignbytes=16
  prototype=define
  Linker and Libraries​:
  ld='g++'
  ldflags =' -fstack-protector-strong -L/usr/local/lib'
  libpth=/usr/include/c++/7 /usr/include/x86_64-linux-gnu/c++/7
/usr/include/c++/7/backward /usr/local/lib
/usr/lib/gcc/x86_64-linux-gnu/7/include-fixed
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
  libc=libc-2.27.so
  so=so
  useshrplib=false
  libperl=libperl.a
  gnulibc_version='2.27'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs
  dlext=so
  d_dlsymun=undef
  ccdlflags='-Wl,-E'
  cccdlflags='-fPIC'
  lddlflags='-shared -O2 -ggdb3 -O0 -L/usr/local/lib
-fstack-protector-strong'


@​INC for perl 5.29.6​:
  /home/khw/blead/lib/perl5/site_perl/5.29.6/x86_64-linux-thread-multi-ld
  /home/khw/blead/lib/perl5/site_perl/5.29.6
  /home/khw/blead/lib/perl5/5.29.6/x86_64-linux-thread-multi-ld
  /home/khw/blead/lib/perl5/5.29.6
  /home/khw/blead/lib/perl5/site_perl/5.29.5
  /home/khw/blead/lib/perl5/site_perl/5.29.4
  /home/khw/blead/lib/perl5/site_perl/5.29.3
  /home/khw/blead/lib/perl5/site_perl/5.29.2
  /home/khw/blead/lib/perl5/site_perl/5.29.1
  /home/khw/blead/lib/perl5/site_perl/5.29.0
  /home/khw/blead/lib/perl5/site_perl/5.28.0
  /home/khw/blead/lib/perl5/site_perl/5.27.11
  /home/khw/blead/lib/perl5/site_perl/5.27.10
  /home/khw/blead/lib/perl5/site_perl/5.27.9
  /home/khw/blead/lib/perl5/site_perl/5.27.8
  /home/khw/blead/lib/perl5/site_perl/5.27.7
  /home/khw/blead/lib/perl5/site_perl/5.27.6
  /home/khw/blead/lib/perl5/site_perl/5.27.5
  /home/khw/blead/lib/perl5/site_perl/5.27.4
  /home/khw/blead/lib/perl5/site_perl/5.27.3
  /home/khw/blead/lib/perl5/site_perl/5.27.2
  /home/khw/blead/lib/perl5/site_perl/5.27.1
  /home/khw/blead/lib/perl5/site_perl/5.27.0
  /home/khw/blead/lib/perl5/site_perl/5.26.0
  /home/khw/blead/lib/perl5/site_perl/5.25.12
  /home/khw/blead/lib/perl5/site_perl/5.25.11
  /home/khw/blead/lib/perl5/site_perl/5.25.10
  /home/khw/blead/lib/perl5/site_perl/5.25.9
  /home/khw/blead/lib/perl5/site_perl/5.25.8
  /home/khw/blead/lib/perl5/site_perl/5.25.7
  /home/khw/blead/lib/perl5/site_perl/5.25.6
  /home/khw/blead/lib/perl5/site_perl/5.25.5
  /home/khw/blead/lib/perl5/site_perl/5.25.4
  /home/khw/blead/lib/perl5/site_perl/5.25.3
  /home/khw/blead/lib/perl5/site_perl/5.25.2
  /home/khw/blead/lib/perl5/site_perl/5.25.1
  /home/khw/blead/lib/perl5/site_perl/5.24.0
  /home/khw/blead/lib/perl5/site_perl/5.23.10
  /home/khw/blead/lib/perl5/site_perl/5.23.9
  /home/khw/blead/lib/perl5/site_perl/5.23.8
  /home/khw/blead/lib/perl5/site_perl/5.23.7
  /home/khw/blead/lib/perl5/site_perl/5.23.6
  /home/khw/blead/lib/perl5/site_perl/5.23.5
  /home/khw/blead/lib/perl5/site_perl/5.23.4
  /home/khw/blead/lib/perl5/site_perl/5.23.3
  /home/khw/blead/lib/perl5/site_perl/5.23.2
  /home/khw/blead/lib/perl5/site_perl/5.23.1
  /home/khw/blead/lib/perl5/site_perl/5.23.0
  /home/khw/blead/lib/perl5/site_perl/5.22.0
  /home/khw/blead/lib/perl5/site_perl/5.21.12
  /home/khw/blead/lib/perl5/site_perl/5.21.11
  /home/khw/blead/lib/perl5/site_perl/5.21.10
  /home/khw/blead/lib/perl5/site_perl/5.21.9
  /home/khw/blead/lib/perl5/site_perl/5.21.8
  /home/khw/blead/lib/perl5/site_perl/5.21.7
  /home/khw/blead/lib/perl5/site_perl/5.21.6
  /home/khw/blead/lib/perl5/site_perl/5.21.5
  /home/khw/blead/lib/perl5/site_perl/5.21.4
  /home/khw/blead/lib/perl5/site_perl/5.21.3
  /home/khw/blead/lib/perl5/site_perl/5.21.2
  /home/khw/blead/lib/perl5/site_perl/5.21.1
  /home/khw/blead/lib/perl5/site_perl/5.20.0
  /home/khw/blead/lib/perl5/site_perl/5.19.12
  /home/khw/blead/lib/perl5/site_perl/5.19.11
  /home/khw/blead/lib/perl5/site_perl/5.19.10
  /home/khw/blead/lib/perl5/site_perl


Environment for perl 5.29.6​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE=en_US
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/usr/lib/ccache​:/home/khw/bin​:/home/khw/perl5/perlbrew/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/usr/local/games​:/snap/bin​:/home/khw/iands/www​:/home/khw/cxoffice/bin
  PERL5OPT=-w
  PERL_BADLANG (unset)
  PERL_DIFF_TOOL=wgdiff
  PERL_POD_PEDANTIC=1
  SHELL=/bin/ksh

@khwilliamson
Copy link
Contributor

Here is my try at a formal grammar. I haven't done this kind of thing in a long time, so it may be flawed.

primary-expression:
compiled set-expression, which is typically an interpolated scalar
[ perl character class ]
[:posix:], which is one of [:alpha:], ... [:digit:], ... [:word:]
\perl backslash escape sequence like \n, \p{...} \N{...} \x{...}, etc
( expression )

unary-expression:
primary-expression
! primary-expression

multiplicative-expression:
unary-expression
multiplicative-expression & unary-expression

additive-expression:
multiplicative-expression
additive-expression + multiplicative-expression
additive-expression | multiplicative-expression
additive-expression - multiplicative-expression
additive-expression ^ multiplicative-expression

expression:
additive_expression

My sense is that @demerphq was expecting there to be lots of gotchas when he requested this, around the fact that you can interpolate in an already-compiled regex set expression. And in fact it doesn't check that the code came from interpolation of such. Only that it is a syntactically valid compiled expression displayed out '(?^:...)'

I would like to make the regex sets feature non-experimental. And I believe this is the final road block to that.

@demerphq
Copy link
Collaborator

demerphq commented Dec 9, 2019 via email

@khwilliamson
Copy link
Contributor

There have been various changes over the years to make things behave properly. I am unaware of anything at this time that would likely be a surprise. It doesn't sound like you have any examples immediately at hand. My guess is that the ones you knew of have been fixed.

@demerphq
Copy link
Collaborator

demerphq commented Dec 9, 2019 via email

@khwilliamson
Copy link
Contributor

khwilliamson commented Dec 9, 2019 via email

@demerphq
Copy link
Collaborator

demerphq commented Dec 9, 2019 via email

@demerphq
Copy link
Collaborator

demerphq commented Dec 9, 2019 via email

@khwilliamson
Copy link
Contributor

khwilliamson commented Dec 9, 2019 via email

@khwilliamson
Copy link
Contributor

khwilliamson commented Dec 9, 2019 via email

@demerphq
Copy link
Collaborator

demerphq commented Dec 10, 2019 via email

@khwilliamson
Copy link
Contributor

Do we need to see multiple warnings about the experimental status
from a single pattern?

That could be fixed, but it will be fixed automatically without extra effort by making it non-experimental.

As to the rest, @demerphq found some real bugs. I have a WIP at smoke-me/khw-sets which I believe addresses his examples. I apologize for the awful error messages. I think this branch cleans them up. The stuff about flags, etc. were extremely misleading.

Before I proceed down that branch, I'm hoping @demerphq will play with it a little to see if he can find some results that are still confusing. Flags should now be irrelevant.

khwilliamson added a commit that referenced this issue Feb 18, 2020
A set operations expression can contain a previously-compiled one
interpolated in.  Prior to this commit, some heuristics were employed
to verify it actually was such a thing, and not a sort of look-alike
that wasn't necessarily valid.  The heuristics actually forbade legal
ones.  I don't know of any illegal ones that were let through, but it is
certainly possible.  Also, the error/warning messages referred to the
heuristics, and were unhelpful at best.

The technique used instead in this commit is to return a regop only used
by this feature for any nested compilations.  This guarantees that the
caller can determine if the result is valid, and what that result is
without having to do any heuristics or inspecting any flags.  The
error/warning messages are changed to reflect this, and I believe are
now helpful.

This fixes the bugs in #16779
#16779 (comment)
khwilliamson added a commit that referenced this issue Feb 20, 2020
A set operations expression can contain a previously-compiled one
interpolated in.  Prior to this commit, some heuristics were employed
to verify it actually was such a thing, and not a sort of look-alike
that wasn't necessarily valid.  The heuristics actually forbade legal
ones.  I don't know of any illegal ones that were let through, but it is
certainly possible.  Also, the error/warning messages referred to the
heuristics, and were unhelpful at best.

The technique used instead in this commit is to return a regop only used
by this feature for any nested compilations.  This guarantees that the
caller can determine if the result is valid, and what that result is
without having to do any heuristics or inspecting any flags.  The
error/warning messages are changed to reflect this, and I believe are
now helpful.

This fixes the bugs in #16779
#16779 (comment)
@xenu xenu removed the Severity Low label Dec 29, 2021
@demerphq
Copy link
Collaborator

It seems a bit weird that I cant do this:

my $cc1= qr/[a-c]/;
my $cc2= qr/(?[ $cc1 + [x-z] ])/;

but I can do this:

my $cc3= qr/(?[ [a-c] + [x-z] ])/;

and I can do this:

my $ecc1= qr/(?[ [a-c] ])/;
my $ecc2= qr/(?[ $ecc1 + [x-z] ])/;

Also shouldn't these two produce the same errors?

$ ./perl -Ilib -lE'print "x"=~/(?[ (?^x:(?[ ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[ <-- HERE (?^x:(?[ ])) ])/ at -e line 1.
Incomplete expression within '(?[ ])' in regex; marked by <-- HERE in m/(?[ (?^x:(?[ <-- HERE ])) ])/ at -e line 1.

$ ./perl -Ilib -lE'print "x"=~/(?[ (?x:(?[ ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[ <-- HERE (?x:(?[ ])) ])/ at -e line 1.
Unexpected character in regex; marked by <-- HERE in m/(?[ (? <-- HERE x:(?[ ])) ])/ at -e line 1.

Why does (?x:(?[ ])) produce an "Unexpected character in regex" and (?^x:(?[ ])) produce a Incomplete expression within.

Similarly why does:

(?x:(?[ [x] ]))

produce an exception but

(?^x:(?[ [x] ]))

does not?

The docs say "(?^x:foo) is equivalent to (?x-imns:foo)" so why is one accepted and one is not?

$ ./perl -Ilib -lE'print "x"=~/(?[ (?^x:(?[ [x] ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[ <-- HERE (?^x:(?[ [x] ])) ])/ at -e line 1.
yes

$ ./perl -Ilib -lE'print "x"=~/(?[ (?x-imns:(?[ [x] ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[ <-- HERE (?x-imns:(?[ [x] ])) ])/ at -e line 1.
Unexpected character in regex; marked by <-- HERE in m/(?[ (? <-- HERE x-imns:(?[ [x] ])) ])/ at -e line 1.

@khwilliamson
Copy link
Contributor

khwilliamson commented Feb 24, 2022 via email

@demerphq
Copy link
Collaborator

demerphq commented Feb 25, 2022 via email

khwilliamson added a commit to khwilliamson/perl5 that referenced this issue Feb 28, 2022
See Perl#16779 (comment)

This commit extends the flags accepted by nested calls to regex sets to
any legal set.  Previously it allowed only '^', as that is what an
actual compilation would return.  But as pointed out in the conversation
in that ticket, it's possible to detach flags from the rest of the
pattern, and to write equivalent ones in multiple ways.

This commit just changes one line to look for any legal flag, not
restricting it to '^'.
@khwilliamson
Copy link
Contributor

khwilliamson commented Feb 28, 2022 via email

@demerphq
Copy link
Collaborator

demerphq commented Mar 1, 2022 via email

khwilliamson added a commit that referenced this issue Mar 3, 2022
See #16779 (comment)

This commit extends the flags accepted by nested calls to regex sets to
any legal set.  Previously it allowed only '^', as that is what an
actual compilation would return.  But as pointed out in the conversation
in that ticket, it's possible to detach flags from the rest of the
pattern, and to write equivalent ones in multiple ways.

This commit just changes one line to look for any legal flag, not
restricting it to '^'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants