regex optimiser wrongly rejects certain matches involving embedded comments #12755

p5pRT · 2013-02-05T09:26:29Z

Migrated from rt.perl.org#116639 (status was 'resolved')

Searchable as RT116639$

p5pRT · 2013-02-05T09:26:29Z

From @nwc10

Created by @nwc10

It looks like the regex optimiser is wrongly concluding "can never match"
for certain patterns involving embedded comments.

Pathological patterns that put a embedded comment between an atom and its
quantifier :-)

Not a regression - has been present since embedded comments were added in
5.001:

$ ~~/Sandpit/5001/bin/perl -le 'print "abbc" =~~ /.b(?#Comment){2}c/ ? "Y" : "n"'
Y
$ ~~/Sandpit/5001/bin/perl -le 'print "abbc" =~~ /ab(?#Comment){2}c/ ? "Y" : "n"'
n

Still present in blead. The smoking gun is the debugging output
"String shorter than min possible regex match"

Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx ".b(?#Comment){2}c"
rarest char b at 0
Final program:
1: REG_ANY (2)
2: CURLY {2,2} (6)
4: EXACT (0)
6: EXACT <c> (8)
8: END (0)
anchored "bbc" at 1 (checking anchored) minlen 4
Enabling $‘$ & $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc"
Found anchored substr "bbc" at offset 1...
Guessed: match at offset 0
Matching REx ".b(?#Comment){2}c" against "abbc"
0 <> <abbc> | 1:REG_ANY(2)
1 <a> <bbc> | 2:CURLY {2,2}(6)
EXACT can match 2 times out of 2...
3 <abb> <c> | 6: EXACT <c>(8)
4 <abbc> <> | 8: END(0)
Match successful!
Y
Freeing REx: ".b(?#Comment){2}c"

with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser, and
hence never enters the engine:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx "ab(?#Comment){2}c"
rarest char b at 1
Final program:
1: CURLYM[0] {2,2} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)
anchored "ababc" at 0 (checking anchored isall) minlen 5
Enabling $‘$ & $' support (0x7).

EXECUTING...

String shorter than min possible regex match
n
Freeing REx: "ab(?#Comment){2}c"

Note, multiline comments, using /x, are not affected by this bug:

$ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"'
Y

Nicholas Clark

Perl Info


Flags:
    category=core
    severity=low

Site configuration information for perl 5.16.2:

Configured by nick at Fri Dec 21 07:57:09 GMT 2012.

Summary of my perl5 (revision 5 version 16 subversion 2) configuration:
  Commit id: db7f29e888f2e6d75a806a11ebc6caa6acd84577
  Platform:
    osname=freebsd, osvers=7.0-stable, archname=i386-freebsd
    uname='freebsd plum.flirble.org 7.0-stable freebsd 7.0-stable #7: sat jul 26 20:39:26 bst 2008 root@plum.flirble.org:usrobjusrsrcsysplum i386 '
    config_args='-des -Dprefix=/home/nick/Sandpit/5162'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
    optimize='-O',
    cppflags='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.2.1 20070719  [FreeBSD]', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-E  -fstack-protector -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib
    libs=-lgdbm -lm -lcrypt -lutil -lc
    perllibs=-lm -lcrypt -lutil -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.16.2:
    /home/nick/Sandpit/5162/lib/perl5/site_perl/5.16.2/i386-freebsd
    /home/nick/Sandpit/5162/lib/perl5/site_perl/5.16.2
    /home/nick/Sandpit/5162/lib/perl5/5.16.2/i386-freebsd
    /home/nick/Sandpit/5162/lib/perl5/5.16.2
    .


Environment for perl 5.16.2:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/opt/local/bin:/opt/local/sbin:/usr/flirble/admin/bin:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/nick/bin:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/bash

p5pRT · 2013-02-05T10:00:02Z

From @nwc10

On Tue, Feb 05, 2013 at 01:26:30AM -0800, Nicholas Clark wrote:

It looks like the regex optimiser is wrongly concluding "can never match"
for certain patterns involving embedded comments.

String shorter than min possible regex match
n
Freeing REx: "ab(?#Comment){2}c"

OK, seems that it's more ugly, and affects more things. Or maybe this is a
different bug:

$ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"'
Compiling REx "a(?{%"b%"}){2}c"
rarest char c at 1
Final program:
1: EXACT <a> (3)
3: CURLYX[0] {2,2} (8)
5: EVAL (7)
7: WHILEM (0)
8: NOTHING (9)
9: EXACT <c> (11)
11: END (0)
anchored "ac" at 0 (checking anchored) minlen 2 with eval
Enabling $‘$ & $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc"
Did not find anchored substr "ac"...
Match rejected by optimizer
n
Freeing REx: "a(?{%"b%"}){2}c"

There is no fixed string "ac" in the pattern. Bad optimiser, no cookie.

Nicholas Clark

p5pRT · 2013-02-05T13:15:54Z

From @demerphq

On 5 February 2013 10:59, Nicholas Clark <nick@ccl4.org> wrote:

On Tue, Feb 05, 2013 at 01:26:30AM -0800, Nicholas Clark wrote:

It looks like the regex optimiser is wrongly concluding "can never match"
for certain patterns involving embedded comments.

String shorter than min possible regex match
n
Freeing REx: "ab(?#Comment){2}c"

OK, seems that it's more ugly, and affects more things. Or maybe this is a
different bug:

$ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"'
Compiling REx "a(?{%"b%"}){2}c"
rarest char c at 1
Final program:
1: EXACT <a> (3)
3: CURLYX[0] {2,2} (8)
5: EVAL (7)
7: WHILEM (0)
8: NOTHING (9)
9: EXACT <c> (11)
11: END (0)
anchored "ac" at 0 (checking anchored) minlen 2 with eval
Enabling $‘$ & $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc"
Did not find anchored substr "ac"...
Match rejected by optimizer
n
Freeing REx: "a(?{%"b%"}){2}c"

There is no fixed string "ac" in the pattern. Bad optimiser, no cookie.

Bad programmer misread manual, no cookie. :-)

The (?{...}) is a zero width pattern. It always matches, and the
return is discarded.

So the pattern is identical to:

/ac/

And is correctly rejected by the optimizer.

Perhaps you meant (??{ ... }) which is quite different.

Also, you must be testing on an old perl no? I dont recall seeing that
rarest char output in a long time....

Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-05T13:15:55Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2013-02-05T13:17:21Z

From @demerphq

On 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote:

# New Ticket Created by Nicholas Clark
# Please include the string: [perl #116639]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639 >

This is a bug report for perl from nick@ccl4.org,
generated with the help of perlbug 1.39 running under perl 5.16.2.

-----------------------------------------------------------------
[Please describe your issue here]

It looks like the regex optimiser is wrongly concluding "can never match"
for certain patterns involving embedded comments.

Pathological patterns that put a embedded comment between an atom and its
quantifier :-)

Not a regression - has been present since embedded comments were added in
5.001:

$ ~~/Sandpit/5001/bin/perl -le 'print "abbc" =~~ /.b(?#Comment){2}c/ ? "Y" : "n"'
Y
$ ~~/Sandpit/5001/bin/perl -le 'print "abbc" =~~ /ab(?#Comment){2}c/ ? "Y" : "n"'
n

Still present in blead. The smoking gun is the debugging output
"String shorter than min possible regex match"

Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx ".b(?#Comment){2}c"
rarest char b at 0
Final program:
1: REG_ANY (2)
2: CURLY {2,2} (6)
4: EXACT (0)
6: EXACT <c> (8)
8: END (0)
anchored "bbc" at 1 (checking anchored) minlen 4
Enabling $‘$ & $' support (0x7).

EXECUTING...

Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc"
Found anchored substr "bbc" at offset 1...
Guessed: match at offset 0
Matching REx ".b(?#Comment){2}c" against "abbc"
0 <> <abbc> | 1:REG_ANY(2)
1 <a> <bbc> | 2:CURLY {2,2}(6)
EXACT can match 2 times out of 2...
3 <abb> <c> | 6: EXACT <c>(8)
4 <abbc> <> | 8: END(0)
Match successful!
Y
Freeing REx: ".b(?#Comment){2}c"

with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser, and
hence never enters the engine:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx "ab(?#Comment){2}c"
rarest char b at 1
Final program:
1: CURLYM[0] {2,2} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)
anchored "ababc" at 0 (checking anchored isall) minlen 5
Enabling $‘$ & $' support (0x7).

EXECUTING...

String shorter than min possible regex match
n
Freeing REx: "ab(?#Comment){2}c"

Note, multiline comments, using /x, are not affected by this bug:

$ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"'
Y

I consider this a documentation bug and not a feature bug. You simply
cannot put comments *anywhere*, and afaik, never could have.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-05T13:26:09Z

From @nwc10

On Tue, Feb 05, 2013 at 02:16:52PM +0100, demerphq wrote:

On 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'

Y

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'

n

I consider this a documentation bug and not a feature bug. You simply
cannot put comments *anywhere*, and afaik, never could have.

I disagree. In that it should not matter whether the first symbol of my
pattern is /./ or /a/. Either both patterns should match, or both should not.
It shouldn't be the case that one is rejected by the optimiser, and one is
accepted by the engine. That's the optimiser changing behaviour, not just
run time needed.

Also, as best I can tell from the *implementation* of comments, the intent
is that you *can* put comments anywhere. The implementation of S_nextchar()
is intended to skip over comments when getting the next character of the
pattern, and as best I can eyeball from the rest of the code in regcomp.c,
the intent is to use it always as the way to get more characters.

Nicholas Clark

p5pRT · 2013-02-05T13:57:50Z

From @arc

Nicholas Clark <perlbug-followup@perl.org> wrote:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx "ab(?#Comment){2}c"
rarest char b at 1
Final program:
1: CURLYM[0] {2,2} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)
anchored "ababc" at 0 (checking anchored isall) minlen 5

This seems very wrong — the pattern's been compiled as if the initial
"ab" is inside (?:). The same problem can be triggered with other
quantifiers:

$ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"'
Compiling REx "ab(?#rem)*c"
rarest char c at 0
synthetic stclass "ANYOF_SYNTHETIC[ac][]".
Final program:
1: CURLYM[0] {0,32767} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)

--
Aaron Crane ** http://aaroncrane.co.uk/

p5pRT · 2013-02-05T13:58:56Z

From @demerphq

On 5 February 2013 14:25, Nicholas Clark <nick@ccl4.org> wrote:u

On Tue, Feb 05, 2013 at 02:16:52PM +0100, demerphq wrote:

On 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote:

$ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'

Y

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'

n

I consider this a documentation bug and not a feature bug. You simply
cannot put comments *anywhere*, and afaik, never could have.

I disagree. In that it should not matter whether the first symbol of my
pattern is /./ or /a/. Either both patterns should match, or both should not.
It shouldn't be the case that one is rejected by the optimiser, and one is
accepted by the engine. That's the optimiser changing behaviour, not just
run time needed.

I have just looked at the output. The cause of this behavior is not
the optimiser, it is the comment parsing feature behaving in ways that
you dont expect. See below comments on poorly thought through quick
hacks to the regex engine.

What happens is that the (?#....) is parsed out, the {2} then applies
to whatever was parsed out before. As a side effect of how the parsing
behavior works this looks to the regex engine like you had done:
/(?:ab)(?#Comment){2}c/.

Which if you remove the (?#...) looks like:

/(?:ab){2}c/

At which point the optimizer quite correctly expects this pattern to
be the same as /ababc/.

In the case with the leading dot, the parsing step which merges 'a'
and 'b' into (?:ab) does not occur, and the {2} applies to the b
alone.

Anyway, none of this is an optimizer bug, its either a doc bug, or a
bug in how quantifiers are applied the (?#...) sequences, or a bug in
how comments are parsed generally. But not an optimizer bug. The
optimizer doesn't even see the original pattern, it operates on
opcodes, and comments shouldnt modify the opcodes.

Also notice that you happen to see the optimiser kick in, which is
natural, its the first thing that might go wrong, however even if it
didn't you would still see these patterns behave differently due to
the way they are constructed. Unfortunately there is no way to turn
off all optimisations, or you would see the problem. But if you
inspect the opcode generated you will see it.

Take a look at the debug output:

$ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /ab(?#Comment){2}c/
? "Y" : "n"'
Compiling REx "ab(?#Comment){2}c"
Starting first pass (sizing)

ab(?#Comme... | 1| reg
| | brnc
| | piec
| | atom
c< | 4| inst - CURLYX
| 7| piec
| | atom
Required size 9 nodes
Starting second pass (creation)
ab(?#Comme... | 1| reg
| | brnc
| | piec
| | atom
c< | 4| tail~ EXACT <ab> (1) -> WHILEM
| | inst - CURLYX
| 7| tail~ CURLYX[2] {0,0} (1) -> NOTHING
| | piec
| | atom
< | 9| tail~ CURLYX[0] {2,2} (1)
| | ~ NOTHING (6) -> EXACT
| 10| lsbr~ tying lastbr CURLYX[0] {2,2} (1) to
ender END (9) offset 8
| | tail~ CURLYX[0] {2,2} (1)
| | ~ NOTHING (6)
| | ~ EXACT <c> (7) -> END
first:> 3: EXACT <ab> (5)
first at 3
Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0
Float: '' @ 0/0
Peep> 1: CURLYX[0] {2,2} (6)
Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @
0 Float: '' @ 0/0
Peep> 3: EXACT <ab> (5)
join> 3: EXACT <ab> (5)
Peep:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:''
@ 0 Float: '' @ 0/0
Peep> 5: WHILEM (0)
pre-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0
*Fixed:'' @ 0 Float: '' @ 0/0
post-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0
*Fixed:'' @ 0 Float: '' @ 0/0
Peep:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'abab' 4:0/0 *Fixed:''
@ 0 Float: '' @ 0/0
Peep> 6: NOTHING (7)
Peep:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'abab' 4:0/0 *Fixed:''
@ 0 Float: '' @ 0/0
Peep> 7: EXACT <c> (9)
join> 7: EXACT <c> (9)
pre-fin:Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' 5:0/0
*Fixed:'' @ 0 Float: '' @ 0/0
post-fin:Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' 5:0/0
*Fixed:'' @ 0 Float: '' @ 0/0
commit: Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' -1:0/0
*Fixed:'ababc' @ 0 Float: '' @ 0/0
minlen: 5 r->minlen:0
Final program:
1: CURLYM[0] {2,2} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)
anchored "ababc" at 0 (checking anchored isall) minlen 5
r->extflags: CHECK_ALL USE_INTUIT_NOML USE_INTUIT_ML
n
Freeing REx: "ab(?#Comment){2}c"

$ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/
? "Y" : "n"'
Compiling REx ".b(?#Comment){2}c"
Starting first pass (sizing)

.b(?#Comme... | 1| reg
| | brnc
| | piec
| | atom
b(?#Commen... | 2| piec
| | atom
c< | 4| inst - CURLY
| 6| piec
| | atom
Required size 8 nodes
Starting second pass (creation)
.b(?#Comme... | 1| reg
| | brnc
| | piec
| | atom
b(?#Commen... | 2| piec
| | atom
c< | 4| inst - CURLY
| 6| tail~ REG_ANY (1) -> CURLY
| | piec
| | atom
< | 8| tail~ CURLY {2,2} (2) -> EXACT
| 9| lsbr~ tying lastbr REG_ANY (1) to ender END (8) offset 7
| | tail~ REG_ANY (1)
| | ~ CURLY {2,2} (2)
| | ~ EXACT <c> (6) -> END
first:> 1: REG_ANY (2)
first at 1
Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0
Float: '' @ 0/0
Peep> 1: REG_ANY (2)
commit: Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:''
@ 0 Float: '' @ 0/0
Peep:Pos:1/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @
0 Float: '' @ 0/0
Peep> 2: CURLY {2,2} (6)
Peep:Pos:1/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:''
@ 0 Float: '' @ 0/0
Peep> 4: EXACT (0)
join> 4: EXACT (0)
pre-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'b' 2:1/1
*Fixed:'' @ 0 Float: '' @ 0/0
post-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'b' 2:1/1
*Fixed:'' @ 0 Float: '' @ 0/0
Peep:Pos:3/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bb' 3:1/1 *Fixed:'' @
0 Float: '' @ 0/0
Peep> 6: EXACT <c> (8)
join> 6: EXACT <c> (8)
pre-fin:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' 4:1/1
*Fixed:'' @ 0 Float: '' @ 0/0
post-fin:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' 4:1/1
*Fixed:'' @ 0 Float: '' @ 0/0
commit: Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' -1:1/1
*Fixed:'bbc' @ 1 Float: '' @ 0/0
minlen: 4 r->minlen:0
Final program:
1: REG_ANY (2)
2: CURLY {2,2} (6)
4: EXACT (0)
6: EXACT <c> (8)
8: END (0)
anchored "bbc" at 1 (checking anchored) minlen 4
r->extflags: USE_INTUIT_NOML USE_INTUIT_ML
Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc"
Check offset min: 1 Start shift: 1 End shift 0 Real End Shift: 0
fbm_instr len=3 str=<bbc>
Found anchored substr "bbc" at offset 1...
Check offset min:1 max:1 S:1 t:0 D:-1 end:4
Guessed: match at offset 0
Matching REx ".b(?#Comment){2}c" against "abbc"
Setting an EVAL scope, savestack=2
regmatch start
0 <> <abbc> | 1:REG_ANY(2)
1 <a> <bbc> | 2:CURLY {2,2}(6)
EXACT can match 2 times out of 2...
Setting an EVAL scope, savestack=11
#0 CURLY_B_max
3 <abb> <c> | 6: EXACT <c>(8)
4 <abbc> <> | 8: END(0)
Match successful!
Y
Freeing REx: ".b(?#Comment){2}c"

Also, as best I can tell from the *implementation* of comments, the intent
is that you *can* put comments anywhere.

You didn't look hard enough then. Sorry.

The implementation of S_nextchar()
is intended to skip over comments when getting the next character of the
pattern, and as best I can eyeball from the rest of the code in regcomp.c,
the intent is to use it always as the way to get more characters.

If so its been broken for a long time. Afaik S_nextchar is only called
at the beginning or end of a parsable sequence.

This is a farily typical piece of code IMO:

case '<': /* (?<...) */
if (*RExC_parse == '!')
paren = ',';
else if (*RExC_parse != '=')
named_capture:
{ /* (?<...>) */
char *name_start;
SV *svname;
paren= '>';

No "nextchar" there.

And IMO, what /some/ of the nextchars suggest as being legal, such as
a comment in between the '(' and '?' and in (?:foo) is IMO borderline
to completely insane, and not something we should use as a precedent.

We know that regex comments were a quick hack added by Larry when
Jeffrey Friedl commented how they would be useful. I also know that
quick hacks in the regex engine, while sometimes quite possible, leave
nasty edge cases that are not so fun to deal with afterwards, and
probably, had people been aware of them at the time, blocked the idea.
(Yay, I can now say that Larry and I have something in common: adding
a feature to the regex engine without thinking it through properly and
without accounting for the edge cases!)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-05T14:03:26Z

From @demerphq

On 5 February 2013 14:57, Aaron Crane <arc@cpan.org> wrote:

Nicholas Clark <perlbug-followup@perl.org> wrote:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx "ab(?#Comment){2}c"
rarest char b at 1
Final program:
1: CURLYM[0] {2,2} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)
anchored "ababc" at 0 (checking anchored isall) minlen 5

This seems very wrong — the pattern's been compiled as if the initial
"ab" is inside (?:). The same problem can be triggered with other
quantifiers:

$ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"'
Compiling REx "ab(?#rem)*c"
rarest char c at 0
synthetic stclass "ANYOF_SYNTHETIC[ac][]".
Final program:
1: CURLYM[0] {0,32767} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)

This is byproduct of how we parse strings into EXACT nodes. When we
are parsing a char, we look ahead and check if there is a quantifier
next. If not then we merge them into a single opcode. (?#...) seems to
not absorb the quantifier, and as such it applies to whatever _opcode_
is before it, in this case the EXACT <ab>.

About the only thing I can see here likely to change is that we should
absorb quantifiers on (?#...) the rest is IMO a doc patch waiting to
happen, along with maybe the removal of some nextchars() that probably
cause trouble.

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-05T14:11:16Z

From @demerphq

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

On 5 February 2013 14:57, Aaron Crane <arc@cpan.org> wrote:

Nicholas Clark <perlbug-followup@perl.org> wrote:

$ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
Compiling REx "ab(?#Comment){2}c"
rarest char b at 1
Final program:
1: CURLYM[0] {2,2} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)
anchored "ababc" at 0 (checking anchored isall) minlen 5

This seems very wrong — the pattern's been compiled as if the initial
"ab" is inside (?:). The same problem can be triggered with other
quantifiers:

$ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"'
Compiling REx "ab(?#rem)*c"
rarest char c at 0
synthetic stclass "ANYOF_SYNTHETIC[ac][]".
Final program:
1: CURLYM[0] {0,32767} (7)
3: EXACT <ab> (5)
5: SUCCEED (0)
6: NOTHING (7)
7: EXACT <c> (9)
9: END (0)

This is byproduct of how we parse strings into EXACT nodes. When we
are parsing a char, we look ahead and check if there is a quantifier
next. If not then we merge them into a single opcode. (?#...) seems to
not absorb the quantifier, and as such it applies to whatever _opcode_
is before it, in this case the EXACT <ab>.

About the only thing I can see here likely to change is that we should
absorb quantifiers on (?#...) the rest is IMO a doc patch waiting to
happen, along with maybe the removal of some nextchars() that probably
cause trouble.

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

Also, I'd like to point out that fixing this without modifying the
docs is very likely to result in a non trivial performance hit for
everyone else.

IMO comments in insane places in regexes is not worth the price.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-05T14:53:24Z

From @khwilliamson

On 02/05/2013 07:10 AM, demerphq wrote:

IMO comments in insane places in regexes is not worth the price.

Yves

+1

p5pRT · 2013-02-06T00:02:09Z

From @rjbs

* demerphq <demerphq@gmail.com> [2013-02-05T08:58:19]

I have just looked at the output. The cause of this behavior is not
the optimiser, it is the comment parsing feature behaving in ways that
you dont expect. See below comments on poorly thought through quick
hacks to the regex engine.

Thanks, this was a really interesting message.

--
rjbs

p5pRT · 2013-02-06T00:05:11Z

From @rjbs

* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

Also, I'd like to point out that fixing this without modifying the
docs is very likely to result in a non trivial performance hit for
everyone else.

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work. I wonder whether we can
issue a warning, too? On one hand, I'd like to imagine that anybody doing
something so crazy would know to consider that it might not work and that he or
she should consult the friendly manual.

On the other hand, the way in which this fails is pretty lame to allow to
happen silently. If a warning could be issued, I think it would be warranted.
(Do we have a "too-clever" category yet? :) )

--
rjbs

p5pRT · 2013-02-06T06:01:34Z

From @demerphq

On 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

Also, I'd like to point out that fixing this without modifying the
docs is very likely to result in a non trivial performance hit for
everyone else.

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work. I wonder whether we can
issue a warning, too?

My inclination is that we should just allow quantifiers on a (?#...) block.

Then we could warn something like "quantifier on comment, maybe you
meant something else"

On one hand, I'd like to imagine that anybody doing
something so crazy would know to consider that it might not work and that he or
she should consult the friendly manual.

Unfortunately tho the existing docs on the subject are pretty sparse.
So there isn't that much to consult.

On the other hand, the way in which this fails is pretty lame to allow to
happen silently. If a warning could be issued, I think it would be warranted.
(Do we have a "too-clever" category yet? :) )

Heh. :-)

I haven't checked the code yet to see what is involved, but I imagine
that implementing the warning mentioned above should not be too
difficult.

Cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T08:50:12Z

From @hvds

demerphq <demerphq@gmail.com> wrote:
:On 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
:> I think it's fine to document that this won't work. I wonder whether we can
:> issue a warning, too?
:
:My inclination is that we should just allow quantifiers on a (?#...) block.
:
:Then we could warn something like "quantifier on comment, maybe you
:meant something else"

It is not a useful thing to use in any context; I'd prefer just to make
it an error.

Allowing it with a warning only makes sense to me if we want to be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Hugo

p5pRT · 2013-02-06T08:53:53Z

From @demerphq

On 6 February 2013 01:01, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T08:58:19]

I have just looked at the output. The cause of this behavior is not
the optimiser, it is the comment parsing feature behaving in ways that
you dont expect. See below comments on poorly thought through quick
hacks to the regex engine.

Thanks, this was a really interesting message.

You are welcome. Very happy to be of service. :-)

BTW, I have started investigating a fix. The offending code is as follows:

case '#': /* (?#...) */
while (*RExC_parse && *RExC_parse != ')')
RExC_parse++;
if (*RExC_parse != ')')
FAIL("Sequence (?#... not terminated");
nextchar(pRExC_state);
*flagp = TRYAGAIN;
return NULL;

The documentation for this construct is:

"(?#text)"
A comment. The text is ignored. If the "/x"
modifier enables whitespace formatting, a simple "#" will suffice.
Note that Perl
closes the comment as soon as it sees a ")", so there
is no way to put a literal ")" in the comment.

The docs do not document that you cannot put a quantifier on such a
construct, yet we know this to be the case.

Actually the behavior of this construct is quite weird. I would argue
that the following is a bug:

$ ./perl -Ilib -Mre=debug -le'my

You can't use 'macro parameter character #' in math mode

$x=")"; /(?#foo$

{x}bar)/'
Compiling REx "(?#foo${x}bar)"
Final program:
1: NOTHING (2)
2: END (0)
minlen 0
Matching REx "(?#foo${x}bar)" against ""
0 <> <> | 1:NOTHING(2)
0 <> <> | 2:END(0)
Match successful!
Freeing REx: "(?#foo${x}bar)"

I would also argue that this is a bug:

$ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/'
Compiling REx "(?#foo%0bar)"
Final program:
1: NOTHING (2)
2: END (0)
minlen 0
Matching REx "(?#foo%0bar)" against ""
0 <> <> | 1:NOTHING(2)
0 <> <> | 2:END(0)
Match successful!
Freeing REx: "(?#foo%0bar)"

I also think that this is weird:

$ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/'
Compiling REx "(?#foo%0bar)"
Final program:
1: NOTHING (2)
2: END (0)
minlen 0
Matching REx "(?#foo%0bar)" against ""
0 <> <> | 1:NOTHING(2)
0 <> <> | 2:END(0)
Match successful!
Freeing REx: "(?#foo%0bar)"

Anyway, this construct has all the fingerprints of "poorly thought
through regex change". The regex engine is incredibly complex,
interacts with the rest of the code in complicated ways so it is
really easy to miss edge case like I would argue this does.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T08:55:38Z

From @demerphq

On 6 February 2013 08:44, <hv@crypt.org> wrote:

demerphq <demerphq@gmail.com> wrote:
:On 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
:> I think it's fine to document that this won't work. I wonder whether we can
:> issue a warning, too?
:
:My inclination is that we should just allow quantifiers on a (?#...) block.
:
:Then we could warn something like "quantifier on comment, maybe you
:meant something else"

It is not a useful thing to use in any context; I'd prefer just to make
it an error.

Allowing it with a warning only makes sense to me if we want to be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T09:09:18Z

From @demerphq

On 6 February 2013 09:53, demerphq <demerphq@gmail.com> wrote:

On 6 February 2013 01:01, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T08:58:19]

I have just looked at the output. The cause of this behavior is not
the optimiser, it is the comment parsing feature behaving in ways that
you dont expect. See below comments on poorly thought through quick
hacks to the regex engine.

Thanks, this was a really interesting message.

You are welcome. Very happy to be of service. :-)

BTW, I have started investigating a fix. The offending code is as follows:
 case '\#'&#8203;: /\* $?\#\.\.\.$ \*/
 while $\*RExC\_parse && \*RExC\_parse \!= '$'\)
 RExC\_parse\+\+;
 if $\*RExC\_parse \!= '$'\)
 FAIL$"Sequence \(?\#\.\.\. not terminated"$;
 nextchar$pRExC\_state$;
 \*flagp = TRYAGAIN;
 return NULL;
The documentation for this construct is:
 "$?\#text$"
 A comment\. The text is ignored\. If the "/x"
modifier enables whitespace formatting, a simple "#" will suffice.
Note that Perl
closes the comment as soon as it sees a ")", so there
is no way to put a literal ")" in the comment.

The tests for this feature are:

t/re/re_tests:^a(?#xxx){3}c aaac y $& aaac
t/re/re_tests:'^a (?#xxx) (?#yyy) {3}c'x aaac y $& aaac

Which means someone thought that putting a comment in between the
quantifier and the item it applies to is a good idea. I personally
think they are wrong, undertested their patch, and never notice how
horribly broken various other things would be by this patch.

Btw, according to blame these tests are from Ilya and not Larry.

c277df4 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39
-0500 547) ^a(?#xxx){3}c aaac y $& aaac
c277df4 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39
-0500 548) '^a (?#xxx) (?#yyy) {3}c'x aaac y $&
aaac

I checked, and the feature was added by Larry as part of the 5.000
"mega-patch" (a0d0e21)

The docs from that commit are:

+=item (?#text)
+
+A comment. The text is ignored.

As far as I can tell there were no tests for the feature at that time.

Anyway, I do not think it is a feasible or reasonable to fix this to
be legal. And will proceed under the assumption that the tests are
wrong.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T09:32:04Z

From @arc

Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an
atom and its quantifier in /x mode:

$ perl -Mre=debug -wE '"" =~ /ab#
+c/x'
Compiling REx "ab#%n+c"
Final program:
1: EXACT <a> (3)
3: PLUS (6)
4: EXACT (0)
6: EXACT <c> (8)
8: END (0)
anchored "ab" at 0 floating "bc" at 1..2147483647 (checking floating) minlen 3
Freeing REx: "ab#%n+c"

I think that means we have three options, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and
a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3, but I probably won't have chance
till the weekend.

I wonder whether we can issue a warning, too?

If we can detect the situation well enough to issue a warning,
wouldn't that imply we could instead skip past an embedded (?#)
comment when parsing? At least, that's my impression to date from
reading regcomp.c.

On one hand, I'd like to imagine that anybody doing
something so crazy would know to consider that it might not work and that he or
she should consult the friendly manual.

My initial response is to agree that it's crazy, but on reflection, I
think at least the /x case is actually useful. I've seen code
formatted like this before:

$line =~ m{
^ # beginning of line
(?: \s* \w+)+ # keywords
$ # end of line
}x

at which point it's not much of a stretch for someone to write that like this:

$line =~ m{
^ # beginning of line
(?: \s* \w+) # keyword
+ # ... one or more times
$ # end of line
}x

which has whitespace and a comment before the "one or more times" quantifier.

--
Aaron Crane ** http://aaroncrane.co.uk/

p5pRT · 2013-02-06T09:43:51Z

From Eirik-Berg.Hanssen@allverden.no

On Wed, Feb 6, 2013 at 10:08 AM, demerphq <demerphq@gmail.com> wrote:

Anyway, I do not think it is a feasible or reasonable to fix this to
be legal. And will proceed under the assumption that the tests are
wrong.

The way you describe it, and given my ignorance of the implementation,
the odd thing is it seems to work with braceless comments (and /x):

eirik@bluebird[10:38:44]~$ perl
use 5.014;
use strict;
use warnings;

say "/x:\t", "food" =~
/ fo # then
{2} # twice
d
/x ? "match" : "miss";

say "(?#):\t", "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss";

say "both:\t", "food" =~
/ fo (?# then)
{2} (?# twice)
d
/x ? "match" : "miss";
__END__
/x: match
(?#): miss
both: miss
eirik@bluebird[10:38:54]~$

Couldn't these be expected – and made? – to parse identically?

Eirik

p5pRT · 2013-02-06T10:16:47Z

From @demerphq

On 6 February 2013 10:43, Eirik Berg Hanssen
<Eirik-Berg.Hanssen@allverden.no> wrote:

On Wed, Feb 6, 2013 at 10:08 AM, demerphq <demerphq@gmail.com> wrote:

Anyway, I do not think it is a feasible or reasonable to fix this to
be legal. And will proceed under the assumption that the tests are
wrong.

The way you describe it, and given my ignorance of the implementation, the
odd thing is it seems to work with braceless comments (and /x):

eirik@bluebird[10:38:44]~$ perl
use 5.014;
use strict;
use warnings;

say "/x:\t", "food" =~
/ fo # then
{2} # twice
d
/x ? "match" : "miss";

say "(?#):\t", "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss";

say "both:\t", "food" =~
/ fo (?# then)
{2} (?# twice)
d
/x ? "match" : "miss";
__END__
/x: match
(?#): miss
both: miss
eirik@bluebird[10:38:54]~$

Couldn't these be expected – and made? – to parse identically?

IMO not very easily no.

You have to understand how the parser works, but basically the code we
would strip off the # style comments is the same code that handles
EXACT so it is easy to handle there.

However the code that handles (?#...) is the same code that handles
(?:...) and any other such compound structure. Each of those
expressions is treated an independent sub expression. The code makes
many assumptions that when it enters the code that parses an
independent subexpression that anything before it is, well,
independent. :-) And this whole thing with quantifiers breaks that
assumption, with predictable results.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T10:35:34Z

From @demerphq

On 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote:

Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an
atom and its quantifier in /x mode:

Yep. But that is handled by the same code that handles EXACT.

See the mail I just sent to Eirik for more details but also take a
look at the commentary at the bottom of this mail.

$ perl -Mre=debug -wE '"" =~ /ab#
+c/x'
Compiling REx "ab#%n+c"
Final program:
1: EXACT <a> (3)
3: PLUS (6)
4: EXACT (0)
6: EXACT <c> (8)
8: END (0)
anchored "ab" at 0 floating "bc" at 1..2147483647 (checking floating) minlen 3
Freeing REx: "ab#%n+c"

I think that means we have three options, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and
a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3, but I probably won't have chance
till the weekend.

I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1
is a good idea when combined with the following:

Basically the fix I want to make is: treat a quantifier following a
(?#...) as applying to that construct, and DIE when we encounter it.

IMO the only thing this might break is broken code already, and we
will break it in a way that prevents data loss or mismatches.

I wonder whether we can issue a warning, too?

If we can detect the situation well enough to issue a warning,
wouldn't that imply we could instead skip past an embedded (?#)
comment when parsing? At least, that's my impression to date from
reading regcomp.c.

Nope I dont think so. And I think the code to do so would be horrible,
or inefficient, or both. :-)

On one hand, I'd like to imagine that anybody doing
something so crazy would know to consider that it might not work and that he or
she should consult the friendly manual.

My initial response is to agree that it's crazy, but on reflection, I
think at least the /x case is actually useful. I've seen code
formatted like this before:

$line =~ m{
^ # beginning of line
(?: \s* \w+)+ # keywords
$ # end of line
}x

Yeah, this is no problem

at which point it's not much of a stretch for someone to write that like this:

$line =~ m{
^ # beginning of line
(?: \s* \w+) # keyword
+ # ... one or more times
$ # end of line
}x

which has whitespace and a comment before the "one or more times" quantifier.

This is also not a problem, and I have no intention to change it.

The problem is strictly the use of (?#...) in between an atom and its
quantifier. I have outside of the code in core never seen anybody use
(?#...) at all, ever, in code. No doubt there is someone who does so,
but I bet they are very few.

On the other hand /x with # style escapes is widely used, is not a
problem and IMO should not change.

Commentary:
To illustrate the core problem here take a look at the output of this
command (trimmed for brevity). You might want to use a fixed width
font to view it. Each of the terms "reg", "brnc", "piec", "atom"
refers to different routine in the parsing of the pattern. Notice the
routine that handles the (?#...) is called "atom". The point here is
that once you enter "atom" you have terminated any previous "atom",
and once you terminate an atom any quantifier encountered is assumed
to apply to the previous *atom*. The only way I see fixing this is NOT
merging characters into an EXACT atom, however that will definitely
have performance and memory consequences - we stopped doing this
because it was a significant performance hit, and massively
complicates matching unicode. I also see it probably breaking
(?:foo)QUANTIFIER.

$ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/
? "Y" : "n"'
Compiling REx ".b(?#Comment){2}c"
Starting first pass (sizing)

.b(?#Comme... | 1| reg
| | brnc
| | piec
| | atom
b(?#Commen... | 2| piec
| | atom
c< | 4| inst - CURLY
| 6| piec
| | atom
Required size 8 nodes
Starting second pass (creation)
.b(?#Comme... | 1| reg
| | brnc
| | piec
| | atom
b(?#Commen... | 2| piec
| | atom
c< | 4| inst - CURLY
| 6| tail~ REG_ANY (1) -> CURLY
| | piec
| | atom
< | 8| tail~ CURLY {2,2} (2) -> EXACT
| 9| lsbr~ tying lastbr REG_ANY (1) to ender END (8) offset 7
| | tail~ REG_ANY (1)
| | ~ CURLY {2,2} (2)
| | ~ EXACT <c> (6) -> END
[....]
Final program:
1: REG_ANY (2)
2: CURLY {2,2} (6)
4: EXACT (0)
6: EXACT <c> (8)
8: END (0)
anchored "bbc" at 1 (checking anchored) minlen 4

Anyway, if you can see an easy fix then of course go for it. But I
will continue to work on the simple fix that I know will not have
nasty side-effects.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T11:11:32Z

From @nwc10

On Wed, Feb 06, 2013 at 11:34:50AM +0100, demerphq wrote:

On 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote:

Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

Specifically, embedded comments.
In that comments in /x mode don't seem to have the price tag.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an
atom and its quantifier in /x mode:

Yep. But that is handled by the same code that handles EXACT.

I think that means we have three options, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and
a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3, but I probably won't have chance
till the weekend.

I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1
is a good idea when combined with the following:

Basically the fix I want to make is: treat a quantifier following a
(?#...) as applying to that construct, and DIE when we encounter it.

That might be as simple as treating embedded comments the same way as
embedded pattern match modifiers:

$ perl -le '/ab(?i)+c/'
Quantifier follows nothing in regex; marked by <-- HERE in m/ab(?i)+ <-- HERE c/ at -e line 1.

(Might be a bit more tricky to get an error that says "comment" for comments
and "nothing" for embedded pattern match modifiers, but possibly it makes just
as much sense to keep the word "nothing" for comments)

IMO the only thing this might break is broken code already, and we
will break it in a way that prevents data loss or mismatches.

Yes. Right now, I feel that the most bonkers thing is that the (mis)parsing
differs depending on whether the preceding thing is multiple literals, or
a single literal. Consistently dieing on /.b(?#comment)*/ and
/ab(?#comment)*/ (etc) would be an improvement.

I wonder whether we can issue a warning, too?

If we can detect the situation well enough to issue a warning,
wouldn't that imply we could instead skip past an embedded (?#)
comment when parsing? At least, that's my impression to date from
reading regcomp.c.

Nope I dont think so. And I think the code to do so would be horrible,
or inefficient, or both. :-)

It wouldn't surprise me that it's much easier to detect the problem (after
it's too late to parse it correctly), than to detect it early enough to
parse correctly, or to fix up the parse after the event.

The problem is strictly the use of (?#...) in between an atom and its
quantifier. I have outside of the code in core never seen anybody use
(?#...) at all, ever, in code. No doubt there is someone who does so,
but I bet they are very few.

On the other hand /x with # style escapes is widely used, is not a
problem and IMO should not change.

Agree. In that I feel that it's perfectly reasonable to document that (?#)
style has limitations on where it can be used.

Anyway, if you can see an easy fix then of course go for it. But I
will continue to work on the simple fix that I know will not have
nasty side-effects.

Yes, I'm not going to object to Aaron having fun grovelling in the code
either. :-)

Nicholas Clark

p5pRT · 2013-02-06T11:27:21Z

From @demerphq

On 6 February 2013 12:10, Nicholas Clark <nick@ccl4.org> wrote:

On Wed, Feb 06, 2013 at 11:34:50AM +0100, demerphq wrote:

On 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote:

Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]

On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:

The docpatch should probably say "You cannot put a comment in between
an atom and its quantifier."

IMO comments in insane places in regexes is not worth the price.

Specifically, embedded comments.
In that comments in /x mode don't seem to have the price tag.

Exactly.

I think it's fine to document that this won't work.

I'm not entirely convinced. We already allow a comment between an
atom and its quantifier in /x mode:

Yep. But that is handled by the same code that handles EXACT.

I think that means we have three options, in the abstract:

1. We preserve this inconsistency between /x comments and (?#) comments

2. We start forbidding (whitespace and?) comments between an atom and
a quantifier in /x mode

3. We start permitting (?#) comments between an atom and its quantifier

I believe option 2 risks breaking existing code.

I'm happy to look at doing option 3, but I probably won't have chance
till the weekend.

I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1
is a good idea when combined with the following:

Basically the fix I want to make is: treat a quantifier following a
(?#...) as applying to that construct, and DIE when we encounter it.

That might be as simple as treating embedded comments the same way as
embedded pattern match modifiers:

$ perl -le '/ab(?i)+c/'
Quantifier follows nothing in regex; marked by <-- HERE in m/ab(?i)+ <-- HERE c/ at -e line 1.

(Might be a bit more tricky to get an error that says "comment" for comments
and "nothing" for embedded pattern match modifiers, but possibly it makes just
as much sense to keep the word "nothing" for comments)

Ah good catch. I guess that is also there for things like /(|)*/

IMO the only thing this might break is broken code already, and we
will break it in a way that prevents data loss or mismatches.

Yes. Right now, I feel that the most bonkers thing is that the (mis)parsing
differs depending on whether the preceding thing is multiple literals, or
a single literal. Consistently dieing on /.b(?#comment)*/ and
/ab(?#comment)*/ (etc) would be an improvement.

Indeed. I also have a suspicion that an exhaustive search would find
other "interesting" oddities related to this.

I suspect there is deeper fu involved here that has not been remarked
on in this thread, as the whole thing with

/(?#$foo)/

where $foo is not expanded is very strange, and cant be accounted for
as far as I know in the regex engine alone.

I wonder whether we can issue a warning, too?

If we can detect the situation well enough to issue a warning,
wouldn't that imply we could instead skip past an embedded (?#)
comment when parsing? At least, that's my impression to date from
reading regcomp.c.

Nope I dont think so. And I think the code to do so would be horrible,
or inefficient, or both. :-)

It wouldn't surprise me that it's much easier to detect the problem (after
it's too late to parse it correctly), than to detect it early enough to
parse correctly, or to fix up the parse after the event.

Thanks. That is a much more succinct description of what I was trying
to say than what I managed to type out. :-)

The problem is strictly the use of (?#...) in between an atom and its
quantifier. I have outside of the code in core never seen anybody use
(?#...) at all, ever, in code. No doubt there is someone who does so,
but I bet they are very few.

On the other hand /x with # style escapes is widely used, is not a
problem and IMO should not change.

Agree. In that I feel that it's perfectly reasonable to document that (?#)
style has limitations on where it can be used.

Anyway, if you can see an easy fix then of course go for it. But I
will continue to work on the simple fix that I know will not have
nasty side-effects.

Yes, I'm not going to object to Aaron having fun grovelling in the code
either. :-)

Especially if he proves me wrong. :-) (secretly this thread is all
part of our evil master plan to get Aaron hacking on the regex engine
more... Oh wait did I type that...)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-02-06T23:59:37Z

From @rjbs

* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]

On 6 February 2013 08:44, <hv@crypt.org> wrote:

Allowing it with a warning only makes sense to me if we want to be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

Agreed, too, thanks Hugo.

--
rjbs

p5pRT · 2013-02-08T06:11:16Z

From @demerphq

On 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]

On 6 February 2013 08:44, <hv@crypt.org> wrote:

Allowing it with a warning only makes sense to me if we want to be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

Agreed, too, thanks Hugo.

A patch to implement this has been pushed as:
smoke-me/regex_comment_quantifier_fix

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-07-10T06:43:43Z

From @tonycoz

On Thu Feb 07 22:11:16 2013, demerphq wrote:

On 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]

On 6 February 2013 08:44, <hv@crypt.org> wrote:

Allowing it with a warning only makes sense to me if we want to
be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

Agreed, too, thanks Hugo.

A patch to implement this has been pushed as:
smoke-me/regex_comment_quantifier_fix

Did this change smoke cleanly for you?

I don't see anything at

http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix

that looks like it was caused by this patch.

This branch no longer rebases cleanly on blead.

Tony

p5pRT · 2013-11-14T23:04:42Z

From @tonycoz

On Wed Feb 06 15:59:37 2013, perl.p5p@rjbs.manxome.org wrote:

* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]

On 6 February 2013 08:44, <hv@crypt.org> wrote:

Allowing it with a warning only makes sense to me if we want to be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

Agreed, too, thanks Hugo.

Karl deprecated at least some of this behaviour in 000947a.

Is the intent to ban it entirely in 5.20, or wait longer?

Tony

p5pRT · 2013-11-15T00:40:41Z

From @khwilliamson

On 11/14/2013 04:04 PM, Tony Cook via RT wrote:

On Wed Feb 06 15:59:37 2013, perl.p5p@rjbs.manxome.org wrote:

* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]

On 6 February 2013 08:44, <hv@crypt.org> wrote:

Allowing it with a warning only makes sense to me if we want to be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

Agreed, too, thanks Hugo.

Karl deprecated at least some of this behaviour in 000947a.

Is the intent to ban it entirely in 5.20, or wait longer?

Some releases ago we agreed to having all deprecations be at least 2
cycles, because our release dates may not mesh with bundlers like debian.

p5pRT · 2013-11-15T09:21:14Z

From @demerphq

Should I investigate a new patch?

On 10 July 2013 08:43, Tony Cook via RT <perlbug-followup@perl.org> wrote:

On Thu Feb 07 22:11:16 2013, demerphq wrote:

On 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]

On 6 February 2013 08:44, <hv@crypt.org> wrote:

Allowing it with a warning only makes sense to me if we want to
be able
to define the effect in terms of something else, like '(?#comment) is
the same as (?=(?:|comment))'; but it isn't, and we don't. :)

Thanks for the analysis, I agree. If I can figure it out at all I will
make it die instead of warn.

Agreed, too, thanks Hugo.

A patch to implement this has been pushed as:
smoke-me/regex_comment_quantifier_fix

Did this change smoke cleanly for you?

I don't see anything at

http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix

that looks like it was caused by this patch.

This branch no longer rebases cleanly on blead.

Tony

---
via perlbug: queue: perl5 status: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2013-11-24T22:36:08Z

From @tonycoz

On Fri Nov 15 01:21:14 2013, demerphq wrote:

Should I investigate a new patch?

It can wait, since it won't be in until 5.22.

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303

Tony

p5pRT · 2015-03-07T02:25:45Z

From @jkeenan

On Sun Nov 24 14:36:08 2013, tonyc wrote:

On Fri Nov 15 01:21:14 2013, demerphq wrote:

Should I investigate a new patch?

It can wait, since it won't be in until 5.22.

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303

Tony

Would it be possible to get an update on the status of this ticket (as it's marked as a blocker for 5.22)?

Thank you very much.

--
James E Keenan (jkeenan@cpan.org)

p5pRT · 2015-03-09T08:43:44Z

From @demerphq

On 7 Mar 2015 03:26, "James E Keenan via RT" <perlbug-followup@perl.org>
wrote:

On Sun Nov 24 14:36:08 2013, tonyc wrote:

On Fri Nov 15 01:21:14 2013, demerphq wrote:

Should I investigate a new patch?

It can wait, since it won't be in until 5.22.

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303

Tony

Would it be possible to get an update on the status of this ticket (as
it's marked as a blocker for 5.22)?

Thank you very much.

As I recall I wrote my patch right before 5.20 was released and we were in
a code freeze.

I've complained in the past about how code freeze isnt necessary and causes
patches to be lost and this is an example, a patch that got forgotten
because of it.

We are now in another code freeze, again done suboptimally, and this change
is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze, and simply create a new prerelease branch
instead and let people merge to blead while the prerelease branch is
finalized. Then after release we would merge the prerelease branch back to
blead. That would prevent things like this from happening, and allow people
to continue to hack blead during the code freeze.

Anyway, if ricardo wants to make an exception here its his call. I assume
he will say wait till 5.26.

p5pRT · 2015-03-10T23:08:29Z

From perl5-porters@perl.org

Yves Orton wrote:

As I recall I wrote my patch right before 5.20 was released and we were in
a code freeze.

I've complained in the past about how code freeze isnt necessary and causes
patches to be lost and this is an example, a patch that got forgotten
because of it.

We are now in another code freeze, again done suboptimally, and this change
is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze, and simply create a new prerelease branch
instead and let people merge to blead while the prerelease branch is
finalized. Then after release we would merge the prerelease branch back to
blead. That would prevent things like this from happening, and allow people
to continue to hack blead during the code freeze.

All it takes is for someone to create a post-5.22 branch and rebase
and merge it to blead after 5.22. I did that a few years ago, but the
last couple of years I was offline during code freeze.

p5pRT · 2015-03-11T09:37:51Z

From @demerphq

On 11 March 2015 at 00:08, Father Chrysostomos <sprout@cpan.org> wrote:

Yves Orton wrote:

As I recall I wrote my patch right before 5.20 was released and we were in
a code freeze.

I've complained in the past about how code freeze isnt necessary and causes
patches to be lost and this is an example, a patch that got forgotten
because of it.

We are now in another code freeze, again done suboptimally, and this change
is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze, and simply create a new prerelease branch
instead and let people merge to blead while the prerelease branch is
finalized. Then after release we would merge the prerelease branch back to
blead. That would prevent things like this from happening, and allow people
to continue to hack blead during the code freeze.

All it takes is for someone to create a post-5.22 branch and rebase
and merge it to blead after 5.22. I did that a few years ago, but the
last couple of years I was offline during code freeze.

IMO That is the wrong way around.

When we create a release candidate we should create the release
candidate branch.

At that point that branch is "owned" by the release manager, and only
they can commit to it.

Anything that is required for release can be applied to either blead
or the release branch, if necessary it can be cherry picked from one
to the other.

Freezing blead just ignores the capabilities of git to merge and
cherry pick, and slows down core development for no good reason.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2015-10-11T16:58:30Z

From @khwilliamson

Fixed by 45a2623346a3b7028c030eb1dc738723b216d2dc
--
Karl Williamson

p5pRT · 2015-10-11T16:58:32Z

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT · 2015-10-13T22:42:28Z

From @khwilliamson

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier? I can see both sides of the issue, so am now bringing it up to discussion again. blead is now in a state where it would be easy to add the ability to choose which places allow (?#...) and which forbid it, but allow white space and regular # comments, both only under /x. We could allow (?#...) only under /x in such cases if we choose. It's easy to change it to do any of this, and I'm willing to do the work, once a decision has been made as to what to do.

My only stance on this is that I think (but am convince-able the other way) that under /x, anywhere there is a # comment, should also allow a (?#...) comment
--
Karl Williamson

p5pRT · 2015-10-14T08:05:29Z

From @demerphq

On 14 October 2015 at 00:42, Karl Williamson via RT
<perlbug-followup@perl.org> wrote:

I may have closed this prematurely. I had not read the extensive commentary on this when I closed it, only the original report. So I had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls, and that led to making all such stuff uniform. And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its quantifier?

IMO no.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2015-10-14T11:43:28Z

From @Abigail

On Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote:

I may have closed this prematurely. I had not read the extensive
commentary on this when I closed it, only the original report. So I
had forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one
(including me) thought about nextchr()'s behavior when the pattern is
UTF-8 encoded. It did a simple ++ of the parse position, which is the
wrong thing to do when the character is a multi-byte character. It would
point to the 2nd byte of that, hence the tests it did after the increment
for white space under /x would fail for white space that was multi-byte.
When I tried to write tests after fixing that, I discovered that nothing
I came up with would reliably fail. And valgrind showed that there
reads outside the buffer of garbage data. That led to me fixing a
bunch of nextchr calls, and that led to making all such stuff uniform.
And that led to this bug being fixed.

But do we really want a (?#...) comment between a character and its
quantifier?

I'd vote yes. For consistency. See below.

quantifier? I can see both sides of the issue, so am now bringing it up
to discussion again. blead is now in a state where it would be easy to
add the ability to choose which places allow (?#...) and which forbid
it, but allow white space and regular # comments, both only under /x.
We could allow (?#...) only under /x in such cases if we choose.
It's easy to change it to do any of this, and I'm willing to do the work,
once a decision has been made as to what to do.

My only stance on this is that I think (but am convince-able the other
way) that under /x, anywhere there is a # comment, should also allow a
(?#...) comment

I agree. And I'd throw whitespace in it as well: anywhere where we
ignore whitespace under /x, we should allow a # comment, and hence,
should allow a (?#...) comment.

Ignorable whitespace between a character and its quantifier(s) is allowed:

$ perl -wE 'say "aa" =~ /^a + +$/x'
1
$

A comment there is also allowed:

$ perl -wE 'say "aa" =~ /^a # Foo
+ # Bar
+$/x'
1
$

If we have different rules for whitespace and (?#), it won't be easy
to document properly, and it won't be easy to learn.

Abigail

p5pRT · 2015-10-15T00:30:14Z

From @ikegami

On Wed, Oct 14, 2015 at 4:05 AM, demerphq <demerphq@gmail.com> wrote:

On 14 October 2015 at 00:42, Karl Williamson via RT
<perlbug-followup@perl.org> wrote:

I may have closed this prematurely. I had not read the extensive
commentary on this when I closed it, only the original report. So I had
forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one
(including me) thought about nextchr()'s behavior when the pattern is UTF-8
encoded. It did a simple ++ of the parse position, which is the wrong
thing to do when the character is a multi-byte character. It would point
to the 2nd byte of that, hence the tests it did after the increment for
white space under /x would fail for white space that was multi-byte. When
I tried to write tests after fixing that, I discovered that nothing I came
up with would reliably fail. And valgrind showed that there reads outside
the buffer of garbage data. That led to me fixing a bunch of nextchr
calls, and that led to making all such stuff uniform. And that led to this
bug being fixed.

But do we really want a (?#...) comment between a character and its
quantifier?

IMO no.

Yves

So the follow question would be: What should happen when someone does that?

p5pRT · 2015-10-15T07:15:00Z

From @demerphq

On 15 October 2015 at 02:29, Eric Brine <ikegami@adaelis.com> wrote:

On Wed, Oct 14, 2015 at 4:05 AM, demerphq <demerphq@gmail.com> wrote:

On 14 October 2015 at 00:42, Karl Williamson via RT
<perlbug-followup@perl.org> wrote:

I may have closed this prematurely. I had not read the extensive
commentary on this when I closed it, only the original report. So I had
forgotten the controversy over what should happen.

To recap what has happened in blead: It turns out that no one
(including me) thought about nextchr()'s behavior when the pattern is UTF-8
encoded. It did a simple ++ of the parse position, which is the wrong thing
to do when the character is a multi-byte character. It would point to the
2nd byte of that, hence the tests it did after the increment for white space
under /x would fail for white space that was multi-byte. When I tried to
write tests after fixing that, I discovered that nothing I came up with
would reliably fail. And valgrind showed that there reads outside the
buffer of garbage data. That led to me fixing a bunch of nextchr calls, and
that led to making all such stuff uniform. And that led to this bug being
fixed.

But do we really want a (?#...) comment between a character and its
quantifier?

IMO no.

Yves

So the follow question would be: What should happen when someone does that?

Throw an error, modifier illegal on zero width match.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2015-10-18T01:54:07Z

From @rjbs

* Abigail <abigail@abigail.be> [2015-10-14T07:43:05]

On Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote:

My only stance on this is that I think (but am convince-able the other
way) that under /x, anywhere there is a # comment, should also allow a
(?#...) comment

I agree. And I'd throw whitespace in it as well: anywhere where we
ignore whitespace under /x, we should allow a # comment, and hence,
should allow a (?#...) comment.

I wish I'd read your testing out of this before going through and doing it
myself. :-)

Anyway, I agree. While I'd rather nobody actually write this, I think that's a
matter of style. Anyway: making it a matter of grammar would require either
weird inconsistency or breakage.

--
rjbs

p5pRT · 2015-10-18T04:17:32Z

From @khwilliamson

On 10/17/2015 07:53 PM, Ricardo Signes wrote:

* Abigail <abigail@abigail.be> [2015-10-14T07:43:05]

On Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote:

My only stance on this is that I think (but am convince-able the other
way) that under /x, anywhere there is a # comment, should also allow a
(?#...) comment

I agree. And I'd throw whitespace in it as well: anywhere where we
ignore whitespace under /x, we should allow a # comment, and hence,
should allow a (?#...) comment.

I wish I'd read your testing out of this before going through and doing it
myself. :-)

Anyway, I agree. While I'd rather nobody actually write this, I think that's a
matter of style. Anyway: making it a matter of grammar would require either
weird inconsistency or breakage.

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We
probably have to support that in the places where it's been that way all
along, but we could decide to not support them in the places that I just
added, when not under /x. Thus, we could say that you can't split a
quantifier from its atom except under /x.

I don't have an opinion on this.

p5pRT · 2015-10-19T02:50:38Z

From @rjbs

* Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We
probably have to support that in the places where it's been that way all
along, but we could decide to not support them in the places that I just
added, when not under /x. Thus, we could say that you can't split a
quantifier from its atom except under /x.

Thanks, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this, but: I think that I would find it useful
to say:

If you want to put comments into a regular expression, you have two
options. You use /x and then insert space and any kind of comments between
tokens, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become
allowed under /x.

--
rjbs

p5pRT · 2015-10-19T09:34:18Z

From @demerphq

On 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We
probably have to support that in the places where it's been that way all
along, but we could decide to not support them in the places that I just
added, when not under /x. Thus, we could say that you can't split a
quantifier from its atom except under /x.

Thanks, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this, but: I think that I would find it useful
to say:

If you want to put comments into a regular expression, you have two
options. You use /x and then insert space and any kind of comments between
tokens, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become
allowed under /x.

I really dont like this. A) it complicates the regex engine, and B) it
makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1,10} is a single token.

a(?#whatever){1,10}

is two "tokens".

I think expecting people to grok that

/a

{1,10}/x

is the same as /a{1,10}/x

is IMO just asking for trouble and for people to misunderstand what is going on.

Related, would you expect

\(?#is this a \w match or not)w

to be the same as

\w

?

Its not, but IMO if you think that a(?#foo){1,10} should be the same
as a{1,10} then why shouldnt \(?#whatever)w?

Of course, it wouldnt work, as \( would mean that (?#whatever) is not
treated as a comment at all.

So this idea just increases inconsistency in our regexes for IMO
basically no value.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2015-10-19T14:26:46Z

From @Abigail

On Mon, Oct 19, 2015 at 11:33:55AM +0200, demerphq wrote:

On 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

* Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We
probably have to support that in the places where it's been that way all
along, but we could decide to not support them in the places that I just
added, when not under /x. Thus, we could say that you can't split a
quantifier from its atom except under /x.

Thanks, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this, but: I think that I would find it useful
to say:

If you want to put comments into a regular expression, you have two
options. You use /x and then insert space and any kind of comments between
tokens, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments become
allowed under /x.

I really dont like this. A) it complicates the regex engine, and B) it
makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1,10} is a single token.

To me, it isn't.

To Perl, it isn't either, as

/a {1,10}/x

matches up to 10 a's:

$ perl -wE '"aaaaaaaaaa" =~ /a {1,10}/x; say $& // "UNDEF"'
aaaaaaaaaa
$

And it's not that /x just removes any whitespace it finds, as it
doesn't remove the whitespace inside the braces:

$ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"'
UNDEF
$

(In this case, it tries to match the literal string "a{1,10}" -- one
can wonder whether that's the most useful thing it could do).

Compiling /a {1,10}/x shows the tokes perl think there are:

1: CURLY {1,10} (5)
3: EXACT <a> (0)
5: END (0)

a(?#whatever){1,10}

is two "tokens".

I don't see any benefit of suddenly disallowing a space (under /x)
between 'a' and '{1,10}', nor to allow space there, but not a
comment. I don't see myself easily use an inline comment there,
but I also don't think Perl should enforce a particular coding
style upon its users. That just makes the language less flexible,
and prohibits creativity.

Abigail

p5pRT · 2015-10-19T14:56:49Z

From @khwilliamson

On 10/19/2015 08:26 AM, Abigail wrote:

 $ perl \-wE '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
 UNDEF
 $
(In this case, it tries to match the literal string "a{1,10}" -- one
can wonder whether that's the most useful thing it could do).

Note that this now give a warning, even without the '-w' command line
option:

$ blead -E '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"'
Unescaped left brace in regex is deprecated, passed through in regex;
marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1.
UNDEF
$

p5pRT · 2015-10-19T15:07:36Z

From @Abigail

On Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote:

On 10/19/2015 08:26 AM, Abigail wrote:
 $ perl \-wE '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
 UNDEF
 $
(In this case, it tries to match the literal string "a{1,10}" -- one
can wonder whether that's the most useful thing it could do).
Note that this now give a warning, even without the '-w' command line
option:
$ blead \-E '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
Unescaped left brace in regex is deprecated\, passed through in regex; 
marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1.
UNDEF
$

Yes, I know.

Is that useful? I'd wager that of all the people who write

/a { 1,10}/x

more than 99.9% of want it to behave like

/a {1,10}/x

and none of them actually want to match the literal string

"a{1,10}"

Abigail (who has put a space after the , in /x{n,m}/ way too often).

p5pRT · 2015-10-19T15:19:09Z

From @ikegami

On Mon, Oct 19, 2015 at 5:33 AM, demerphq <demerphq@gmail.com> wrote:

On 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

* Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]

Just to make sure everyone understands.

Currently (?#...) comments are allowed even when there is no /x. We
probably have to support that in the places where it's been that way all
along, but we could decide to not support them in the places that I just
added, when not under /x. Thus, we could say that you can't split a
quantifier from its atom except under /x.

Thanks, I was confused.

I don't have an opinion on this.

I'm not strongly opinionated on this, but: I think that I would find it
useful
to say:

If you want to put comments into a regular expression, you have two
options. You use /x and then insert space and any kind of comments
between
tokens, or you can skip /x and use (?#...) between tokens.

That is: always allow (?#...) in those places where space and comments
become
allowed under /x.

I really dont like this. A) it complicates the regex engine, and B) it
makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1,10} is a single token.

Whether whitespace is allowed within or not, exports would not consider
that a token. Tokens are indivisible, yet a number of parts of "a{1,10}"
are optional. That pattern has 6 tokens ("a", "{", "1", ",", "10", "}").

a(?#whatever){1,10} is two "tokens".

Still the same 6 tokens.

p5pRT · 2015-10-19T15:24:29Z

From @khwilliamson

On 10/19/2015 09:07 AM, Abigail wrote:

On Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote:
On 10/19/2015 08:26 AM, Abigail wrote:
 $ perl \-wE '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
 UNDEF
 $
(In this case, it tries to match the literal string "a{1,10}" -- one
can wonder whether that's the most useful thing it could do).
Note that this now give a warning, even without the '-w' command line
option:
$ blead \-E '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
Unescaped left brace in regex is deprecated\, passed through in regex;
marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1.
UNDEF
$
Yes, I know.

Is that useful? I'd wager that of all the people who write
 /a \{ 1\,10\}/x
more than 99.9% of want it to behave like
 /a \{1\,10\}/x
and none of them actually want to match the literal string
"a\{1\,10\}"
Abigail (who has put a space after the , in /x{n,m}/ way too often).

I think the warning is useful, so they aren't silently deceived. It was
never useful to behave as it has, but now they are warned.

In 5.24, this will not compile. In 5.26, it will behave as you and I
both think would be the useful way.

p5pRT · 2015-10-19T15:30:21Z

From @Abigail

On Mon, Oct 19, 2015 at 09:16:43AM -0600, Karl Williamson wrote:

On 10/19/2015 09:07 AM, Abigail wrote:
On Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote:
On 10/19/2015 08:26 AM, Abigail wrote:
 $ perl \-wE '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
 UNDEF
 $
(In this case, it tries to match the literal string "a{1,10}" -- one
can wonder whether that's the most useful thing it could do).
Note that this now give a warning, even without the '-w' command line
option:
$ blead \-E '"aaaaaaaaaa" =~ /a \{ 1\,10\}/x; say $& // "UNDEF"'
Unescaped left brace in regex is deprecated\, passed through in regex;
marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1.
UNDEF
$
Yes, I know.

Is that useful? I'd wager that of all the people who write
 /a \{ 1\,10\}/x
more than 99.9% of want it to behave like
 /a \{1\,10\}/x
and none of them actually want to match the literal string
"a\{1\,10\}"
Abigail (who has put a space after the , in /x{n,m}/ way too often).
I think the warning is useful, so they aren't silently deceived. It was
never useful to behave as it has, but now they are warned.

In 5.24, this will not compile. In 5.26, it will behave as you and I
both think would be the useful way.

Excellent.

Abigail

p5pRT · 2016-05-13T04:02:15Z

From @khwilliamson

Thank you for submitting this report. You have helped make Perl better.

With the release of Perl 5.24.0 on May 9, 2016, this and 149 other issues have been resolved.

Perl 5.24.0 may be downloaded via https://metacpan.org/release/RJBS/perl-5.24.0

p5pRT · 2016-05-13T04:02:15Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed May 13, 2016

p5pRT added Severity Low distro-freebsd hasPatch type-core labels Oct 19, 2019

regex optimiser wrongly rejects certain matches involving embedded comments #12755

regex optimiser wrongly rejects certain matches involving embedded comments #12755

Comments

p5pRT commented Feb 5, 2013

p5pRT commented Feb 5, 2013

From @nwc10

Created by @nwc10

p5pRT commented Feb 5, 2013

From @nwc10

p5pRT commented Feb 5, 2013

From @demerphq

p5pRT commented Feb 5, 2013

p5pRT commented Feb 5, 2013

From @demerphq

p5pRT commented Feb 5, 2013

From @nwc10

p5pRT commented Feb 5, 2013

From @arc

p5pRT commented Feb 5, 2013

From @demerphq

p5pRT commented Feb 5, 2013

From @demerphq

p5pRT commented Feb 5, 2013

From @demerphq

p5pRT commented Feb 5, 2013

From @khwilliamson

p5pRT commented Feb 6, 2013

From @rjbs

p5pRT commented Feb 6, 2013

From @rjbs

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @hvds

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @arc

p5pRT commented Feb 6, 2013

From Eirik-Berg.Hanssen@allverden.no

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @nwc10

p5pRT commented Feb 6, 2013

From @demerphq

p5pRT commented Feb 6, 2013

From @rjbs

p5pRT commented Feb 8, 2013

From @demerphq

p5pRT commented Jul 10, 2013

From @tonycoz

p5pRT commented Nov 14, 2013

From @tonycoz

p5pRT commented Nov 15, 2013

From @khwilliamson

p5pRT commented Nov 15, 2013

From @demerphq

p5pRT commented Nov 24, 2013

From @tonycoz

p5pRT commented Mar 7, 2015

From @jkeenan

p5pRT commented Mar 9, 2015

From @demerphq

p5pRT commented Mar 10, 2015

From perl5-porters@perl.org

p5pRT commented Mar 11, 2015

From @demerphq

p5pRT commented Oct 11, 2015

From @khwilliamson

p5pRT commented Oct 11, 2015

p5pRT commented Oct 13, 2015

From @khwilliamson

p5pRT commented Oct 14, 2015