Skip Menu |

Subject: regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 05 Feb 2013 09:26:06 +0000
To: perlbug [...] perl.org
From: nick [...] ccl4.org
Download (untitled) / with headers
text/plain 4.9k
This is a bug report for perl from nick@ccl4.org, generated with the help of perlbug 1.39 running under perl 5.16.2. ----------------------------------------------------------------- [Please describe your issue here] It looks like the regex optimiser is wrongly concluding "can never match" for certain patterns involving embedded comments. Pathological patterns that put a embedded comment between an atom and its quantifier :-) Not a regression - has been present since embedded comments were added in 5.001: $ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Y $ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' n Still present in blead. The smoking gun is the debugging output "String shorter than min possible regex match" Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches: $ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" rarest char b at 0 Final program: 1: REG_ANY (2) 2: CURLY {2,2} (6) 4: EXACT <b> (0) 6: EXACT <c> (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4 Enabling $` $& $' support (0x7). EXECUTING... Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" Found anchored substr "bbc" at offset 1... Guessed: match at offset 0 Matching REx ".b(?#Comment){2}c" against "abbc" 0 <> <abbc> | 1:REG_ANY(2) 1 <a> <bbc> | 2:CURLY {2,2}(6) EXACT <b> can match 2 times out of 2... 3 <abb> <c> | 6: EXACT <c>(8) 4 <abbc> <> | 8: END(0) Match successful! Y Freeing REx: ".b(?#Comment){2}c" with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser, and hence never enters the engine: $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" rarest char b at 1 Final program: 1: CURLYM[0] {2,2} (7) 3: EXACT <ab> (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT <c> (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5 Enabling $` $& $' support (0x7). EXECUTING... String shorter than min possible regex match n Freeing REx: "ab(?#Comment){2}c" Note, multiline comments, using /x, are not affected by this bug: $ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"' Y Nicholas Clark [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl 5.16.2: Configured by nick at Fri Dec 21 07:57:09 GMT 2012. Summary of my perl5 (revision 5 version 16 subversion 2) configuration: Commit id: db7f29e888f2e6d75a806a11ebc6caa6acd84577 Platform: osname=freebsd, osvers=7.0-stable, archname=i386-freebsd uname='freebsd plum.flirble.org 7.0-stable freebsd 7.0-stable #7: sat jul 26 20:39:26 bst 2008 root@plum.flirble.org:usrobjusrsrcsysplum i386 ' config_args='-des -Dprefix=/home/nick/Sandpit/5162' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include', optimize='-O', cppflags='-DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.2.1 20070719 [FreeBSD]', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib' libpth=/usr/lib /usr/local/lib libs=-lgdbm -lm -lcrypt -lutil -lc perllibs=-lm -lcrypt -lutil -lc libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-DPIC -fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: --- @INC for perl 5.16.2: /home/nick/Sandpit/5162/lib/perl5/site_perl/5.16.2/i386-freebsd /home/nick/Sandpit/5162/lib/perl5/site_perl/5.16.2 /home/nick/Sandpit/5162/lib/perl5/5.16.2/i386-freebsd /home/nick/Sandpit/5162/lib/perl5/5.16.2 . --- Environment for perl 5.16.2: HOME=/home/nick LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/nick/bin:/opt/local/bin:/opt/local/sbin:/usr/flirble/admin/bin:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/home/nick/bin:/usr/local/sbin:/sbin:/usr/sbin PERL_BADLANG (unset) SHELL=/usr/local/bin/bash
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 09:59:29 +0000
To: perl5-porters [...] perl.org
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1001b
On Tue, Feb 05, 2013 at 01:26:30AM -0800, Nicholas Clark wrote: Show quoted text
> It looks like the regex optimiser is wrongly concluding "can never match" > for certain patterns involving embedded comments.
Show quoted text
> String shorter than min possible regex match > n > Freeing REx: "ab(?#Comment){2}c"
OK, seems that it's more ugly, and affects more things. Or maybe this is a different bug: $ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"' Compiling REx "a(?{%"b%"}){2}c" rarest char c at 1 Final program: 1: EXACT <a> (3) 3: CURLYX[0] {2,2} (8) 5: EVAL (7) 7: WHILEM (0) 8: NOTHING (9) 9: EXACT <c> (11) 11: END (0) anchored "ac" at 0 (checking anchored) minlen 2 with eval Enabling $` $& $' support (0x7). EXECUTING... Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc" Did not find anchored substr "ac"... Match rejected by optimizer n Freeing REx: "a(?{%"b%"}){2}c" There is no fixed string "ac" in the pattern. Bad optimiser, no cookie. Nicholas Clark
CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 14:15:05 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.5k
On 5 February 2013 10:59, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> On Tue, Feb 05, 2013 at 01:26:30AM -0800, Nicholas Clark wrote: >
>> It looks like the regex optimiser is wrongly concluding "can never match" >> for certain patterns involving embedded comments.
>
>> String shorter than min possible regex match >> n >> Freeing REx: "ab(?#Comment){2}c"
> > OK, seems that it's more ugly, and affects more things. Or maybe this is a > different bug: > > $ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"' > Compiling REx "a(?{%"b%"}){2}c" > rarest char c at 1 > Final program: > 1: EXACT <a> (3) > 3: CURLYX[0] {2,2} (8) > 5: EVAL (7) > 7: WHILEM (0) > 8: NOTHING (9) > 9: EXACT <c> (11) > 11: END (0) > anchored "ac" at 0 (checking anchored) minlen 2 with eval > Enabling $` $& $' support (0x7). > > EXECUTING... > > Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc" > Did not find anchored substr "ac"... > Match rejected by optimizer > n > Freeing REx: "a(?{%"b%"}){2}c" > > > There is no fixed string "ac" in the pattern. Bad optimiser, no cookie.
Bad programmer misread manual, no cookie. :-) The (?{...}) is a zero width pattern. It always matches, and the return is discarded. So the pattern is identical to: /ac/ And is correctly rejected by the optimizer. Perhaps you meant (??{ ... }) which is quite different. Also, you must be testing on an old perl no? I dont recall seeing that rarest char output in a long time.... Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 14:16:52 +0100
To: perl5-porters [...] perl.org
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.9k
On 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote: Show quoted text
> # New Ticket Created by Nicholas Clark > # Please include the string: [perl #116639] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=116639 > > > > > This is a bug report for perl from nick@ccl4.org, > generated with the help of perlbug 1.39 running under perl 5.16.2. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > It looks like the regex optimiser is wrongly concluding "can never match" > for certain patterns involving embedded comments. > > Pathological patterns that put a embedded comment between an atom and its > quantifier :-) > > Not a regression - has been present since embedded comments were added in > 5.001: > > $ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' > Y > $ ~/Sandpit/5001/bin/perl -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' > n > > > Still present in blead. The smoking gun is the debugging output > "String shorter than min possible regex match" > > > > Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches: > > $ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' > Compiling REx ".b(?#Comment){2}c" > rarest char b at 0 > Final program: > 1: REG_ANY (2) > 2: CURLY {2,2} (6) > 4: EXACT <b> (0) > 6: EXACT <c> (8) > 8: END (0) > anchored "bbc" at 1 (checking anchored) minlen 4 > Enabling $` $& $' support (0x7). > > EXECUTING... > > Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" > Found anchored substr "bbc" at offset 1... > Guessed: match at offset 0 > Matching REx ".b(?#Comment){2}c" against "abbc" > 0 <> <abbc> | 1:REG_ANY(2) > 1 <a> <bbc> | 2:CURLY {2,2}(6) > EXACT <b> can match 2 times out of 2... > 3 <abb> <c> | 6: EXACT <c>(8) > 4 <abbc> <> | 8: END(0) > Match successful! > Y > Freeing REx: ".b(?#Comment){2}c" > > > with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser, and > hence never enters the engine: > > $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' > Compiling REx "ab(?#Comment){2}c" > rarest char b at 1 > Final program: > 1: CURLYM[0] {2,2} (7) > 3: EXACT <ab> (5) > 5: SUCCEED (0) > 6: NOTHING (7) > 7: EXACT <c> (9) > 9: END (0) > anchored "ababc" at 0 (checking anchored isall) minlen 5 > Enabling $` $& $' support (0x7). > > EXECUTING... > > String shorter than min possible regex match > n > Freeing REx: "ab(?#Comment){2}c" > > > Note, multiline comments, using /x, are not affected by this bug: > > $ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"' > Y
I consider this a documentation bug and not a feature bug. You simply cannot put comments *anywhere*, and afaik, never could have. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 13:25:36 +0000
To: demerphq <demerphq [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.1k
On Tue, Feb 05, 2013 at 02:16:52PM +0100, demerphq wrote: Show quoted text
> On 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote:
Show quoted text
> > $ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'
Show quoted text
> > Y
Show quoted text
> > $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
Show quoted text
> > n
Show quoted text
> I consider this a documentation bug and not a feature bug. You simply > cannot put comments *anywhere*, and afaik, never could have.
I disagree. In that it should not matter whether the first symbol of my pattern is /./ or /a/. Either both patterns should match, or both should not. It shouldn't be the case that one is rejected by the optimiser, and one is accepted by the engine. That's the optimiser changing behaviour, not just run time needed. Also, as best I can tell from the *implementation* of comments, the intent is that you *can* put comments anywhere. The implementation of S_nextchar() is intended to skip over comments when getting the next character of the pattern, and as best I can eyeball from the rest of the code in regcomp.c, the intent is to use it always as the way to get more characters. Nicholas Clark
CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 13:57:00 +0000
To: perl5-porters [...] perl.org
From: Aaron Crane <arc [...] cpan.org>
Download (untitled) / with headers
text/plain 880b
Nicholas Clark <perlbug-followup@perl.org> wrote: Show quoted text
> $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' > Compiling REx "ab(?#Comment){2}c" > rarest char b at 1 > Final program: > 1: CURLYM[0] {2,2} (7) > 3: EXACT <ab> (5) > 5: SUCCEED (0) > 6: NOTHING (7) > 7: EXACT <c> (9) > 9: END (0) > anchored "ababc" at 0 (checking anchored isall) minlen 5
This seems very wrong — the pattern's been compiled as if the initial "ab" is inside (?:). The same problem can be triggered with other quantifiers: $ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' Compiling REx "ab(?#rem)*c" rarest char c at 0 synthetic stclass "ANYOF_SYNTHETIC[ac][]". Final program: 1: CURLYM[0] {0,32767} (7) 3: EXACT <ab> (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT <c> (9) 9: END (0) -- Aaron Crane ** http://aaroncrane.co.uk/
CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 14:58:19 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 9.7k
On 5 February 2013 14:25, Nicholas Clark <nick@ccl4.org> wrote:u Show quoted text
> On Tue, Feb 05, 2013 at 02:16:52PM +0100, demerphq wrote:
>> On 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote:
>
>> > $ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"'
>
>> > Y
>
>> > $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"'
>
>> > n
>
>> I consider this a documentation bug and not a feature bug. You simply >> cannot put comments *anywhere*, and afaik, never could have.
> > I disagree. In that it should not matter whether the first symbol of my > pattern is /./ or /a/. Either both patterns should match, or both should not. > It shouldn't be the case that one is rejected by the optimiser, and one is > accepted by the engine. That's the optimiser changing behaviour, not just > run time needed.
I have just looked at the output. The cause of this behavior is not the optimiser, it is the comment parsing feature behaving in ways that you dont expect. See below comments on poorly thought through quick hacks to the regex engine. What happens is that the (?#....) is parsed out, the {2} then applies to whatever was parsed out before. As a side effect of how the parsing behavior works this looks to the regex engine like you had done: /(?:ab)(?#Comment){2}c/. Which if you remove the (?#...) looks like: /(?:ab){2}c/ At which point the optimizer quite correctly expects this pattern to be the same as /ababc/. In the case with the leading dot, the parsing step which merges 'a' and 'b' into (?:ab) does not occur, and the {2} applies to the b alone. Anyway, none of this is an optimizer bug, its either a doc bug, or a bug in how quantifiers are applied the (?#...) sequences, or a bug in how comments are parsed generally. But not an optimizer bug. The optimizer doesn't even see the original pattern, it operates on opcodes, and comments shouldnt modify the opcodes. Also notice that you happen to see the optimiser kick in, which is natural, its the first thing that might go wrong, however even if it didn't you would still see these patterns behave differently due to the way they are constructed. Unfortunately there is no way to turn off all optimisations, or you would see the problem. But if you inspect the opcode generated you will see it. Take a look at the debug output: $ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' Compiling REx "ab(?#Comment){2}c" Starting first pass (sizing) Show quoted text
>ab(?#Comme... | 1| reg
| | brnc | | piec | | atom Show quoted text
>c< | 4| inst - CURLYX
| 7| piec | | atom Required size 9 nodes Starting second pass (creation) Show quoted text
>ab(?#Comme... | 1| reg
| | brnc | | piec | | atom Show quoted text
>c< | 4| tail~ EXACT <ab> (1) -> WHILEM
| | inst - CURLYX | 7| tail~ CURLYX[2] {0,0} (1) -> NOTHING | | piec | | atom Show quoted text
>< | 9| tail~ CURLYX[0] {2,2} (1)
| | ~ NOTHING (6) -> EXACT | 10| lsbr~ tying lastbr CURLYX[0] {2,2} (1) to ender END (9) offset 8 | | tail~ CURLYX[0] {2,2} (1) | | ~ NOTHING (6) | | ~ EXACT <c> (7) -> END first:> 3: EXACT <ab> (5) first at 3 Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 1: CURLYX[0] {2,2} (6)
Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 3: EXACT <ab> (5)
Show quoted text
join> 3: EXACT <ab> (5)
Peep:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 5: WHILEM (0)
pre-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ab' 2:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'abab' 4:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 6: NOTHING (7)
Peep:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'abab' 4:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 7: EXACT <c> (9)
Show quoted text
join> 7: EXACT <c> (9)
pre-fin:Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' 5:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' 5:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 commit: Pos:5/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'ababc' -1:0/0 *Fixed:'ababc' @ 0 Float: '' @ 0/0 minlen: 5 r->minlen:0 Final program: 1: CURLYM[0] {2,2} (7) 3: EXACT <ab> (5) 5: SUCCEED (0) 6: NOTHING (7) 7: EXACT <c> (9) 9: END (0) anchored "ababc" at 0 (checking anchored isall) minlen 5 r->extflags: CHECK_ALL USE_INTUIT_NOML USE_INTUIT_ML n Freeing REx: "ab(?#Comment){2}c" $ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" Starting first pass (sizing) Show quoted text
>.b(?#Comme... | 1| reg
| | brnc | | piec | | atom Show quoted text
>b(?#Commen... | 2| piec
| | atom Show quoted text
>c< | 4| inst - CURLY
| 6| piec | | atom Required size 8 nodes Starting second pass (creation) Show quoted text
>.b(?#Comme... | 1| reg
| | brnc | | piec | | atom Show quoted text
>b(?#Commen... | 2| piec
| | atom Show quoted text
>c< | 4| inst - CURLY
| 6| tail~ REG_ANY (1) -> CURLY | | piec | | atom Show quoted text
>< | 8| tail~ CURLY {2,2} (2) -> EXACT
| 9| lsbr~ tying lastbr REG_ANY (1) to ender END (8) offset 7 | | tail~ REG_ANY (1) | | ~ CURLY {2,2} (2) | | ~ EXACT <c> (6) -> END first:> 1: REG_ANY (2) first at 1 Peep:Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' 0:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 1: REG_ANY (2)
commit: Pos:0/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Peep:Pos:1/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 2: CURLY {2,2} (6)
Peep:Pos:1/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'' -1:0/0 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 4: EXACT <b> (0)
Show quoted text
join> 4: EXACT <b> (0)
pre-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'b' 2:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:2/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'b' 2:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 Peep:Pos:3/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bb' 3:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 Show quoted text
Peep> 6: EXACT <c> (8)
Show quoted text
join> 6: EXACT <c> (8)
pre-fin:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' 4:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 post-fin:Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' 4:1/1 *Fixed:'' @ 0 Float: '' @ 0/0 commit: Pos:4/0 Flags: 0x0 Whilem_c: 0 Lcp: 0 Last:'bbc' -1:1/1 *Fixed:'bbc' @ 1 Float: '' @ 0/0 minlen: 4 r->minlen:0 Final program: 1: REG_ANY (2) 2: CURLY {2,2} (6) 4: EXACT <b> (0) 6: EXACT <c> (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4 r->extflags: USE_INTUIT_NOML USE_INTUIT_ML Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" Check offset min: 1 Start shift: 1 End shift 0 Real End Shift: 0 fbm_instr len=3 str=<bbc> Found anchored substr "bbc" at offset 1... Check offset min:1 max:1 S:1 t:0 D:-1 end:4 Guessed: match at offset 0 Matching REx ".b(?#Comment){2}c" against "abbc" Setting an EVAL scope, savestack=2 regmatch start 0 <> <abbc> | 1:REG_ANY(2) 1 <a> <bbc> | 2:CURLY {2,2}(6) EXACT <b> can match 2 times out of 2... Setting an EVAL scope, savestack=11 #0 CURLY_B_max 3 <abb> <c> | 6: EXACT <c>(8) 4 <abbc> <> | 8: END(0) Match successful! Y Freeing REx: ".b(?#Comment){2}c" Show quoted text
> > Also, as best I can tell from the *implementation* of comments, the intent > is that you *can* put comments anywhere.
You didn't look hard enough then. Sorry. Show quoted text
> The implementation of S_nextchar() > is intended to skip over comments when getting the next character of the > pattern, and as best I can eyeball from the rest of the code in regcomp.c, > the intent is to use it always as the way to get more characters.
If so its been broken for a long time. Afaik S_nextchar is only called at the beginning or end of a parsable sequence. This is a farily typical piece of code IMO: case '<': /* (?<...) */ if (*RExC_parse == '!') paren = ','; else if (*RExC_parse != '=') named_capture: { /* (?<...>) */ char *name_start; SV *svname; paren= '>'; No "nextchar" there. And IMO, what /some/ of the nextchars suggest as being legal, such as a comment in between the '(' and '?' and in (?:foo) is IMO borderline to completely insane, and not something we should use as a precedent. We know that regex comments were a quick hack added by Larry when Jeffrey Friedl commented how they would be useful. I also know that quick hacks in the regex engine, while sometimes quite possible, leave nasty edge cases that are not so fun to deal with afterwards, and probably, had people been aware of them at the time, blocked the idea. (Yay, I can now say that Larry and I have something in common: adding a feature to the regex engine without thinking it through properly and without accounting for the edge cases!) Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 15:02:45 +0100
To: Aaron Crane <arc [...] cpan.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On 5 February 2013 14:57, Aaron Crane <arc@cpan.org> wrote: Show quoted text
> Nicholas Clark <perlbug-followup@perl.org> wrote:
>> $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' >> Compiling REx "ab(?#Comment){2}c" >> rarest char b at 1 >> Final program: >> 1: CURLYM[0] {2,2} (7) >> 3: EXACT <ab> (5) >> 5: SUCCEED (0) >> 6: NOTHING (7) >> 7: EXACT <c> (9) >> 9: END (0) >> anchored "ababc" at 0 (checking anchored isall) minlen 5
> > This seems very wrong — the pattern's been compiled as if the initial > "ab" is inside (?:). The same problem can be triggered with other > quantifiers: > > $ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' > Compiling REx "ab(?#rem)*c" > rarest char c at 0 > synthetic stclass "ANYOF_SYNTHETIC[ac][]". > Final program: > 1: CURLYM[0] {0,32767} (7) > 3: EXACT <ab> (5) > 5: SUCCEED (0) > 6: NOTHING (7) > 7: EXACT <c> (9) > 9: END (0)
This is byproduct of how we parse strings into EXACT nodes. When we are parsing a char, we look ahead and check if there is a quantifier next. If not then we merge them into a single opcode. (?#...) seems to not absorb the quantifier, and as such it applies to whatever _opcode_ is before it, in this case the EXACT <ab>. About the only thing I can see here likely to change is that we should absorb quantifiers on (?#...) the rest is IMO a doc patch waiting to happen, along with maybe the removal of some nextchars() that probably cause trouble. The docpatch should probably say "You cannot put a comment in between an atom and its quantifier." cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 15:10:42 +0100
To: Aaron Crane <arc [...] cpan.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.9k
On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 5 February 2013 14:57, Aaron Crane <arc@cpan.org> wrote:
>> Nicholas Clark <perlbug-followup@perl.org> wrote:
>>> $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' >>> Compiling REx "ab(?#Comment){2}c" >>> rarest char b at 1 >>> Final program: >>> 1: CURLYM[0] {2,2} (7) >>> 3: EXACT <ab> (5) >>> 5: SUCCEED (0) >>> 6: NOTHING (7) >>> 7: EXACT <c> (9) >>> 9: END (0) >>> anchored "ababc" at 0 (checking anchored isall) minlen 5
>> >> This seems very wrong — the pattern's been compiled as if the initial >> "ab" is inside (?:). The same problem can be triggered with other >> quantifiers: >> >> $ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' >> Compiling REx "ab(?#rem)*c" >> rarest char c at 0 >> synthetic stclass "ANYOF_SYNTHETIC[ac][]". >> Final program: >> 1: CURLYM[0] {0,32767} (7) >> 3: EXACT <ab> (5) >> 5: SUCCEED (0) >> 6: NOTHING (7) >> 7: EXACT <c> (9) >> 9: END (0)
> > This is byproduct of how we parse strings into EXACT nodes. When we > are parsing a char, we look ahead and check if there is a quantifier > next. If not then we merge them into a single opcode. (?#...) seems to > not absorb the quantifier, and as such it applies to whatever _opcode_ > is before it, in this case the EXACT <ab>. > > About the only thing I can see here likely to change is that we should > absorb quantifiers on (?#...) the rest is IMO a doc patch waiting to > happen, along with maybe the removal of some nextchars() that probably > cause trouble. > > The docpatch should probably say "You cannot put a comment in between > an atom and its quantifier."
Also, I'd like to point out that fixing this without modifying the docs is very likely to result in a non trivial performance hit for everyone else. IMO comments in insane places in regexes is not worth the price. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Aaron Crane <arc [...] cpan.org>, perl5-porters [...] perl.org, bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 05 Feb 2013 07:51:30 -0700
To: demerphq <demerphq [...] gmail.com>
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 121b
On 02/05/2013 07:10 AM, demerphq wrote: Show quoted text
> IMO comments in insane places in regexes is not worth the price. > > Yves
+1
CC: Nicholas Clark <nick [...] ccl4.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 19:01:41 -0500
To: demerphq <demerphq [...] gmail.com>
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 354b
* demerphq <demerphq@gmail.com> [2013-02-05T08:58:19] Show quoted text
> I have just looked at the output. The cause of this behavior is not > the optimiser, it is the comment parsing feature behaving in ways that > you dont expect. See below comments on poorly thought through quick > hacks to the regex engine.
Thanks, this was a really interesting message. -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Tue, 5 Feb 2013 19:04:39 -0500
To: perl5-porters [...] perl.org
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 941b
* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42] Show quoted text
> On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
> > > > The docpatch should probably say "You cannot put a comment in between > > an atom and its quantifier."
> > Also, I'd like to point out that fixing this without modifying the > docs is very likely to result in a non trivial performance hit for > everyone else. > > IMO comments in insane places in regexes is not worth the price.
I think it's fine to document that this won't work. I wonder whether we can issue a warning, too? On one hand, I'd like to imagine that anybody doing something so crazy would know to consider that it might not work and that he or she should consult the friendly manual. On the other hand, the way in which this fails is pretty lame to allow to happen silently. If a warning could be issued, I think it would be warranted. (Do we have a "too-clever" category yet? :) ) -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 07:01:00 +0100
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.4k
On 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: Show quoted text
> * demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]
>> On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
>> > >> > The docpatch should probably say "You cannot put a comment in between >> > an atom and its quantifier."
>> >> Also, I'd like to point out that fixing this without modifying the >> docs is very likely to result in a non trivial performance hit for >> everyone else. >> >> IMO comments in insane places in regexes is not worth the price.
> > I think it's fine to document that this won't work. I wonder whether we can > issue a warning, too?
My inclination is that we should just allow quantifiers on a (?#...) block. Then we could warn something like "quantifier on comment, maybe you meant something else" Show quoted text
> On one hand, I'd like to imagine that anybody doing > something so crazy would know to consider that it might not work and that he or > she should consult the friendly manual.
Unfortunately tho the existing docs on the subject are pretty sparse. So there isn't that much to consult. Show quoted text
> On the other hand, the way in which this fails is pretty lame to allow to > happen silently. If a warning could be issued, I think it would be warranted. > (Do we have a "too-clever" category yet? :) )
Heh. :-) I haven't checked the code yet to see what is involved, but I imagine that implementing the warning mentioned above should not be too difficult. Cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perl5-porters [...] perl.org, hv [...] crypt.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 06 Feb 2013 07:44:06 +0000
To: demerphq <demerphq [...] gmail.com>
From: hv [...] crypt.org
Download (untitled) / with headers
text/plain 690b
demerphq <demerphq@gmail.com> wrote: :On 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: :> I think it's fine to document that this won't work. I wonder whether we can :> issue a warning, too? : :My inclination is that we should just allow quantifiers on a (?#...) block. : :Then we could warn something like "quantifier on comment, maybe you :meant something else" It is not a useful thing to use in any context; I'd prefer just to make it an error. Allowing it with a warning only makes sense to me if we want to be able to define the effect in terms of something else, like '(?#comment) is the same as (?=(?:|comment))'; but it isn't, and we don't. :) Hugo
CC: Nicholas Clark <nick [...] ccl4.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 09:53:20 +0100
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.7k
On 6 February 2013 01:01, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: Show quoted text
> * demerphq <demerphq@gmail.com> [2013-02-05T08:58:19]
>> I have just looked at the output. The cause of this behavior is not >> the optimiser, it is the comment parsing feature behaving in ways that >> you dont expect. See below comments on poorly thought through quick >> hacks to the regex engine.
> > Thanks, this was a really interesting message.
You are welcome. Very happy to be of service. :-) BTW, I have started investigating a fix. The offending code is as follows: case '#': /* (?#...) */ while (*RExC_parse && *RExC_parse != ')') RExC_parse++; if (*RExC_parse != ')') FAIL("Sequence (?#... not terminated"); nextchar(pRExC_state); *flagp = TRYAGAIN; return NULL; The documentation for this construct is: "(?#text)" A comment. The text is ignored. If the "/x" modifier enables whitespace formatting, a simple "#" will suffice. Note that Perl closes the comment as soon as it sees a ")", so there is no way to put a literal ")" in the comment. The docs do not document that you cannot put a quantifier on such a construct, yet we know this to be the case. Actually the behavior of this construct is quite weird. I would argue that the following is a bug: $ ./perl -Ilib -Mre=debug -le'my $x=")"; /(?#foo${x}bar)/' Compiling REx "(?#foo${x}bar)" Final program: 1: NOTHING (2) 2: END (0) minlen 0 Matching REx "(?#foo${x}bar)" against "" 0 <> <> | 1:NOTHING(2) 0 <> <> | 2:END(0) Match successful! Freeing REx: "(?#foo${x}bar)" I would also argue that this is a bug: $ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/' Compiling REx "(?#foo%0bar)" Final program: 1: NOTHING (2) 2: END (0) minlen 0 Matching REx "(?#foo%0bar)" against "" 0 <> <> | 1:NOTHING(2) 0 <> <> | 2:END(0) Match successful! Freeing REx: "(?#foo%0bar)" I also think that this is weird: $ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/' Compiling REx "(?#foo%0bar)" Final program: 1: NOTHING (2) 2: END (0) minlen 0 Matching REx "(?#foo%0bar)" against "" 0 <> <> | 1:NOTHING(2) 0 <> <> | 2:END(0) Match successful! Freeing REx: "(?#foo%0bar)" Anyway, this construct has all the fingerprints of "poorly thought through regex change". The regex engine is incredibly complex, interacts with the rest of the code in complicated ways so it is really easy to miss edge case like I would argue this does. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 09:54:55 +0100
To: hv [...] crypt.org
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 930b
On 6 February 2013 08:44, <hv@crypt.org> wrote: Show quoted text
> demerphq <demerphq@gmail.com> wrote: > :On 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: > :> I think it's fine to document that this won't work. I wonder whether we can > :> issue a warning, too? > : > :My inclination is that we should just allow quantifiers on a (?#...) block. > : > :Then we could warn something like "quantifier on comment, maybe you > :meant something else" > > It is not a useful thing to use in any context; I'd prefer just to make > it an error. > > Allowing it with a warning only makes sense to me if we want to be able > to define the effect in terms of something else, like '(?#comment) is > the same as (?=(?:|comment))'; but it isn't, and we don't. :)
Thanks for the analysis, I agree. If I can figure it out at all I will make it die instead of warn. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Nicholas Clark <nick [...] ccl4.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 10:08:45 +0100
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.4k
On 6 February 2013 09:53, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 6 February 2013 01:01, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>> * demerphq <demerphq@gmail.com> [2013-02-05T08:58:19]
>>> I have just looked at the output. The cause of this behavior is not >>> the optimiser, it is the comment parsing feature behaving in ways that >>> you dont expect. See below comments on poorly thought through quick >>> hacks to the regex engine.
>> >> Thanks, this was a really interesting message.
> > You are welcome. Very happy to be of service. :-) > > BTW, I have started investigating a fix. The offending code is as follows: > > case '#': /* (?#...) */ > while (*RExC_parse && *RExC_parse != ')') > RExC_parse++; > if (*RExC_parse != ')') > FAIL("Sequence (?#... not terminated"); > nextchar(pRExC_state); > *flagp = TRYAGAIN; > return NULL; > > The documentation for this construct is: > > "(?#text)" > A comment. The text is ignored. If the "/x" > modifier enables whitespace formatting, a simple "#" will suffice. > Note that Perl > closes the comment as soon as it sees a ")", so there > is no way to put a literal ")" in the comment. >
The tests for this feature are: t/re/re_tests:^a(?#xxx){3}c aaac y $& aaac t/re/re_tests:'^a (?#xxx) (?#yyy) {3}c'x aaac y $& aaac Which means someone thought that putting a comment in between the quantifier and the item it applies to is a good idea. I personally think they are wrong, undertested their patch, and never notice how horribly broken various other things would be by this patch. Btw, according to blame these tests are from Ilya and not Larry. c277df42 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39 -0500 547) ^a(?#xxx){3}c aaac y $& aaac c277df42 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39 -0500 548) '^a (?#xxx) (?#yyy) {3}c'x aaac y $& aaac I checked, and the feature was added by Larry as part of the 5.000 "mega-patch" (a0d0e21e) The docs from that commit are: +=item (?#text) + +A comment. The text is ignored. As far as I can tell there were no tests for the feature at that time. Anyway, I do not think it is a feasible or reasonable to fix this to be legal. And will proceed under the assumption that the tests are wrong. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 09:31:02 +0000
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: Aaron Crane <arc [...] cpan.org>
Download (untitled) / with headers
text/plain 2.2k
Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: Show quoted text
> * demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]
>> On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
>> > The docpatch should probably say "You cannot put a comment in between >> > an atom and its quantifier."
>> >> IMO comments in insane places in regexes is not worth the price.
> > I think it's fine to document that this won't work.
I'm not entirely convinced. We already allow a comment between an atom and its quantifier in /x mode: $ perl -Mre=debug -wE '"" =~ /ab# +c/x' Compiling REx "ab#%n+c" Final program: 1: EXACT <a> (3) 3: PLUS (6) 4: EXACT <b> (0) 6: EXACT <c> (8) 8: END (0) anchored "ab" at 0 floating "bc" at 1..2147483647 (checking floating) minlen 3 Freeing REx: "ab#%n+c" I think that means we have three options, in the abstract: 1. We preserve this inconsistency between /x comments and (?#) comments 2. We start forbidding (whitespace and?) comments between an atom and a quantifier in /x mode 3. We start permitting (?#) comments between an atom and its quantifier I believe option 2 risks breaking existing code. I'm happy to look at doing option 3, but I probably won't have chance till the weekend. Show quoted text
> I wonder whether we can issue a warning, too?
If we can detect the situation well enough to issue a warning, wouldn't that imply we could instead skip past an embedded (?#) comment when parsing? At least, that's my impression to date from reading regcomp.c. Show quoted text
> On one hand, I'd like to imagine that anybody doing > something so crazy would know to consider that it might not work and that he or > she should consult the friendly manual.
My initial response is to agree that it's crazy, but on reflection, I think at least the /x case is actually useful. I've seen code formatted like this before: $line =~ m{ ^ # beginning of line (?: \s* \w+)+ # keywords $ # end of line }x at which point it's not much of a stretch for someone to write that like this: $line =~ m{ ^ # beginning of line (?: \s* \w+) # keyword + # ... one or more times $ # end of line }x which has whitespace and a comment before the "one or more times" quantifier. -- Aaron Crane ** http://aaroncrane.co.uk/
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Nicholas Clark <nick [...] ccl4.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 10:43:18 +0100
To: demerphq <demerphq [...] gmail.com>
From: Eirik Berg Hanssen <Eirik-Berg.Hanssen [...] allverden.no>
Download (untitled) / with headers
text/plain 842b
On Wed, Feb 6, 2013 at 10:08 AM, demerphq <demerphq@gmail.com> wrote:
Show quoted text

Anyway, I do not think it is a feasible or reasonable to fix this to
be legal.  And will proceed under the assumption that the tests are
wrong.

  The way you describe it, and given my ignorance of the implementation, the odd thing is it seems to work with braceless comments (and /x):

eirik@bluebird[10:38:44]~$ perl
use 5.014;
use strict;
use warnings;

say "/x:\t", "food" =~
  / fo  # then
    {2} # twice
    d
  /x ? "match" : "miss";

say "(?#):\t", "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss";

say "both:\t", "food" =~
  / fo  (?# then)
    {2} (?# twice)
    d
  /x ? "match" : "miss";
__END__
/x:     match
(?#):   miss
both:   miss
eirik@bluebird[10:38:54]~$


  Couldn't these be expected – and made? – to parse identically?


Eirik
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Nicholas Clark <nick [...] ccl4.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 11:16:08 +0100
To: Eirik Berg Hanssen <Eirik-Berg.Hanssen [...] allverden.no>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On 6 February 2013 10:43, Eirik Berg Hanssen <Eirik-Berg.Hanssen@allverden.no> wrote: Show quoted text
> On Wed, Feb 6, 2013 at 10:08 AM, demerphq <demerphq@gmail.com> wrote:
>> >> >> Anyway, I do not think it is a feasible or reasonable to fix this to >> be legal. And will proceed under the assumption that the tests are >> wrong.
> > > The way you describe it, and given my ignorance of the implementation, the > odd thing is it seems to work with braceless comments (and /x): > > eirik@bluebird[10:38:44]~$ perl > use 5.014; > use strict; > use warnings; > > say "/x:\t", "food" =~ > / fo # then > {2} # twice > d > /x ? "match" : "miss"; > > say "(?#):\t", "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss"; > > say "both:\t", "food" =~ > / fo (?# then) > {2} (?# twice) > d > /x ? "match" : "miss"; > __END__ > /x: match > (?#): miss > both: miss > eirik@bluebird[10:38:54]~$ > > > Couldn't these be expected – and made? – to parse identically?
IMO not very easily no. You have to understand how the parser works, but basically the code we would strip off the # style comments is the same code that handles EXACT so it is easy to handle there. However the code that handles (?#...) is the same code that handles (?:...) and any other such compound structure. Each of those expressions is treated an independent sub expression. The code makes many assumptions that when it enters the code that parses an independent subexpression that anything before it is, well, independent. :-) And this whole thing with quantifiers breaks that assumption, with predictable results. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 11:34:50 +0100
To: Aaron Crane <arc [...] cpan.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 5.8k
On 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote: Show quoted text
> Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>> * demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]
>>> On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
>>> > The docpatch should probably say "You cannot put a comment in between >>> > an atom and its quantifier."
>>> >>> IMO comments in insane places in regexes is not worth the price.
>> >> I think it's fine to document that this won't work.
> > I'm not entirely convinced. We already allow a comment between an > atom and its quantifier in /x mode:
Yep. But that is handled by the same code that handles EXACT. See the mail I just sent to Eirik for more details but also take a look at the commentary at the bottom of this mail. Show quoted text
> $ perl -Mre=debug -wE '"" =~ /ab# > +c/x' > Compiling REx "ab#%n+c" > Final program: > 1: EXACT <a> (3) > 3: PLUS (6) > 4: EXACT <b> (0) > 6: EXACT <c> (8) > 8: END (0) > anchored "ab" at 0 floating "bc" at 1..2147483647 (checking floating) minlen 3 > Freeing REx: "ab#%n+c" > > I think that means we have three options, in the abstract: > > 1. We preserve this inconsistency between /x comments and (?#) comments > > 2. We start forbidding (whitespace and?) comments between an atom and > a quantifier in /x mode > > 3. We start permitting (?#) comments between an atom and its quantifier > > I believe option 2 risks breaking existing code. > > I'm happy to look at doing option 3, but I probably won't have chance > till the weekend.
I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1 is a good idea when combined with the following: Basically the fix I want to make is: treat a quantifier following a (?#...) as applying to that construct, and DIE when we encounter it. IMO the only thing this might break is broken code already, and we will break it in a way that prevents data loss or mismatches. Show quoted text
>
>> I wonder whether we can issue a warning, too?
> > If we can detect the situation well enough to issue a warning, > wouldn't that imply we could instead skip past an embedded (?#) > comment when parsing? At least, that's my impression to date from > reading regcomp.c.
Nope I dont think so. And I think the code to do so would be horrible, or inefficient, or both. :-) Show quoted text
>> On one hand, I'd like to imagine that anybody doing >> something so crazy would know to consider that it might not work and that he or >> she should consult the friendly manual.
> > My initial response is to agree that it's crazy, but on reflection, I > think at least the /x case is actually useful. I've seen code > formatted like this before: > > $line =~ m{ > ^ # beginning of line > (?: \s* \w+)+ # keywords > $ # end of line > }x
Yeah, this is no problem Show quoted text
> at which point it's not much of a stretch for someone to write that like this: > > $line =~ m{ > ^ # beginning of line > (?: \s* \w+) # keyword > + # ... one or more times > $ # end of line > }x > > which has whitespace and a comment before the "one or more times" quantifier.
This is also not a problem, and I have no intention to change it. The problem is strictly the use of (?#...) in between an atom and its quantifier. I have outside of the code in core never seen anybody use (?#...) at all, ever, in code. No doubt there is someone who does so, but I bet they are very few. On the other hand /x with # style escapes is widely used, is not a problem and IMO should not change. Commentary: To illustrate the core problem here take a look at the output of this command (trimmed for brevity). You might want to use a fixed width font to view it. Each of the terms "reg", "brnc", "piec", "atom" refers to different routine in the parsing of the pattern. Notice the routine that handles the (?#...) is called "atom". The point here is that once you enter "atom" you have terminated any previous "atom", and once you terminate an atom any quantifier encountered is assumed to apply to the previous *atom*. The only way I see fixing this is NOT merging characters into an EXACT atom, however that will definitely have performance and memory consequences - we stopped doing this because it was a significant performance hit, and massively complicates matching unicode. I also see it probably breaking (?:foo)QUANTIFIER. $ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' Compiling REx ".b(?#Comment){2}c" Starting first pass (sizing) Show quoted text
>.b(?#Comme... | 1| reg
| | brnc | | piec | | atom Show quoted text
>b(?#Commen... | 2| piec
| | atom Show quoted text
>c< | 4| inst - CURLY
| 6| piec | | atom Required size 8 nodes Starting second pass (creation) Show quoted text
>.b(?#Comme... | 1| reg
| | brnc | | piec | | atom Show quoted text
>b(?#Commen... | 2| piec
| | atom Show quoted text
>c< | 4| inst - CURLY
| 6| tail~ REG_ANY (1) -> CURLY | | piec | | atom Show quoted text
>< | 8| tail~ CURLY {2,2} (2) -> EXACT
| 9| lsbr~ tying lastbr REG_ANY (1) to ender END (8) offset 7 | | tail~ REG_ANY (1) | | ~ CURLY {2,2} (2) | | ~ EXACT <c> (6) -> END [....] Final program: 1: REG_ANY (2) 2: CURLY {2,2} (6) 4: EXACT <b> (0) 6: EXACT <c> (8) 8: END (0) anchored "bbc" at 1 (checking anchored) minlen 4 Anyway, if you can see an easy fix then of course go for it. But I will continue to work on the simple fix that I know will not have nasty side-effects. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 11:10:53 +0000
To: perl5-porters [...] perl.org
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 3.6k
On Wed, Feb 06, 2013 at 11:34:50AM +0100, demerphq wrote: Show quoted text
> On 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote:
> > Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> >> * demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]
> >>> On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
> >>> > The docpatch should probably say "You cannot put a comment in between > >>> > an atom and its quantifier."
> >>> > >>> IMO comments in insane places in regexes is not worth the price.
Specifically, embedded comments. In that comments in /x mode don't seem to have the price tag. Show quoted text
> >> I think it's fine to document that this won't work.
> > > > I'm not entirely convinced. We already allow a comment between an > > atom and its quantifier in /x mode:
> > Yep. But that is handled by the same code that handles EXACT.
Show quoted text
> > I think that means we have three options, in the abstract: > > > > 1. We preserve this inconsistency between /x comments and (?#) comments > > > > 2. We start forbidding (whitespace and?) comments between an atom and > > a quantifier in /x mode > > > > 3. We start permitting (?#) comments between an atom and its quantifier > > > > I believe option 2 risks breaking existing code. > > > > I'm happy to look at doing option 3, but I probably won't have chance > > till the weekend.
> > I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1 > is a good idea when combined with the following: > > Basically the fix I want to make is: treat a quantifier following a > (?#...) as applying to that construct, and DIE when we encounter it.
That might be as simple as treating embedded comments the same way as embedded pattern match modifiers: $ perl -le '/ab(?i)+c/' Quantifier follows nothing in regex; marked by <-- HERE in m/ab(?i)+ <-- HERE c/ at -e line 1. (Might be a bit more tricky to get an error that says "comment" for comments and "nothing" for embedded pattern match modifiers, but possibly it makes just as much sense to keep the word "nothing" for comments) Show quoted text
> IMO the only thing this might break is broken code already, and we > will break it in a way that prevents data loss or mismatches.
Yes. Right now, I feel that the most bonkers thing is that the (mis)parsing differs depending on whether the preceding thing is multiple literals, or a single literal. Consistently dieing on /.b(?#comment)*/ and /ab(?#comment)*/ (etc) would be an improvement. Show quoted text
> >> I wonder whether we can issue a warning, too?
> > > > If we can detect the situation well enough to issue a warning, > > wouldn't that imply we could instead skip past an embedded (?#) > > comment when parsing? At least, that's my impression to date from > > reading regcomp.c.
> > Nope I dont think so. And I think the code to do so would be horrible, > or inefficient, or both. :-)
It wouldn't surprise me that it's much easier to detect the problem (after it's too late to parse it correctly), than to detect it early enough to parse correctly, or to fix up the parse after the event. Show quoted text
> The problem is strictly the use of (?#...) in between an atom and its > quantifier. I have outside of the code in core never seen anybody use > (?#...) at all, ever, in code. No doubt there is someone who does so, > but I bet they are very few. > > On the other hand /x with # style escapes is widely used, is not a > problem and IMO should not change.
Agree. In that I feel that it's perfectly reasonable to document that (?#) style has limitations on where it can be used. Show quoted text
> Anyway, if you can see an easy fix then of course go for it. But I > will continue to work on the simple fix that I know will not have > nasty side-effects.
Yes, I'm not going to object to Aaron having fun grovelling in the code either. :-) Nicholas Clark
CC: perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 12:26:42 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 4.5k
On 6 February 2013 12:10, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
> On Wed, Feb 06, 2013 at 11:34:50AM +0100, demerphq wrote:
>> On 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote:
>> > Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>> >> * demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]
>> >>> On 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
>> >>> > The docpatch should probably say "You cannot put a comment in between >> >>> > an atom and its quantifier."
>> >>> >> >>> IMO comments in insane places in regexes is not worth the price.
> > Specifically, embedded comments. > In that comments in /x mode don't seem to have the price tag.
Exactly. Show quoted text
>> >> I think it's fine to document that this won't work.
>> > >> > I'm not entirely convinced. We already allow a comment between an >> > atom and its quantifier in /x mode:
>> >> Yep. But that is handled by the same code that handles EXACT.
>
>> > I think that means we have three options, in the abstract: >> > >> > 1. We preserve this inconsistency between /x comments and (?#) comments >> > >> > 2. We start forbidding (whitespace and?) comments between an atom and >> > a quantifier in /x mode >> > >> > 3. We start permitting (?#) comments between an atom and its quantifier >> > >> > I believe option 2 risks breaking existing code. >> > >> > I'm happy to look at doing option 3, but I probably won't have chance >> > till the weekend.
>> >> I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1 >> is a good idea when combined with the following: >> >> Basically the fix I want to make is: treat a quantifier following a >> (?#...) as applying to that construct, and DIE when we encounter it.
> > That might be as simple as treating embedded comments the same way as > embedded pattern match modifiers: > > $ perl -le '/ab(?i)+c/' > Quantifier follows nothing in regex; marked by <-- HERE in m/ab(?i)+ <-- HERE c/ at -e line 1. > > (Might be a bit more tricky to get an error that says "comment" for comments > and "nothing" for embedded pattern match modifiers, but possibly it makes just > as much sense to keep the word "nothing" for comments)
Ah good catch. I guess that is also there for things like /(|)*/ Show quoted text
>> IMO the only thing this might break is broken code already, and we >> will break it in a way that prevents data loss or mismatches.
> > Yes. Right now, I feel that the most bonkers thing is that the (mis)parsing > differs depending on whether the preceding thing is multiple literals, or > a single literal. Consistently dieing on /.b(?#comment)*/ and > /ab(?#comment)*/ (etc) would be an improvement.
Indeed. I also have a suspicion that an exhaustive search would find other "interesting" oddities related to this. I suspect there is deeper fu involved here that has not been remarked on in this thread, as the whole thing with /(?#$foo)/ where $foo is not expanded is very strange, and cant be accounted for as far as I know in the regex engine alone. Show quoted text
>> >> I wonder whether we can issue a warning, too?
>> > >> > If we can detect the situation well enough to issue a warning, >> > wouldn't that imply we could instead skip past an embedded (?#) >> > comment when parsing? At least, that's my impression to date from >> > reading regcomp.c.
>> >> Nope I dont think so. And I think the code to do so would be horrible, >> or inefficient, or both. :-)
> > It wouldn't surprise me that it's much easier to detect the problem (after > it's too late to parse it correctly), than to detect it early enough to > parse correctly, or to fix up the parse after the event.
Thanks. That is a much more succinct description of what I was trying to say than what I managed to type out. :-) Show quoted text
>> The problem is strictly the use of (?#...) in between an atom and its >> quantifier. I have outside of the code in core never seen anybody use >> (?#...) at all, ever, in code. No doubt there is someone who does so, >> but I bet they are very few. >> >> On the other hand /x with # style escapes is widely used, is not a >> problem and IMO should not change.
> > Agree. In that I feel that it's perfectly reasonable to document that (?#) > style has limitations on where it can be used. >
>> Anyway, if you can see an easy fix then of course go for it. But I >> will continue to work on the simple fix that I know will not have >> nasty side-effects.
> > Yes, I'm not going to object to Aaron having fun grovelling in the code > either. :-)
Especially if he proves me wrong. :-) (secretly this thread is all part of our evil master plan to get Aaron hacking on the regex engine more... Oh wait did I type that...) Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: hv [...] crypt.org, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 6 Feb 2013 18:58:56 -0500
To: demerphq <demerphq [...] gmail.com>
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 465b
* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55] Show quoted text
> On 6 February 2013 08:44, <hv@crypt.org> wrote:
> > Allowing it with a warning only makes sense to me if we want to be able > > to define the effect in terms of something else, like '(?#comment) is > > the same as (?=(?:|comment))'; but it isn't, and we don't. :)
> > Thanks for the analysis, I agree. If I can figure it out at all I will > make it die instead of warn.
Agreed, too, thanks Hugo. -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

CC: hv [...] crypt.org, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Fri, 8 Feb 2013 07:10:36 +0100
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 687b
On 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: Show quoted text
> * demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]
>> On 6 February 2013 08:44, <hv@crypt.org> wrote:
>> > Allowing it with a warning only makes sense to me if we want to be able >> > to define the effect in terms of something else, like '(?#comment) is >> > the same as (?=(?:|comment))'; but it isn't, and we don't. :)
>> >> Thanks for the analysis, I agree. If I can figure it out at all I will >> make it die instead of warn.
> > Agreed, too, thanks Hugo.
A patch to implement this has been pushed as: smoke-me/regex_comment_quantifier_fix Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 938b
On Thu Feb 07 22:11:16 2013, demerphq wrote: Show quoted text
> On 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote: Show quoted text
> > * demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]
> >> On 6 February 2013 08:44, <hv@crypt.org> wrote:
> >> > Allowing it with a warning only makes sense to me if we want to
be able Show quoted text
> >> > to define the effect in terms of something else, like '(?#comment) is > >> > the same as (?=(?:|comment))'; but it isn't, and we don't. :)
> >> > >> Thanks for the analysis, I agree. If I can figure it out at all I will > >> make it die instead of warn.
> > > > Agreed, too, thanks Hugo.
> > A patch to implement this has been pushed as: > smoke-me/regex_comment_quantifier_fix
Did this change smoke cleanly for you? I don't see anything at http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix that looks like it was caused by this patch. This branch no longer rebases cleanly on blead. Tony
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 696b
On Wed Feb 06 15:59:37 2013, perl.p5p@rjbs.manxome.org wrote: Show quoted text
> * demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]
> > On 6 February 2013 08:44, <hv@crypt.org> wrote:
> > > Allowing it with a warning only makes sense to me if we want to be able > > > to define the effect in terms of something else, like '(?#comment) is > > > the same as (?=(?:|comment))'; but it isn't, and we don't. :)
> > > > Thanks for the analysis, I agree. If I can figure it out at all I will > > make it die instead of warn.
> > Agreed, too, thanks Hugo.
Karl deprecated at least some of this behaviour in 000947ada5b027f394ee63b5166df8c06b64a74e. Is the intent to ban it entirely in 5.20, or wait longer? Tony
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
To: perlbug-followup [...] perl.org, "OtherRecipients of perl Ticket #116639:;" [...] smtp.indra.com
CC: perl5-porters [...] perl.org
Date: Thu, 14 Nov 2013 17:40:04 -0700
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 892b
On 11/14/2013 04:04 PM, Tony Cook via RT wrote: Show quoted text
> On Wed Feb 06 15:59:37 2013, perl.p5p@rjbs.manxome.org wrote:
>> * demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]
>>> On 6 February 2013 08:44, <hv@crypt.org> wrote:
>>>> Allowing it with a warning only makes sense to me if we want to be able >>>> to define the effect in terms of something else, like '(?#comment) is >>>> the same as (?=(?:|comment))'; but it isn't, and we don't. :)
>>> >>> Thanks for the analysis, I agree. If I can figure it out at all I will >>> make it die instead of warn.
>> >> Agreed, too, thanks Hugo.
> > Karl deprecated at least some of this behaviour in 000947ada5b027f394ee63b5166df8c06b64a74e. > > Is the intent to ban it entirely in 5.20, or wait longer? >
Some releases ago we agreed to having all deprecations be at least 2 cycles, because our release dates may not mesh with bundlers like debian.
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
CC: Perl5 Porteros <perl5-porters [...] perl.org>
To: reneeb via RT <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Date: Fri, 15 Nov 2013 10:20:34 +0100
Download (untitled) / with headers
text/plain 1.2k
Should I investigate a new patch? On 10 July 2013 08:43, Tony Cook via RT <perlbug-followup@perl.org> wrote: Show quoted text
> On Thu Feb 07 22:11:16 2013, demerphq wrote:
>> On 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org>
> wrote:
>> > * demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]
>> >> On 6 February 2013 08:44, <hv@crypt.org> wrote:
>> >> > Allowing it with a warning only makes sense to me if we want to
> be able
>> >> > to define the effect in terms of something else, like '(?#comment) is >> >> > the same as (?=(?:|comment))'; but it isn't, and we don't. :)
>> >> >> >> Thanks for the analysis, I agree. If I can figure it out at all I will >> >> make it die instead of warn.
>> > >> > Agreed, too, thanks Hugo.
>> >> A patch to implement this has been pushed as: >> smoke-me/regex_comment_quantifier_fix
> > Did this change smoke cleanly for you? > > I don't see anything at > > http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix > > that looks like it was caused by this patch. > > This branch no longer rebases cleanly on blead. > > Tony > > --- > via perlbug: queue: perl5 status: open > https://rt.perl.org:443/rt3/Ticket/Display.html?id=116639
-- perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 196b
On Fri Nov 15 01:21:14 2013, demerphq wrote: Show quoted text
> Should I investigate a new patch?
It can wait, since it won't be in until 5.22. https://rt.perl.org/Ticket/Display.html?id=116639#txn-1268303 Tony
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 423b
On Sun Nov 24 14:36:08 2013, tonyc wrote: Show quoted text
> On Fri Nov 15 01:21:14 2013, demerphq wrote:
> > Should I investigate a new patch?
> > It can wait, since it won't be in until 5.22. > > https://rt.perl.org/Ticket/Display.html?id=116639#txn-1268303 > > Tony
Would it be possible to get an update on the status of this ticket (as it's marked as a blocker for 5.22)? Thank you very much. -- James E Keenan (jkeenan@cpan.org)
Date: Mon, 9 Mar 2015 09:43:12 +0100
From: demerphq <demerphq [...] gmail.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 1.3k


On 7 Mar 2015 03:26, "James E Keenan via RT" <perlbug-followup@perl.org> wrote:
Show quoted text

>
> On Sun Nov 24 14:36:08 2013, tonyc wrote:
> > On Fri Nov 15 01:21:14 2013, demerphq wrote:
> > > Should I investigate a new patch?
> >
> > It can wait, since it won't be in until 5.22.
> >
> > https://rt.perl.org/Ticket/Display.html?id=116639#txn-1268303
> >
> > Tony
>
> Would it be possible to get an update on the status of this ticket (as it's marked as a blocker for 5.22)?
>
> Thank you very much.

As I recall I wrote my patch right before 5.20 was released and we were in a code freeze.

I've complained in the past about how code freeze isnt necessary and causes patches to be lost and this is an example, a patch that got forgotten because of it.

We are now in another code freeze, again done suboptimally, and this change is user visible so I guess it will have to wait till 5.26 for the fix.

We should ditch the code freeze, and simply create a new prerelease branch instead and let people merge to blead while the prerelease branch is finalized. Then after release we would merge the prerelease branch back to blead. That would prevent things like this from happening, and allow people to continue to hack blead during the code freeze.

Anyway, if ricardo wants to make an exception here its his call. I assume he will say wait till 5.26.

Date: 10 Mar 2015 23:08:03 -0000
From: Father Chrysostomos <sprout [...] cpan.org>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
To: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 990b
Yves Orton wrote: Show quoted text
> As I recall I wrote my patch right before 5.20 was released and we were in > a code freeze. > > I've complained in the past about how code freeze isnt necessary and causes > patches to be lost and this is an example, a patch that got forgotten > because of it. > > We are now in another code freeze, again done suboptimally, and this change > is user visible so I guess it will have to wait till 5.26 for the fix. > > We should ditch the code freeze, and simply create a new prerelease branch > instead and let people merge to blead while the prerelease branch is > finalized. Then after release we would merge the prerelease branch back to > blead. That would prevent things like this from happening, and allow people > to continue to hack blead during the code freeze.
All it takes is for someone to create a post-5.22 branch and rebase and merge it to blead after 5.22. I did that a few years ago, but the last couple of years I was offline during code freeze.
From: demerphq <demerphq [...] gmail.com>
Date: Wed, 11 Mar 2015 10:37:30 +0100
To: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Download (untitled) / with headers
text/plain 1.6k
On 11 March 2015 at 00:08, Father Chrysostomos <sprout@cpan.org> wrote: Show quoted text
> Yves Orton wrote:
>> As I recall I wrote my patch right before 5.20 was released and we were in >> a code freeze. >> >> I've complained in the past about how code freeze isnt necessary and causes >> patches to be lost and this is an example, a patch that got forgotten >> because of it. >> >> We are now in another code freeze, again done suboptimally, and this change >> is user visible so I guess it will have to wait till 5.26 for the fix. >> >> We should ditch the code freeze, and simply create a new prerelease branch >> instead and let people merge to blead while the prerelease branch is >> finalized. Then after release we would merge the prerelease branch back to >> blead. That would prevent things like this from happening, and allow people >> to continue to hack blead during the code freeze.
> > All it takes is for someone to create a post-5.22 branch and rebase > and merge it to blead after 5.22. I did that a few years ago, but the > last couple of years I was offline during code freeze.
IMO That is the wrong way around. When we create a release candidate we should create the release candidate branch. At that point that branch is "owned" by the release manager, and only they can commit to it. Anything that is required for release can be applied to either blead or the release branch, if necessary it can be cherry picked from one to the other. Freezing blead just ignores the capabilities of git to merge and cherry pick, and slows down core development for no good reason. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
Fixed by 45a2623346a3b7028c030eb1dc738723b216d2dc -- Karl Williamson
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.6k
I may have closed this prematurely. I had not read the extensive commentary on this when I closed it, only the original report. So I had forgotten the controversy over what should happen. To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls, and that led to making all such stuff uniform. And that led to this bug being fixed. But do we really want a (?#...) comment between a character and its quantifier? I can see both sides of the issue, so am now bringing it up to discussion again. blead is now in a state where it would be easy to add the ability to choose which places allow (?#...) and which forbid it, but allow white space and regular # comments, both only under /x. We could allow (?#...) only under /x in such cases if we choose. It's easy to change it to do any of this, and I'm willing to do the work, once a decision has been made as to what to do. My only stance on this is that I think (but am convince-able the other way) that under /x, anywhere there is a # comment, should also allow a (?#...) comment -- Karl Williamson
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 14 Oct 2015 10:05:09 +0200
Download (untitled) / with headers
text/plain 1.1k
On 14 October 2015 at 00:42, Karl Williamson via RT <perlbug-followup@perl.org> wrote: Show quoted text
> I may have closed this prematurely. I had not read the extensive commentary on this when I closed it, only the original report. So I had forgotten the controversy over what should happen. > > To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls, and that led to making all such stuff uniform. And that led to this bug being fixed. > > But do we really want a (?#...) comment between a character and its quantifier?
IMO no. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
From: Abigail <abigail [...] abigail.be>
To: Karl Williamson via RT <perlbug-followup [...] perl.org>
CC: "OtherRecipients of perl Ticket #116639": ;, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Wed, 14 Oct 2015 13:43:05 +0200
Download (untitled) / with headers
text/plain 2.3k
On Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote: Show quoted text
> I may have closed this prematurely. I had not read the extensive > commentary on this when I closed it, only the original report. So I > had forgotten the controversy over what should happen. > > To recap what has happened in blead: It turns out that no one > (including me) thought about nextchr()'s behavior when the pattern is > UTF-8 encoded. It did a simple ++ of the parse position, which is the > wrong thing to do when the character is a multi-byte character. It would > point to the 2nd byte of that, hence the tests it did after the increment > for white space under /x would fail for white space that was multi-byte. > When I tried to write tests after fixing that, I discovered that nothing > I came up with would reliably fail. And valgrind showed that there > reads outside the buffer of garbage data. That led to me fixing a > bunch of nextchr calls, and that led to making all such stuff uniform. > And that led to this bug being fixed. > > But do we really want a (?#...) comment between a character and its > quantifier?
I'd vote yes. For consistency. See below. Show quoted text
> quantifier? I can see both sides of the issue, so am now bringing it up > to discussion again. blead is now in a state where it would be easy to > add the ability to choose which places allow (?#...) and which forbid > it, but allow white space and regular # comments, both only under /x. > We could allow (?#...) only under /x in such cases if we choose. > It's easy to change it to do any of this, and I'm willing to do the work, > once a decision has been made as to what to do. > > My only stance on this is that I think (but am convince-able the other > way) that under /x, anywhere there is a # comment, should also allow a > (?#...) comment
I agree. And I'd throw whitespace in it as well: anywhere where we ignore whitespace under /x, we should allow a # comment, and hence, should allow a (?#...) comment. Ignorable whitespace between a character and its quantifier(s) is allowed: $ perl -wE 'say "aa" =~ /^a + +$/x' 1 $ A comment there is also allowed: $ perl -wE 'say "aa" =~ /^a # Foo + # Bar +$/x' 1 $ If we have different rules for whitespace and (?#), it won't be easy to document properly, and it won't be easy to learn. Abigail
Date: Wed, 14 Oct 2015 20:29:45 -0400
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
To: demerphq <demerphq [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 1.2k
On Wed, Oct 14, 2015 at 4:05 AM, demerphq <demerphq@gmail.com> wrote:
Show quoted text
On 14 October 2015 at 00:42, Karl Williamson via RT
<perlbug-followup@perl.org> wrote:
> I may have closed this prematurely.  I had not read the extensive commentary on this when I closed it, only the original report.  So I had forgotten the controversy over what should happen.
>
> To recap what has happened in blead:  It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded.  It did a simple ++ of the parse position, which is the wrong thing to do when the character is a multi-byte character.  It would point to the 2nd byte of that, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte.  When I tried to write tests after fixing that, I discovered that nothing I came up with would reliably fail.  And valgrind showed that there reads outside the buffer of garbage data.  That led to me fixing a bunch of nextchr calls, and that led to making all such stuff uniform.  And that led to this bug being fixed.
>
> But do we really want a (?#...) comment between a character and its quantifier?

IMO no.

Yves

So the follow question would be: What should happen when someone does that?

Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Thu, 15 Oct 2015 09:14:29 +0200
CC: Perl RT Bug Tracker <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
To: Eric Brine <ikegami [...] adaelis.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.5k
On 15 October 2015 at 02:29, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text
> On Wed, Oct 14, 2015 at 4:05 AM, demerphq <demerphq@gmail.com> wrote:
>> >> On 14 October 2015 at 00:42, Karl Williamson via RT >> <perlbug-followup@perl.org> wrote:
>> > I may have closed this prematurely. I had not read the extensive >> > commentary on this when I closed it, only the original report. So I had >> > forgotten the controversy over what should happen. >> > >> > To recap what has happened in blead: It turns out that no one >> > (including me) thought about nextchr()'s behavior when the pattern is UTF-8 >> > encoded. It did a simple ++ of the parse position, which is the wrong thing >> > to do when the character is a multi-byte character. It would point to the >> > 2nd byte of that, hence the tests it did after the increment for white space >> > under /x would fail for white space that was multi-byte. When I tried to >> > write tests after fixing that, I discovered that nothing I came up with >> > would reliably fail. And valgrind showed that there reads outside the >> > buffer of garbage data. That led to me fixing a bunch of nextchr calls, and >> > that led to making all such stuff uniform. And that led to this bug being >> > fixed. >> > >> > But do we really want a (?#...) comment between a character and its >> > quantifier?
>> >> IMO no. >> >> Yves
> > > So the follow question would be: What should happen when someone does that?
Throw an error, modifier illegal on zero width match. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
To: Abigail <abigail [...] abigail.be>
CC: Karl Williamson via RT <perlbug-followup [...] perl.org>;, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Date: Sat, 17 Oct 2015 21:53:39 -0400
Download (untitled) / with headers
text/plain 767b
* Abigail <abigail@abigail.be> [2015-10-14T07:43:05] Show quoted text
> On Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote:
> > > > My only stance on this is that I think (but am convince-able the other > > way) that under /x, anywhere there is a # comment, should also allow a > > (?#...) comment
> > I agree. And I'd throw whitespace in it as well: anywhere where we > ignore whitespace under /x, we should allow a # comment, and hence, > should allow a (?#...) comment.
I wish I'd read your testing out of this before going through and doing it myself. :-) Anyway, I agree. While I'd rather nobody actually write this, I think that's a matter of style. Anyway: making it a matter of grammar would require either weird inconsistency or breakage. -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Abigail <abigail [...] abigail.be>
From: Karl Williamson <public [...] khwilliamson.com>
CC: Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Sat, 17 Oct 2015 22:16:44 -0600
Download (untitled) / with headers
text/plain 1.2k
On 10/17/2015 07:53 PM, Ricardo Signes wrote: Show quoted text
> * Abigail <abigail@abigail.be> [2015-10-14T07:43:05]
>> On Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote:
>>> >>> My only stance on this is that I think (but am convince-able the other >>> way) that under /x, anywhere there is a # comment, should also allow a >>> (?#...) comment
>> >> I agree. And I'd throw whitespace in it as well: anywhere where we >> ignore whitespace under /x, we should allow a # comment, and hence, >> should allow a (?#...) comment.
> > I wish I'd read your testing out of this before going through and doing it > myself. :-) > > Anyway, I agree. While I'd rather nobody actually write this, I think that's a > matter of style. Anyway: making it a matter of grammar would require either > weird inconsistency or breakage. >
Just to make sure everyone understands. Currently (?#...) comments are allowed even when there is no /x. We probably have to support that in the places where it's been that way all along, but we could decide to not support them in the places that I just added, when not under /x. Thus, we could say that you can't split a quantifier from its atom except under /x. I don't have an opinion on this.
CC: Abigail <abigail [...] abigail.be>, Karl Williamson via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
To: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Date: Sun, 18 Oct 2015 22:50:16 -0400
Download (untitled) / with headers
text/plain 914b
* Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44] Show quoted text
> Just to make sure everyone understands. > > Currently (?#...) comments are allowed even when there is no /x. We > probably have to support that in the places where it's been that way all > along, but we could decide to not support them in the places that I just > added, when not under /x. Thus, we could say that you can't split a > quantifier from its atom except under /x.
Thanks, I was confused. Show quoted text
> I don't have an opinion on this.
I'm not strongly opinionated on this, but: I think that I would find it useful to say: If you want to put comments into a regular expression, you have two options. You use /x and then insert space and any kind of comments between tokens, or you can skip /x and use (?#...) between tokens. That is: always allow (?#...) in those places where space and comments become allowed under /x. -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

Date: Mon, 19 Oct 2015 11:33:55 +0200
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
CC: Karl Williamson <public [...] khwilliamson.com>, Abigail <abigail [...] abigail.be>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 1.8k
On 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote: Show quoted text
> * Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]
>> Just to make sure everyone understands. >> >> Currently (?#...) comments are allowed even when there is no /x. We >> probably have to support that in the places where it's been that way all >> along, but we could decide to not support them in the places that I just >> added, when not under /x. Thus, we could say that you can't split a >> quantifier from its atom except under /x.
> > Thanks, I was confused. >
>> I don't have an opinion on this.
> > I'm not strongly opinionated on this, but: I think that I would find it useful > to say: > > If you want to put comments into a regular expression, you have two > options. You use /x and then insert space and any kind of comments between > tokens, or you can skip /x and use (?#...) between tokens. > > That is: always allow (?#...) in those places where space and comments become > allowed under /x.
I really dont like this. A) it complicates the regex engine, and B) it makes a mockery of what an expert would consider to be one token. so for instance to *me*: a{1,10} is a single token. a(?#whatever){1,10} is two "tokens". I think expecting people to grok that /a {1,10}/x is the same as /a{1,10}/x is IMO just asking for trouble and for people to misunderstand what is going on. Related, would you expect \(?#is this a \w match or not)w to be the same as \w ? Its not, but IMO if you think that a(?#foo){1,10} should be the same as a{1,10} then why shouldnt \(?#whatever)w? Of course, it wouldnt work, as \( would mean that (?#whatever) is not treated as a comment at all. So this idea just increases inconsistency in our regexes for IMO basically no value. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Karl Williamson <public [...] khwilliamson.com>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
From: Abigail <abigail [...] abigail.be>
To: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Mon, 19 Oct 2015 16:26:06 +0200
Download (untitled) / with headers
text/plain 2.3k
On Mon, Oct 19, 2015 at 11:33:55AM +0200, demerphq wrote: Show quoted text
> On 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> > * Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]
> >> Just to make sure everyone understands. > >> > >> Currently (?#...) comments are allowed even when there is no /x. We > >> probably have to support that in the places where it's been that way all > >> along, but we could decide to not support them in the places that I just > >> added, when not under /x. Thus, we could say that you can't split a > >> quantifier from its atom except under /x.
> > > > Thanks, I was confused. > >
> >> I don't have an opinion on this.
> > > > I'm not strongly opinionated on this, but: I think that I would find it useful > > to say: > > > > If you want to put comments into a regular expression, you have two > > options. You use /x and then insert space and any kind of comments between > > tokens, or you can skip /x and use (?#...) between tokens. > > > > That is: always allow (?#...) in those places where space and comments become > > allowed under /x.
> > > I really dont like this. A) it complicates the regex engine, and B) it > makes a mockery of what an expert would consider to be one token. > > so for instance to *me*: a{1,10} is a single token.
To me, it isn't. To Perl, it isn't either, as /a {1,10}/x matches up to 10 a's: $ perl -wE '"aaaaaaaaaa" =~ /a {1,10}/x; say $& // "UNDEF"' aaaaaaaaaa $ And it's not that /x just removes any whitespace it finds, as it doesn't remove the whitespace inside the braces: $ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' UNDEF $ (In this case, it tries to match the literal string "a{1,10}" -- one can wonder whether that's the most useful thing it could do). Compiling /a {1,10}/x shows the tokes perl think there are: 1: CURLY {1,10} (5) 3: EXACT <a> (0) 5: END (0) Show quoted text
> > a(?#whatever){1,10} > > is two "tokens". >
I don't see any benefit of suddenly disallowing a space (under /x) between 'a' and '{1,10}', nor to allow space there, but not a comment. I don't see myself easily use an inline comment there, but I also don't think Perl should enforce a particular coding style upon its users. That just makes the language less flexible, and prohibits creativity. Abigail
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Mon, 19 Oct 2015 08:33:51 -0600
From: Karl Williamson <public [...] khwilliamson.com>
To: Abigail <abigail [...] abigail.be>, demerphq <demerphq [...] gmail.com>
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 552b
On 10/19/2015 08:26 AM, Abigail wrote: Show quoted text
> $ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' > UNDEF > $ > > (In this case, it tries to match the literal string "a{1,10}" -- one > can wonder whether that's the most useful thing it could do).
Note that this now give a warning, even without the '-w' command line option: $ blead -E '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1. UNDEF $
From: Abigail <abigail [...] abigail.be>
To: Karl Williamson <public [...] khwilliamson.com>
CC: demerphq <demerphq [...] gmail.com>, Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Mon, 19 Oct 2015 17:07:15 +0200
Download (untitled) / with headers
text/plain 942b
On Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote: Show quoted text
> On 10/19/2015 08:26 AM, Abigail wrote:
>> $ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' >> UNDEF >> $ >> >> (In this case, it tries to match the literal string "a{1,10}" -- one >> can wonder whether that's the most useful thing it could do).
> > Note that this now give a warning, even without the '-w' command line > option: > > $ blead -E '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' > Unescaped left brace in regex is deprecated, passed through in regex; > marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1. > UNDEF > $
Yes, I know. Is that useful? I'd wager that of all the people who write /a { 1,10}/x more than 99.9% of want it to behave like /a {1,10}/x and none of them actually want to match the literal string "a{1,10}" Abigail (who has put a space after the , in /x{n,m}/ way too often).
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Karl Williamson <public [...] khwilliamson.com>, Abigail <abigail [...] abigail.be>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
From: Eric Brine <ikegami [...] adaelis.com>
To: demerphq <demerphq [...] gmail.com>
Date: Mon, 19 Oct 2015 11:18:51 -0400
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Download (untitled) / with headers
text/plain 1.5k
On Mon, Oct 19, 2015 at 5:33 AM, demerphq <demerphq@gmail.com> wrote:
Show quoted text
On 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> * Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]
>> Just to make sure everyone understands.
>>
>> Currently (?#...) comments are allowed even when there is no /x.  We
>> probably have to support that in the places where it's been that way all
>> along, but we could decide to not support them in the places that I just
>> added, when not under /x.  Thus, we could say that you can't split a
>> quantifier from its atom except under /x.
>
> Thanks, I was confused.
>
>> I don't have an opinion on this.
>
> I'm not strongly opinionated on this, but:  I think that I would find it useful
> to say:
>
>   If you want to put comments into a regular expression, you have two
>   options.  You use /x and then insert space and any kind of comments between
>   tokens, or you can skip /x and use (?#...) between tokens.
>
> That is: always allow (?#...) in those places where space and comments become
> allowed under /x.


I really dont like this. A) it complicates the regex engine, and B) it
makes a mockery of what an expert would consider to be one token.

so for instance to *me*: a{1,10} is a single token.

Whether whitespace is allowed within or not, exports would not consider that a token. Tokens are indivisible, yet a number of parts of "a{1,10}" are optional. That pattern has 6 tokens ("a", "{", "1", ",", "10", "}").
 
Show quoted text
a(?#whatever){1,10} is two "tokens".

Still the same 6 tokens.

From: Karl Williamson <public [...] khwilliamson.com>
To: Abigail <abigail [...] abigail.be>
CC: demerphq <demerphq [...] gmail.com>, Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Date: Mon, 19 Oct 2015 09:16:43 -0600
Download (untitled) / with headers
text/plain 1.2k
On 10/19/2015 09:07 AM, Abigail wrote: Show quoted text
> On Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote:
>> On 10/19/2015 08:26 AM, Abigail wrote:
>>> $ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' >>> UNDEF >>> $ >>> >>> (In this case, it tries to match the literal string "a{1,10}" -- one >>> can wonder whether that's the most useful thing it could do).
>> >> Note that this now give a warning, even without the '-w' command line >> option: >> >> $ blead -E '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' >> Unescaped left brace in regex is deprecated, passed through in regex; >> marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1. >> UNDEF >> $
> > > > Yes, I know. > > Is that useful? I'd wager that of all the people who write > > /a { 1,10}/x > > more than 99.9% of want it to behave like > > /a {1,10}/x > > and none of them actually want to match the literal string > > "a{1,10}" > > > Abigail (who has put a space after the , in /x{n,m}/ way too often). >
I think the warning is useful, so they aren't silently deceived. It was never useful to behave as it has, but now they are warned. In 5.24, this will not compile. In 5.26, it will behave as you and I both think would be the useful way.
To: Karl Williamson <public [...] khwilliamson.com>
From: Abigail <abigail [...] abigail.be>
CC: demerphq <demerphq [...] gmail.com>, Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Karl Williamson via RT <perlbug-followup [...] perl.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Date: Mon, 19 Oct 2015 17:29:56 +0200
Subject: Re: [perl #116639] regex optimiser wrongly rejects certain matches involving embedded comments
Download (untitled) / with headers
text/plain 1.3k
On Mon, Oct 19, 2015 at 09:16:43AM -0600, Karl Williamson wrote: Show quoted text
> On 10/19/2015 09:07 AM, Abigail wrote:
>> On Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote:
>>> On 10/19/2015 08:26 AM, Abigail wrote:
>>>> $ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' >>>> UNDEF >>>> $ >>>> >>>> (In this case, it tries to match the literal string "a{1,10}" -- one >>>> can wonder whether that's the most useful thing it could do).
>>> >>> Note that this now give a warning, even without the '-w' command line >>> option: >>> >>> $ blead -E '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' >>> Unescaped left brace in regex is deprecated, passed through in regex; >>> marked by <-- HERE in m/a { <-- HERE 1,10}/ at -e line 1. >>> UNDEF >>> $
>> >> >> >> Yes, I know. >> >> Is that useful? I'd wager that of all the people who write >> >> /a { 1,10}/x >> >> more than 99.9% of want it to behave like >> >> /a {1,10}/x >> >> and none of them actually want to match the literal string >> >> "a{1,10}" >> >> >> Abigail (who has put a space after the , in /x{n,m}/ way too often). >>
> > I think the warning is useful, so they aren't silently deceived. It was > never useful to behave as it has, but now they are warned. > > In 5.24, this will not compile. In 5.26, it will behave as you and I > both think would be the useful way.
Excellent. Abigail
Download (untitled) / with headers
text/plain 252b
Thank you for submitting this report. You have helped make Perl better. With the release of Perl 5.24.0 on May 9, 2016, this and 149 other issues have been resolved. Perl 5.24.0 may be downloaded via https://metacpan.org/release/RJBS/perl-5.24.0


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org