End of string + 0-Width assertion oddity #2406

p5pRT · 2000-08-19T20:18:25Z

Migrated from rt.perl.org#3762 (status was 'stalled')

Searchable as RT3762$

p5pRT · 2000-08-19T20:18:25Z

From @btilly

Created by ben_tilly@hotmail.com

I actually know the design decisions that led to this. I still think
that this is a bug though:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the
"\n" at the end, and once from matching a 0-width assertion. In
the finest DWIM tradition I think that after matching $ you should
not be able to match a zero-width assertion at that point again.
Doing otherwise is likely to be unexpected. (Except when done by
smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just
disallowing 0-width assertions matching twice at the same spot,
disallow having two REs match ending at the same position in a /g.
That rule may be simpler to implement and likewise removes other
surprises, like /x*/ matching both at and after an 'x'.

Perl Info



Site configuration information for perl 5.00503:

Configured by tilly at Fri May 28 18:22:31 EDT 1999.

Summary of my perl5 (5.0 patchlevel 5 subversion 3) configuration:
  Platform:
    osname=linux, osvers=2.0.34, archname=i386-linux
    uname='linux mcrubs1305 2.0.34 #1 tue aug 25 19:28:36 edt 1998 i586 
unknown '
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef useperlio=undef d_sfio=undef
  Compiler:
    cc='gcc', optimize='-O2', gccversion=2.7.2.3
    cppflags='-Dbool=char -DHAS_BOOL -I/usr/local/include'
    ccflags ='-Dbool=char -DHAS_BOOL -I/usr/local/include'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -ldb -ldl -lm -lc -lposix -lcrypt
    libc=, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.00503:
    /usr/local/lib/perl5/5.00503/i386-linux
    /usr/local/lib/perl5/5.00503
    /usr/local/lib/perl5/site_perl/5.005/i386-linux
    /usr/local/lib/perl5/site_perl/5.005
    .


Environment for perl 5.00503:
    HOME=/home/tilly
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:.:.
    PERL_BADLANG (unset)
    SHELL=/bin/bash

________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2000-08-20T13:58:34Z

From @ysth

In article <LAW2-F130FRaqbC8wSA000014bd@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the
"\n" at the end, and once from matching a 0-width assertion. In
the finest DWIM tradition I think that after matching $ you should
not be able to match a zero-width assertion at that point again.
Doing otherwise is likely to be unexpected. (Except when done by
smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just
disallowing 0-width assertions matching twice at the same spot,
disallow having two REs match ending at the same position in a /g.
That rule may be simpler to implement and likewise removes other
surprises, like /x*/ matching both at and after an 'x'.

I don't understand the problem. It's matching once at pos 11 (with
0 \r's and 1 \n followed by EOL) and once at pos 12 (0 \r's, 0 \n's
and EOL).

The following, on the other hand, does seem a little odd:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g," matches!\n"'

p5pRT · 2000-08-20T19:45:59Z

From @btilly

Yitzchak Scott-Thoennes wrote:

In article <LAW2-F130FRaqbC8wSA000014bd@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the
"\n" at the end, and once from matching a 0-width assertion. In
the finest DWIM tradition I think that after matching $ you should
not be able to match a zero-width assertion at that point again.
Doing otherwise is likely to be unexpected. (Except when done by
smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just
disallowing 0-width assertions matching twice at the same spot,
disallow having two REs match ending at the same position in a /g.
That rule may be simpler to implement and likewise removes other
surprises, like /x*/ matching both at and after an 'x'.

I don't understand the problem. It's matching once at pos 11 (with
0 \r's and 1 \n followed by EOL) and once at pos 12 (0 \r's, 0 \n's
and EOL).

Define problem? It is documented in perlre, in a section which is
thoughtfully described as difficult and needing a rewrite. (True
both in 5.005_03 and 5.6.0.)

Now why doesn't it find the second match a few dozen more times?
No real reason except that an exception has been made for it. I
happen to think that that the exception as it stands is a little
more confusing than it needs to be.

The following, on the other hand, does seem a little odd:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g," matches!\n"'

It works as designed. Match zero times. Try again, can't because
of the exception mentioned above. Match one char. Try again, can
match a zero-width assertion. Try again, finally fail because of
the exception.

*shrug*

Ben
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2000-08-20T21:06:48Z

From @ysth

In article <LAW2-F54lLRa8174jDJ0000488b@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

Yitzchak Scott-Thoennes wrote:

In article <LAW2-F130FRaqbC8wSA000014bd@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the
"\n" at the end, and once from matching a 0-width assertion. In
the finest DWIM tradition I think that after matching $ you should
not be able to match a zero-width assertion at that point again.
Doing otherwise is likely to be unexpected. (Except when done by
smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just
disallowing 0-width assertions matching twice at the same spot,
disallow having two REs match ending at the same position in a /g.
That rule may be simpler to implement and likewise removes other
surprises, like /x*/ matching both at and after an 'x'.

I don't understand the problem. It's matching once at pos 11 (with
0 \r's and 1 \n followed by EOL) and once at pos 12 (0 \r's, 0 \n's
and EOL).

Define problem? It is documented in perlre, in a section which is
thoughtfully described as difficult and needing a rewrite. (True
both in 5.005_03 and 5.6.0.)

I guess I should have said I don't understand your proposed solution.

Now why doesn't it find the second match a few dozen more times?
No real reason except that an exception has been made for it. I
happen to think that that the exception as it stands is a little
more confusing than it needs to be.

How would you simplify it? From your phrase "after matching $ you
should not be able to match a zero-width assertion at that point
again" it's not clear to me what you are proposing.

The following, on the other hand, does seem a little odd:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g," matches!\n"'

It works as designed. Match zero times. Try again, can't because
of the exception mentioned above. Match one char. Try again, can
match a zero-width assertion. Try again, finally fail because of
the exception.

Yes, of course. Now I don't see what was confusing me. :)
I guess I was just getting too caught up in the -Dr output to
really think.

p5pRT · 2000-08-20T21:28:23Z

From @btilly

Yitzchak Scott-Thoennes wrote:

In article <LAW2-F54lLRa8174jDJ0000488b@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

Yitzchak Scott-Thoennes wrote:
[..]
Now why doesn't it find the second match a few dozen more times?
No real reason except that an exception has been made for it. I
happen to think that that the exception as it stands is a little
more confusing than it needs to be.

How would you simplify it? From your phrase "after matching $ you
should not be able to match a zero-width assertion at that point
again" it's not clear to me what you are proposing.

Well my initial thought is that someone who asked to match $ is
unlikely to want to match again. This came up because someone was
playing around and got confused that /x*$/g matched 'x' twice.

However I believe that a simpler rule is, "two matches cannot end
at the same place". That covers the current rule about two zero
width assertions, and makes, eg, s/ */ /g more likely to do what
most people would expect.

The following, on the other hand, does seem a little odd:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g,"
matches!\n"'

It works as designed. Match zero times. Try again, can't because
of the exception mentioned above. Match one char. Try again, can
match a zero-width assertion. Try again, finally fail because of
the exception.

Yes, of course. Now I don't see what was confusing me. :)
I guess I was just getting too caught up in the -Dr output to
really think.

I ran it on a Perl without debugging. Made it easier. :-)

Cheers,
Ben

PS Sorry for the resend, forgot to cc p5p the first time. :-(
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2000-08-21T00:58:49Z

From @ysth

In article <LAW2-F146PaatKGS7hD00003113@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

However I believe that a simpler rule is, "two matches cannot end
at the same place". That covers the current rule about two zero
width assertions, and makes, eg, s/ */ /g more likely to do what
most people would expect.

Let me make sure I'm understanding you. So you would want this:

[D:\home\sthoenna]perl -wle "print map qq:<$_>:, 'abc'=~/.??/g"
<><a><><><c><>

to instead output

<><a><c>

?? I'm not sure that's less unexpected.

p5pRT · 2000-08-21T04:10:39Z

From @btilly

Yitzchak Scott-Thoennes wrote:

In article <LAW2-F146PaatKGS7hD00003113@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

However I believe that a simpler rule is, "two matches cannot end
at the same place". That covers the current rule about two zero
width assertions, and makes, eg, s/ */ /g more likely to do what
most people would expect.

Let me make sure I'm understanding you. So you would want this:

[D:\home\sthoenna]perl -wle "print map qq:<$_>:, 'abc'=~/.??/g"
<><a><><><c><>

to instead output

<><a><c>

?? I'm not sure that's less unexpected.

That would be correct, but I am dubious that /.??/g has any
particularly natural meaning.

Cheers,
Ben
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2000-08-21T05:49:50Z

From [Unknown Contact. See original ticket]

Ben Tilly wrote:

Yitzchak Scott-Thoennes wrote:

In article <LAW2-F146PaatKGS7hD00003113@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

However I believe that a simpler rule is, "two matches cannot end
at the same place". That covers the current rule about two zero
width assertions, and makes, eg, s/ */ /g more likely to do what
most people would expect.

Let me make sure I'm understanding you. So you would want this:

[D:\home\sthoenna]perl -wle "print map qq:<$_>:, 'abc'=~/.??/g"
<><a><><><c><>

to instead output

<><a><c>

?? I'm not sure that's less unexpected.

That would be correct, but I am dubious that /.??/g has any
particularly natural meaning.

What about

while ($text =~ /$token/g) {
print length($1) if /\G($optional_token)/g;
}

?

If $optional_token matches "" then this would fail? That doesn't seem
as useful as the current rule.

Note that you can't fix this by just resetting the
ends-at-the-same-place flag between ops because then this:

print length($1) while /\G($optional_token)/g;

would loop forever.

p5pRT · 2000-08-21T06:40:40Z

From @btilly

Rick Delaney wrote:

Ben Tilly wrote:

[...]

That would be correct, but I am dubious that /.??/g has any
particularly natural meaning.

What about
while $$text =~ /$token/g$ \{
    print length$$1$ if /\\G$$optional\_token$/g;
\}
?

If $optional_token matches "" then this would fail? That doesn't seem
as useful as the current rule.

What about testing several optional tokens in a row at the same
place? The current rule already breaks that!

Is it better to break assumptions early, or late?

Note that you can't fix this by just resetting the
ends-at-the-same-place flag between ops because then this:
print length$$1$ while /\\G$$optional\_token$/g;
would loop forever.

Yup. Perhaps I should just patch the current explanation to move
it up and clarify? Given that the current behaviour is already
documented, I am probably in the wrong to have suggested anything
else. :-(
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2000-08-21T20:37:52Z

From @ysth

In article <LAW2-F49HXG2U9iTb8k00004ca5@hotmail.com>,
"Ben Tilly" <ben_tilly@hotmail.com> wrote:

Yitzchak Scott-Thoennes wrote:

Let me make sure I'm understanding you. So you would want this:

[D:\home\sthoenna]perl -wle "print map qq:<$_>:, 'abc'=~/.??/g"
<><a><><><c><>

to instead output

<><a><c>

?? I'm not sure that's less unexpected.

That would be correct, but I am dubious that /.??/g has any
particularly natural meaning.

Agreed. It is good for showing people how the exception works, though.

p5pRT · 2000-08-22T06:31:55Z

From [Unknown Contact. See original ticket]

Ben Tilly wrote:

while $$text =~ /$token/g$ \{
    print length$$1$ if /\\G$$optional\_token$/g;
\}
If $optional_token matches "" then this would fail? That doesn't seem
as useful as the current rule.
What about testing several optional tokens in a row at the same
place? The current rule already breaks that!

Good point.

Is it better to break assumptions early, or late?
Note that you can't fix this by just resetting the
ends-at-the-same-place flag between ops because then this:
print length$$1$ while /\\G$$optional\_token$/g;
would loop forever.
Yup. Perhaps I should just patch the current explanation to move
it up and clarify? Given that the current behaviour is already
documented, I am probably in the wrong to have suggested anything
else. :-(

You could always suggest a pragma. I can see value in each of the three
behaviours mentioned.

p5pRT · 2000-08-22T07:41:10Z

From @btilly

Rick Delaney wrote:

Ben Tilly wrote:
while $$text =~ /$token/g$ \{
    print length$$1$ if /\\G$$optional\_token$/g;
\}
If $optional_token matches "" then this would fail? That doesn't seem
as useful as the current rule.
What about testing several optional tokens in a row at the same
place? The current rule already breaks that!
Good point.

[...]

Yup. Perhaps I should just patch the current explanation to move
it up and clarify? Given that the current behaviour is already
documented, I am probably in the wrong to have suggested anything
else. :-(

You could always suggest a pragma. I can see value in each of the three
behaviours mentioned.

Anyone with enough tuits to use the pragma IMO should be assumed to
have enough tuits to assign to pos() or write an RE that doesn't
match 0-width where you don't want to.

OTOH the pragma that I *really* want to see is one to force the
RE engine to find how many matches it could have found total, and
warn if that number seems excessive. This would be very useful
for testing scripts for poorly written REs...

No idea how hard it would be though.

Cheers,
Ben
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2000-08-22T08:03:55Z

From @vanstyn

In <LAW2-F64rkgeSvKHJz6000063e1@hotmail.com>, "Ben Tilly" writes:
:OTOH the pragma that I *really* want to see is one to force the
:RE engine to find how many matches it could have found total, and
:warn if that number seems excessive. This would be very useful
:for testing scripts for poorly written REs...
:
:No idea how hard it would be though.

I'm not sure I understand what you mean by 'how many matches it
could have found', but guessing:

The trouble is that it is quite reasonable for the main engine to
report 10^10 theoretically possible matches, while the optimiser
reports that depending on the data it will quickly throw out
anywhere from 0 to 10^10 of them. I'm not sure that it is possible
to go from there to reporting a useful number.

Hugo

p5pRT · 2000-08-22T09:24:16Z

From @btilly

Hugo wrote:

In <LAW2-F164gxdvzhTLSm00006166@hotmail.com>, "Ben Tilly" writes:
:>I'm not sure I understand what you mean by 'how many matches it
:>could have found', but guessing:

I guess I guessed wrong.

:OK, here is an idea of how to do it. Have a pragma that forces
:any run of the RE engine to do a full trial run then a real run.
:For the trial run put at the end of the RE a custom escape (see
:the custom engine stuff in perlre) which always fails but keeps a
:counter of how many times it was reached, bombing out if it passes
:a fixed limit.

Something like (?{ ++$cnt > $limit ? die : '' }), then? Though it
is unfortunate, particularly in the context, that the code is hit
twice for every zero-length match. :(

Yeah, except when it is done go back and try again.

:Slow, inefficient, etc. But useful for smoking out poorly written
:REs in test suites. :-)

I had misunderstood: I thought you were talking about the number
of comparisons performed, to catch things like exponential failure
cases.

That is *exactly* what I am talking about. If you run it with
expected data in the mode that I am talking about then you will
get a pretty good handle on how slow the failure case would be
without having to track down each RE and code up a case that would
fail on that RE.

I would imagine this is not a feature most people would consider
using until they already know they have a problem, at which point
there is an array of other debugging mechanisms that are probably
as, if not more, useful. I'm probably still missing the point.

Unless it was mentioned as a wise test to proactively put into
your standard benchmark suite...

The idea is to have an easy way to make Perl search for
potentially inefficient REs, rather than encountering them by
trial and error.

Cheers,
Ben

PS Hugo, sorry for the resend. Forgot to cc p5p.
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

p5pRT · 2010-04-15T09:12:00Z

@chorny - Status changed from 'open' to 'stalled'

p5pRT added Not a Bug Severity Low distro-Linux type-regex labels Oct 18, 2019

xenu removed the affects-5.5 label Nov 19, 2021

xenu removed the Severity Low label Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End of string + 0-Width assertion oddity #2406

End of string + 0-Width assertion oddity #2406

p5pRT commented Aug 19, 2000

p5pRT commented Aug 19, 2000

p5pRT commented Aug 20, 2000

p5pRT commented Aug 20, 2000

p5pRT commented Aug 20, 2000

p5pRT commented Aug 20, 2000

p5pRT commented Aug 21, 2000

p5pRT commented Aug 21, 2000

p5pRT commented Aug 21, 2000

p5pRT commented Aug 21, 2000

p5pRT commented Aug 21, 2000

p5pRT commented Aug 22, 2000

p5pRT commented Aug 22, 2000

p5pRT commented Aug 22, 2000

p5pRT commented Aug 22, 2000

p5pRT commented Apr 15, 2010

End of string + 0-Width assertion oddity #2406

End of string + 0-Width assertion oddity #2406

Comments

p5pRT commented Aug 19, 2000

p5pRT commented Aug 19, 2000

From @btilly

Created by ben_tilly@hotmail.com

p5pRT commented Aug 20, 2000

From @ysth

p5pRT commented Aug 20, 2000

From @btilly

p5pRT commented Aug 20, 2000

From @ysth

p5pRT commented Aug 20, 2000

From @btilly

p5pRT commented Aug 21, 2000

From @ysth

p5pRT commented Aug 21, 2000

From @btilly

p5pRT commented Aug 21, 2000

From [Unknown Contact. See original ticket]

p5pRT commented Aug 21, 2000

From @btilly

p5pRT commented Aug 21, 2000

From @ysth

p5pRT commented Aug 22, 2000

From [Unknown Contact. See original ticket]

p5pRT commented Aug 22, 2000

From @btilly

p5pRT commented Aug 22, 2000

From @vanstyn

p5pRT commented Aug 22, 2000

From @btilly

p5pRT commented Apr 15, 2010