Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A substitution that never terminates #1975

Closed
p5pRT opened this issue May 15, 2000 · 50 comments
Closed

A substitution that never terminates #1975

p5pRT opened this issue May 15, 2000 · 50 comments

Comments

@p5pRT
Copy link

p5pRT commented May 15, 2000

Migrated from rt.perl.org#3248 (status was 'resolved')

Searchable as RT3248$

@p5pRT
Copy link
Author

p5pRT commented May 15, 2000

From @shields

The following test script causes Mail​::Header to enter an infinite
loop in Mail​::Header​::_fold_line. The header is malformed, but the
failure mode is severe.

---- cut here
#!/usr/bin/perl

use Mail​::Internet;

my $mail = <<'EOF';
To​: xxxxxxx@​xxxxxxx.xxxxx.xxx'" <xxxxxxx@​xxxxxxx.xxxxx.xxx>, "'xxxxxxx.x@​xxxxxxxxx.xx'" <xxxxxxx.x@​xxxxxxxxx.xx>, "'xxxxxxx@​xxxxxxx.xxx'" <xxxxxxx@​xxxxxxx.xxx>, "'xxxxxxxxxxx@​xxxxxxx.xxx'" <xxxxxxxxxxx@​xxxxxxx.xxx>, "'xxxx@​xxxxxxx.xxx'" <xxxx@​xxxxxxx.xxx>

body
EOF

my $message = new Mail​::Internet [split /\n/, $mail];

exit 0;
---- cut here

The expression that fails to terminate is​:

  $x .= $2 . "\n "
  while($_[0] =~ s/^(\s*(
  [^"]{$min,$max}?[\,\;]
  |[^"]{1,$max}[\s\n]
  |[^\s"]+[\s\n]
  |([^\s"]+("[^"]*")?)+[\s\n]
  ))
  //x);

I pruned this down to the test case​:

---- cut here
#!/usr/bin/perl

$foo = <<'EOF';
xxxxxxx@​xxxxxxx.xxxxx.xxx'" <xxxxxxx@​xxxxxxx.xxxxx.xxx>,
EOF

print "Starting\n";

$foo =~ s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//;

print "Finished\n";

exit 0;
---- cut here

This single substitution never terminates on perl 5.005_03, from
Debian Linux package version 5.005.03-06 and also from FreeBSD
3.4-RELEASE.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2000

From @tamias

On Tue, May 16, 2000 at 04​:40​:02AM +0000, Michael Shields wrote​:

The following test script causes Mail​::Header to enter an infinite
loop in Mail​::Header​::_fold_line. The header is malformed, but the
failure mode is severe.

---- cut here
#!/usr/bin/perl

$foo = <<'EOF';
xxxxxxx@​xxxxxxx.xxxxx.xxx'" <xxxxxxx@​xxxxxxx.xxxxx.xxx>,
EOF

print "Starting\n";

$foo =~ s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//;

print "Finished\n";

exit 0;
---- cut here

This single substitution never terminates on perl 5.005_03, from
Debian Linux package version 5.005.03-06 and also from FreeBSD
3.4-RELEASE.

How long did you wait?

The nested quantifiers make this regex take time exponentially relative to
the length of the target string to fail to find a match. The regex should
be fixed.

For example​:

s/^(\s*[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s])//;

Ronald

@p5pRT
Copy link
Author

p5pRT commented May 15, 2000

From @shields

How long did you wait?

Three minutes on a PII-450.

@p5pRT
Copy link
Author

p5pRT commented May 15, 2000

From @shields

I will therefore suggest the following patch to Mail​::Header (v 1.17),
based on your rewritten regex.

Inline Patch
--- Header.pm.orig      Tue May 16 05:17:25 2000
+++ Header.pm   Tue May 16 05:26:15 2000
@@ -110,7 +110,7 @@
                            [^"]{$min,$max}?[\,\;]
                           |[^"]{1,$max}[\s\n]
                           |[^\s"]+[\s\n]
-                          |([^\s"]+("[^"]*")?)+[\s\n]
+                          |[^\s"]+(?:"[^"]*"[^\s"]+)*[\s]
                          ))
                        //x);
    $x .= $_[0];

Thanks.

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Michael Shields <shields@​msrl.com> wrote

This single substitution never terminates on perl 5.005_03, from
Debian Linux package version 5.005.03-06 and also from FreeBSD
3.4-RELEASE.

Ronald has pointed out that it's taking a long time rather than looping
(it takes a few minutes on my platform.). But the optimisation
seems to have been improved in perl5.6.0, so it takes no time at all.

Mike Guy

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From @tamias

On Tue, May 16, 2000 at 05​:29​:05AM +0000, Michael Shields wrote​:

I will therefore suggest the following patch to Mail​::Header (v 1.17),
based on your rewritten regex.

--- Header.pm.orig Tue May 16 05​:17​:25 2000
+++ Header.pm Tue May 16 05​:26​:15 2000
@​@​ -110,7 +110,7 @​@​
[^"]{$min,$max}?[\,\;]
|[^"]{1,$max}[\s\n]
|[^\s"]+[\s\n]
- |([^\s"]+("[^"]*")?)+[\s\n]
+ |[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s]
  ^^^^

Oops, I should have written that as \s rather than [\s], don't need the
brackets.

\s includes \n, so [\s\n] is redundant; the same change can be made to the
other alternatives.

Ronald

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From @gbarr

On Tue, May 16, 2000 at 04​:40​:02AM +0000, Michael Shields wrote​:

The following test script causes Mail​::Header to enter an infinite
loop in Mail​::Header​::_fold_line. The header is malformed, but the
failure mode is severe.

Yes this regex does go on forever on malformed headers. But I have not had the
time to redo the regex in question.

Code has been added to only perform this regex on "structured" headers, but there is an
assumption that these headers sre correctly formed.

If you have any suggestions let me know.

Thanks,
Graham.

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Ronald J Kimball wrote​:

How long did you wait?

The nested quantifiers make this regex take time exponentially relative
to
the length of the target string to fail to find a match. The regex
should
be fixed.

For example​:

s/^(\s*[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s])//;

I think that the regex engine could and should be less sensitive to
this problem. Yes, it can be fooled. But something not dissimilar
to the improvements that Ilya was just talking about in

http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg00509.html

could address this case. Take the original regex​:

s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//;

The key problem is here​:

...([^\s"]+("[^"]*")?)+...

because it searches over ways to partition a string on its way to
failing when there is no way it can matter. Now stare at that
until you go a little cross-eyed, and let me label with numbers
various key positions where you have just matched something and
need to try to match a character​:

1([^\s"]2+("3[^"]4*"5)?)

Now at those points here is what you have to try in order​:

1​: Match [^\s"] and -> 2
  Fail back

2​: Match [^\s"] and -> 2
  Match " and go to 3
  Match [^\s"] and -> 2 # Problem
  Go on to the rest of the RE (not shown)
  Fail back

3​: Match [^"] and -> 4
  Match " and -> 5
  Fail back

4​: Match [^"] and -> 4
  Match " and -> 5
  Fail back

5​: Match [^\s"] and -> 2
  Go on to the rest of the RE (not shown)
  Fail back

And, of course, it tries all of this recursively. So if it
tries something and that eventually fails back, then it will
go on to the next item on the list.

Now the problem, of course, is in 2 which can match [^\s"], fail,
then try doing the same work. If it has to make this decision on
every single character, it gets to double its work each time,
which makes it exponentially bad and turns into our performance
disaster.

Now how could Ilya's suggestion, properly modified, help? Well
what he was talking about was looking for opportunities to do next
character discrimination. Then if someone went to write a regular
expression for all of the states, they would only have to test "N"
once, at that point either eliminating NC ND NE NH NJ NM NV NY, or
else eliminating everything else. However once you have set up next
character discrimination, you have the opportunity to also identify
(and fix) the most common cases of this kind of disaster - just check
when your next character discrimination is going to result in
reaching the same state in 2 ways, and when you notice that eliminate
the second way. (Note, be careful not to do this if the exact thing
matched by a subpattern you are finishing is later used as a
backreference!)

Here is a simple form of another common case​:

(\s*\w\s*)+

Analyzing as above​:

1(\s2*\w3\s*)+

1​: Match \s and -> 2
  Match \w and -> 3
  Fail back

2​: Match \s and -> 2
  Match \w and -> 3
  Fail back

3​: Match \s and -> 3
  Match \s and -> 2
  Match \w and -> 3
  Go on to the rest of the RE (not shown)
  Fail back

Again, when you think about doing next character discrimination,
you have the chance to spot the most common poorly written regular
expressions and fix them on the fly.

Were my C as developed as my math skills, I might try to do this.

But it is not. :-(

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

M.J.T. Guy writes​:

Michael Shields <shields@​msrl.com> wrote

This single substitution never terminates on perl 5.005_03, from
Debian Linux package version 5.005.03-06 and also from FreeBSD
3.4-RELEASE.

Ronald has pointed out that it's taking a long time rather than looping
(it takes a few minutes on my platform.). But the optimisation
seems to have been improved in perl5.6.0, so it takes no time at all.

From theoretical point of view this optimization is extremely buggy​: I
deduced the conditions when the optimization can be applied, but in
the heat of coding forgot to implement a check for these conditions.

Unfortunately, up to now I could not deduce a test case when the
optimization fails. ;-) :-( :-( :-( This means that I do not know
how to fix this code so that

  a) it always produces correct results;
  b) in typical cases the optimization is enable.

The conditions I had in mind initially are very restrictive, say,
there are not applicable to the documentation example

  ((\w{1,5}){1,5}){1,5}

Here is the scoop if you want to help me​: if a long match is detected,
we create a rectangular table with cells

("position in the REx", "position in the input string")

and keep pairs of positions which were "visited" during the match. If
you touch the same cell the second time, this means that the first
time we failed, so we fail again (without trying the rest of the REx!).

Here is the problem​: there is more state than this. Say, for (\w{1,5}){1,5}$

  If we are at \w again with the "external" count being 4. This
  number 3 is a part of the state of the match, say, it dictates that
  we should match \w{1,5} 2 more times before $.

But we do not keep this number in the table. Now if we already
visited this position in the string before with a different value of
the external count and failed, this does not mean that we will fail
now.

This the optimization is not viable. Now if somebody could provide an
example basing on the above hints...

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Ben_Tilly@​trepp.com writes​:

Again, when you think about doing next character discrimination,
you have the chance to spot the most common poorly written regular
expressions and fix them on the fly.

Yes, this is why I stole some time and put next character
discrimination code into 5.005_6*. Currently it is enable for the
first char only, to enable STCLASS optimization (know which chars can
start a REx).

In short​: in some cases we can go as quick as a DFA, and it is easy to
detect these cases, and these cases are typical.

But what to do meanwhile? Answer​: use (?>) instead.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Ilya wrote​:

Ben_Tilly@​trepp.com writes​:

Again, when you think about doing next character discrimination,
you have the chance to spot the most common poorly written regular
expressions and fix them on the fly.

Yes, this is why I stole some time and put next character
discrimination code into 5.005_6*. Currently it is enable for the
first char only, to enable STCLASS optimization (know which chars can
start a REx).

Why not enable it for arbitrary numbers of characters?

In short​: in some cases we can go as quick as a DFA, and it is easy to
detect these cases, and these cases are typical.

About a year ago I figured out how to do a hybrid algorithm which can
do virtually any construct an NFA can, will not have performance
problems on any RE that can be done by a DFA, and removes potential
problems with NFAs when you recurse too far. Unfortunately its
average performance suffers significantly, some of the features that
Perl has (lookaheads and lookbehinds in particular) are problematic,
and it is sufficiently different that I am doubtful it is worth
trying to fit into Perl.

But then again I remember a co-worker who was complaining about the
abysmal performance of Perl reading comma separated data, it turned
out that he was parsing each line with constructs like
/(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/

:-(

But what to do meanwhile? Answer​: use (?>) instead.

I suspect that virtually anyone with sufficient tuits to understand
what (?>) means is likely to have enough tuits to write clear enough
regular expressions that they don't need (?>) for the performance.
(Though the fine control is useful.) The problem is how to get
people who lack those tuits to stay out of trouble.

Personally I think that careful use of /g to write miniature parsing
engines that grab tokens is in some ways an ideal approach. Among
other things, it allows for the expression to get far more
complicated without becoming unmanagable, and it allows for exception
checking to be put in place quite smoothly. It also provides for a
direct solution to problems that regular expressions by themselves do
not. (eg Balanced tags.) But I (for one) do not have a good
suggestion for how to get people to use this consistently..

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Ben_Tilly@​trepp.com writes​:
|| > But what to do meanwhile? Answer​: use (?>) instead.
||
|| I suspect that virtually anyone with sufficient tuits to understand
|| what (?>) means is likely to have enough tuits to write clear enough
|| regular expressions that they don't need (?>) for the performance.
|| (Though the fine control is useful.) The problem is how to get
|| people who lack those tuits to stay out of trouble.
||
|| Personally I think that careful use of /g to write miniature parsing
|| engines that grab tokens is in some ways an ideal approach. Among
|| other things, it allows for the expression to get far more
|| complicated without becoming unmanagable, and it allows for exception
|| checking to be put in place quite smoothly. It also provides for a
|| direct solution to problems that regular expressions by themselves do
|| not. (eg Balanced tags.) But I (for one) do not have a good
|| suggestion for how to get people to use this consistently..

A module that defined a number of variables containing qr patterns
for a number of purposes might help. It could also contains functions
that construct a pattern based on arguments.

  use re​::pat;
 
  ...
 
  /$re​::pat​::balpar/
 
  ...
 
  $x = re​::pat​::balpar( '<([', '>)]' );

Such a module would let non-experts​:

  1. avoid writing bad RE's for common tasks that are provided
  in the module
 
  2. provide examples of how to write good RE's, as a hint for
  when they have to deal with a less common task

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

On Tue, May 16, 2000 at 03​:40​:57PM -0400, Ben_Tilly@​trepp.com wrote​:

Yes, this is why I stole some time and put next character
discrimination code into 5.005_6*. Currently it is enable for the
first char only, to enable STCLASS optimization (know which chars can
start a REx).

Why not enable it for arbitrary numbers of characters?

How would you use it? There was already an infrastructure to use
STCLASS info. One needs to create the infrastructure for at least​:

  a) to optimize alternation basing on the next-char info;

  b) to (?>)-ize repetition

  A*B => (?>A*)B

  if A and B cannot match at the same place

  c) Run-time code similar to "b" if there are *some* next-chars which
  allow A to match, but B not or visa versa.

About a year ago I figured out how to do a hybrid algorithm which can
do virtually any construct an NFA can, will not have performance
problems on any RE that can be done by a DFA, and removes potential
problems with NFAs when you recurse too far.

I think with a/b/c above (and better visited-position code) we can go
very close to this with minimal changes to the current REx engine.
(At least when one can optimize "a" to deeper trie-based dispatch
trees).

But then again I remember a co-worker who was complaining about the
abysmal performance of Perl reading comma separated data, it turned
out that he was parsing each line with constructs like
/(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/

:-(

If one changes the code for fixed-substrings to support an arbitrary
number of fixed substrings, we can get enough info to optimize *this*
too (optimizer will find all the 8 commas in the string). How to make
REx runtime use this info from optimizer is a separate question...

But what to do meanwhile? Answer​: use (?>) instead.

I suspect that virtually anyone with sufficient tuits to understand
what (?>) means is likely to have enough tuits to write clear enough
regular expressions that they don't need (?>) for the performance.

What for? Just use (?>).

complicated without becoming unmanagable,

How so? Why do you think a well-commented Perl code with 17 simple
RExen scattered around is better than one equivalent well-commented REx?

Definitely /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ expresses
one's intentions well enough.

and it allows for exception checking to be put in place quite
smoothly.

This is an interesting observation. Can we do the same with (a
modification to) (?{}) ?

It also provides for a
direct solution to problems that regular expressions by themselves do
not. (eg Balanced tags.)

Which are handled by REx engine without any problem.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From @gbarr

On Tue, May 16, 2000 at 09​:40​:40AM -0400, Ronald J Kimball wrote​:

On Tue, May 16, 2000 at 05​:29​:05AM +0000, Michael Shields wrote​:

I will therefore suggest the following patch to Mail​::Header (v 1.17),
based on your rewritten regex.

--- Header.pm.orig Tue May 16 05​:17​:25 2000
+++ Header.pm Tue May 16 05​:26​:15 2000
@​@​ -110,7 +110,7 @​@​
[^"]{$min,$max}?[\,\;]
|[^"]{1,$max}[\s\n]
|[^\s"]+[\s\n]
- |([^\s"]+("[^"]*")?)+[\s\n]
+ |[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s]

While this my fix the forever loop, it causes testcases to fail

Oops, I should have written that as \s rather than [\s], don't need the
brackets.

\s includes \n, so [\s\n] is redundant; the same change can be made to the
other alternatives.

Yes, for some reason I always do that.

Graham.

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From @tamias

On Tue, May 16, 2000 at 09​:55​:51PM +0100, Graham Barr wrote​:

On Tue, May 16, 2000 at 09​:40​:40AM -0400, Ronald J Kimball wrote​:

On Tue, May 16, 2000 at 05​:29​:05AM +0000, Michael Shields wrote​:

I will therefore suggest the following patch to Mail​::Header (v 1.17),
based on your rewritten regex.

--- Header.pm.orig Tue May 16 05​:17​:25 2000
+++ Header.pm Tue May 16 05​:26​:15 2000
@​@​ -110,7 +110,7 @​@​
[^"]{$min,$max}?[\,\;]
|[^"]{1,$max}[\s\n]
|[^\s"]+[\s\n]
- |([^\s"]+("[^"]*")?)+[\s\n]
+ |[^\s"]+(?​:"[^"]*"[^\s"]+)*[\s]

While this my fix the forever loop, it causes testcases to fail

I think it fails because the new regex requires at least one [^\s"] after
the last quoted string, where the old one didn't.

Try this instead​:

[^\s"]+(?​:"[^"]*"[^\s"]*)*\s

That should still avoid the exponential failure time.

Oops, I should have written that as \s rather than [\s], don't need the
brackets.

\s includes \n, so [\s\n] is redundant; the same change can be made to the
other alternatives.

Yes, for some reason I always do that.

It's redundant, but it's also explicit. Works either way. :)

Ronald

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

On Tue, May 16, 2000 at 07​:27​:51PM -0400, Ben_Tilly@​trepp.com wrote​:

One needs to create the infrastructure for at least​:

a) to optimize alternation basing on the next-char info;

b) to (?>)-ize repetition

A*B => (?>A*)B

 if A and B cannot match at the same place

c) Run-time code similar to "b" if there are *some* next-chars which
allow A to match, but B not or visa versa.

You only need a) to be useful.

Do you mean "a) alone would be useful too"?

Just make it possible to put off the
decision about whether you are following path A or B as long as
feasible.

Exactly the opposite. Make this decision as early as possible, so
that you do not need to check A,B,...,C in (A|B|...|C|M|...) if they
cannot match.

As for b) and c), I presume you mean that B will match there only if
A does, thereby making a run-time test for B matching superfluous...

Exactly the opposite. If A and B cannot match at the same position,
then A*B is equivalent to (?>A*)B.

I think with a/b/c above (and better visited-position code) we can go
very close to this with minimal changes to the current REx engine.
(At least when one can optimize "a" to deeper trie-based dispatch
trees).

DANGER. Better visited position code needs to know for each position
in the RE the full relevant state that can matter. This includes for
fixed repetitions the repetition count, and for backreferences that
actually get used, the backreference. This information can get
prohibitive very fast if you have a long match which is long because
of back references. (Read exponential memory usage!)

This is not how I see it. You determine (in compile time) whether the
repetition counts are going to be needed. Say, I would guess they are
not needed for {n,infty} limits and for {n,n+1} limits (thus for all
the simple cases *+? !). With backreferences it is even easier​: in
the current implementation when \1 is accessed, all the
visited-position data is marked as invalid.

I would suggest a fix that consists of unravelling all fixed
repetitions before compiling so that the only state that matters
is the backreferences that get used in the RE

Easier​: just keep the flag that "we are inside a fixed run of {n,m}".
For {n,infty} we can collect the matched position info even when this
flag is present, but we do not *use* this position unless this flag is
not set.

But then again I remember a co-worker who was complaining about the
abysmal performance of Perl reading comma separated data, it turned
out that he was parsing each line with constructs like
/(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/

:-(

If one changes the code for fixed-substrings to support an arbitrary
number of fixed substrings, we can get enough info to optimize *this*
too (optimizer will find all the 8 commas in the string). How to make
REx runtime use this info from optimizer is a separate question...

With sufficient analysis you can optimize anything. Do you want to
put that effort out for a special case like this?

Multiple-fixed-substring is not a special case. We already do 2 fixed
substrings (one anchored, another floating), and the code for a
general case may be even simpler (as it happens time to time). The
problem is how to communicate this info to the REx engine, so it can
optimize some A*B if it knows that B cannot match, say, for the next
200 positions.

Besides which your visited positions optimization will optimize the
above case as well.

No. We do not use this for SIMPLE* - the code for SIMPLE* is very
quick now, and this could make it significantly slower.

What for? Just use (?>).

What tool do I reach for first? I mostly write simple REs so I would
rarely benefit.

?? Whenever you want * to be maximally-greedy, use (?>). As easy as that.

How so? Why do you think a well-commented Perl code with 17 simple
RExen scattered around is better than one equivalent well-commented REx?

Different mode of thought. Additionally the best goal is frequently
to populate an arbitrarily complex data structure and then work off
of that. RE engines are not generally designed to do that, and even
if they were, that would effectively mean learning a second language
to figure out how to get that capability.

If you use RExen, it is not a second language. You call RExen from
Perl anyway, so calling Perl from RExen would come as a natural
extension. Well, some people even think that s/a/ EXPR /e makes REx
engine call EXPR, so they would be an easy fodder... ;-)

The basic mode of thought is that regular expressions are a great
way to identify the next token in a data stream, but to manipulate
said tokens in complex ways, you really want to have a full
programming language.

Time-to-time​: yes. So use (?{}). ;-)

And if you need to backtrack, then having one REx would be much easier
than calling many RExen from a parser...

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

On Tue, May 16, 2000 at 03​:40​:57PM -0400, Ben_Tilly@​trepp.com wrote​:

About a year ago I figured out how to do a hybrid algorithm which can
do virtually any construct an NFA can, will not have performance
problems on any RE that can be done by a DFA, and removes potential
problems with NFAs when you recurse too far.

May be *you* can clear it for me​: who do DFAs without memory implement
the POSIX semantic? All I can invent without using memory is​: color
chars of the string which are inside possible matches red. Make DFA
output the end of the first red region.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Ronald J Kimball writes​:

\s includes \n, so [\s\n] is redundant; the same change can be made to the
other alternatives.

Yes, for some reason I always do that.

It's redundant, but it's also explicit. Works either way. :)

And without `use locale' the produce identical RExen...

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 16, 2000

From [Unknown Contact. See original ticket]

Ilya wrote​:

On Tue, May 16, 2000 at 03​:40​:57PM -0400, Ben_Tilly@​trepp.com wrote​:

Yes, this is why I stole some time and put next character
discrimination code into 5.005_6*. Currently it is enable for the
first char only, to enable STCLASS optimization (know which chars can
start a REx).

Why not enable it for arbitrary numbers of characters?

How would you use it? There was already an infrastructure to use
STCLASS info. One needs to create the infrastructure for at least​:

a) to optimize alternation basing on the next-char info;

b) to (?>)-ize repetition

A*B => (?>A*)B

 if A and B cannot match at the same place

c) Run-time code similar to "b" if there are *some* next-chars which
allow A to match, but B not or visa versa.

You only need a) to be useful. Just make it possible to put off the
decision about whether you are following path A or B as long as
feasible.

As for b) and c), I presume you mean that B will match there only if
A does, thereby making a run-time test for B matching superfluous...

About a year ago I figured out how to do a hybrid algorithm which can
do virtually any construct an NFA can, will not have performance
problems on any RE that can be done by a DFA, and removes potential
problems with NFAs when you recurse too far.

I think with a/b/c above (and better visited-position code) we can go
very close to this with minimal changes to the current REx engine.
(At least when one can optimize "a" to deeper trie-based dispatch
trees).

DANGER. Better visited position code needs to know for each position
in the RE the full relevant state that can matter. This includes for
fixed repetitions the repetition count, and for backreferences that
actually get used, the backreference. This information can get
prohibitive very fast if you have a long match which is long because
of back references. (Read exponential memory usage!)

I would suggest a fix that consists of unravelling all fixed
repetitions before compiling so that the only state that matters
is the backreferences that get used in the RE, and then throwing
away visited position information whenever a backreference that
appears anywhere in the RE gets modified. That is both safe and
it covers almost all of the common cases that could arise.

But then again I remember a co-worker who was complaining about the
abysmal performance of Perl reading comma separated data, it turned
out that he was parsing each line with constructs like
/(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/

:-(

If one changes the code for fixed-substrings to support an arbitrary
number of fixed substrings, we can get enough info to optimize *this*
too (optimizer will find all the 8 commas in the string). How to make
REx runtime use this info from optimizer is a separate question...

With sufficient analysis you can optimize anything. Do you want to
put that effort out for a special case like this? I think that it
was best in the long run that I had the opportunity to explain the
joys of split, which made his code not only faster but a lot clearer
as well.

Besides which your visited positions optimization will optimize the
above case as well.

But what to do meanwhile? Answer​: use (?>) instead.

I suspect that virtually anyone with sufficient tuits to understand
what (?>) means is likely to have enough tuits to write clear enough
regular expressions that they don't need (?>) for the performance.

What for? Just use (?>).

What tool do I reach for first? I mostly write simple REs so I would
rarely benefit. Therefore when I do need to think about the potential
for a problem, I find it easier to just be unambiguous than to go
looking in the documentation for an option that I don't remember,
which other people who need to read my code are unlikely to have ever
heard about either.

Besides which the simple expedient of writing clear REs is a skill
that transfers to all other tools where you might need it. Somehow
I find that even "Perl compatible regular expressions" are not truly
Ilya compatible. :-)

complicated without becoming unmanagable,

How so? Why do you think a well-commented Perl code with 17 simple
RExen scattered around is better than one equivalent well-commented REx?

Different mode of thought. Additionally the best goal is frequently
to populate an arbitrarily complex data structure and then work off
of that. RE engines are not generally designed to do that, and even
if they were, that would effectively mean learning a second language
to figure out how to get that capability.

The basic mode of thought is that regular expressions are a great
way to identify the next token in a data stream, but to manipulate
said tokens in complex ways, you really want to have a full
programming language.

Definitely /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ expresses
one's intentions well enough.

Split does it better.

and it allows for exception checking to be put in place quite
smoothly.

This is an interesting observation. Can we do the same with (a
modification to) (?{}) ?

Not that I can see. A trivial example of why not would be an
exception that means you need to read more data in. This arises
naturally if, for instance, there turned out to be an embedded return
in a quoted field in a CSV file.

It also provides for a
direct solution to problems that regular expressions by themselves do
not. (eg Balanced tags.)

Which are handled by REx engine without any problem.

Right, another of those extensions that I don't know to reach for.

I don't believe that it existed in 5.005 either, which makes it an
extension that does not exist in any version of Perl I would feel
comfortable putting in production at the moment.

The idea of a parse engine is another idea that I can take with me and
use in other languages...

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 17, 2000

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Tue, May 16, 2000 at 07​:27​:51PM -0400, Ben_Tilly@​trepp.com wrote​:

One needs to create the infrastructure for at least​:

a) to optimize alternation basing on the next-char info;

b) to (?>)-ize repetition

A*B => (?>A*)B

 if A and B cannot match at the same place

c) Run-time code similar to "b" if there are *some* next-chars which
allow A to match, but B not or visa versa.

You only need a) to be useful.

Do you mean "a) alone would be useful too"?

Yup.

Just make it possible to put off the
decision about whether you are following path A or B as long as
feasible.

Exactly the opposite. Make this decision as early as possible, so
that you do not need to check A,B,...,C in (A|B|...|C|M|...) if they
cannot match.

We are talking past each other. If someone writes
/Alaska|Alabama|Indiana/, it would be a good thing to effectively make
this the same as /(?​:Ala(?​:ska|bama))|Indiana/. Discriminate between
A.* and Indiana ASAP, but also don't bother needlessly backtracking on
"Allah".

As for b) and c), I presume you mean that B will match there only if
A does, thereby making a run-time test for B matching superfluous...

Exactly the opposite. If A and B cannot match at the same position,
then A*B is equivalent to (?>A*)B.

I think you missed my point here. Consider the simple expression​:
/^(\s*yada\s*)+$/. This is poorly written. When you analyze it
carefully, after matching yada, if you encounter a space, that space
can match either after yada, or before the next yada. If it is
possible for that space to match before yada and successfully complete
that match, then it is possible for that space to match after yada and
complete that match. Which match will the engine actually find?

Knowing this fact, we can know that after matching yada, we should
never bother backtracking from trying to match a space after yada to
trying to match one before - the second is wasted work. Guaranteed.

I think with a/b/c above (and better visited-position code) we can go
very close to this with minimal changes to the current REx engine.
(At least when one can optimize "a" to deeper trie-based dispatch
trees).

DANGER. Better visited position code needs to know for each position
in the RE the full relevant state that can matter. This includes for
fixed repetitions the repetition count, and for backreferences that
actually get used, the backreference. This information can get
prohibitive very fast if you have a long match which is long because
of back references. (Read exponential memory usage!)

This is not how I see it. You determine (in compile time) whether the
repetition counts are going to be needed. Say, I would guess they are
not needed for {n,infty} limits and for {n,n+1} limits (thus for all
the simple cases *+? !). With backreferences it is even easier​: in
the current implementation when \1 is accessed, all the
visited-position data is marked as invalid.

Ack! No!

Throwing away the optimization when you notice that \1 is used is
fine, though throwing it away at runtime when you notice a change
instead of at compile time will make the engine more robust. I
agree that you do not need it for +, ?, and *. But neither
generalization holds, and here are simple test cases that show it​:

  "yyyx" =~ /y{1,2}x/
  "yyyx" =~ /y*y{2,}x/

I would suggest a fix that consists of unravelling all fixed
repetitions before compiling so that the only state that matters
is the backreferences that get used in the RE

Easier​: just keep the flag that "we are inside a fixed run of {n,m}".
For {n,infty} we can collect the matched position info even when this
flag is present, but we do not *use* this position unless this flag is
not set.

Good point. :-)

[ ...
  Discussion on /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/
  truncated
  ...]

Multiple-fixed-substring is not a special case. We already do 2 fixed
substrings (one anchored, another floating), and the code for a
general case may be even simpler (as it happens time to time). The
problem is how to communicate this info to the REx engine, so it can
optimize some A*B if it knows that B cannot match, say, for the next
200 positions.

I see your point but I will have to think about whether I agree.

My understanding is that searching for fixed strings is done first
because you can search for long fixed sub-strings faster than you can
scan through a string..?

Besides which your visited positions optimization will optimize the
above case as well.

No. We do not use this for SIMPLE* - the code for SIMPLE* is very
quick now, and this could make it significantly slower.

Where is the change-over? The code above is pretty slow, and if you
use it for scanning a file it becomes abysmally so. No matter how
much you optimize it, I am glad I got to lecture about the joys of
split earlier rather than later.

What for? Just use (?>).

What tool do I reach for first? I mostly write simple REs so I would
rarely benefit.

?? Whenever you want * to be maximally-greedy, use (?>). As easy as that.

Whenever you want * to be maximally-greedy, make it unambiguously not
match anything that what follows can match. As easy as that, and it
works in every tool that provides REs.

[...]

Different mode of thought. Additionally the best goal is frequently
to populate an arbitrarily complex data structure and then work off
of that. RE engines are not generally designed to do that, and even
if they were, that would effectively mean learning a second language
to figure out how to get that capability.

If you use RExen, it is not a second language. You call RExen from
Perl anyway, so calling Perl from RExen would come as a natural
extension. Well, some people even think that s/a/ EXPR /e makes REx
engine call EXPR, so they would be an easy fodder... ;-)

I agree with something Larry once said, the point of REs is to provide
nicely encapsulated line-noise for pattern matching. I am not a fan
of embedding complex logic in line-noise.

The basic mode of thought is that regular expressions are a great
way to identify the next token in a data stream, but to manipulate
said tokens in complex ways, you really want to have a full
programming language.

Time-to-time​: yes. So use (?{}). ;-)

And if you need to backtrack, then having one REx would be much easier
than calling many RExen from a parser...

And there I see what made me uncomfortable with the idea.

If you wish to extend (?{}) to be useful, here are the things that
I would need to see​:

1) (?{forward_code}{backtrack_code}). If you have to backtrack, give
  a way for people to undo what they just did with the forward code.

2) (?\/) No backtracking past here. Many parsing problems come down
  to identifying tokens, and then doing things with them. One of the
  strengths of a parsing engine like I described is that it lets you
  execute code safely in the conviction that you can view a unit of
  work as being complete. This greatly simplifies the thinking
  process - indeed not understanding backtracking is the cause of
  most poorly designed regular expressions.

Without the ability for embedded code to either execute knowing that
it does not have to worry about backtracking, or to execute knowing
that it can undo what it did if backtracking happens, it is very hard
to meaningfully use intermediate information in the execution of
RExen.

Even then I am not sure how often I would use it since I don't face
many complex parsing problems.

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 17, 2000

From [Unknown Contact. See original ticket]

On Wed, May 17, 2000 at 05​:21​:39PM -0400, Ben_Tilly@​trepp.com wrote​:

Just make it possible to put off the
decision about whether you are following path A or B as long as
feasible.

Exactly the opposite. Make this decision as early as possible, so
that you do not need to check A,B,...,C in (A|B|...|C|M|...) if they
cannot match.

We are talking past each other. If someone writes
/Alaska|Alabama|Indiana/, it would be a good thing to effectively make
this the same as /(?​:Ala(?​:ska|bama))|Indiana/. Discriminate between
A.* and Indiana ASAP, but also don't bother needlessly backtracking on
"Allah".

This is the target, but I do not see any relationship to your sentence
"put off decision as long as possible".

As for b) and c), I presume you mean that B will match there only if
A does, thereby making a run-time test for B matching superfluous...

Exactly the opposite. If A and B cannot match at the same position,
then A*B is equivalent to (?>A*)B.

I think you missed my point here. Consider the simple expression​:
/^(\s*yada\s*)+$/. This is poorly written. When you analyze it
carefully, after matching yada, if you encounter a space, that space
can match either after yada, or before the next yada. If it is
possible for that space to match before yada and successfully complete
that match, then it is possible for that space to match after yada and
complete that match. Which match will the engine actually find?

Finally, this is documented in 5.6.0 ;-) (if you do not consider fairy
tales about "Backtracking" which worked as documenation substitutes
before).

Knowing this fact, we can know that after matching yada, we should
never bother backtracking from trying to match a space after yada to
trying to match one before - the second is wasted work. Guaranteed.

I think this is a fairly complicated optimization. The sniffer would
need to know *a lot* to deduce this.

DANGER. Better visited position code needs to know for each position
in the RE the full relevant state that can matter. This includes for
fixed repetitions the repetition count, and for backreferences that
actually get used, the backreference. This information can get
prohibitive very fast if you have a long match which is long because
of back references. (Read exponential memory usage!)

This is not how I see it. You determine (in compile time) whether the
repetition counts are going to be needed. Say, I would guess they are
not needed for {n,infty} limits and for {n,n+1} limits (thus for all
the simple cases *+? !). With backreferences it is even easier​: in
the current implementation when \1 is accessed, all the
visited-position data is marked as invalid.

Ack! No!

Throwing away the optimization when you notice that \1 is used is
fine, though throwing it away at runtime when you notice a change
instead of at compile time will make the engine more robust.

"Will make"? As I said, it is doing it now (note "accessed").

I
agree that you do not need it for +, ?, and *. But neither
generalization holds, and here are simple test cases that show it​:

"yyyx" =~ /y{1,2}x/
"yyyx" =~ /y*y{2,}x/

Shows what?

My understanding is that searching for fixed strings is done first
because you can search for long fixed sub-strings faster than you can
scan through a string..?

I think this is more than an order of magnitude faster. Probably two.

Besides which your visited positions optimization will optimize the
above case as well.

No. We do not use this for SIMPLE* - the code for SIMPLE* is very
quick now, and this could make it significantly slower.

Where is the change-over? The code above is pretty slow,

But polynomial. Well, I agree that for all practical purposes n^9 is
not a polynomial. ;-)

The changeover is governed by what is repeated by {n,m}. If it
constant-length, no-side-effects-except-setting-a-group, then we can
match it without any real-backtracking. Such cases are quick, and
they are not involved in the above optimization.

Of course, if you have 9 such groups in turn, this gives you n^9 time...

?? Whenever you want * to be maximally-greedy, use (?>). As easy as that.

Whenever you want * to be maximally-greedy, make it unambiguously not
match anything that what follows can match.

Just compare which recipe is easier-to-use and expresses your
intentions better.

If you wish to extend (?{}) to be useful, here are the things that
I would need to see​:

1) (?{forward_code}{backtrack_code}). If you have to backtrack, give
a way for people to undo what they just did with the forward code.

There is a way​: use localization and DESTROY, or DESTROY on $^R (which
is autolocalized). It is not pretty, but it is there. If a
consistent usage mode appears, one can make a syntactic sugar for
this. Without proof that it is needed, I do not want to make things
harder.

Some functionality like this is badly needed, but I'm not sure how it
should actually look from the user point of view.

2) (?\/) No backtracking past here.

This is (?>), but (?>) is better​: nobody could explain what the above
sentence *actually means*. (?>) provides a scope to limit
backtracking-prohibition to, thus makes things defined.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 17, 2000

From [Unknown Contact. See original ticket]

Ilya wrote​:

On Wed, May 17, 2000 at 05​:21​:39PM -0400, Ben_Tilly@​trepp.com wrote​:
[...]

We are talking past each other. If someone writes
/Alaska|Alabama|Indiana/, it would be a good thing to effectively make
this the same as /(?​:Ala(?​:ska|bama))|Indiana/. Discriminate between
A.* and Indiana ASAP, but also don't bother needlessly backtracking on
"Allah".

This is the target, but I do not see any relationship to your sentence
"put off decision as long as possible".

While matching /Ala/ you are putting off the decision about whether
you are doing the work to wind up matching Alaska or Alabama for 3
characters.

I interpreted you claim about only optimizing one character of next
character discrimination as in this situation making the match look
like /(?​:A(?​:laska|labama))|Indiana/ which is an improvement but not
a big one.

[...]

I think you missed my point here. Consider the simple expression​:
/^(\s*yada\s*)+$/. This is poorly written. When you analyze it
carefully, after matching yada, if you encounter a space, that space
can match either after yada, or before the next yada. If it is
possible for that space to match before yada and successfully complete
that match, then it is possible for that space to match after yada and
complete that match. Which match will the engine actually find?

Finally, this is documented in 5.6.0 ;-) (if you do not consider fairy
tales about "Backtracking" which worked as documenation substitutes
before).

It was always documented..in the source-code.

Some forms of documentation are easier to read, that is all.

Knowing this fact, we can know that after matching yada, we should
never bother backtracking from trying to match a space after yada to
trying to match one before - the second is wasted work. Guaranteed.

I think this is a fairly complicated optimization. The sniffer would
need to know *a lot* to deduce this.

Read back a few emails.

If it tried to deduce this from scratch, yes. But if you aggressively
pursue trieing first, the internal representation of this RE should
reduce down to a representation in which it is a trivial deduction.
(One which does not, unfortunately, directly get represented by any
RE.)

I laid out detailed examples before. The point is that in trying to
optimize the match, you will wind up merging the two choices of
how to match \s* into one state where you are matching \s* and then
trying to figure out where you put the parentheses - then it is
trivial to notice the optimization.

Try it by hand and you should see what I mean.

[...]

This is not how I see it. You determine (in compile time) whether the
repetition counts are going to be needed. Say, I would guess they are
not needed for {n,infty} limits and for {n,n+1} limits (thus for all
the simple cases *+? !). With backreferences it is even easier​: in
the current implementation when \1 is accessed, all the
visited-position data is marked as invalid.

Ack! No!

Throwing away the optimization when you notice that \1 is used is
fine, though throwing it away at runtime when you notice a change
instead of at compile time will make the engine more robust.

"Will make"? As I said, it is doing it now (note "accessed").

I meant that were you to still attempt the optimization on RExen where
\1 gets accessed, but throw away the collected information every time
\1 changes value, then some performance disasters would be fixed.

This is worthwhile.

I
agree that you do not need it for +, ?, and *. But neither
generalization holds, and here are simple test cases that show it​:

"yyyx" =~ /y{1,2}x/

Whoops, I had a braino. My example above should have been

  "yyyx" =~/y*y{2,3}x/

"yyyx" =~ /y*y{2,}x/

Shows what?

They are regular expressions that take the forms you thought you could
use a visited-position optimization on without repetition counts which
match without the optimization and do not match with it.

In both of them the issue is that when you backtrack on the y* you
will fail y{2,} at first because y* matched all but none, and all
but 1, and after that you should succeed but the optimization will
stop you dead in your tracks.

[...]

The changeover is governed by what is repeated by {n,m}. If it
constant-length, no-side-effects-except-setting-a-group, then we can
match it without any real-backtracking. Such cases are quick, and
they are not involved in the above optimization.

Of course, if you have 9 such groups in turn, this gives you n^9 time...

From your description I understood that if the RE took above some
runtime threshold, then you retried with the optimization on. The
example /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ will be
sped up by the optimization from n^9 behaviour to n^2. While this
is not as good as the linear split, in many cases it is acceptable.

So, does the optimization hit in this case?

?? Whenever you want * to be maximally-greedy, use (?>). As easy as that.

Whenever you want * to be maximally-greedy, make it unambiguously not
match anything that what follows can match.

Just compare which recipe is easier-to-use and expresses your
intentions better.

To the computer, to me, or to whoever is unlucky enough to read my
code? The computer probably likes the cleaner (?>), I forget what it
means so the clean RE is probably easier for me, and whoever is
unlucky enough to read my code is unlikely to know about (?>) unless
they do little but hack Perl.

Plus the skill of writing unambiguous RExen is of general use.

If you wish to extend (?{}) to be useful, here are the things that
I would need to see​:

1) (?{forward_code}{backtrack_code}). If you have to backtrack, give
a way for people to undo what they just did with the forward code.

There is a way​: use localization and DESTROY, or DESTROY on $^R (which
is autolocalized). It is not pretty, but it is there. If a
consistent usage mode appears, one can make a syntactic sugar for
this. Without proof that it is needed, I do not want to make things
harder.

AFAICS localization of variables is insufficient if your goal is to
populate a parse tree. Plus it is inefficient to produce and destroy
stacks of variables when you could just increment and decrement a
counter.

Some functionality like this is badly needed, but I'm not sure how it
should actually look from the user point of view.

2) (?\/) No backtracking past here.

This is (?>), but (?>) is better​: nobody could explain what the above
sentence *actually means*. (?>) provides a scope to limit
backtracking-prohibition to, thus makes things defined.

This is an independent idea from (?>). Consider this​:

  /
  \G
  (
  (> # Hmmm...the shoe fits :-)
  "(?\/)(?​:[^"]|"")*"
  |
  [^,\n]*
  )
  )
  (?=,|\n|$)
  /gx

This will match the next field in a .csv format. The point is that if
the field starts with a ", then it is a quoted field, otherwise it is
an unquoted field. How can I say that now?

  /
  \G
  (
  (> # Hmmm...the shoe fits :-)
  "(?​:[^"]|"")*"
  |
  (?!")[^,\n]*
  )
  )
  (?=,|\n|$)
  /gx

Oops, I have to embed the rule about what started a quoted field
everywhere except where I actually wanted to see it! This is an
inefficient way to express myself, and for a complex format this
could quickly get prohibitively complex to keep putting all of the
negative lookaheads everywhere that they are needed.

Providing an easy way to guarantee spots in an RE where you will not
backtrack can simplify your life, and the spots I would want to put
code would be points where you have just recognized some meaningful
token, which are also points where you would want to not backtrack
from...

Programming is not so much the art of being able to do complex
things as it is the art of being able to make complexity managable by
humans. Beyond a narrow horizon, backtracking is not very managable
for most humans.

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 17, 2000

From [Unknown Contact. See original ticket]

On Wed, May 17, 2000 at 07​:24​:12PM -0400, Ben_Tilly@​trepp.com wrote​:

While matching /Ala/ you are putting off the decision about whether
you are doing the work to wind up matching Alaska or Alabama for 3
characters.

Thanks, I think I understand your language now.

I interpreted you claim about only optimizing one character of next
character discrimination as in this situation making the match look
like /(?​:A(?​:laska|labama))|Indiana/ which is an improvement but not
a big one.

Why? Now do the same with (?​:laska|labama) etc.

Finally, this is documented in 5.6.0 ;-) (if you do not consider fairy
tales about "Backtracking" which worked as documenation substitutes
before).

It was always documented..in the source-code.
Some forms of documentation are easier to read, that is all.

Source code is good as documentation only until the next patch in the queue.

If it tried to deduce this from scratch, yes. But if you aggressively
pursue trieing first, the internal representation of this RE should
reduce down to a representation in which it is a trivial deduction.

Do not see how this would work.

I meant that were you to still attempt the optimization on RExen where
\1 gets accessed, but throw away the collected information every time
\1 changes value, then some performance disasters would be fixed.

An interesting idea indeed.

  "yyyx" =~/y\*y\{2\,3\}x/

"yyyx" =~ /y*y{2,}x/

Shows what?

In both of them the issue is that when you backtrack on the y* you
will fail y{2,} at first because y* matched all but none, and all
but 1, and after that you should succeed but the optimization will
stop you dead in your tracks.

You mean the second y​: probably you suppose that after

  yyyx
  112 # First two chars matched by y*, the third one by y in y{2,3}

the position 3 is marked as "Cannot match y in y{2,3}", right? It
would not, since the "required" count 2 of y's is not fulfilled yet.

From your description I understood that if the RE took above some
runtime threshold, then you retried with the optimization on. The
example /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ will be
sped up by the optimization from n^9 behaviour to n^2. While this
is not as good as the linear split, in many cases it is acceptable.

So, does the optimization hit in this case?

There is no instrumentingn of MEDIUM{n,m} or SIMPLE{n,m} for such an
optimization. Just collecting statistic for the switchover might make
things significantly slower.

There is a way​: use localization and DESTROY, or DESTROY on $^R (which
is autolocalized). It is not pretty, but it is there. If a
consistent usage mode appears, one can make a syntactic sugar for
this. Without proof that it is needed, I do not want to make things
harder.

AFAICS localization of variables is insufficient if your goal is to
populate a parse tree.

Given DESTROY, you have enough rope for anything you want.

Plus it is inefficient to produce and destroy
stacks of variables when you could just increment and decrement a
counter.

I do not think questions of efficiency should enter the discussion
when we want to get a feeling for what is *desired* to be there.

This is an independent idea from (?>). Consider this​:

/
\G
(
(> # Hmmm...the shoe fits :-)
"(?\/)(?​:[^"]|"")*"
|
[^,\n]*
)
)
(?=,|\n|$)
/gx

This will match the next field in a .csv format. The point is that if
the field starts with a ", then it is a quoted field, otherwise it is
an unquoted field. How can I say that now?

Since I do not know what (?\/) is supposed to do, it is very hard to guess.
But why won't you put parens around the groups separately, and check
which one matched? And put (?>) around (?​:[^"]|"")* and [^,\n]* ...

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Wed, May 17, 2000 at 07​:24​:12PM -0400, Ben_Tilly@​trepp.com wrote​:

While matching /Ala/ you are putting off the decision about whether
you are doing the work to wind up matching Alaska or Alabama for 3
characters.

Thanks, I think I understand your language now.

:-)

I interpreted you claim about only optimizing one character of next
character discrimination as in this situation making the match look
like /(?​:A(?​:laska|labama))|Indiana/ which is an improvement but not
a big one.

Why? Now do the same with (?​:laska|labama) etc.

Now that you understand what I was thinking, you can see why I was
wondering exactly that.

OTOH now I am convinced it is not so simple, when you open yourself up
for this form of optimization, there are a lot more that come to mind
along the same lines...

Also you cannot just trie it arbitrarily far without some extra rules,
else trying to trie \s*\s* would be very dangerous. (I feel in my gut
that while trieing the expression this should be recognizable as just
\s*, but I am not sure how to say the rule that implies that.)

Finally, this is documented in 5.6.0 ;-) (if you do not consider fairy
tales about "Backtracking" which worked as documenation substitutes
before).

It was always documented..in the source-code.
Some forms of documentation are easier to read, that is all.

Source code is good as documentation only until the next patch in the queue.

True. But the documentation is always less accurate about what is
currently true. :-)

If it tried to deduce this from scratch, yes. But if you aggressively
pursue trieing first, the internal representation of this RE should
reduce down to a representation in which it is a trivial deduction.

Do not see how this would work.

Now that I examine it in detail, I was implicitly first applying a
rather complex optimization. :-( If I can figure out how to verbalize
it, I will probably mention it, but at the moment I cannot. :-( :-(

I meant that were you to still attempt the optimization on RExen where
\1 gets accessed, but throw away the collected information every time
\1 changes value, then some performance disasters would be fixed.

An interesting idea indeed.

Now that I think about it though, any performance problems which had
the cached RExen in place would slow down...

Does the "visited position" optimization only apply to positions where
you have some sort of choice that could have been just made? If not
then it should since this reduces the tracking work.

[...]

Shows what?

In both of them the issue is that when you backtrack on the y* you
will fail y{2,} at first because y* matched all but none, and all
but 1, and after that you should succeed but the optimization will
stop you dead in your tracks.

You mean the second y​: probably you suppose that after

yyyx
112 # First two chars matched by y*, the third one by y in y{2,3}

the position 3 is marked as "Cannot match y in y{2,3}", right? It
would not, since the "required" count 2 of y's is not fulfilled yet.

Ah, I misunderstood your visited position optimization rule. That is
quite slick. :-)

From your description I understood that if the RE took above some
runtime threshold, then you retried with the optimization on. The
example /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ will be
sped up by the optimization from n^9 behaviour to n^2. While this
is not as good as the linear split, in many cases it is acceptable.

So, does the optimization hit in this case?

There is no instrumentingn of MEDIUM{n,m} or SIMPLE{n,m} for such an
optimization. Just collecting statistic for the switchover might make
things significantly slower.

Thinking more about this example it gets very frustrating. I feel in
my gut that this expression can be extremely nicely optimized by
something like the trieing. You take the expression and rewrite it so
that you have states where you are trying the first greedy match,
possibly are on the second or the first, possibly any of the first 3,
etc. Essentially note the passing commas as you encounter them, and
put off the question of which .* you are trying to match when you
encounter them as late as possible (ie the end of the line).

My thinking feels a little muddy on how you would want to do this, but
I am morally convinced that having a careful way of trieing would let
you optimize many regular expressions to the most efficient way that
people could do them. (eg In the above turn it into a scan for ",",
keeping track of where they are and how many they are until you reach
"enough" (ie 8) and then just keep track of where they are so you can
reconstruct the answer when you are done.

[...]

Given DESTROY, you have enough rope for anything you want.

You mean I can theoretically write it that way. I would not want to
though...

[...]

This is an independent idea from (?>). Consider this​:

/
\G
(
(> # Hmmm...the shoe fits :-)
"(?\/)(?​:[^"]|"")*"
|
[^,\n]*
)
)
(?=,|\n|$)
/gx

This will match the next field in a .csv format. The point is that if
the field starts with a ", then it is a quoted field, otherwise it is
an unquoted field. How can I say that now?

Since I do not know what (?\/) is supposed to do, it is very hard to guess.
But why won't you put parens around the groups separately, and check
which one matched? And put (?>) around (?​:[^"]|"")* and [^,\n]* ...

Note that with good trieing the above (>..) construct should be
superfluous, the optimizer should notice the fact that it applies
without the hint. :-)

As for what it is supposed to do, that is simple. The engine should
never backtrack past (?\/). This is a point of no return. If the
pattern reaches this point it either matches from this point forward
or else it fails.

A regular expression engine notes points where it *might* make a
decision. It has no facility to directly state that you *have
finalized* that decision. Here, for your reference, is what a
*.csv file should look like. It consists of rows of data. Each
row consists of fields separated by commas, and the row ends either
with the end of the file or "\n". A field comes in two forms,
quoted and unquoted. If a field begins with a '"' it is a quoted
field. A quoted field can may contain embedded commas and newlines,
and may also contain '""'s to represent '"'s. It ends at the first
unpaired '"'. An unquoted field may not start with '"', and extends
to the next ',', "\n", or the end of the file.

If you have a parsing engine of the form I described, your work is
divided into concrete components and stages. Generally you are
searching for some sort of token, and once you identify a token you
will never turn back. In parsing a *.csv file properly you should
never turn back after deciding that a field is quoted or unquoted,
after locating where a field finishes, or after locating where a
row finishes.

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

On Thu, May 18, 2000 at 03​:44​:40PM -0400, Ben_Tilly@​trepp.com wrote​:

Also you cannot just trie it arbitrarily far without some extra rules,
else trying to trie \s*\s* would be very dangerous.

Why? (Unless you suppose a wrong implementation. ;-) \s*\s*\w+
should be trie()ed as having a leading [\s\w]. As I said, this is
already implemented.

Source code is good as documentation only until the next patch in the queue.

True. But the documentation is always less accurate about what is
currently true. :-)

Personally, I do not care much about what is currently true (unless in
a one-liner). I prefer to know when "If this breaks, I would report
it as a bug".

Does the "visited position" optimization only apply to positions where
you have some sort of choice that could have been just made? If not
then it should since this reduces the tracking work.

Only before 'H' in (?​:HERE){n,m}.

Thinking more about this example it gets very frustrating. I feel in
my gut that this expression can be extremely nicely optimized by
something like the trieing.

It is called DFA. ;-)

Given DESTROY, you have enough rope for anything you want.

You mean I can theoretically write it that way. I would not want to
though...

Just write a custom REx engine (which see) which converts

  (?{ CODE1 }{ CODE2 })

to

  (?{ local $re​::unwind = new re​::unwind sub {CODE2}; CODE1 })

with properly defined re​::unwind​::new and re​::unwind​::DESTROY.

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be
superfluous, the optimizer should notice the fact that it applies
without the hint. :-)

No it would not. Note "" and ".

As for what it is supposed to do, that is simple. The engine should
never backtrack past (?\/). This is a point of no return.

I consider things which work in REX1 but would not work in

  WHATEVER (REX1) WHATEVER_ELSE

not that much useful. If you can design a syntax where you can
specify the scope up to which backtracking is prohibited, they YES I
WISH THIS TOO.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

On Thu, May 18, 2000 at 03​:44​:40PM -0400, Ben_Tilly@​trepp.com wrote​:

Also you cannot just trie it arbitrarily far without some extra rules,
else trying to trie \s*\s* would be very dangerous.

Why? (Unless you suppose a wrong implementation. ;-) \s*\s*\w+
should be trie()ed as having a leading [\s\w]. As I said, this is
already implemented.

I was pointing out that in this case naive can easily be wrong. :-)

[...]

Personally, I do not care much about what is currently true (unless in
a one-liner). I prefer to know when "If this breaks, I would report
it as a bug".

Might I suggest changing that standard to, "If this breaks, Tom would
report it as a bug"?

(Just teasing!)

Does the "visited position" optimization only apply to positions where
you have some sort of choice that could have been just made? If not
then it should since this reduces the tracking work.

Only before 'H' in (?​:HERE){n,m}.

:-)

Thinking more about this example it gets very frustrating. I feel in
my gut that this expression can be extremely nicely optimized by
something like the trieing.

It is called DFA. ;-)

That is an interesting insight. If you take an RE and aggressively
trie it, you get something for that section that really is extremely
similar to a DFA with the minor detail that it matches what an NFA
does, and it becomes natural to be able to keep a history. (As we
discussed privately, that is something that DFAs cannot easily do.)

I wonder if this idea can be somehow extended to make NFAs in
practice nearly as reliable as DFAs? Somehow I suspect that, modulo
arbitrary limits on how big you will let an RE get, you probably
could get some amazing improvements.

Given DESTROY, you have enough rope for anything you want.

You mean I can theoretically write it that way. I would not want to
though...

Just write a custom REx engine (which see) which converts

(?{ CODE1 }{ CODE2 })

to

(?{ local $re​::unwind = new re​::unwind sub {CODE2}; CODE1 })

with properly defined re​::unwind​::new and re​::unwind​::DESTROY.

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be
superfluous, the optimizer should notice the fact that it applies
without the hint. :-)

No it would not. Note "" and ".

Check again. The anchored subpattern terminates either at a '"' or at
something else that is not a "," or "\n". The next thing afterwards
is

As for what it is supposed to do, that is simple. The engine should
never backtrack past (?\/). This is a point of no return.

I consider things which work in REX1 but would not work in

WHATEVER (REX1) WHATEVER_ELSE

not that much useful. If you can design a syntax where you can
specify the scope up to which backtracking is prohibited, they YES I
WISH THIS TOO.

This might not meet even a camel's beauty standard, but...

Imitate the same block labelling mechanism as in loop labels. Define
a block some natural way. For instance independent subpatterns,
compiled REs, or anything delimited by the (?​:pattern) construct.
Add optional labelled blocks that look like (?%LABEL​:pattern). Then
use (?\/) to lock down the current block, and (?\/LABEL) to lock down
the current iteration of block LABEL.

Voila.

(Yes Tom, lacking aesthetic taste is an occupational hazard among
mathematicians...)

Ben

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

On Thu, May 18, 2000 at 05​:55​:24PM -0400, Ben_Tilly@​trepp.com wrote​:

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be
superfluous, the optimizer should notice the fact that it applies
without the hint. :-)

No it would not. Note "" and ".

Check again. The anchored subpattern terminates either at a '"' or at
something else that is not a "," or "\n". The next thing afterwards
is

You cannot omit (?>) in

  (?> (?​:[^"]|"")* )"

Since it changes the semantic. Deeper lookahead on the larger REx
would see that the change would not affect the larger thing, but I do
not think we want to have a deep lookahead so early in the Perl life. ;-)

This might not meet even a camel's beauty standard, but...
Imitate the same block labelling mechanism as in loop labels.

E2BIZZARE.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

(Yes Tom, lacking aesthetic taste is an occupational hazard among
mathematicians...)

To the contrary​: elegance and creativity are essential. See
"The Art of Mathematics", by Jerry King.

--tom

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

\(
  \(> \# Hmmm\.\.\.the shoe fits :\-\)
    "\(?\\/\)\(?&#8203;:\[^"\]|""\)\*"
   |
 \[^\,\\n\]\*
  \)
\)
\(?=\,|\\n|$\)

Note that with good trieing the above (>..) construct should be
superfluous, the optimizer should notice the fact that it applies
without the hint. :-)

No it would not. Note "" and ".

Check again. The anchored subpattern terminates either at a '"' or at
something else that is not a "," or "\n". The next thing afterwards
is

You cannot omit (?>) in

(?> (?​:[^"]|"")* )"

Since it changes the semantic. Deeper lookahead on the larger REx
would see that the change would not affect the larger thing, but I do
not think we want to have a deep lookahead so early in the Perl life. ;
-)

One of us is miscounting parens?

It should not take a particularly deep trieing to trie together the
initial '"' of a '""' pair with the attempted match of '"(?=,|\n|$)'
after which point you have implicitly found the optimization since
there is now no point in the regular expression in which you can
find a place where your next character gives you a choice on what
to do if you match. Ergo you will now reduce down to a single scan
of the string.

This might not meet even a camel's beauty standard, but...
Imitate the same block labelling mechanism as in loop labels.

E2BIZZARE.

From you I will take that as a high compliment. :-)

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 18, 2000

From [Unknown Contact. See original ticket]

Tom Christiansen wrote​:

(Yes Tom, lacking aesthetic taste is an occupational hazard among
mathematicians...)

To the contrary​: elegance and creativity are essential. See
"The Art of Mathematics", by Jerry King.

But mathematicians often find that the most elegant way to express
themselves is to create obscure and precise languages in which they
can directly express very obfuscated concepts...

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 25, 2000

From [Unknown Contact. See original ticket]

On Wed, May 17, 2000 at 07​:24​:12PM -0400, Ben_Tilly@​trepp.com wrote​:

2) (?\/) No backtracking past here.

This is (?>), but (?>) is better​: nobody could explain what the above
sentence *actually means*. (?>) provides a scope to limit
backtracking-prohibition to, thus makes things defined.

This is an independent idea from (?>). Consider this​:

Apparently, you something like (?x​: | (?{ die })) and putting the
match inside eval {}?

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Wed, May 17, 2000 at 07​:24​:12PM -0400, Ben_Tilly@​trepp.com wrote​:

2) (?\/) No backtracking past here.

This is (?>), but (?>) is better​: nobody could explain what the above
sentence *actually means*. (?>) provides a scope to limit
backtracking-prohibition to, thus makes things defined.

This is an independent idea from (?>). Consider this​:

Apparently, you something like (?x​: | (?{ die })) and putting the
match inside eval {}?

Ick.

After some puzzling I realize that that achieves the effect.

But, ick.

Ilya, I have to hand it to you. I would never have come up with that
on my own. Be careful what you ask for and all that...

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 08​:57​:37AM -0400, Ben_Tilly@​trepp.com wrote​:

On Wed, May 17, 2000 at 07​:24​:12PM -0400, Ben_Tilly@​trepp.com wrote​:

2) (?\/) No backtracking past here.

Apparently, you something like (?x​: | (?{ die })) and putting the
match inside eval {}?

Ick.

After some puzzling I realize that that achieves the effect.

But, ick.

Ilya, I have to hand it to you. I would never have come up with that
on my own. Be careful what you ask for and all that...

Well, I'm not sure that it would work as wanted from inside

@​matches = /$REx/g;

But if *this* is the semantic you want, we can at least think whether
we want to make work as expected etc.

Btw, as I said, the main problem with a scalable approach is how to
specify up to which level to stop backtracking. But we already have
*some* structure to name groups, witness $2, $33 etc. So one can have

  (... ( ... \g{!2} ... ) ...)

which would make the group 2 fail.

Additionally, I'm thinking about whether we want an escape to disallow
forward-tracking up until the given point. Say,

  / (
  " [\"]* " \g! \g{!1} (?!) # Do not retry until this point
  |
  /\* .*? \*/
  )
  /

would match comments in the C file without ''-literals, #-directives,
and backwacked " inside strings. Here \g! is the hypothetical "do not
retry the whole match" until this point".

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

Ilya Zaharevich wrote​:

On Fri, May 26, 2000 at 08​:57​:37AM -0400, Ben_Tilly@​trepp.com wrote​:

On Wed, May 17, 2000 at 07​:24​:12PM -0400, Ben_Tilly@​trepp.com wrote​:
[...]
But if *this* is the semantic you want, we can at least think whether
we want to make work as expected etc.

After seeing that die construct, I am getting more dubious by the
minute...

Btw, as I said, the main problem with a scalable approach is how to
specify up to which level to stop backtracking. But we already have
*some* structure to name groups, witness $2, $33 etc. So one can have

(... ( ... \g{!2} ... ) ...)

which would make the group 2 fail.

Hmmm...

That brings up one absolutely huge request.

$yyyymmdd =~ /(?$year​: \d{4} )
  (?$month​: \d\d )
  (?$day​: \d\d )/x;

That both gives you better naming on groups, as well as giving a nice
way to handle referring to named references. (After all the first
thing that people usually do with $1, etc is to drop it into a named
variable for later use...)

Additionally, I'm thinking about whether we want an escape to disallow
forward-tracking up until the given point. Say,

/ (
" [\"]* " \g! \g{!1} (?!) # Do not retry until this point
|
/\* .*? \*/
)
/

would match comments in the C file without ''-literals, #-directives,
and backwacked " inside strings. Here \g! is the hypothetical "do not
retry the whole match" until this point".

OK, now you have *me* confused...

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 12​:51​:29PM -0400, Ben_Tilly@​trepp.com wrote​:

That brings up one absolutely huge request.

$yyyymmdd =~ /(?$year​: \d{4} )
(?$month​: \d\d )
(?$day​: \d\d )/x;

I do not think we want this. You are using the symbol table as a
hash. I discussed the approach I prefer​:

  $yyyymmdd =~ /(?{year}​: \d{4} )
  (?{month}​: \d\d )
  (?{day}​: \d\d )/x;

which would set $^R{year} etc. Or $^R{date}{year} if it is inside
another (?{date}​:) group. Plus additionally there should be a
way to replace %^R by a different hash​:

  (?%RESULT​: )

Plus a way to put *all* the matches of GROUP* into an anonymous array​:

  (?[]*​: GROUP)*

Plus a way to put groups inside this one into an anonymous array

  (?[]​: PATTERN )

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

[I'm branching different questions on this thread.]

On Fri, May 26, 2000 at 12​:51​:29PM -0400, Ben_Tilly@​trepp.com wrote​:

/ (
" [\"]* " \g! \g{!1} (?!) # Do not retry until this point
|
/\* .*? \*/
)
/

would match comments in the C file without ''-literals, #-directives,
and backwacked " inside strings. Here \g! is the hypothetical "do not
retry the whole match" until this point".

OK, now you have *me* confused...

How to look for C comments? Of course, the FAQ answer is completely
wrong, since you cannot look for comments separately from all the
other constructs. Your REx should recognize char literals, string
literals, and (possibly) preprocessor constructs.

But when you found a literal, you need

  a) to fail;

  b) before failing to tell the REx engine that what is currently in
  $& cannot contain the start of a valid match, so the engine
  should not retry looking for a match until "this" point.

a) is done by (?!), b) is done by \g!.

\g{!1} is just an optimization telling that you cannot match / if you
matched ", so one should not retry anything in the group 1. This
optimization can be made automatic if I suddently decide I want to
continue doing anything with Perl -- or if somebody else is ready to
do it.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On May 26, Ilya Zakharevich said​:

   "  \[\\"\]\* " \\g\! \\g\{\!1\} \(?\!\)  \# Do not retry until this point

b) before failing to tell the REx engine that what is currently in
$& cannot contain the start of a valid match, so the engine
should not retry looking for a match until "this" point.

a) is done by (?!), b) is done by \g!.

So -- correct me if I'm wrong (and I'm sure that would be done) -- \g! is
kind of like placing where \G would match, if you were doing /\G.../g?

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 02​:03​:41PM -0400, Jeff Pinyan wrote​:

On May 26, Ilya Zakharevich said​:

   "  \[\\"\]\* " \\g\! \\g\{\!1\} \(?\!\)  \# Do not retry until this point

b) before failing to tell the REx engine that what is currently in
$& cannot contain the start of a valid match, so the engine
should not retry looking for a match until "this" point.

a) is done by (?!), b) is done by \g!.

So -- correct me if I'm wrong (and I'm sure that would be done) -- \g! is
kind of like placing where \G would match, if you were doing /\G.../g?

A goold (gold/good) point. Set pos() here. Yes, indeed.

Hmm, we need more​: do not unset it in the case of backtracking. So in
fact it *is* different than what you propose.

What you propose is closer to two other escapes \g< and \g> we badly
need. They would start/end what is considered to be $& of the match.
As I wrote it before​:

  / ab \g< cd \g> ef /x

is equivalent to the current

  / (?< ab ) cd (?= ef ) /x

but can be used in many other situations as well.

So what you propose is basically \g>. But we *want* \g< and \g> to be
reset by backtracking, and is clear that \g! should behave differently...

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Fri, May 26, 2000 at 12​:51​:29PM -0400, Ben_Tilly@​trepp.com wrote​:

That brings up one absolutely huge request.

$yyyymmdd =~ /(?$year​: \d{4} )
(?$month​: \d\d )
(?$day​: \d\d )/x;

I do not think we want this. You are using the symbol table as a
hash. I discussed the approach I prefer​:

No, I would want this to work with whatever variables I want. Lexical
or global.

$yyyymmdd =~ /(?{year}​: \d{4} )
(?{month}​: \d\d )
(?{day}​: \d\d )/x;

which would set $^R{year} etc. Or $^R{date}{year} if it is inside
another (?{date}​:) group. Plus additionally there should be a
way to replace %^R by a different hash​:

While slightly nicer than the current, I would find myself using the
good old​:

  $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/;
  my ($year, $month, $day) = ($1, $2, $3);

and cursing and swearing if I ever want to get this or that block,
depending upon how it matched.

If I am using $year, $month, and $day in my code, those are going to
be the variables that I want the RE to populate. Populating a more
complex structure is good, but you should be able to choose what
structure that is...

(?%RESULT​: )

Plus a way to put *all* the matches of GROUP* into an anonymous array​:

(?[]*​: GROUP)*

Plus a way to put groups inside this one into an anonymous array

(?[]​: PATTERN )

Hmmm...

What if I wanted an array of arrays, for each match of the larger group,
everything in the smaller?

What if I wanted an array of hashes, for each match of the larger group,
a hash of the smaller?

etc.

There is clearly a need for some sort of generic "use this RE to
extract structured data into a native Perl data structure" but the
need and syntax need careful thought.

Here is a proposal. Run through the RE. Once you are through the
RE and understand the match, then add an optional phase where you
run through and apply structured actions. Say use a (?~..) type
construct to indicate that this action is tied to the position of
said string, once the match is completed. Then with a bit of syntax
you could decide to accept the data as a post-match bit of cleanup​:

(?~$year​: \d{4})

(?~{push @​year}​: \d{4})

The first is good enough for most basic needs. The second is flexible
enough to handle extracting any kind of data structure that you want
in virtually any form.

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 02​:49​:26PM -0400, Ben_Tilly@​trepp.com wrote​:

Plus a way to put *all* the matches of GROUP* into an anonymous array​:

(?[]*​: GROUP)*

Plus a way to put groups inside this one into an anonymous array

(?[]​: PATTERN )

What if I wanted an array of arrays, for each match of the larger group,
everything in the smaller?

Put one such group inside another.

What if I wanted an array of hashes, for each match of the larger group,
a hash of the smaller?

Likewise.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 02​:49​:26PM -0400, Ben_Tilly@​trepp.com wrote​:

Here is a proposal. Run through the RE. Once you are through the
RE and understand the match, then add an optional phase where you
run through and apply structured actions. Say use a (?~..) type
construct to indicate that this action is tied to the position of
said string, once the match is completed. Then with a bit of syntax
you could decide to accept the data as a post-match bit of cleanup​:

(?~$year​: \d{4})

(?~{push @​year}​: \d{4})

The first is good enough for most basic needs. The second is flexible
enough to handle extracting any kind of data structure that you want
in virtually any form.

You can do it now with (?{}). If you want syntactic sugar, use a
custom REx engine.

Ilya

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

[I'm branching different questions on this thread.]

On Fri, May 26, 2000 at 12​:51​:29PM -0400, Ben_Tilly@​trepp.com wrote​:

/ (
" [\"]* " \g! \g{!1} (?!) # Do not retry until this point
|
/\* .*? \*/
)
/

would match comments in the C file without ''-literals, #-directives,
and backwacked " inside strings. Here \g! is the hypothetical "do
not
retry the whole match" until this point".

OK, now you have *me* confused...

How to look for C comments? Of course, the FAQ answer is completely
wrong, since you cannot look for comments separately from all the
other constructs. Your REx should recognize char literals, string
literals, and (possibly) preprocessor constructs.

But when you found a literal, you need

a) to fail;

b) before failing to tell the REx engine that what is currently in
$& cannot contain the start of a valid match, so the engine
should not retry looking for a match until "this" point.

a) is done by (?!), b) is done by \g!.

*lightbulbs go off*

Gotcha.

(?!) fails because it is the negative of an trivially true assertion.

\g! is a note to the RE that this is where it should restart trying
the match.

\g{!1} is just an optimization telling that you cannot match / if you
matched ", so one should not retry anything in the group 1. This
optimization can be made automatic if I suddently decide I want to
continue doing anything with Perl -- or if somebody else is ready to
do it.

Please?

BTW I suspect that a careful job of next-character discrimination
along the lines that I came up with earlier would make both of those
optimizations automatic. At the least the syntactical hint should not
be added until you know that said optimization cannot be made to happen
automatically.

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

Ilya Zakharevich wrote​:

On Fri, May 26, 2000 at 02​:49​:26PM -0400, Ben_Tilly@​trepp.com wrote​:

Here is a proposal. Run through the RE. Once you are through the
RE and understand the match, then add an optional phase where you
run through and apply structured actions. Say use a (?~..) type
construct to indicate that this action is tied to the position of
said string, once the match is completed. Then with a bit of syntax
you could decide to accept the data as a post-match bit of cleanup​:

(?~$year​: \d{4})

(?~{push @​year}​: \d{4})

The first is good enough for most basic needs. The second is flexible
enough to handle extracting any kind of data structure that you want
in virtually any form.

You can do it now with (?{}). If you want syntactic sugar, use a
custom REx engine.

Yes, but doing so during the match is considerably more complex, and
considerably slower, than it would be to do so when the match is
actually done.

Nevermind. I can live with $1, $2, $3.

I will try some playing with custom REx engines though.

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From @RandalSchwartz

"Ben" == Ben Tilly <Ben_Tilly@​trepp.com> writes​:

Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/;
Ben> my ($year, $month, $day) = ($1, $2, $3);

Just for the record, this is dangerous code. Never use $1 unles you
have checked the success of the match you think has set it. Because
if it hasn't, you are getting the *previous* match.

*You* may know that, but your code surely didn't, so I judge you
on the code you wrote. :)

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 12​:49​:05PM -0700, "Randal L. Schwartz" wrote​:

"Ben" == Ben Tilly <Ben_Tilly@​trepp.com> writes​:
Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/;
Ben> my ($year, $month, $day) = ($1, $2, $3);
Just for the record, this is dangerous code. Never use $1 unles you
have checked the success of the match you think has set it. Because
if it hasn't, you are getting the *previous* match.

That's why we have​:

  my($year, $month, $day) = ($yyyymmdd =~ /^(\d{4})(\d\d)(\d\d)$/)
  or die "Date could not be parsed​: $yyyymmdd\n";

:-)

mark

--
markm@​nortelnetworks.com/mark@​mielke.cc/markm@​ncf.ca __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | SIR Tools (7H12)
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | Nortel Networks
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
  and in the darkness bind them...

  http​://mark.mielke.cc/

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

Randal Schwartz wrote​:

"Ben" == Ben Tilly <Ben_Tilly@​trepp.com> writes​:

Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/;
Ben> my ($year, $month, $day) = ($1, $2, $3);

Just for the record, this is dangerous code. Never use $1 unles you
have checked the success of the match you think has set it. Because
if it hasn't, you are getting the *previous* match.

Only sometimes. :-)

$1, etc are localized so in many cases you may be sure that you are
not getting a previous match. Likewise if you precede that by a
trivially successful match, then you know they are not populated.
And conversely you may know from the code that you will never fail
to match.

*You* may know that, but your code surely didn't, so I judge you
on the code you wrote. :)

You are right that I was being lazy in the worst possible way in my
example. Normally I do some sort of error check, but I was being
lazy. (The wrong kind of lazy.)

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

At 04​:26 PM 5/26/00 -0400, Ben_Tilly@​trepp.com wrote​:

$1, etc are localized

No they aren't. They're mutant pseudo-scalars attached to the underlying
optree, and you may well get a $1 from the last time you matched that
regex. (I've got a trivial program that shows the problem with a recursive
call)

Don't Do That. :)

  Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
dan@​sidhe.org have teddy bears and even
  teddy bears get drunk

@p5pRT
Copy link
Author

p5pRT commented May 26, 2000

From [Unknown Contact. See original ticket]

On Fri, May 26, 2000 at 03​:09​:48PM -0400, Ben_Tilly@​trepp.com wrote​:

You can do it now with (?{}). If you want syntactic sugar, use a
custom REx engine.

Yes, but doing so during the match is considerably more complex, and
considerably slower, than it would be to do so when the match is
actually done.

Just make your assignments the last thing in the pattern.

Ilya

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2004

From @smpeters

Using the code with the substitution that never terminates​:

steve@​kirk sandbox $ perl subs.pl
Starting
Finished

It appears that this has been fixed by 5.8.5 or earlier.

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2004

@smpeters - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant