New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A substitution that never terminates #1975
Comments
From @shieldsThe following test script causes Mail::Header to enter an infinite ---- cut here use Mail::Internet; my $mail = <<'EOF'; body my $message = new Mail::Internet [split /\n/, $mail]; exit 0; The expression that fails to terminate is: $x .= $2 . "\n " I pruned this down to the test case: ---- cut here $foo = <<'EOF'; print "Starting\n"; $foo =~ s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//; print "Finished\n"; exit 0; This single substitution never terminates on perl 5.005_03, from |
From @tamiasOn Tue, May 16, 2000 at 04:40:02AM +0000, Michael Shields wrote:
How long did you wait? The nested quantifiers make this regex take time exponentially relative to For example: s/^(\s*[^\s"]+(?:"[^"]*"[^\s"]+)*[\s])//; Ronald |
From @shields
Three minutes on a PII-450. |
From @shieldsI will therefore suggest the following patch to Mail::Header (v 1.17), Inline Patch--- Header.pm.orig Tue May 16 05:17:25 2000
+++ Header.pm Tue May 16 05:26:15 2000
@@ -110,7 +110,7 @@
[^"]{$min,$max}?[\,\;]
|[^"]{1,$max}[\s\n]
|[^\s"]+[\s\n]
- |([^\s"]+("[^"]*")?)+[\s\n]
+ |[^\s"]+(?:"[^"]*"[^\s"]+)*[\s]
))
//x);
$x .= $_[0];
Thanks. |
From [Unknown Contact. See original ticket]Michael Shields <shields@msrl.com> wrote
Ronald has pointed out that it's taking a long time rather than looping Mike Guy |
From @tamiasOn Tue, May 16, 2000 at 05:29:05AM +0000, Michael Shields wrote:
Oops, I should have written that as \s rather than [\s], don't need the \s includes \n, so [\s\n] is redundant; the same change can be made to the Ronald |
From @gbarrOn Tue, May 16, 2000 at 04:40:02AM +0000, Michael Shields wrote:
Yes this regex does go on forever on malformed headers. But I have not had the Code has been added to only perform this regex on "structured" headers, but there is an If you have any suggestions let me know. Thanks, |
From [Unknown Contact. See original ticket]Ronald J Kimball wrote:
I think that the regex engine could and should be less sensitive to http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg00509.html could address this case. Take the original regex: s/^(\s*(([^\s"]+("[^"]*")?)+[\s\n]))//; The key problem is here: ...([^\s"]+("[^"]*")?)+... because it searches over ways to partition a string on its way to 1([^\s"]2+("3[^"]4*"5)?) Now at those points here is what you have to try in order: 1: Match [^\s"] and -> 2 2: Match [^\s"] and -> 2 3: Match [^"] and -> 4 4: Match [^"] and -> 4 5: Match [^\s"] and -> 2 And, of course, it tries all of this recursively. So if it Now the problem, of course, is in 2 which can match [^\s"], fail, Now how could Ilya's suggestion, properly modified, help? Well Here is a simple form of another common case: (\s*\w\s*)+ Analyzing as above: 1(\s2*\w3\s*)+ 1: Match \s and -> 2 2: Match \s and -> 2 3: Match \s and -> 3 Again, when you think about doing next character discrimination, Were my C as developed as my math skills, I might try to do this. But it is not. :-( Cheers, |
From [Unknown Contact. See original ticket]M.J.T. Guy writes:
From theoretical point of view this optimization is extremely buggy: I Unfortunately, up to now I could not deduce a test case when the a) it always produces correct results; The conditions I had in mind initially are very restrictive, say, ((\w{1,5}){1,5}){1,5} Here is the scoop if you want to help me: if a long match is detected, ("position in the REx", "position in the input string") and keep pairs of positions which were "visited" during the match. If Here is the problem: there is more state than this. Say, for (\w{1,5}){1,5}$ If we are at \w again with the "external" count being 4. This But we do not keep this number in the table. Now if we already This the optimization is not viable. Now if somebody could provide an Ilya |
From [Unknown Contact. See original ticket]Ben_Tilly@trepp.com writes:
Yes, this is why I stole some time and put next character In short: in some cases we can go as quick as a DFA, and it is easy to But what to do meanwhile? Answer: use (?>) instead. Ilya |
From [Unknown Contact. See original ticket]Ilya wrote:
Why not enable it for arbitrary numbers of characters?
About a year ago I figured out how to do a hybrid algorithm which can But then again I remember a co-worker who was complaining about the :-(
I suspect that virtually anyone with sufficient tuits to understand Personally I think that careful use of /g to write miniature parsing Cheers, |
From [Unknown Contact. See original ticket]Ben_Tilly@trepp.com writes: A module that defined a number of variables containing qr patterns use re::pat; Such a module would let non-experts: 1. avoid writing bad RE's for common tasks that are provided |
From [Unknown Contact. See original ticket]On Tue, May 16, 2000 at 03:40:57PM -0400, Ben_Tilly@trepp.com wrote:
How would you use it? There was already an infrastructure to use a) to optimize alternation basing on the next-char info; b) to (?>)-ize repetition A*B => (?>A*)B if A and B cannot match at the same place c) Run-time code similar to "b" if there are *some* next-chars which
I think with a/b/c above (and better visited-position code) we can go
If one changes the code for fixed-substrings to support an arbitrary
What for? Just use (?>).
How so? Why do you think a well-commented Perl code with 17 simple Definitely /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ expresses
This is an interesting observation. Can we do the same with (a
Which are handled by REx engine without any problem. Ilya |
From @gbarrOn Tue, May 16, 2000 at 09:40:40AM -0400, Ronald J Kimball wrote:
While this my fix the forever loop, it causes testcases to fail
Yes, for some reason I always do that. Graham. |
From @tamiasOn Tue, May 16, 2000 at 09:55:51PM +0100, Graham Barr wrote:
I think it fails because the new regex requires at least one [^\s"] after Try this instead: [^\s"]+(?:"[^"]*"[^\s"]*)*\s That should still avoid the exponential failure time.
It's redundant, but it's also explicit. Works either way. :) Ronald |
From [Unknown Contact. See original ticket]On Tue, May 16, 2000 at 07:27:51PM -0400, Ben_Tilly@trepp.com wrote:
Do you mean "a) alone would be useful too"?
Exactly the opposite. Make this decision as early as possible, so
Exactly the opposite. If A and B cannot match at the same position,
This is not how I see it. You determine (in compile time) whether the
Easier: just keep the flag that "we are inside a fixed run of {n,m}".
Multiple-fixed-substring is not a special case. We already do 2 fixed
No. We do not use this for SIMPLE* - the code for SIMPLE* is very
?? Whenever you want * to be maximally-greedy, use (?>). As easy as that.
If you use RExen, it is not a second language. You call RExen from
Time-to-time: yes. So use (?{}). ;-) And if you need to backtrack, then having one REx would be much easier Ilya |
From [Unknown Contact. See original ticket]On Tue, May 16, 2000 at 03:40:57PM -0400, Ben_Tilly@trepp.com wrote:
May be *you* can clear it for me: who do DFAs without memory implement Ilya |
From [Unknown Contact. See original ticket]Ronald J Kimball writes:
And without `use locale' the produce identical RExen... Ilya |
From [Unknown Contact. See original ticket]Ilya wrote:
You only need a) to be useful. Just make it possible to put off the As for b) and c), I presume you mean that B will match there only if
DANGER. Better visited position code needs to know for each position I would suggest a fix that consists of unravelling all fixed
With sufficient analysis you can optimize anything. Do you want to Besides which your visited positions optimization will optimize the
What tool do I reach for first? I mostly write simple REs so I would Besides which the simple expedient of writing clear REs is a skill
Different mode of thought. Additionally the best goal is frequently The basic mode of thought is that regular expressions are a great
Split does it better.
Not that I can see. A trivial example of why not would be an
Right, another of those extensions that I don't know to reach for. I don't believe that it existed in 5.005 either, which makes it an The idea of a parse engine is another idea that I can take with me and Cheers, |
From [Unknown Contact. See original ticket]Ilya Zakharevich wrote:
Yup.
We are talking past each other. If someone writes
I think you missed my point here. Consider the simple expression: Knowing this fact, we can know that after matching yada, we should
Ack! No! Throwing away the optimization when you notice that \1 is used is "yyyx" =~ /y{1,2}x/
Good point. :-) [ ...
I see your point but I will have to think about whether I agree. My understanding is that searching for fixed strings is done first
Where is the change-over? The code above is pretty slow, and if you
Whenever you want * to be maximally-greedy, make it unambiguously not [...]
I agree with something Larry once said, the point of REs is to provide
And there I see what made me uncomfortable with the idea. If you wish to extend (?{}) to be useful, here are the things that 1) (?{forward_code}{backtrack_code}). If you have to backtrack, give 2) (?\/) No backtracking past here. Many parsing problems come down Without the ability for embedded code to either execute knowing that Even then I am not sure how often I would use it since I don't face Cheers, |
From [Unknown Contact. See original ticket]On Wed, May 17, 2000 at 05:21:39PM -0400, Ben_Tilly@trepp.com wrote:
This is the target, but I do not see any relationship to your sentence
Finally, this is documented in 5.6.0 ;-) (if you do not consider fairy
I think this is a fairly complicated optimization. The sniffer would
"Will make"? As I said, it is doing it now (note "accessed").
Shows what?
I think this is more than an order of magnitude faster. Probably two.
But polynomial. Well, I agree that for all practical purposes n^9 is The changeover is governed by what is repeated by {n,m}. If it Of course, if you have 9 such groups in turn, this gives you n^9 time...
Just compare which recipe is easier-to-use and expresses your
There is a way: use localization and DESTROY, or DESTROY on $^R (which Some functionality like this is badly needed, but I'm not sure how it
This is (?>), but (?>) is better: nobody could explain what the above Ilya |
From [Unknown Contact. See original ticket]Ilya wrote:
While matching /Ala/ you are putting off the decision about whether I interpreted you claim about only optimizing one character of next [...]
It was always documented..in the source-code. Some forms of documentation are easier to read, that is all.
Read back a few emails. If it tried to deduce this from scratch, yes. But if you aggressively I laid out detailed examples before. The point is that in trying to Try it by hand and you should see what I mean. [...]
I meant that were you to still attempt the optimization on RExen where This is worthwhile.
Whoops, I had a braino. My example above should have been "yyyx" =~/y*y{2,3}x/
They are regular expressions that take the forms you thought you could In both of them the issue is that when you backtrack on the y* you [...]
From your description I understood that if the RE took above some So, does the optimization hit in this case?
To the computer, to me, or to whoever is unlucky enough to read my Plus the skill of writing unambiguous RExen is of general use.
AFAICS localization of variables is insufficient if your goal is to
This is an independent idea from (?>). Consider this: / This will match the next field in a .csv format. The point is that if / Oops, I have to embed the rule about what started a quoted field Providing an easy way to guarantee spots in an RE where you will not Programming is not so much the art of being able to do complex Cheers, |
From [Unknown Contact. See original ticket]On Wed, May 17, 2000 at 07:24:12PM -0400, Ben_Tilly@trepp.com wrote:
Thanks, I think I understand your language now.
Why? Now do the same with (?:laska|labama) etc.
Source code is good as documentation only until the next patch in the queue.
Do not see how this would work.
An interesting idea indeed.
You mean the second y: probably you suppose that after yyyx the position 3 is marked as "Cannot match y in y{2,3}", right? It
There is no instrumentingn of MEDIUM{n,m} or SIMPLE{n,m} for such an
Given DESTROY, you have enough rope for anything you want.
I do not think questions of efficiency should enter the discussion
Since I do not know what (?\/) is supposed to do, it is very hard to guess. Ilya |
From [Unknown Contact. See original ticket]Ilya Zakharevich wrote:
:-)
Now that you understand what I was thinking, you can see why I was OTOH now I am convinced it is not so simple, when you open yourself up Also you cannot just trie it arbitrarily far without some extra rules,
True. But the documentation is always less accurate about what is
Now that I examine it in detail, I was implicitly first applying a
Now that I think about it though, any performance problems which had Does the "visited position" optimization only apply to positions where [...]
Ah, I misunderstood your visited position optimization rule. That is
Thinking more about this example it gets very frustrating. I feel in My thinking feels a little muddy on how you would want to do this, but [...]
You mean I can theoretically write it that way. I would not want to [...]
Note that with good trieing the above (>..) construct should be As for what it is supposed to do, that is simple. The engine should A regular expression engine notes points where it *might* make a If you have a parsing engine of the form I described, your work is Cheers, |
From [Unknown Contact. See original ticket]On Thu, May 18, 2000 at 03:44:40PM -0400, Ben_Tilly@trepp.com wrote:
Why? (Unless you suppose a wrong implementation. ;-) \s*\s*\w+
Personally, I do not care much about what is currently true (unless in
Only before 'H' in (?:HERE){n,m}.
It is called DFA. ;-)
Just write a custom REx engine (which see) which converts (?{ CODE1 }{ CODE2 }) to (?{ local $re::unwind = new re::unwind sub {CODE2}; CODE1 }) with properly defined re::unwind::new and re::unwind::DESTROY.
No it would not. Note "" and ".
I consider things which work in REX1 but would not work in WHATEVER (REX1) WHATEVER_ELSE not that much useful. If you can design a syntax where you can Ilya |
From [Unknown Contact. See original ticket]
I was pointing out that in this case naive can easily be wrong. :-) [...]
Might I suggest changing that standard to, "If this breaks, Tom would (Just teasing!)
:-)
That is an interesting insight. If you take an RE and aggressively I wonder if this idea can be somehow extended to make NFAs in
Check again. The anchored subpattern terminates either at a '"' or at
This might not meet even a camel's beauty standard, but... Imitate the same block labelling mechanism as in loop labels. Define Voila. (Yes Tom, lacking aesthetic taste is an occupational hazard among Ben |
From [Unknown Contact. See original ticket]On Thu, May 18, 2000 at 05:55:24PM -0400, Ben_Tilly@trepp.com wrote:
You cannot omit (?>) in (?> (?:[^"]|"")* )" Since it changes the semantic. Deeper lookahead on the larger REx
E2BIZZARE. Ilya |
From [Unknown Contact. See original ticket]
To the contrary: elegance and creativity are essential. See --tom |
From [Unknown Contact. See original ticket]
One of us is miscounting parens? It should not take a particularly deep trieing to trie together the
From you I will take that as a high compliment. :-) Cheers, |
From [Unknown Contact. See original ticket]Tom Christiansen wrote:
But mathematicians often find that the most elegant way to express Cheers, |
From [Unknown Contact. See original ticket]On Wed, May 17, 2000 at 07:24:12PM -0400, Ben_Tilly@trepp.com wrote:
Apparently, you something like (?x: | (?{ die })) and putting the Ilya |
From [Unknown Contact. See original ticket]Ilya Zakharevich wrote:
Ick. After some puzzling I realize that that achieves the effect. But, ick. Ilya, I have to hand it to you. I would never have come up with that Cheers, |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 08:57:37AM -0400, Ben_Tilly@trepp.com wrote:
Well, I'm not sure that it would work as wanted from inside @matches = /$REx/g; But if *this* is the semantic you want, we can at least think whether Btw, as I said, the main problem with a scalable approach is how to (... ( ... \g{!2} ... ) ...) which would make the group 2 fail. Additionally, I'm thinking about whether we want an escape to disallow / ( would match comments in the C file without ''-literals, #-directives, Ilya |
From [Unknown Contact. See original ticket]Ilya Zaharevich wrote:
After seeing that die construct, I am getting more dubious by the
Hmmm... That brings up one absolutely huge request. $yyyymmdd =~ /(?$year: \d{4} ) That both gives you better naming on groups, as well as giving a nice
OK, now you have *me* confused... Cheers, |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 12:51:29PM -0400, Ben_Tilly@trepp.com wrote:
I do not think we want this. You are using the symbol table as a $yyyymmdd =~ /(?{year}: \d{4} ) which would set (?%RESULT: ) Plus a way to put *all* the matches of GROUP* into an anonymous array: (?[]*: GROUP)* Plus a way to put groups inside this one into an anonymous array (?[]: PATTERN ) Ilya |
From [Unknown Contact. See original ticket][I'm branching different questions on this thread.] On Fri, May 26, 2000 at 12:51:29PM -0400, Ben_Tilly@trepp.com wrote:
How to look for C comments? Of course, the FAQ answer is completely But when you found a literal, you need a) to fail; b) before failing to tell the REx engine that what is currently in a) is done by (?!), b) is done by \g!. \g{!1} is just an optimization telling that you cannot match / if you Ilya |
From [Unknown Contact. See original ticket]On May 26, Ilya Zakharevich said:
So -- correct me if I'm wrong (and I'm sure that would be done) -- \g! is |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 02:03:41PM -0400, Jeff Pinyan wrote:
A goold (gold/good) point. Set pos() here. Yes, indeed. Hmm, we need more: do not unset it in the case of backtracking. So in What you propose is closer to two other escapes \g< and \g> we badly / ab \g< cd \g> ef /x is equivalent to the current / (?< ab ) cd (?= ef ) /x but can be used in many other situations as well. So what you propose is basically \g>. But we *want* \g< and \g> to be Ilya |
From [Unknown Contact. See original ticket]Ilya Zakharevich wrote:
No, I would want this to work with whatever variables I want. Lexical
While slightly nicer than the current, I would find myself using the $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/; and cursing and swearing if I ever want to get this or that block, If I am using $year, $month, and $day in my code, those are going to
Hmmm... What if I wanted an array of arrays, for each match of the larger group, What if I wanted an array of hashes, for each match of the larger group, etc. There is clearly a need for some sort of generic "use this RE to Here is a proposal. Run through the RE. Once you are through the (?~$year: \d{4}) (?~{push @year}: \d{4}) The first is good enough for most basic needs. The second is flexible Cheers, |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 02:49:26PM -0400, Ben_Tilly@trepp.com wrote:
Put one such group inside another.
Likewise. Ilya |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 02:49:26PM -0400, Ben_Tilly@trepp.com wrote:
You can do it now with (?{}). If you want syntactic sugar, use a Ilya |
From [Unknown Contact. See original ticket]Ilya Zakharevich wrote:
*lightbulbs go off* Gotcha. (?!) fails because it is the negative of an trivially true assertion. \g! is a note to the RE that this is where it should restart trying
Please? BTW I suspect that a careful job of next-character discrimination Cheers, |
From [Unknown Contact. See original ticket]Ilya Zakharevich wrote:
Yes, but doing so during the match is considerably more complex, and Nevermind. I can live with $1, $2, $3. I will try some playing with custom REx engines though. Cheers, |
From @RandalSchwartz
Ben> $yyyymmdd =~ /(\d{4})(\d\d)(\d\d)/; Just for the record, this is dangerous code. Never use $1 unles you *You* may know that, but your code surely didn't, so I judge you |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 12:49:05PM -0700, "Randal L. Schwartz" wrote:
That's why we have: my($year, $month, $day) = ( :-) mark -- One ring to rule them all, one ring to find them, one ring to bring them all http://mark.mielke.cc/ |
From [Unknown Contact. See original ticket]Randal Schwartz wrote:
Only sometimes. :-) $1, etc are localized so in many cases you may be sure that you are
You are right that I was being lazy in the worst possible way in my Cheers, |
From [Unknown Contact. See original ticket]At 04:26 PM 5/26/00 -0400, Ben_Tilly@trepp.com wrote:
No they aren't. They're mutant pseudo-scalars attached to the underlying Don't Do That. :) Dan --------------------------------------"it's like this"------------------- |
From [Unknown Contact. See original ticket]On Fri, May 26, 2000 at 03:09:48PM -0400, Ben_Tilly@trepp.com wrote:
Just make your assignments the last thing in the pattern. Ilya |
From @smpetersUsing the code with the substitution that never terminates: steve@kirk sandbox $ perl subs.pl It appears that this has been fixed by 5.8.5 or earlier. |
@smpeters - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#3248 (status was 'resolved')
Searchable as RT3248$
The text was updated successfully, but these errors were encountered: