New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regexes which interpolate list elements incorrectly parsed #16478
Comments
From david@cantrell.org.ukCreated by david@cantrell.org.ukIf you interpolate a list element - ie $x[...] into a regex, bar $x[1] = "foo"; $_ = "foo"; s/$x[1]/bar/; print "$_\n"; NB I didn't discover this myself. User 'ernix' on stackoverflow Perl Info
|
From @AbigailOn Mon, Mar 26, 2018 at 09:14:26AM -0700, David Cantrell wrote:
If one uses Deparse, it becomes clear where the difference is: $x[1] = 'foo'; $_ = 'foo'; s/$x[1]/bar/; print "$_\n"; No idea why this happens though. Abigail |
The RT System itself - Status changed from 'new' to 'open' |
From @ilmariAbigail <abigail@abigail.be> writes:
Additionally, if all the three digits are the same, it doesn't happen: $ perl -MO=Deparse -e '$x[666] = "foo"; $_ = "foo"; s/$x[666]/bar/; print "$_\n";' from #london.pm: <@mst> oh, there's a thingy in the parser where it tries to work - ilmari |
From @iabynOn Mon, Mar 26, 2018 at 06:08:26PM +0100, Dagfinn Ilmari Mannsåker wrote:
The code is in S_intuit_more(). From the code comments: * Returns TRUE if there's more to the expression (e.g., a subscript), .... That last comment is followed by 100 lines of code which calculates a -- |
I now understand what's going on here. The issue is how does one decide between [13579] meaning a subscript vs meaning a class of the odd digits. I haven't checked but it could well be that [97531] would be considered a subscript, and [13579] a character class. See #19614 for more details The code changed in that PR is a heuristic for deciding. Now that I understand it, I can think of more criteria that could be thrown in, but it would still remain a heuristic. It's hard for me to imagine that it works as poorly as it does with perl working as well as it does. This ticket indicates that if it guess wrong in favor of a character class that things don't work. I think it must be that if it guesses things wrong the other direction, that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one. The code special cases a construct consisting of precisely 1 or 2 digits to be a subscript. It doesn't special case other all-digit operands. The bias is towards it being a character class. So at three digits, it remains so. At 4 digits, with the final three being the same number '000' in this case, the bias is overcome and it is considered a subscript. The bias is increased if the digits are consecutive [123] would be more likely to be considered a character class; and it is decreased if the same character occurs multiple times, with each extra occurrence value more towards making it a subscript than the previous such occurrence. If in fact guessing wrong towards it being a subscript is recovered from, then we could just assume all digits means subscript. But if guessing a subscript always is recovered from when wrong, why have this routine at all? |
I doubt I could find it back, but many (20?) years ago I reported this with a script that showed the very random looking patterns the heuristic throws up for a long list of integers. It's crazy stuff, but nobody seemed interested in touching it.
It would be useful to find evidence of that, I don't recall any mention of such a correction mechanism before. |
The current code is buggy, it turns out. I don't see a correction mechanism, and the heuristics do fail |
I would assume that if it biases towards assuming it is a character class then it probably does the right thing most of the time, especially if the subscript is long. I find it hard to believe that many people are writing patterns with high subscript values in array lookups. eg,
its hard to believe that many people actually would have an array with 13k+ elements in it that they want to interpolate in a pattern. So if it assumes that is a character-class I doubt many, if any, people would stumble on the bug. |
There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.
There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.
There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.
There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.
Migrated from rt.perl.org#133027 (status was 'open')
Searchable as RT133027$
The text was updated successfully, but these errors were encountered: