Regexes which interpolate list elements incorrectly parsed #16478

p5pRT · 2018-03-26T16:14:26Z

Migrated from rt.perl.org#133027 (status was 'open')

Searchable as RT133027$

p5pRT · 2018-03-26T16:14:26Z

From david@cantrell.org.uk

Created by david@cantrell.org.uk

If you interpolate a list element - ie $x[...] into a regex,
It appears to not always be parsed correctly. I would expect
this code to print "bar" four times. Instead it prints:

bar
bar
foo
bar

$x[1] = "foo"; $_ = "foo"; s/$x[1]/bar/; print "$_\n";
$x[10] = "foo"; $_ = "foo"; s/$x[10]/bar/; print "$_\n";
$x[100] = "foo"; $_ = "foo"; s/$x[100]/bar/; print "$_\n";
$x[1000] = "foo"; $_ = "foo"; s/$x[1000]/bar/; print "$_\n";

NB I didn't discover this myself. User 'ernix' on stackoverflow
did, but I couldn't see anything for it in rt.perl.org. Here's his
original message:

https://stackoverflow.com/questions/49492181/regex-substitution-using-a-list-element-with-just-3-digits-index-doesnt-work-as

Perl Info


Flags:
    category=core
    severity=medium

Site configuration information for perl 5.27.11:

Configured by dc at Mon Mar 26 16:20:20 BST 2018.

Summary of my perl5 (revision 5 version 27 subversion 11) configuration:
  Commit id: 811612a11efb1ebc131370e8238d3512779354f8
  Platform:
    osname=linux
    osvers=4.9.0-6-amd64
    archname=x86_64-linux-thread-multi
    uname='linux chimera-dev-vm 4.9.0-6-amd64 #1 smp debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 gnulinux '
    config_args='-de -Duse64bitall -Dusethreads -Dprefix=/home/dc/perl-blead -Dusedevel'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-O2'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='6.3.0 20170516'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/6/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.24.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.24'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.27.11:
    /home/dc/perl-blead/lib/site_perl/5.27.11/x86_64-linux-thread-multi
    /home/dc/perl-blead/lib/site_perl/5.27.11
    /home/dc/perl-blead/lib/5.27.11/x86_64-linux-thread-multi
    /home/dc/perl-blead/lib/5.27.11


Environment for perl 5.27.11:
    HOME=/home/dc
    LANG=en_GB.UTF-8
    LANGUAGE=en_GB:en
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/dc/bin:/opt/perlbrew/bin:/opt/perlbrew/perls/perl-5.22.3/bin:/home/dc/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
    PERLBREW_BASHRC_VERSION=0.39
    PERLBREW_HOME=/home/dc/.perlbrew
    PERLBREW_MANPATH=/opt/perlbrew/perls/perl-5.22.3/man
    PERLBREW_PATH=/opt/perlbrew/bin:/opt/perlbrew/perls/perl-5.22.3/bin
    PERLBREW_PERL=perl-5.22.3
    PERLBREW_ROOT=/opt/perlbrew
    PERLBREW_VERSION=0.39
    PERL_BADLANG (unset)
    PERL_UNICODE=AS
    SHELL=/bin/bash

p5pRT · 2018-03-26T16:39:25Z

From @Abigail

On Mon, Mar 26, 2018 at 09:14:26AM -0700, David Cantrell wrote:

# New Ticket Created by David Cantrell
# Please include the string: [perl #133027]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133027 >

This is a bug report for perl from david@cantrell.org.uk,
generated with the help of perlbug 1.41 running under perl 5.27.11.

-----------------------------------------------------------------
[Please describe your issue here]

If you interpolate a list element - ie $x[...] into a regex,
It appears to not always be parsed correctly. I would expect
this code to print "bar" four times. Instead it prints:

bar
bar
foo
bar

$x[1] = "foo"; $_ = "foo"; s/$x[1]/bar/; print "$_\n";
$x[10] = "foo"; $_ = "foo"; s/$x[10]/bar/; print "$_\n";
$x[100] = "foo"; $_ = "foo"; s/$x[100]/bar/; print "$_\n";
$x[1000] = "foo"; $_ = "foo"; s/$x[1000]/bar/; print "$_\n";

NB I didn't discover this myself. User 'ernix' on stackoverflow
did, but I couldn't see anything for it in rt.perl.org. Here's his
original message:

https://stackoverflow.com/questions/49492181/regex-substitution-using-a-list-element-with-just-3-digits-index-doesnt-work-as

If one uses Deparse, it becomes clear where the difference is:

$x[1] = 'foo'; $_ = 'foo'; s/$x[1]/bar/; print "$_\n";
$x[10] = 'foo'; $_ = 'foo'; s/$x[10]/bar/; print "$_\n";
$x[100] = 'foo'; $_ = 'foo'; s/${x}[100]/bar/; print "$_\n";
$x[1000] = 'foo'; $_ = 'foo'; s/$x[1000]/bar/; print "$_\n";

No idea why this happens though.

Abigail

p5pRT · 2018-03-26T16:39:25Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2018-03-26T17:08:49Z

From @ilmari

Abigail <abigail@abigail.be> writes:

If one uses Deparse, it becomes clear where the difference is:

$x[1] = 'foo'; $_ = 'foo'; s/$x[1]/bar/; print "$_\n";
$x[10] = 'foo'; $_ = 'foo'; s/$x[10]/bar/; print "$_\n";
$x[100] = 'foo'; $_ = 'foo'; s/${x}[100]/bar/; print "$_\n";
$x[1000] = 'foo'; $_ = 'foo'; s/$x[1000]/bar/; print "$_\n";

No idea why this happens though.

Additionally, if all the three digits are the same, it doesn't happen:

$ perl -MO=Deparse -e '$x[666] = "foo"; $_ = "foo"; s/$x[666]/bar/; print "$_\n";'
$x[666] = 'foo';
$_ = 'foo';
s/$x[666]/bar/;
print "$_\n";

from #london.pm:

<@mst> oh, there's a thingy in the parser where it tries to work
out what [] means that adds up a bunch of things to get a
score to decide which way to break
<@ilmari> huh, $x[111] works
<@ilmari> and 222, 333 etc
<@ilmari> it deparses at s/${x}[...]/bar/ when it fails
<@mst> my guess would be that 'all the same character inside the
[]' changes the scorte
<@ilmari> ô_Ô
<@ilmari> oh, it's trying to guess if it's an array subscript or a
character class
<@ilmari> but only if it's three characters...
<@mst> that heuristic scares me more than S_intuit_method
<@DrHyde> i'm kinda surprised that the parser doesn't first try as hard
as it can to find variables to interpolate, and only treat
$x[...] as normal regex line-noise if it can't
<@mst> the treatment of [] is *special*

- ilmari
--
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
the consequences of." -- Skud's Meta-Law

p5pRT · 2018-03-27T11:04:12Z

From @iabyn

On Mon, Mar 26, 2018 at 06:08:26PM +0100, Dagfinn Ilmari Mannsåker wrote:

<@mst> oh, there's a thingy in the parser where it tries to work
out what [] means that adds up a bunch of things to get a
score to decide which way to break

The code is in S_intuit_more(). From the code comments:

* Returns TRUE if there's more to the expression (e.g., a subscript),
* FALSE otherwise.
...
* if we're in a pattern and the first char is a [
* [] returns FALSE
* [SOMETHING] has a funky algorithm to decide whether it's a
* character class or not. It has to deal with things like
* /$foo[-3]/ and /$foo[$bar]/ as well as /$foo[$\d]+/
....
/* This is the one truly awful dwimmer necessary to conflate C and sed. */

....
/* this is terrifying, and it works */

That last comment is followed by 100 lines of code which calculates a
weight heuristic, and which I haven't even attempted to understand.

--
Spock (or Data) is fired from his high-ranking position for not being able
to understand the most basic nuances of about one in three sentences that
anyone says to him.
-- Things That Never Happen in "Star Trek" #19

khwilliamson · 2022-04-10T19:18:16Z

I now understand what's going on here.

The issue is how does one decide between [13579] meaning a subscript vs meaning a class of the odd digits. I haven't checked but it could well be that [97531] would be considered a subscript, and [13579] a character class. See #19614 for more details

The code changed in that PR is a heuristic for deciding. Now that I understand it, I can think of more criteria that could be thrown in, but it would still remain a heuristic. It's hard for me to imagine that it works as poorly as it does with perl working as well as it does. This ticket indicates that if it guess wrong in favor of a character class that things don't work. I think it must be that if it guesses things wrong the other direction, that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one.

The code special cases a construct consisting of precisely 1 or 2 digits to be a subscript. It doesn't special case other all-digit operands. The bias is towards it being a character class. So at three digits, it remains so. At 4 digits, with the final three being the same number '000' in this case, the bias is overcome and it is considered a subscript.

The bias is increased if the digits are consecutive [123] would be more likely to be considered a character class; and it is decreased if the same character occurs multiple times, with each extra occurrence value more towards making it a subscript than the previous such occurrence.

If in fact guessing wrong towards it being a subscript is recovered from, then we could just assume all digits means subscript.

But if guessing a subscript always is recovered from when wrong, why have this routine at all?

hvds · 2022-04-10T23:10:20Z

I doubt I could find it back, but many (20?) years ago I reported this with a script that showed the very random looking patterns the heuristic throws up for a long list of integers. It's crazy stuff, but nobody seemed interested in touching it.

I think it must be that if it guesses things wrong [in favour of a subscript], that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one.

It would be useful to find evidence of that, I don't recall any mention of such a correction mechanism before.

khwilliamson · 2022-04-17T20:18:12Z

The current code is buggy, it turns out. I don't see a correction mechanism, and the heuristics do fail

demerphq · 2022-04-18T05:01:30Z

This ticket indicates that if it guess wrong in favor of a character class that things don't work. I think it must be that if it guesses things wrong the other direction, that it all gets sorted out somehow later. Otherwise, we'd be having many more bugs like this one.

I would assume that if it biases towards assuming it is a character class then it probably does the right thing most of the time, especially if the subscript is long. I find it hard to believe that many people are writing patterns with high subscript values in array lookups.

eg,

/$thing[13579]/

its hard to believe that many people actually would have an array with 13k+ elements in it that they want to interpolate in a pattern. So if it assumes that is a character-class I doubt many, if any, people would stumble on the bug.

There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.

p5pRT added Severity Medium type-core labels Oct 19, 2019

toddr removed the khw label Oct 25, 2019

xenu removed the affects-5.27 label Nov 19, 2021

xenu removed the Severity Medium label Dec 15, 2021

khwilliamson mentioned this issue Apr 10, 2022

toke.c: S_intuit_more: Cleanup #19614

Merged

khwilliamson added the type-regex label Apr 17, 2022

zmughal added a commit to zmughal/p5-PPR that referenced this issue Aug 7, 2022

Fix for some of the interpolation cases

22ea2c0

There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.

zmughal added a commit to zmughal/p5-PPR that referenced this issue Aug 7, 2022

Fix for some of the interpolation cases

ca2d86b

There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.

zmughal added a commit to zmughal/p5-PPR that referenced this issue Aug 8, 2022

Fix for some of the interpolation cases

0fb014c

There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.

zmughal added a commit to zmughal/p5-PPR that referenced this issue Aug 8, 2022

Fix for some of the interpolation cases

461993c

There are possibly still more to do. It might not be possible to cover all cases without reading `S_intuit_more()`. See <Perl/perl5#16478>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regexes which interpolate list elements incorrectly parsed #16478

Regexes which interpolate list elements incorrectly parsed #16478

p5pRT commented Mar 26, 2018

p5pRT commented Mar 26, 2018

p5pRT commented Mar 26, 2018

p5pRT commented Mar 26, 2018

p5pRT commented Mar 26, 2018

p5pRT commented Mar 27, 2018

khwilliamson commented Apr 10, 2022

hvds commented Apr 10, 2022

khwilliamson commented Apr 17, 2022

demerphq commented Apr 18, 2022

Regexes which interpolate list elements incorrectly parsed #16478

Regexes which interpolate list elements incorrectly parsed #16478

Comments

p5pRT commented Mar 26, 2018

p5pRT commented Mar 26, 2018

From david@cantrell.org.uk

Created by david@cantrell.org.uk

p5pRT commented Mar 26, 2018

From @Abigail

p5pRT commented Mar 26, 2018

p5pRT commented Mar 26, 2018

From @ilmari

p5pRT commented Mar 27, 2018

From @iabyn

khwilliamson commented Apr 10, 2022

hvds commented Apr 10, 2022

khwilliamson commented Apr 17, 2022

demerphq commented Apr 18, 2022