Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$1 suddenly tainted after regexp on utf-8 string #9102

Closed
p5pRT opened this issue Nov 6, 2007 · 10 comments
Closed

$1 suddenly tainted after regexp on utf-8 string #9102

p5pRT opened this issue Nov 6, 2007 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 6, 2007

Migrated from rt.perl.org#47195 (status was 'resolved')

Searchable as RT47195$

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2007

From dst@heise.de

This is a bug report for perl from dst@​heise.de,
generated with the help of perlbug 1.35 running under perl v5.8.8.


After applying a regex using \S on a string which is flagged as utf-8
the submatch is unexpectedly tainted. You can reproduce the behaviour
with this code​:

#!/usr/bin/perl -T

use warnings;
use strict;
use Scalar​::Util qw(tainted);

my $line = 'A';
utf8​::upgrade($line);
$line =~ /(\S*)/;
my $first = $1;
print tainted($first) . "\n";

__END__

$first is unexpectedly tainted. If you comment out the line which
flags $line as utf-8, $first is not tainted any longer.

The behaviour is identical on a self-built perl5.9.5.



Flags​:
  category=core
  severity=high


Site configuration information for perl v5.8.8​:

Configured by Debian Project at Wed Dec 6 23​:17​:41 UTC 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration​:
  Platform​:
  osname=linux, osvers=2.6.18.3, archname=i486-linux-gnu-thread-multi
  uname='linux saens 2.6.18.3 #1 smp sat nov 25 13​:39​:52 est 2006 i686 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-20)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8
  gnulibc_version='2.3.6'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:
 


@​INC for perl v5.8.8​:
  /etc/perl
  /usr/local/lib/perl/5.8.8
  /usr/local/share/perl/5.8.8
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.8
  /usr/share/perl/5.8
  /usr/local/lib/site_perl
  .


Environment for perl v5.8.8​:
  HOME=/net/juan01.heise.de/opt/home/dst
  LANG=en_GB
  LANGUAGE=en_DE​:en_US​:en_GB​:en
  LC_COLLATE=POSIX
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/net/juan01.heise.de/opt/home/dst/lib/git/bin​:/net/juan01.heise.de/opt/home/dst/bin​:/net/juan01.heise.de/opt/home/dst/lib/git/bin​:/net/juan01.heise.de/opt/home/dst/bin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/bin/X11​:/usr/games
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2007

From @nwc10

On Tue, Nov 06, 2007 at 07​:38​:36AM -0800, dst @​ heise. de wrote​:

After applying a regex using \S on a string which is flagged as utf-8
the submatch is unexpectedly tainted. You can reproduce the behaviour
with this code​:

#!/usr/bin/perl -T

use warnings;
use strict;
use Scalar​::Util qw(tainted);

my $line = 'A';
utf8​::upgrade($line);
$line =~ /(\S*)/;
my $first = $1;
print tainted($first) . "\n";

__END__

$first is unexpectedly tainted. If you comment out the line which
flags $line as utf-8, $first is not tainted any longer.

It's in still in blead. I thought that it was something to do with the swash
loading code​:

$ cat 47195.pl
my $line = 'A1' . substr ($^X, 0, 0) . chr shift;
chop $line;
$line =~ /(A\S*)/;
print eval { join("",$1), kill 0; 1 } ? "untainted\n" : "tainted\n";

my $line = 'A1' . substr ($^X, 0, 0) . chr shift;
chop $line;
$line =~ /(A\S*)/;
print eval { join("",$1), kill 0; 1 } ? "untainted\n" : "tainted\n";

__END__
$ ./perl -Ilib -T 47195.pl 256
tainted
untainted

But it doesn't seem to be, because if I change the program to be a loop​:

$ cat 47195.pl
for (0..1) {
  my $line = 'A1' . substr ($^X, 0, 0) . chr $ARGV[0];
  chop $line;
  $line =~ /(A\S*)/;
  print eval { join("",$1), kill 0; 1 } ? "untainted\n" : "tainted\n";
}

__END__
$ ./perl -Ilib -T 47195.pl 256
tainted
tainted

the results change. (Correct output would be "untainted" both times)

Certainly, the solution isn't this simple​:

$ p4 diff
==== //depot/perl/regcomp.c#618 - /home/nick/p4perl/perl/regcomp.c ====

Inline Patch
--- /tmp/tmp.59321.0    Tue Nov  6 15:55:47 2007
+++ /home/nick/p4perl/perl/regcomp.c    Tue Nov  6 15:54:11 2007
@@ -9627,7 +9627,10 @@ Perl_save_re_context(pTHX)
     struct re_save_state *state;
 
     SAVEVPTR(PL_curcop);
-    SSGROW(SAVESTACK_ALLOC_FOR_RE_SAVE_STATE + 1);
+    /* Add 3 for the save bool, and do the check in one to avoid a possible
+       second reallocation.  */
+    SSGROW(SAVESTACK_ALLOC_FOR_RE_SAVE_STATE + 1 + 3);
+    SAVEBOOL(PL_tainted);
 
     state = (struct re_save_state *)(PL_savestack + PL_savestack_ix);
     PL_savestack_ix += SAVESTACK_ALLOC_FOR_RE_SAVE_STATE;


Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2007

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2007

From @nwc10

use more 'coffee';

On Tue, Nov 06, 2007 at 03​:59​:01PM +0000, Nicholas Clark wrote​:

It's in still in blead. I thought that it was something to do with the swash
loading code​:

$ cat 47195.pl
my $line = 'A1' . substr ($^X, 0, 0) . chr shift;
chop $line;
$line =~ /(A\S*)/;
print eval { join("",$1), kill 0; 1 } ? "untainted\n" : "tainted\n";

my $line = 'A1' . substr ($^X, 0, 0) . chr shift;
chop $line;
$line =~ /(A\S*)/;
print eval { join("",$1), kill 0; 1 } ? "untainted\n" : "tainted\n";

__END__
$ ./perl -Ilib -T 47195.pl 256
tainted
untainted

But it doesn't seem to be, because if I change the program to be a loop​:

$ cat 47195.pl
for (0..1) {
my $line = 'A1' . substr ($^X, 0, 0) . chr $ARGV[0];
chop $line;
$line =~ /(A\S*)/;
print eval { join("",$1), kill 0; 1 } ? "untainted\n" : "tainted\n";
}

Then I had to change shift, which I didn't realise I'd not done in the first
example.

Therefore, it's nothing to do with the swatch *loading* code.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2007

From @hvds

Nicholas Clark <nick@​ccl4.org> wrote​:
:On Tue, Nov 06, 2007 at 07​:38​:36AM -0800, dst @​ heise. de wrote​:
:
:> After applying a regex using \S on a string which is flagged as utf-8
:> the submatch is unexpectedly tainted. You can reproduce the behaviour
:> with this code​:
[...]
:> $first is unexpectedly tainted. If you comment out the line which
:> flags $line as utf-8, $first is not tainted any longer.
:
:It's in still in blead. I thought that it was something to do with the swash
:loading code​:
[...]
:But it doesn't seem to be, because if I change the program to be a loop​:
[...]
:the results change. (Correct output would be "untainted" both times)

Is it because of the way multiple magic is handled, by hiding the first
magic in the chain when handing on to the next? I seem to remember that
causing a problem similar to this in the past.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2007

From rick@bort.ca

On Nov 06 2007, dst@​heise.de wrote​:

After applying a regex using \S on a string which is flagged as utf-8
the submatch is unexpectedly tainted. You can reproduce the behaviour
with this code​:

#!/usr/bin/perl -T

use warnings;
use strict;
use Scalar​::Util qw(tainted);

my $line = 'A';
utf8​::upgrade($line);
$line =~ /(\S*)/;
my $first = $1;
print tainted($first) . "\n";

__END__

Attached patch fixes this. I had a look through the rest of the switch
statement to see if there were any other mis(sing|placed) breaks and I'm
pretty sure there should be one at the end of case NDIGIT​:

  case NDIGIT​:
  if (do_utf8) {
  loceol = PL_regeol;
  LOAD_UTF8_CHARCLASS_DIGIT();
  while (hardcount < max && scan < loceol &&
  !swash_fetch(PL_utf8_digit, (U8*)scan, do_utf8)) {
  scan += UTF8SKIP(scan);
  hardcount++;
  }
  } else {
  while (scan < loceol && !isDIGIT(*scan))
  scan++;
  }
  case LNBREAK​:

Case LNBREAK was added with change 31026 which looks like it may have
missed adding break to a few switch statements. I haven't looked at it
closely, though, just browsed the change.

  http​://public.activestate.com/cgi-bin/perlbrowse/p/31026

--
Rick Delaney
rick@​bort.ca

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2007

From rick@bort.ca

47195.patch
diff -pruN perl-current/regexec.c perl-current-dev/regexec.c
--- perl-current/regexec.c	2007-08-30 09:49:59.000000000 -0400
+++ perl-current-dev/regexec.c	2007-11-06 17:49:04.000000000 -0500
@@ -5405,8 +5405,8 @@ S_regrepeat(pTHX_ const regexp *prog, co
 	} else {
 	    while (scan < loceol && !isSPACE(*scan))
 		scan++;
-	    break;
 	}
+	break;
     case NSPACEL:
 	PL_reg_flags |= RF_tainted;
 	if (do_utf8) {

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2007

From @nwc10

On Tue, Nov 06, 2007 at 07​:18​:45PM -0500, Rick Delaney wrote​:

On Nov 06 2007, dst@​heise.de wrote​:

After applying a regex using \S on a string which is flagged as utf-8
the submatch is unexpectedly tainted. You can reproduce the behaviour
with this code​:

#!/usr/bin/perl -T

use warnings;
use strict;
use Scalar​::Util qw(tainted);

my $line = 'A';
utf8​::upgrade($line);
$line =~ /(\S*)/;
my $first = $1;
print tainted($first) . "\n";

__END__

Attached patch fixes this. I had a look through the rest of the switch

Thanks, applied (change 32236)

statement to see if there were any other mis(sing|placed) breaks and I'm
pretty sure there should be one at the end of case NDIGIT​:

case NDIGIT&#8203;:
    if \(do\_utf8\) \{
        loceol = PL\_regeol;
        LOAD\_UTF8\_CHARCLASS\_DIGIT\(\);
        while \(hardcount \< max && scan \< loceol &&
               \!swash\_fetch\(PL\_utf8\_digit\, \(U8\*\)scan\, do\_utf8\)\) \{
            scan \+= UTF8SKIP\(scan\);
            hardcount\+\+;
        \}
    \} else \{
        while \(scan \< loceol && \!isDIGIT\(\*scan\)\)
            scan\+\+;
    \}
case LNBREAK&#8203;:

Case LNBREAK was added with change 31026 which looks like it may have
missed adding break to a few switch statements. I haven't looked at it
closely, though, just browsed the change.

http&#8203;://public\.activestate\.com/cgi\-bin/perlbrowse/p/31026

At first I assume that I can't make a regression test that shows a problem on
that one because the characters that LNBREAK matches are a subset of those
that NDIGIT matches, so it's not going to match anything extra.

But something is niggling me that it ought to be possible to expose this one
somehow - by crafting a string that gets a spurious overlong match.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Nov 16, 2008

From @demerphq

2007/11/7 Rick Delaney <rick@​bort.ca>​:

On Nov 06 2007, dst@​heise.de wrote​:

After applying a regex using \S on a string which is flagged as utf-8
the submatch is unexpectedly tainted. You can reproduce the behaviour
with this code​:

#!/usr/bin/perl -T

use warnings;
use strict;
use Scalar​::Util qw(tainted);

my $line = 'A';
utf8​::upgrade($line);
$line =~ /(\S*)/;
my $first = $1;
print tainted($first) . "\n";

__END__

Attached patch fixes this. I had a look through the rest of the switch
statement to see if there were any other mis(sing|placed) breaks and I'm
pretty sure there should be one at the end of case NDIGIT​:

case NDIGIT​:
if (do_utf8) {
loceol = PL_regeol;
LOAD_UTF8_CHARCLASS_DIGIT();
while (hardcount < max && scan < loceol &&
!swash_fetch(PL_utf8_digit, (U8*)scan, do_utf8)) {
scan += UTF8SKIP(scan);
hardcount++;
}
} else {
while (scan < loceol && !isDIGIT(*scan))
scan++;
}
case LNBREAK​:

Case LNBREAK was added with change 31026 which looks like it may have
missed adding break to a few switch statements. I haven't looked at it
closely, though, just browsed the change.

I think there was only the one missing break, which your patch fixed.

Sorry about that, and thanks,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 25, 2008

@smpeters - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant