Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum string length with substr #9634

Closed
p5pRT opened this issue Jan 22, 2009 · 36 comments
Closed

Maximum string length with substr #9634

p5pRT opened this issue Jan 22, 2009 · 36 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 22, 2009

Migrated from rt.perl.org#62646 (status was 'resolved')

Searchable as RT62646$

@p5pRT
Copy link
Author

p5pRT commented Jan 22, 2009

From skylar2@u.washington.edu

Created by skylar2@u.washington.edu

e've run into a problem where substr appears to have a maximum string
length of 2^31 bytes even on 64-bit hosts (or at least AMD64 hosts). We've
run into this problem with the Red Hat-supplied Perl 5.8.8 on RHEL5, and
also a self-compiled install of 5.10.0. Here's a script to exercise the
bug​:

===

#!/usr/bin/perl

use strict;
use warnings;

my($file,$chars,$text,$length);
$file = "/dev/zero";

open(FILE, "< $file") or
  die "Can't open $file for reading​: $!\n";

$text = q{ }; # Make sure there's at least one character to run substr on
do {
  read(FILE,$chars,1);
  $text .= $chars;
} while(substr($text,1,1)); # This appears to die when $text is 2GB

close(FILE);

$length = length($text);
print "substr died when text was $length bytes long.\n";

===

This will create this output​:

===

substr outside of string at /net/gs/vol1/home/skylar2/cfm/scripts/nick/big_substr.pl line 13.
substr died when text was 2147483649 bytes long.

===

Perl Info

Flags:
    category=core
    severity=low

This perlbug was built using Perl v5.8.8 in the Red Hat build system.
It is being executed now by Perl v5.8.8 - Tue Oct 23 12:21:01 EDT 2007.

Site configuration information for perl v5.8.8:

Configured by Red Hat, Inc. at Tue Oct 23 12:21:01 EDT 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.9-55.0.9.elsmp, archname=x86_64-linux-thread-multi
    uname='linux hs20-bc1-7.build.redhat.com 2.6.9-55.0.9.elsmp #1 smp tue sep 25 02:16:15 edt 2007 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Dversion=5.8.8 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dprivlib=/usr/lib/perl5/5.8.8 -Dsitelib=/usr/lib/perl5/site_perl/5.8.8 -Dvendorlib=/usr/lib/perl5/vendor_perl/5.8.8 -Darchlib=/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi -Dsitearch=/usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi -Darchname=x86_64-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_pr!
 oto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dinc_version_list=5.8.7 5.8.6 5.8.5 -Dscriptdir=/usr/bin'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=define use64bitall=define uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.1 20070105 (Red Hat 4.1.1-52)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =''
    libpth=/usr/local/lib64 /lib64 /usr/lib64
    libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic'

Locally applied patches:
    


@INC for perl v5.8.8:
    /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi
    /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi
    /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi
    /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl/5.8.7
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl/5.8.5
    /usr/lib/perl5/site_perl
    /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi
    /usr/lib64/perl5/vendor_perl/5.8.7/x86_64-linux-thread-multi
    /usr/lib64/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi
    /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl/5.8.7
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl/5.8.5
    /usr/lib/perl5/vendor_perl
    /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi
    /usr/lib/perl5/5.8.8
    .


Environment for perl v5.8.8:
    HOME=/net/gs/vol1/home/skylar2
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_ALL=en_US
    LC_CTYPE=en_US
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/nfs/sge/bin/lx24-amd64:/nfs/sge/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/net/gs/vol1/home/skylar2/bin:/usr/bin:/usr/pkg/bin:/usr/pkg/sbin:/usr/local/bin:/usr/local/sbin:/sbin:/usr/sbin:/opt/csw/bin:/opt/csw/sbin:/usr/local/maui/bin:/usr/local/maui/sbin:/net/gs/vol1/home/skylar2/bin:/net/gs/vol1/home/skylar2/software/gerris/bin:/net/gs/vol1/home/skylar2/software/gts/bin:/usr/local/pgsql/bin:/net/maccoss/vol2/software/python2.4/bin/
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 23, 2009

From @nwc10

On Thu, Jan 22, 2009 at 11​:12​:20AM -0800, skylar2@​u.washington.edu (via RT) wrote​:

e've run into a problem where substr appears to have a maximum string
length of 2^31 bytes even on 64-bit hosts (or at least AMD64 hosts). We've
run into this problem with the Red Hat-supplied Perl 5.8.8 on RHEL5, and
also a self-compiled install of 5.10.0. Here's a script to exercise the
bug​:

Thanks for the bug report. I can replicate it somewhat more tersely​:

$ ./perl -lwe 'print "$_ gives " . substr ("x" x $_, 1, 1) for 2147483648, 2147483649'
2147483648 gives x
substr outside of string at -e line 1.
Use of uninitialized value in concatenation (.) or string at -e line 1.
2147483649 gives

The problem seems to be this part of the implementation of substr​:

int
Perl_magic_getsubstr(pTHX_ SV *sv, MAGIC *mg)
{
  STRLEN len;
  SV * const lsv = LvTARG(sv);
  const char * const tmps = SvPV_const(lsv,len);
  I32 offs = LvTARGOFF(sv);
  I32 rem = LvTARGLEN(sv);

  PERL_ARGS_ASSERT_MAGIC_GETSUBSTR;
  PERL_UNUSED_ARG(mg);

  if (SvUTF8(lsv))
  sv_pos_u2b(lsv, &offs, &rem);
  if (offs > (I32)len)
  offs = len;
  if (rem + offs > (I32)len)
  rem = len - offs;
  sv_setpvn(sv, tmps + offs, (STRLEN)rem);
  if (SvUTF8(lsv))
  SvUTF8_on(sv);
  return 0;
}

which is using variables of type I32, which will be a signed 32 bit integer
on all platforms (except whichever Cray it is that only has 64 bit types)

It's using I32 because the interface of sv_pos_u2b() is using I32 pointers,
clearly a mistake in hindsight. sv_pos_u2b() was added in 1998​:

http​://perl5.git.perl.org/perl.git/blame/fdf134946da249a71c49962435817212b8fa195a​:/sv.c#l3236

I suspect we need to write a replacement that uses STRLEN pointers, and
deprecate the old interface. (It doesn't work changing the type of a pointer
on an existing function, as it will cause memory corruption when the revised
function is writing 8 bytes, but old code passes in a pointer to a variable
that is only 4 bytes long)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Jan 23, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2009

From perlbug@plan9.de

Created by perlbug@plan9.de

I found that substr acts weirdly when confronted with larger than ~2gb
strings​:

  warn length $data;
  substr $data, 0, 256;

  15701683970 at ...
  substr outside of string at ...

My perl uses a uint64_t for STRLEN (standard amd64), so I would expect
substr to handle this, given that I pay the memory overhead to store an
8-byte length everywhere :)

It seems that pp_substr uses I32 for everything, which is of course not
enough.

A cursory glance over the rets of pp.c indicates that perl simply can't
handle >2gb scalars, even on a 64 bit system, as it uses I32 almost
everywhere (pp_index, pp_reverse, about anything that deals with string
offsets uses I32).

15GB seems like a lot, but I don't think thr wish to, say, load a DVD
image into memory is so far off in the future.

Maybe the safe way for now would be to disallow >31 bit scalar lengths?

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.0:

Configured by Marc Lehmann at Sat Feb 21 02:30:27 CET 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.24-etchnhalf.1-amd64, archname=amd64-linux
    uname='linux cerebro 2.6.24-etchnhalf.1-amd64 #1 smp mon jul 21 10:36:02 utc 2008 x86_64 gnulinux '
    config_args='-Duselargefiles -Dxxxxuse64bitint -Uuse64bitall -Dusemymalloc=n -Dcc=gcc -Dccflags=-ggdb -gdwarf-2 -g3 -Dcppflags=-DPERL_ARENA_SIZE=16368 -D_GNU_SOURCE -I/opt/include -Doptimize=-O6 -msse2 -funroll-loops -fno-strict-aliasing -Dcccdlflags=-fPIC -Dldflags=-L/opt/perl/lib -L/opt/lib -Dlibs=-ldl -lm -lcrypt -Darchname=amd64-linux -Dprefix=/opt/perl -Dprivlib=/opt/perl/lib/perl5 -Darchlib=/opt/perl/lib/perl5 -Dvendorprefix=/opt/perl -Dvendorlib=/opt/perl/lib/perl5 -Dvendorarch=/opt/perl/lib/perl5 -Dsiteprefix=/opt/perl -Dsitelib=/opt/perl/lib/perl5 -Dsitearch=/opt/perl/lib/perl5 -Dsitebin=/opt/perl/bin -Dman1dir=/opt/perl/man/man1 -Dman3dir=/opt/perl/man/man3 -Dsiteman1dir=/opt/perl/man/man1 -Dsiteman3dir=/opt/perl/man/man3 -Dman1ext=1 -Dman3ext=3 -Dpager=/usr/bin/less -Uafs -Uusesfio -Uusenm -Uuseshrplib -Dd_dosuid -Dusethreads=undef -Duse5005threads=undef -Duseithreads=undef -Dusemultiplicity=undef -Demail=perl-binary@plan9.de -Dcf_email=perl-binary@plan9.de -Dcf_by=Marc Lehmann -Dlocincpth=/opt/perl/include /opt/include -Dmyhostname=localhost -Dmultiarch=undef -Dbin=/opt/perl/bin -Dxxxusedevel -DxxxDEBUGGING -Dxxxuse_debugging_perl -Dxxxuse_debugmalloc -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-ggdb -gdwarf-2 -g3 -fno-strict-aliasing -pipe -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O6 -msse2 -funroll-loops -fno-strict-aliasing',
    cppflags='-DPERL_ARENA_SIZE=16368 -D_GNU_SOURCE -I/opt/include -ggdb -gdwarf-2 -g3 -fno-strict-aliasing -pipe -I/opt/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-L/opt/perl/lib -L/opt/lib -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-ldl -lm -lcrypt
    perllibs=-ldl -lm -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O6 -msse2 -funroll-loops -fno-strict-aliasing -L/opt/perl/lib -L/opt/lib -L/usr/local/lib'

Locally applied patches:
    http://public.activestate.com/cgi-bin/perlbrowse/p/34209
    http://public.activestate.com/cgi-bin/perlbrowse/p/34507
    http://www.gossamer-threads.com/lists/perl/porters/232549
    embed.fnc:Perl_vcroak NULLOK


@INC for perl 5.10.0:
    /root/src/sex
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    .


Environment for perl 5.10.0:
    HOME=/root
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_US.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/root/s2:/root/s:/opt/bin:/opt/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11/bin:/usr/games:/usr/local/bin:/usr/local/sbin:/root/pserv:.
    PERL5LIB=/root/src/sex
    PERL5_CPANPLUS_CONFIG=/root/.cpanplus/config
    PERLDB_OPTS=ornaments=0
    PERL_ANYEVENT_DBI_TESTS=1
    PERL_ANYEVENT_EDNS0=1
    PERL_ANYEVENT_NET_TESTS=1
    PERL_ANYEVENT_PROTOCOLS=ipv4,ipv6
    PERL_ANYEVENT_STRICT=1
    PERL_BADLANG (unset)
    PERL_UNICODE=E
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Oct 10, 2009

From @fbriere

This is a bug report for perl from fbriere@​fbriere.net,
generated with the help of perlbug 1.36 running under perl 5.10.0.


substr() is behaving strangely on values larger than an int​:

$ perl -le 'print substr "abcd", 0, $_ for 2**31-1, 2**31, 2**32'
abcd

abc

(Obviously, this was run on a 32-bit machine.)



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.10.0​:

Configured by Debian Project at Sun Aug 16 22​:37​:28 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration​:
  Platform​:
  osname=linux, osvers=2.6.26-2-amd64, archname=i486-linux-gnu-thread-multi
  uname='linux puccini 2.6.26-2-amd64 #1 smp fri aug 14 07​:12​:04 utc 2009 i686 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='4.3.4', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib /usr/lib64
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=/lib/libc-2.9.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
  gnulibc_version='2.9'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches​:
 


@​INC for perl 5.10.0​:
  /etc/perl
  /usr/local/lib/perl/5.10.0
  /usr/local/share/perl/5.10.0
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.10
  /usr/share/perl/5.10
  /usr/local/lib/site_perl
  .


Environment for perl 5.10.0​:
  HOME=/home/fbriere
  LANG=en_CA.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/local/bin​:/usr/bin​:/bin​:/usr/games
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2010

From zefram@fysh.org

Attached patch should fix bug #62646.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2010

From zefram@fysh.org

Inline Patch
diff --git a/pp.c b/pp.c
index c659b13..4735c94 100644
--- a/pp.c
+++ b/pp.c
@@ -3079,12 +3079,12 @@ PP(pp_substr)
 {
     dVAR; dSP; dTARGET;
     SV *sv;
-    I32 len = 0;
+    IV len = 0;
     STRLEN curlen;
     STRLEN utf8_curlen;
-    I32 pos;
-    I32 rem;
-    I32 fail;
+    IV pos;
+    IV rem;
+    IV fail;
     const I32 lvalue = PL_op->op_flags & OPf_MOD || LVRET;
     const char *tmps;
     const I32 arybase = CopARYBASE_get(PL_curcop);
@@ -3147,7 +3147,7 @@ PP(pp_substr)
 	    rem = curlen;
 	else if (len >= 0) {
 	    rem = pos+len;
-	    if (rem > (I32)curlen)
+	    if (rem > (IV)curlen)
 		rem = curlen;
 	}
 	else {
@@ -3167,8 +3167,8 @@ PP(pp_substr)
 	RETPUSHUNDEF;
     }
     else {
-	const I32 upos = pos;
-	const I32 urem = rem;
+	const IV upos = pos;
+	const IV urem = rem;
 	if (utf8_curlen)
 	    sv_pos_u2b(sv, &pos, &rem);
 	tmps += pos;
diff --git a/t/re/substr.t b/t/re/substr.t
index c3fa6e1..900c8f7 100644
--- a/t/re/substr.t
+++ b/t/re/substr.t
@@ -24,7 +24,7 @@ $SIG{__WARN__} = sub {
 
 require './test.pl';
 
-plan(334);
+plan(338);
 
 run_tests() unless caller;
 
@@ -682,4 +682,19 @@ is($x, "\x{100}\x{200}\xFFb");
     is(substr($a,1,1), 'b');
 }
 
+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP: {
+    skip("32-bit system", 4) unless ~0 > 0xffffffff;
+    my $a = "abc";
+    my $r;
+    $w = 0;
+    $r = substr($a, 0xffffffff, 1);
+    is($r, undef);
+    is($w, 1);
+    $w = 0;
+    $r = substr($a, 0xffffffff+1, 1);
+    is($r, undef);
+    is($w--, 1);
+}
+
 }

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2010

From @nwc10

On Tue, Jan 05, 2010 at 10​:18​:42PM +0000, Zefram wrote​:

Attached patch should fix bug #62646.

+++ b/t/re/substr.t

@​@​ -682,4 +682,19 @​@​ is($x, "\x{100}\x{200}\xFFb");
is(substr($a,1,1), 'b');
}

+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP​: {
+ skip("32-bit system", 4) unless ~0 > 0xffffffff;
+ my $a = "abc";
+ my $r;
+ $w = 0;
+ $r = substr($a, 0xffffffff, 1);
+ is($r, undef);
+ is($w, 1);
+ $w = 0;
+ $r = substr($a, 0xffffffff+1, 1);
+ is($r, undef);
+ is($w--, 1);
+}
+
}

Any reason for $w-- right at the end, instead of just $w?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2010

From zefram@fysh.org

Nicholas Clark wrote​:

Any reason for $w-- right at the end, instead of just $w?

Bah. No, that's an editing mistake.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2010

From @ikegami

On Tue, Jan 5, 2010 at 5​:18 PM, Zefram <zefram@​fysh.org> wrote​:

Attached patch should fix bug #62646.

The patch doesn't change mg.c. Does that mean it doesn't fix lvalue subtr?

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2010

From zefram@fysh.org

Eric Brine wrote​:

The patch doesn't change mg.c. Does that mean it doesn't fix lvalue subtr?

Oh, missed that. And upon looking further, it seems that #62646 has
widened in scope to consider all other string ops, which I didn't look at.
Jesse, as a 5.12 blocker, are you concerned just with substr, or with
all string ops?

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 8, 2010

From @obra

On Tue 5.Jan'10 at 23​:16​:53 +0000, Zefram wrote​:

Eric Brine wrote​:

The patch doesn't change mg.c. Does that mean it doesn't fix lvalue subtr?

Oh, missed that. And upon looking further, it seems that #62646 has
widened in scope to consider all other string ops, which I didn't look at.
Jesse, as a 5.12 blocker, are you concerned just with substr, or with
all string ops?

Actually, it looks like this was nicholas' blocker.

I'd certainly rather a partial solution than no solution, so long as
we're not making a full fix harder.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 15, 2010

From @rgarcia

2010/1/5 Zefram <zefram@​fysh.org>​:

Attached patch should fix bug #62646.

Thanks, applied to bleadperl as
b6d1426. I won't mark this bug fixed,
since it doesn't apply to lvalue substr, and since probably other
string functions have the limitation; however we may consider taking
it off the 5.12-blockers list.

@p5pRT
Copy link
Author

p5pRT commented Jan 15, 2010

From @nwc10

On Fri, Jan 15, 2010 at 05​:17​:17PM +0100, Rafael Garcia-Suarez wrote​:

2010/1/5 Zefram <zefram@​fysh.org>​:

Attached patch should fix bug #62646.

Thanks, applied to bleadperl as
b6d1426. I won't mark this bug fixed,
since it doesn't apply to lvalue substr, and since probably other
string functions have the limitation; however we may consider taking
it off the 5.12-blockers list.

I feel that the lvalue substr bug should remain on the blockers list, for now.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 11, 2010

From @obra

Can be summarized as​:

$ ./perl -le 'print substr "abcd", 0, $_ for 2**31-1, 2**31, 2**32'

abcd

abc

It's one of only three known release blockers left.

@p5pRT
Copy link
Author

p5pRT commented Feb 11, 2010

From @ikegami

On Thu, Feb 11, 2010 at 3​:04 PM, Jesse Vincent <jesse@​fsck.com> wrote​:

Can be summarized as​:

$ ./perl -le 'print substr "abcd", 0, $_ for 2**31-1, 2**31, 2**32'

abcd

abc

It's one of only three known release blockers left.

Zephram did the non-lvalue portion. I offered to do the lvalue portion for
him, but got no reply. The fix is straightfoward and consists of changing
some I32 to IV in pp*.c files (done?), in SV_PVLV, and in mg.c

@p5pRT
Copy link
Author

p5pRT commented Feb 11, 2010

From @obra

On Thu, Feb 11, 2010 at 06​:10​:52PM -0500, Eric Brine wrote​:

Zephram did the non-lvalue portion. I offered to do the lvalue portion for
him, but got no reply. The fix is straightfoward and consists of changing
some I32 to IV in pp*.c files (done?), in SV_PVLV, and in mg.c

Ooh. I'm sorry I missed your offer. If you're still game, that would be
great.

Best,
Jesse

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @ikegami

On Thu, Feb 11, 2010 at 6​:58 PM, jesse <jesse@​fsck.com> wrote​:

Ooh. I'm sorry I missed your offer.

I had sent the offer directly to him

If you're still game, that would be great.

Already started. One problem (Craig noticed) is that sv_pos_u2b only works
with 32bit string lengths. I'm not sure how to fix that.

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @ikegami

On Thu, Feb 11, 2010 at 7​:18 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

If you're still game, that would be great.

Already started. One problem (Craig noticed) is that sv_pos_u2b only works
with 32bit string lengths. I'm not sure how to fix that.

Turns out sv_pos_u2b uses STRLEN internally even though it presents I32 for
its arguments, so the changes needed were minor. All I did was to provide
another interface. I called it sv_pos_u2b_proper, but you'll probably want
to change that.

I think some more improvements relating to types can be made to pp_substr,
but it's in code outside the mandate of the bug.

I wish I could do some testing with very large strings (2**31), but I don't
have access to a system that can handle that.

Patch attached.

- ELB

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @ikegami

0001-Remove-32-bit-limit-on-substr-arguments.patch
From 946c161373a17c942c55dbe086606cbee7893a76 Mon Sep 17 00:00:00 2001
From: Eric Brine <ikegami@adaelis.com>
Date: Thu, 11 Feb 2010 20:28:29 -0500
Subject: [PATCH] Remove 32-bit limit on substr arguments

---
 embed.fnc     |    1 +
 embed.h       |    2 ++
 global.sym    |    1 +
 mg.c          |   19 +++++++++----------
 pp.c          |   24 +++++++++++++-----------
 proto.h       |    5 +++++
 sv.c          |   47 +++++++++++++++++++++++++++++++++++++++++------
 t/re/substr.t |   37 ++++++++++++++++++++++++++++++++++++-
 8 files changed, 108 insertions(+), 28 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 7463274..7e450aa 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1165,6 +1165,7 @@ ApdR	|SV*	|sv_newmortal
 Apd	|SV*	|sv_newref	|NULLOK SV *const sv
 Ap	|char*	|sv_peek	|NULLOK SV* sv
 Apd	|void	|sv_pos_u2b	|NULLOK SV *const sv|NN I32 *const offsetp|NULLOK I32 *const lenp
+Apd	|void	|sv_pos_u2b_proper|NULLOK SV *const sv|NN STRLEN *const offsetp|NULLOK STRLEN *const lenp
 Apd	|void	|sv_pos_b2u	|NULLOK SV *const sv|NN I32 *const offsetp
 Amdb	|char*	|sv_pvn_force	|NN SV* sv|NULLOK STRLEN* lp
 Apd	|char*	|sv_pvutf8n_force|NN SV *const sv|NULLOK STRLEN *const lp
diff --git a/embed.h b/embed.h
index 246106b..1281fcc 100644
--- a/embed.h
+++ b/embed.h
@@ -967,6 +967,7 @@
 #define sv_newref		Perl_sv_newref
 #define sv_peek			Perl_sv_peek
 #define sv_pos_u2b		Perl_sv_pos_u2b
+#define sv_pos_u2b_proper	Perl_sv_pos_u2b_proper
 #define sv_pos_b2u		Perl_sv_pos_b2u
 #define sv_pvutf8n_force	Perl_sv_pvutf8n_force
 #define sv_pvbyten_force	Perl_sv_pvbyten_force
@@ -3371,6 +3372,7 @@
 #define sv_newref(a)		Perl_sv_newref(aTHX_ a)
 #define sv_peek(a)		Perl_sv_peek(aTHX_ a)
 #define sv_pos_u2b(a,b,c)	Perl_sv_pos_u2b(aTHX_ a,b,c)
+#define sv_pos_u2b_proper(a,b,c)	Perl_sv_pos_u2b_proper(aTHX_ a,b,c)
 #define sv_pos_b2u(a,b)		Perl_sv_pos_b2u(aTHX_ a,b)
 #define sv_pvutf8n_force(a,b)	Perl_sv_pvutf8n_force(aTHX_ a,b)
 #define sv_pvbyten_force(a,b)	Perl_sv_pvbyten_force(aTHX_ a,b)
diff --git a/global.sym b/global.sym
index f0361df..f0e462e 100644
--- a/global.sym
+++ b/global.sym
@@ -567,6 +567,7 @@ Perl_sv_newmortal
 Perl_sv_newref
 Perl_sv_peek
 Perl_sv_pos_u2b
+Perl_sv_pos_u2b_proper
 Perl_sv_pos_b2u
 Perl_sv_pvn_force
 Perl_sv_pvutf8n_force
diff --git a/mg.c b/mg.c
index b9a1464..4f8207c 100644
--- a/mg.c
+++ b/mg.c
@@ -2008,17 +2008,17 @@ Perl_magic_getsubstr(pTHX_ SV *sv, MAGIC *mg)
     STRLEN len;
     SV * const lsv = LvTARG(sv);
     const char * const tmps = SvPV_const(lsv,len);
-    I32 offs = LvTARGOFF(sv);
-    I32 rem = LvTARGLEN(sv);
+    STRLEN offs = LvTARGOFF(sv);
+    STRLEN rem = LvTARGLEN(sv);
 
     PERL_ARGS_ASSERT_MAGIC_GETSUBSTR;
     PERL_UNUSED_ARG(mg);
 
     if (SvUTF8(lsv))
-	sv_pos_u2b(lsv, &offs, &rem);
-    if (offs > (I32)len)
+	sv_pos_u2b_proper(lsv, &offs, &rem);
+    if (offs > len)
 	offs = len;
-    if (rem + offs > (I32)len)
+    if (rem > len - offs)
 	rem = len - offs;
     sv_setpvn(sv, tmps + offs, (STRLEN)rem);
     if (SvUTF8(lsv))
@@ -2033,22 +2033,22 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
     STRLEN len;
     const char * const tmps = SvPV_const(sv, len);
     SV * const lsv = LvTARG(sv);
-    I32 lvoff = LvTARGOFF(sv);
-    I32 lvlen = LvTARGLEN(sv);
+    STRLEN lvoff = LvTARGOFF(sv);
+    STRLEN lvlen = LvTARGLEN(sv);
 
     PERL_ARGS_ASSERT_MAGIC_SETSUBSTR;
     PERL_UNUSED_ARG(mg);
 
     if (DO_UTF8(sv)) {
 	sv_utf8_upgrade(lsv);
- 	sv_pos_u2b(lsv, &lvoff, &lvlen);
+	sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
 	sv_insert(lsv, lvoff, lvlen, tmps, len);
 	LvTARGLEN(sv) = sv_len_utf8(sv);
 	SvUTF8_on(lsv);
     }
     else if (lsv && SvUTF8(lsv)) {
 	const char *utf8;
-	sv_pos_u2b(lsv, &lvoff, &lvlen);
+	sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
 	LvTARGLEN(sv) = len;
 	utf8 = (char*)bytes_to_utf8((U8*)tmps, &len);
 	sv_insert(lsv, lvoff, lvlen, utf8, len);
@@ -2059,7 +2059,6 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
 	LvTARGLEN(sv) = len;
     }
 
-
     return 0;
 }
 
diff --git a/pp.c b/pp.c
index 2f4703b..d78d712 100644
--- a/pp.c
+++ b/pp.c
@@ -3079,12 +3079,12 @@ PP(pp_substr)
 {
     dVAR; dSP; dTARGET;
     SV *sv;
-    I32 len = 0;
+    IV len = 0;
     STRLEN curlen;
     STRLEN utf8_curlen;
-    I32 pos;
-    I32 rem;
-    I32 fail;
+    IV pos;
+    IV rem;
+    IV fail;
     const I32 lvalue = PL_op->op_flags & OPf_MOD || LVRET;
     const char *tmps;
     const I32 arybase = CopARYBASE_get(PL_curcop);
@@ -3147,7 +3147,7 @@ PP(pp_substr)
 	    rem = curlen;
 	else if (len >= 0) {
 	    rem = pos+len;
-	    if (rem > (I32)curlen)
+	    if (rem > (IV)curlen)
 		rem = curlen;
 	}
 	else {
@@ -3167,11 +3167,13 @@ PP(pp_substr)
 	RETPUSHUNDEF;
     }
     else {
-	const I32 upos = pos;
-	const I32 urem = rem;
+	const STRLEN upos = pos;
+	const STRLEN urem = rem;
+	STRLEN bpos = pos;
+	STRLEN brem = rem;
 	if (utf8_curlen)
-	    sv_pos_u2b(sv, (I32 *)&pos, (I32 *)&rem);
-	tmps += pos;
+	    sv_pos_u2b_proper(sv, &bpos, &brem);
+	tmps += bpos;
 	/* we either return a PV or an LV. If the TARG hasn't been used
 	 * before, or is of that type, reuse it; otherwise use a mortal
 	 * instead. Note that LVs can have an extended lifetime, so also
@@ -3185,7 +3187,7 @@ PP(pp_substr)
 	    }
 	}
 
-	sv_setpvn(TARG, tmps, rem);
+	sv_setpvn(TARG, tmps, brem);
 #ifdef USE_LOCALE_COLLATE
 	sv_unmagic(TARG, PERL_MAGIC_collxfrm);
 #endif
@@ -3202,7 +3204,7 @@ PP(pp_substr)
 	    }
 	    if (!SvOK(sv))
 		sv_setpvs(sv, "");
-	    sv_insert_flags(sv, pos, rem, repl, repl_len, 0);
+	    sv_insert_flags(sv, bpos, brem, repl, repl_len, 0);
 	    if (repl_is_utf8)
 		SvUTF8_on(sv);
 	    SvREFCNT_dec(repl_sv_copy);
diff --git a/proto.h b/proto.h
index 4a343be..ae48597 100644
--- a/proto.h
+++ b/proto.h
@@ -3374,6 +3374,11 @@ PERL_CALLCONV void	Perl_sv_pos_u2b(pTHX_ SV *const sv, I32 *const offsetp, I32 *
 #define PERL_ARGS_ASSERT_SV_POS_U2B	\
 	assert(offsetp)
 
+PERL_CALLCONV void	Perl_sv_pos_u2b_proper(pTHX_ SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
+			__attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_SV_POS_U2B_PROPER	\
+	assert(offsetp)
+
 PERL_CALLCONV void	Perl_sv_pos_b2u(pTHX_ SV *const sv, I32 *const offsetp)
 			__attribute__nonnull__(pTHX_2);
 #define PERL_ARGS_ASSERT_SV_POS_B2U	\
diff --git a/sv.c b/sv.c
index 4ab41f6..02be580 100644
--- a/sv.c
+++ b/sv.c
@@ -6240,7 +6240,7 @@ S_sv_pos_u2b_cached(pTHX_ SV *const sv, MAGIC **const mgp, const U8 *const start
 
 
 /*
-=for apidoc sv_pos_u2b
+=for apidoc sv_pos_u2b_proper
 
 Converts the value pointed to by offsetp from a count of UTF-8 chars from
 the start of the string, to a count of the equivalent number of bytes; if
@@ -6252,14 +6252,14 @@ type coercion.
 */
 
 /*
- * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * sv_pos_u2b_proper() uses, like sv_pos_b2u(), the mg_ptr of the potential
  * PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
  * byte offsets.  See also the comments of S_utf8_mg_pos_cache_update().
  *
  */
 
 void
-Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+Perl_sv_pos_u2b_proper(pTHX_ register SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
 {
     const U8 *start;
     STRLEN len;
@@ -6271,17 +6271,17 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
 
     start = (U8*)SvPV_const(sv, len);
     if (len) {
-	STRLEN uoffset = (STRLEN) *offsetp;
+	STRLEN uoffset = *offsetp;
 	const U8 * const send = start + len;
 	MAGIC *mg = NULL;
 	const STRLEN boffset = sv_pos_u2b_cached(sv, &mg, start, send,
 					     uoffset, 0, 0);
 
-	*offsetp = (I32) boffset;
+	*offsetp = boffset;
 
 	if (lenp) {
 	    /* Convert the relative offset to absolute.  */
-	    const STRLEN uoffset2 = uoffset + (STRLEN) *lenp;
+	    const STRLEN uoffset2 = uoffset + *lenp;
 	    const STRLEN boffset2
 		= sv_pos_u2b_cached(sv, &mg, start, send, uoffset2,
 				      uoffset, boffset) - boffset;
@@ -6298,6 +6298,41 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
     return;
 }
 
+/*
+=for apidoc sv_pos_u2b
+
+Converts the value pointed to by offsetp from a count of UTF-8 chars from
+the start of the string, to a count of the equivalent number of bytes; if
+lenp is non-zero, it does the same to lenp, but this time starting from
+the offset, rather than from the start of the string. Handles magic and
+type coercion.
+
+=cut
+*/
+
+/*
+ * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
+ * byte offsets.  See also the comments of S_utf8_mg_pos_cache_update().
+ *
+ */
+
+/* This function is subject to size and sign problems */
+
+void
+Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+{
+    STRLEN uoffset = (STRLEN)*offsetp;
+    if (lenp) {
+	STRLEN ulen = (STRLEN)*lenp;
+	sv_pos_u2b_proper(sv, &uoffset, &ulen);
+	*lenp = (I32)ulen;
+    } else {
+	sv_pos_u2b_proper(sv, &uoffset, NULL);
+    }
+    *offsetp = (I32)uoffset;
+}
+
 /* Create and update the UTF8 magic offset cache, with the proffered utf8/
    byte length pairing. The (byte) length of the total SV is passed in too,
    as blen, because for some (more esoteric) SVs, the call to SvPV_const()
diff --git a/t/re/substr.t b/t/re/substr.t
index c3fa6e1..c94e376 100644
--- a/t/re/substr.t
+++ b/t/re/substr.t
@@ -24,7 +24,7 @@ $SIG{__WARN__} = sub {
 
 require './test.pl';
 
-plan(334);
+plan(358);
 
 run_tests() unless caller;
 
@@ -682,4 +682,39 @@ is($x, "\x{100}\x{200}\xFFb");
     is(substr($a,1,1), 'b');
 }
 
+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP: {
+    skip("32-bit system", 24) unless ~0 > 0xffffffff;
+    my $a = "abc";
+    my $s;
+    my $r;
+
+    utf8::downgrade($a);
+    for (1..2) {
+	$w = 0;
+	$r = substr($a, 0xffffffff, 1);
+	is($r, undef);
+	is($w, 1);
+
+	$w = 0;
+	$r = substr($a, 0xffffffff+1, 1);
+	is($r, undef);
+	is($w, 1);
+
+	$w = 0;
+	ok( !eval { $r = substr($s=$a, 0xffffffff, 1, "_"); 1 } );
+	is($r, undef);
+	is($s, $a);
+	is($w, 0);
+
+	$w = 0;
+	ok( !eval { $r = substr($s=$a, 0xffffffff+1, 1, "_"); 1 } );
+	is($r, undef);
+	is($s, $a);
+	is($w, 0);
+
+	utf8::upgrade($a);
+    }
+}
+
 }
-- 
1.6.5.2

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @nwc10

On Thu, Feb 11, 2010 at 07​:18​:11PM -0500, Eric Brine wrote​:

On Thu, Feb 11, 2010 at 6​:58 PM, jesse <jesse@​fsck.com> wrote​:

Ooh. I'm sorry I missed your offer.

I had sent the offer directly to him

If you're still game, that would be great.

Already started. One problem (Craig noticed) is that sv_pos_u2b only works
with 32bit string lengths. I'm not sure how to fix that.

I think​:

1​: give it (well, all the public functions, as sv_pod_b2u will need it too)
  a "new" name, and fix it to take STRLEN sized arguments.
2​: Make sv_pos_u2b a wrapper around the fixed version

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @ikegami

On Fri, Feb 12, 2010 at 2​:14 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

On Thu, Feb 11, 2010 at 07​:18​:11PM -0500, Eric Brine wrote​:

On Thu, Feb 11, 2010 at 6​:58 PM, jesse <jesse@​fsck.com> wrote​:

Ooh. I'm sorry I missed your offer.

I had sent the offer directly to him

If you're still game, that would be great.

Already started. One problem (Craig noticed) is that sv_pos_u2b only
works
with 32bit string lengths. I'm not sure how to fix that.

I think​:

1​: give it (well, all the public functions, as sv_pod_b2u will need it too)
a "new" name, and fix it to take STRLEN sized arguments.
2​: Make sv_pos_u2b a wrapper around the fixed version

That's exactly what I did. (See the patch I posted.) I couldn't come up with
decent name for the function. It's called sv_pos_u2b_proper in the patch,
but that's easily changeable.

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From zefram@fysh.org

Eric Brine wrote​:

Zephram did the non-lvalue portion. I offered to do the lvalue portion for
him, but got no reply.

Yeah, sorry. I was intending to work on this, but for the past month or
so I've been occupied with fixing Devel​::Declare (which was yet another
victim of my sub lookup patch) and a motherboard failure.

My patch for non-lvalue substr turned out to be faulty. The willingness
to handle 64-bit offsets and lengths needs to go a bit deeper than
I implemented. For the case of 64-bit offsets when the address space
(and therefore size_t) is 32-bit, shims will be required before string
operations can be handed off to standard library functions. I fear that
the ramifications of trying to do this correctly are quite extensive.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @nwc10

On Fri, Feb 12, 2010 at 10​:53​:27AM +0000, Zefram wrote​:

Eric Brine wrote​:

Zephram did the non-lvalue portion. I offered to do the lvalue portion for
him, but got no reply.

Yeah, sorry. I was intending to work on this, but for the past month or
so I've been occupied with fixing Devel​::Declare (which was yet another
victim of my sub lookup patch) and a motherboard failure.

My patch for non-lvalue substr turned out to be faulty. The willingness
to handle 64-bit offsets and lengths needs to go a bit deeper than
I implemented. For the case of 64-bit offsets when the address space
(and therefore size_t) is 32-bit, shims will be required before string
operations can be handed off to standard library functions. I fear that
the ramifications of trying to do this correctly are quite extensive.

In which case I think that the patch as Eric proposed is technically wrong.
It uses IVs, which *can* be 64 bit when addresses are 32 bit.
Whereas what we need is a (signed) STRLEN type which is the same size as
pointers, so that it's not possible for it to hold an offset larger than
memory.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From zefram@fysh.org

Nicholas Clark wrote​:

Whereas what we need is a (signed) STRLEN type which is the same size as
pointers,

That's ssize_t, in standard C, just as the unsigned STRLEN is size_t.

When IV is larger than ssize_t, you need some logic to handle offsets
and lengths that don't fit into ssize_t. This is part of the job of
the notional shim layer.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @nwc10

On Fri, Feb 12, 2010 at 11​:12​:43AM +0000, Zefram wrote​:

Nicholas Clark wrote​:

Whereas what we need is a (signed) STRLEN type which is the same size as
pointers,

That's ssize_t, in standard C, just as the unsigned STRLEN is size_t.

config.h has the former available as SSize_t

When IV is larger than ssize_t, you need some logic to handle offsets
and lengths that don't fit into ssize_t. This is part of the job of
the notional shim layer.

Are you envisaging a shim layer that merely copes with passing pointers to
different length integers, or one that also range checks the values going in
and out, for the case where the value overflows the smaller integer?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From zefram@fysh.org

Nicholas Clark wrote​:

Are you envisaging a shim layer that merely copes with passing pointers to
different length integers, or one that also range checks the values going in
and out, for the case where the value overflows the smaller integer?

Correctly handling the overflows is essential to the bug we're dealing
with. Exactly where they need to be handled depends on what internal
interfaces turn out to be useful.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @ikegami

On Fri, Feb 12, 2010 at 6​:00 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

In which case I think that the patch as Eric proposed is technically wrong.

I did make a comment about that, but it didn't go in nearly enough details
and my conclusion was wrong. The conversion from IV to STRLEN occurs in each
of these lines​:

  const STRLEN upos = pos;
  const STRLEN urem = rem;
  STRLEN bpos = pos;
  STRLEN brem = rem;

The code can be adjusted so rem is never anything but a STRLEN, and a check
can be added to make sure pos is in range.

In the same code lurks another bug​: On a 32-bit system, not all positions of
the string can be accessed. Large positive positions are treated as
negatives.

I'll address these today.

@p5pRT
Copy link
Author

p5pRT commented Feb 12, 2010

From @nwc10

On Fri, Feb 12, 2010 at 07​:14​:44AM +0000, Nicholas Clark wrote​:

On Thu, Feb 11, 2010 at 07​:18​:11PM -0500, Eric Brine wrote​:

On Thu, Feb 11, 2010 at 6​:58 PM, jesse <jesse@​fsck.com> wrote​:

Ooh. I'm sorry I missed your offer.

I had sent the offer directly to him

If you're still game, that would be great.

Already started. One problem (Craig noticed) is that sv_pos_u2b only works
with 32bit string lengths. I'm not sure how to fix that.

I think​:

1​: give it (well, all the public functions, as sv_pod_b2u will need it too)
a "new" name, and fix it to take STRLEN sized arguments.
2​: Make sv_pos_u2b a wrapper around the fixed version

For reference, in my unpacked CPAN​:

ack --sort -l '\b(?​:Perl_)?sv_pos_(?​:u2b|b2u)\b(?!\|)'

gave​:

Convert-Binary-C/tests/include/perlinc/embed.h
Convert-Binary-C/tests/include/perlinc/proto.h
Devel-PPPort/PPPort.pm
Padre/share/languages/perl5/perlapi_current.yml
Perl-APIReference/lib/Perl/APIReference/V5_008_000.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_001.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_002.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_003.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_004.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_005.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_006.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_007.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_008.pm
Perl-APIReference/lib/Perl/APIReference/V5_008_009.pm
Perl-APIReference/lib/Perl/APIReference/V5_010_000.pm
Perl-APIReference/lib/Perl/APIReference/V5_010_001.pm
Perl-APIReference/lib/Perl/APIReference/V5_011_000.pm
Perl-APIReference/lib/Perl/APIReference/V5_011_001.pm
Perl-APIReference/lib/Perl/APIReference/V5_011_002.pm
perl/embed.h
perl/mg.c
perl/pp.c
perl/pp_ctl.c
perl/pp_sys.c
perl/proto.h
perl/sv.c

[I think that I need to investigate something a bit more sophisticated than
a brute force ack, but I would like full perl regexps for my search]

Google codesearch doesn't show any users either of either, outside the core.

Looks like we can convert the I32 versions to wrappers, deprecated them,
and subsequently remove them, without actually affecting anyone.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2010

From @ikegami

Take 2.

substr() must accept both UVs and IVs for its $pos and $len arguments in
order to ensure any part of the string can be extracted, and to allow
negative indexing. (Even then, they are limits on the range of negative
indexing.) This patches provides that range.

It also fixes​:

$ perl -wle'$[=2; print substr("abcdefghij", 1);'
panic​: sv_setpvn called with negative strlen at -e line 1.

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2010

From @ikegami

0001-Removes-32-bit-limit-on-substr-arguments.patch
From 11857816569bd87acaa6fb52a97c45445113e82c Mon Sep 17 00:00:00 2001
From: Eric Brine <ikegami@adaelis.com>
Date: Thu, 11 Feb 2010 20:28:29 -0500
Subject: [PATCH] Removes 32-bit limit on substr arguments. The full range of IV and UV is available for the pos and len arguments, with safe conversion to STRLEN where it's smaller than an IV.

---
 embed.fnc     |    1 +
 embed.h       |    2 +
 global.sym    |    1 +
 mg.c          |   19 ++++----
 pp.c          |  143 +++++++++++++++++++++++++++++++++++++--------------------
 proto.h       |    5 ++
 sv.c          |   47 ++++++++++++++++--
 t/re/substr.t |   42 ++++++++++++++++-
 8 files changed, 193 insertions(+), 67 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 7463274..7e450aa 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1165,6 +1165,7 @@ ApdR	|SV*	|sv_newmortal
 Apd	|SV*	|sv_newref	|NULLOK SV *const sv
 Ap	|char*	|sv_peek	|NULLOK SV* sv
 Apd	|void	|sv_pos_u2b	|NULLOK SV *const sv|NN I32 *const offsetp|NULLOK I32 *const lenp
+Apd	|void	|sv_pos_u2b_proper|NULLOK SV *const sv|NN STRLEN *const offsetp|NULLOK STRLEN *const lenp
 Apd	|void	|sv_pos_b2u	|NULLOK SV *const sv|NN I32 *const offsetp
 Amdb	|char*	|sv_pvn_force	|NN SV* sv|NULLOK STRLEN* lp
 Apd	|char*	|sv_pvutf8n_force|NN SV *const sv|NULLOK STRLEN *const lp
diff --git a/embed.h b/embed.h
index 246106b..1281fcc 100644
--- a/embed.h
+++ b/embed.h
@@ -967,6 +967,7 @@
 #define sv_newref		Perl_sv_newref
 #define sv_peek			Perl_sv_peek
 #define sv_pos_u2b		Perl_sv_pos_u2b
+#define sv_pos_u2b_proper	Perl_sv_pos_u2b_proper
 #define sv_pos_b2u		Perl_sv_pos_b2u
 #define sv_pvutf8n_force	Perl_sv_pvutf8n_force
 #define sv_pvbyten_force	Perl_sv_pvbyten_force
@@ -3371,6 +3372,7 @@
 #define sv_newref(a)		Perl_sv_newref(aTHX_ a)
 #define sv_peek(a)		Perl_sv_peek(aTHX_ a)
 #define sv_pos_u2b(a,b,c)	Perl_sv_pos_u2b(aTHX_ a,b,c)
+#define sv_pos_u2b_proper(a,b,c)	Perl_sv_pos_u2b_proper(aTHX_ a,b,c)
 #define sv_pos_b2u(a,b)		Perl_sv_pos_b2u(aTHX_ a,b)
 #define sv_pvutf8n_force(a,b)	Perl_sv_pvutf8n_force(aTHX_ a,b)
 #define sv_pvbyten_force(a,b)	Perl_sv_pvbyten_force(aTHX_ a,b)
diff --git a/global.sym b/global.sym
index f0361df..f0e462e 100644
--- a/global.sym
+++ b/global.sym
@@ -567,6 +567,7 @@ Perl_sv_newmortal
 Perl_sv_newref
 Perl_sv_peek
 Perl_sv_pos_u2b
+Perl_sv_pos_u2b_proper
 Perl_sv_pos_b2u
 Perl_sv_pvn_force
 Perl_sv_pvutf8n_force
diff --git a/mg.c b/mg.c
index b9a1464..4f8207c 100644
--- a/mg.c
+++ b/mg.c
@@ -2008,17 +2008,17 @@ Perl_magic_getsubstr(pTHX_ SV *sv, MAGIC *mg)
     STRLEN len;
     SV * const lsv = LvTARG(sv);
     const char * const tmps = SvPV_const(lsv,len);
-    I32 offs = LvTARGOFF(sv);
-    I32 rem = LvTARGLEN(sv);
+    STRLEN offs = LvTARGOFF(sv);
+    STRLEN rem = LvTARGLEN(sv);
 
     PERL_ARGS_ASSERT_MAGIC_GETSUBSTR;
     PERL_UNUSED_ARG(mg);
 
     if (SvUTF8(lsv))
-	sv_pos_u2b(lsv, &offs, &rem);
-    if (offs > (I32)len)
+	sv_pos_u2b_proper(lsv, &offs, &rem);
+    if (offs > len)
 	offs = len;
-    if (rem + offs > (I32)len)
+    if (rem > len - offs)
 	rem = len - offs;
     sv_setpvn(sv, tmps + offs, (STRLEN)rem);
     if (SvUTF8(lsv))
@@ -2033,22 +2033,22 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
     STRLEN len;
     const char * const tmps = SvPV_const(sv, len);
     SV * const lsv = LvTARG(sv);
-    I32 lvoff = LvTARGOFF(sv);
-    I32 lvlen = LvTARGLEN(sv);
+    STRLEN lvoff = LvTARGOFF(sv);
+    STRLEN lvlen = LvTARGLEN(sv);
 
     PERL_ARGS_ASSERT_MAGIC_SETSUBSTR;
     PERL_UNUSED_ARG(mg);
 
     if (DO_UTF8(sv)) {
 	sv_utf8_upgrade(lsv);
- 	sv_pos_u2b(lsv, &lvoff, &lvlen);
+	sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
 	sv_insert(lsv, lvoff, lvlen, tmps, len);
 	LvTARGLEN(sv) = sv_len_utf8(sv);
 	SvUTF8_on(lsv);
     }
     else if (lsv && SvUTF8(lsv)) {
 	const char *utf8;
-	sv_pos_u2b(lsv, &lvoff, &lvlen);
+	sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
 	LvTARGLEN(sv) = len;
 	utf8 = (char*)bytes_to_utf8((U8*)tmps, &len);
 	sv_insert(lsv, lvoff, lvlen, utf8, len);
@@ -2059,7 +2059,6 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
 	LvTARGLEN(sv) = len;
     }
 
-
     return 0;
 }
 
diff --git a/pp.c b/pp.c
index 2f4703b..95dc5fd 100644
--- a/pp.c
+++ b/pp.c
@@ -3079,15 +3079,19 @@ PP(pp_substr)
 {
     dVAR; dSP; dTARGET;
     SV *sv;
-    I32 len = 0;
     STRLEN curlen;
     STRLEN utf8_curlen;
-    I32 pos;
-    I32 rem;
-    I32 fail;
+    SV *   pos_sv;
+    IV     pos1_iv;
+    int    pos1_is_uv;
+    IV     pos2_iv;
+    int    pos2_is_uv;
+    SV *   len_sv;
+    IV     len_iv = 0;
+    int    len_is_uv = 1;
     const I32 lvalue = PL_op->op_flags & OPf_MOD || LVRET;
     const char *tmps;
-    const I32 arybase = CopARYBASE_get(PL_curcop);
+    const IV arybase = CopARYBASE_get(PL_curcop);
     SV *repl_sv = NULL;
     const char *repl = NULL;
     STRLEN repl_len;
@@ -3103,9 +3107,13 @@ PP(pp_substr)
 	    repl = SvPV_const(repl_sv, repl_len);
 	    repl_is_utf8 = DO_UTF8(repl_sv) && SvCUR(repl_sv);
 	}
-	len = POPi;
+	len_sv    = POPs;
+	len_iv    = SvIV(len_sv);
+	len_is_uv = SvIOK_UV(len_sv);
     }
-    pos = POPi;
+    pos_sv     = POPs;
+    pos1_iv    = SvIV(pos_sv);
+    pos1_is_uv = SvIOK_UV(pos_sv);
     sv = POPs;
     PUTBACK;
     if (repl_sv) {
@@ -3127,51 +3135,80 @@ PP(pp_substr)
     else
 	utf8_curlen = 0;
 
-    if (pos >= arybase) {
-	pos -= arybase;
-	rem = curlen-pos;
-	fail = rem;
-	if (num_args > 2) {
-	    if (len < 0) {
-		rem += len;
-		if (rem < 0)
-		    rem = 0;
-	    }
-	    else if (rem > len)
-		     rem = len;
+    if ( (pos1_is_uv && arybase < 0) || (pos1_iv >= arybase) ) { /* pos >= $[ */
+	UV pos1_uv = pos1_iv-arybase;
+	/* Overflow can occur when $[ < 0 */
+	if (arybase < 0 && pos1_uv < (UV)pos1_iv)
+	    goto BOUND_FAIL;
+	pos1_iv = pos1_uv;
+	pos1_is_uv = 1;
+    }
+    else if (pos1_is_uv ? (UV)pos1_iv > 0 : pos1_iv > 0) {
+	goto BOUND_FAIL;  /* $[=3; substr($_,2,...) */
+    }
+    else { /* pos < $[ */
+	if (pos1_iv == 0) { /* $[=1; substr($_,0,...) */
+	    pos1_iv = curlen;
+	    pos1_is_uv = 1;
+	} else {
+	    if (curlen) {
+		pos1_is_uv = curlen-1 > ~(UV)pos1_iv;
+		pos1_iv += curlen;
+	   }
 	}
     }
-    else {
-	pos += curlen;
-	if (num_args < 3)
-	    rem = curlen;
-	else if (len >= 0) {
-	    rem = pos+len;
-	    if (rem > (I32)curlen)
-		rem = curlen;
+    if (pos1_is_uv || pos1_iv > 0) {
+	if ((UV)pos1_iv > curlen)
+	    goto BOUND_FAIL;
+    }
+
+    if (num_args > 2) {
+	if (!len_is_uv && len_iv < 0) {
+	    pos2_iv = curlen + len_iv;
+	    if (curlen)
+		pos2_is_uv = curlen-1 > ~(UV)len_iv;
+	    else
+		pos2_is_uv = 0;
+	} else {  /* len_iv >= 0 */
+	    if (!pos1_is_uv && pos1_iv < 0) {
+		pos2_iv = pos1_iv + len_iv;
+		pos2_is_uv = (UV)len_iv > (UV)IV_MAX;
+	    } else {
+		if ((UV)len_iv > curlen-(UV)pos1_iv)
+		    pos2_iv = curlen;
+		else
+		    pos2_iv = pos1_iv+len_iv;
+		pos2_is_uv = 1;
+	    }
 	}
-	else {
-	    rem = curlen+len;
-	    if (rem < pos)
-		rem = pos;
-	}
-	if (pos < 0)
-	    pos = 0;
-	fail = rem;
-	rem -= pos;
-    }
-    if (fail < 0) {
-	if (lvalue || repl)
-	    Perl_croak(aTHX_ "substr outside of string");
-	Perl_ck_warner(aTHX_ packWARN(WARN_SUBSTR), "substr outside of string");
-	RETPUSHUNDEF;
     }
     else {
-	const I32 upos = pos;
-	const I32 urem = rem;
+	pos2_iv = curlen;
+	pos2_is_uv = 1;
+    }
+
+    if (!pos2_is_uv && pos2_iv < 0) {
+	if (!pos1_is_uv && pos1_iv < 0)
+	    goto BOUND_FAIL;
+	pos2_iv = 0;
+    }
+    else if (!pos1_is_uv && pos1_iv < 0)
+	pos1_iv = 0;
+
+    if ((UV)pos2_iv < (UV)pos1_iv)
+	pos2_iv = pos1_iv;
+    if ((UV)pos2_iv > curlen)
+	pos2_iv = curlen;
+
+    {
+	/* pos1_iv and pos2_iv both in 0..curlen, so the cast is safe */
+	const STRLEN pos = (STRLEN)( (UV)pos1_iv );
+	const STRLEN len = (STRLEN)( (UV)pos2_iv - (UV)pos1_iv );
+	STRLEN byte_pos = pos;
+	STRLEN byte_len = len;
 	if (utf8_curlen)
-	    sv_pos_u2b(sv, (I32 *)&pos, (I32 *)&rem);
-	tmps += pos;
+	    sv_pos_u2b_proper(sv, &byte_pos, &byte_len);
+	tmps += byte_pos;
 	/* we either return a PV or an LV. If the TARG hasn't been used
 	 * before, or is of that type, reuse it; otherwise use a mortal
 	 * instead. Note that LVs can have an extended lifetime, so also
@@ -3185,7 +3222,7 @@ PP(pp_substr)
 	    }
 	}
 
-	sv_setpvn(TARG, tmps, rem);
+	sv_setpvn(TARG, tmps, byte_len);
 #ifdef USE_LOCALE_COLLATE
 	sv_unmagic(TARG, PERL_MAGIC_collxfrm);
 #endif
@@ -3202,7 +3239,7 @@ PP(pp_substr)
 	    }
 	    if (!SvOK(sv))
 		sv_setpvs(sv, "");
-	    sv_insert_flags(sv, pos, rem, repl, repl_len, 0);
+	    sv_insert_flags(sv, byte_pos, byte_len, repl, repl_len, 0);
 	    if (repl_is_utf8)
 		SvUTF8_on(sv);
 	    SvREFCNT_dec(repl_sv_copy);
@@ -3232,13 +3269,19 @@ PP(pp_substr)
 		SvREFCNT_dec(LvTARG(TARG));
 		LvTARG(TARG) = SvREFCNT_inc_simple(sv);
 	    }
-	    LvTARGOFF(TARG) = upos;
-	    LvTARGLEN(TARG) = urem;
+	    LvTARGOFF(TARG) = pos;
+	    LvTARGLEN(TARG) = len;
 	}
     }
     SPAGAIN;
     PUSHs(TARG);		/* avoid SvSETMAGIC here */
     RETURN;
+
+BOUND_FAIL:
+    if (lvalue || repl)
+	Perl_croak(aTHX_ "substr outside of string");
+    Perl_ck_warner(aTHX_ packWARN(WARN_SUBSTR), "substr outside of string");
+    RETPUSHUNDEF;
 }
 
 PP(pp_vec)
diff --git a/proto.h b/proto.h
index 4a343be..ae48597 100644
--- a/proto.h
+++ b/proto.h
@@ -3374,6 +3374,11 @@ PERL_CALLCONV void	Perl_sv_pos_u2b(pTHX_ SV *const sv, I32 *const offsetp, I32 *
 #define PERL_ARGS_ASSERT_SV_POS_U2B	\
 	assert(offsetp)
 
+PERL_CALLCONV void	Perl_sv_pos_u2b_proper(pTHX_ SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
+			__attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_SV_POS_U2B_PROPER	\
+	assert(offsetp)
+
 PERL_CALLCONV void	Perl_sv_pos_b2u(pTHX_ SV *const sv, I32 *const offsetp)
 			__attribute__nonnull__(pTHX_2);
 #define PERL_ARGS_ASSERT_SV_POS_B2U	\
diff --git a/sv.c b/sv.c
index 4ab41f6..02be580 100644
--- a/sv.c
+++ b/sv.c
@@ -6240,7 +6240,7 @@ S_sv_pos_u2b_cached(pTHX_ SV *const sv, MAGIC **const mgp, const U8 *const start
 
 
 /*
-=for apidoc sv_pos_u2b
+=for apidoc sv_pos_u2b_proper
 
 Converts the value pointed to by offsetp from a count of UTF-8 chars from
 the start of the string, to a count of the equivalent number of bytes; if
@@ -6252,14 +6252,14 @@ type coercion.
 */
 
 /*
- * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * sv_pos_u2b_proper() uses, like sv_pos_b2u(), the mg_ptr of the potential
  * PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
  * byte offsets.  See also the comments of S_utf8_mg_pos_cache_update().
  *
  */
 
 void
-Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+Perl_sv_pos_u2b_proper(pTHX_ register SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
 {
     const U8 *start;
     STRLEN len;
@@ -6271,17 +6271,17 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
 
     start = (U8*)SvPV_const(sv, len);
     if (len) {
-	STRLEN uoffset = (STRLEN) *offsetp;
+	STRLEN uoffset = *offsetp;
 	const U8 * const send = start + len;
 	MAGIC *mg = NULL;
 	const STRLEN boffset = sv_pos_u2b_cached(sv, &mg, start, send,
 					     uoffset, 0, 0);
 
-	*offsetp = (I32) boffset;
+	*offsetp = boffset;
 
 	if (lenp) {
 	    /* Convert the relative offset to absolute.  */
-	    const STRLEN uoffset2 = uoffset + (STRLEN) *lenp;
+	    const STRLEN uoffset2 = uoffset + *lenp;
 	    const STRLEN boffset2
 		= sv_pos_u2b_cached(sv, &mg, start, send, uoffset2,
 				      uoffset, boffset) - boffset;
@@ -6298,6 +6298,41 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
     return;
 }
 
+/*
+=for apidoc sv_pos_u2b
+
+Converts the value pointed to by offsetp from a count of UTF-8 chars from
+the start of the string, to a count of the equivalent number of bytes; if
+lenp is non-zero, it does the same to lenp, but this time starting from
+the offset, rather than from the start of the string. Handles magic and
+type coercion.
+
+=cut
+*/
+
+/*
+ * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
+ * byte offsets.  See also the comments of S_utf8_mg_pos_cache_update().
+ *
+ */
+
+/* This function is subject to size and sign problems */
+
+void
+Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+{
+    STRLEN uoffset = (STRLEN)*offsetp;
+    if (lenp) {
+	STRLEN ulen = (STRLEN)*lenp;
+	sv_pos_u2b_proper(sv, &uoffset, &ulen);
+	*lenp = (I32)ulen;
+    } else {
+	sv_pos_u2b_proper(sv, &uoffset, NULL);
+    }
+    *offsetp = (I32)uoffset;
+}
+
 /* Create and update the UTF8 magic offset cache, with the proffered utf8/
    byte length pairing. The (byte) length of the total SV is passed in too,
    as blen, because for some (more esoteric) SVs, the call to SvPV_const()
diff --git a/t/re/substr.t b/t/re/substr.t
index c3fa6e1..d0717ba 100644
--- a/t/re/substr.t
+++ b/t/re/substr.t
@@ -24,7 +24,7 @@ $SIG{__WARN__} = sub {
 
 require './test.pl';
 
-plan(334);
+plan(360);
 
 run_tests() unless caller;
 
@@ -201,6 +201,11 @@ is($w--, 1);
 eval{substr($a,1) = "" ; };     # P=R=S Q
 like($@, $FATAL_MSG);
 
+$b = substr($a,-7,-6) ; # warn  # Q R P S
+is($w--, 1);
+eval{substr($a,-7,-6) = "" ; }; # Q R P S
+like($@, $FATAL_MSG);
+
 my $a = 'zxcvbnm';
 substr($a,2,0) = '';
 is($a, 'zxcvbnm');
@@ -682,4 +687,39 @@ is($x, "\x{100}\x{200}\xFFb");
     is(substr($a,1,1), 'b');
 }
 
+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP: {
+    skip("32-bit system", 24) unless ~0 > 0xffffffff;
+    my $a = "abc";
+    my $s;
+    my $r;
+
+    utf8::downgrade($a);
+    for (1..2) {
+	$w = 0;
+	$r = substr($a, 0xffffffff, 1);
+	is($r, undef);
+	is($w, 1);
+
+	$w = 0;
+	$r = substr($a, 0xffffffff+1, 1);
+	is($r, undef);
+	is($w, 1);
+
+	$w = 0;
+	ok( !eval { $r = substr($s=$a, 0xffffffff, 1, "_"); 1 } );
+	is($r, undef);
+	is($s, $a);
+	is($w, 0);
+
+	$w = 0;
+	ok( !eval { $r = substr($s=$a, 0xffffffff+1, 1, "_"); 1 } );
+	is($r, undef);
+	is($s, $a);
+	is($w, 0);
+
+	utf8::upgrade($a);
+    }
+}
+
 }
-- 
1.6.5.2

@p5pRT
Copy link
Author

p5pRT commented Feb 13, 2010

From @nwc10

On Sat, Feb 13, 2010 at 12​:15​:56AM -0500, Eric Brine wrote​:

Take 2.

substr() must accept both UVs and IVs for its $pos and $len arguments in
order to ensure any part of the string can be extracted, and to allow
negative indexing. (Even then, they are limits on the range of negative
indexing.) This patches provides that range.

It also fixes​:

$ perl -wle'$[=2; print substr("abcdefghij", 1);'
panic​: sv_setpvn called with negative strlen at -e line 1.

Whilst I've looked at the patch, and don't see anything wrong with it, for
some reason I'm uneasy about applying it today. (Tomorrow should be fine).

Whilst this *isn't* technically the same bug, it is related to I32 abuse and
the UTF-8 caching code. It's rather embarrassing that something as simple as
length can return the wrong result, or panic​:

$ time ./perl -lwe 'open A, shift or die $!; read A, $a, (1<<32) + 4; chop $a; warn length $a; substr $a, 0x100000000, 1, chr 256; print ord substr $a, 0x100000000; print ord substr $a, 0xFFFFFFF0; warn length $a; warn length $a' /dev/zero
Hexadecimal number > 0xffffffff non-portable at -e line 1.
Hexadecimal number > 0xffffffff non-portable at -e line 1.
4294967299 at -e line 1.
256
panic​: sv_len_utf8 cache 3 real 4294967299 for Ä at -e line 1.

real 37m17.458s
user 2m56.791s
sys 1m11.734s

I propose the following work around, which disables storing a bad value in
the cache for the length in Unicode characters, if that value has wrapped​:

Inline Patch
diff --git a/sv.c b/sv.c
index 02be580..87fc348 100644
--- a/sv.c
+++ b/sv.c
@@ -6072,6 +6072,10 @@ Perl_sv_len_utf8(pTHX_ register SV *const sv)
                    }
                    assert(mg);
                    mg->mg_len = ulen;
+                   /* For now, treat "overflowed" as "still unknown".
+                      See RT #72784.  */
+                   if (ulen != (STRLEN) mg->mg_len)
+                       mg->mg_len = -1;
                }
            }
            return ulen;


It avoids the panic, and produces the right results (albeit slowly, but 10 times faster than the panic takes to end\)​:

warn length $a; substr $a, 0x100000000, 1, chr 256; print ord substr $a, 0x100000000; print ord substr $a, 0xFFFFFFF0; warn length $a; warn length $a' /dev/zero
Hexadecimal number > 0xffffffff non-portable at -e line 1.
Hexadecimal number > 0xffffffff non-portable at -e line 1.
4294967299 at -e line 1.
256
0
4294967299 at -e line 1.
4294967299 at -e line 1.

real 3m41.042s
user 3m36.332s
sys 0m4.668s

[I've compiled with -DDEBUGGING so I get debug mode enabled. If I turn it off​:

$ time ./perl -lwe '${^UTF8CACHE}=1; open A, shift or die $!; read A, $a, (1<<32) + 4; chop $a; warn length $a; substr $a, 0x100000000, 1, chr 256; print ord substr $a, 0x100000000; print ord substr $a, 0xFFFFFFF0; warn length $a; warn length $a' /dev/zero
Hexadecimal number > 0xffffffff non-portable at -e line 1.
Hexadecimal number > 0xffffffff non-portable at -e line 1.
4294967299 at -e line 1.
256
0
4294967299 at -e line 1.
4294967299 at -e line 1.

real 2m21.951s
user 2m17.326s
sys 0m4.579s

I don't know how many fewer linear scans of the string that time drop equates
to. One dromedary has the extra RAM installed it will become a lot less
painful to start to investigate these things. 6GB is not enough :-) ]

I've created a meta-ticket to group all the tickets related to I32 abuse,
and collated the remaining uses of Perl_sv_pos_b2u() and Perl_sv_pos_u2b()
as tickets under it​:

http​://rt.perl.org/rt3/Ticket/Display.html?id=72784

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 16, 2010

From @nwc10

On Sun, Feb 14, 2010 at 02​:23​:37PM -0500, Eric Brine wrote​:

On Sun, Feb 14, 2010 at 11​:50 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

So I renamed "_proper" to "_flags", and changed it to return the byte
conversion of uoffset​:

-void
-Perl_sv_pos_u2b_proper(pTHX_ register SV *const sv, STRLEN *const offsetp,
STRLEN *const lenp)

+STRLEN
+Perl_sv_pos_u2b_flags(pTHX_ SV *const sv, STRLEN uoffset, STRLEN *const
lenp,
+ U32 flags)

It looks like you haven't committed this yet. Before you do, can you please
move PERL_ARGS_ASSERT_SV_POS_U2B to Perl_sv_pos_u2b() and add
PERL_ARGS_ASSERT_SV_POS_U2B_FLAGS to Perl_sv_pos_u2b_flags()? Thanks

I had committed it, and fixed that part, and found a bug in the test that's
supposed to catch those​: http​://rt.perl.org/rt3/Ticket/Display.html?id=72800

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 17, 2010

From @nwc10

Thanks for the report. Sorry for the delay in replying.

On Sat Sep 12 08​:52​:05 2009, perlbug@​plan9.de wrote​:

This is a bug report for perl from perlbug@​plan9.de,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

I found that substr acts weirdly when confronted with larger than ~2gb
strings​:

warn length $data;
substr $data, 0, 256;

15701683970 at ...
substr outside of string at ...

My perl uses a uint64_t for STRLEN (standard amd64), so I would expect
substr to handle this, given that I pay the memory overhead to store
an
8-byte length everywhere :)

It seems that pp_substr uses I32 for everything, which is of course
not
enough.

A cursory glance over the rets of pp.c indicates that perl simply
can't
handle >2gb scalars, even on a 64 bit system, as it uses I32 almost
everywhere (pp_index, pp_reverse, about anything that deals with
string
offsets uses I32).

Yes. It's defective.

15GB seems like a lot, but I don't think thr wish to, say, load a DVD
image into memory is so far off in the future.

Maybe the safe way for now would be to disallow >31 bit scalar
lengths?

I don't think that there's actually an easy way to do that.

There's a patch in blead now specifically for substr. Jesse isn't
planning to fix anything else for 5.12.0, but I believe that many of the
others can be fixed without breaking binary compatibility, so hopefully
they will make it into 5.12.1 etc, not just 5.14.0

I made a new ticket to track tickets relating to the misuse of I32​:

http​://rt.perl.org/rt3/Ticket/Display.html?id=72784

(the intent being that we close this ticket as it's just tracking substr)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @nwc10

The substr bug is resolved with 777f7c5.
The other bugs are being tracked by ticket 72784

http​://rt.perl.org/rt3/Ticket/Display.html?id=72784

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

@nwc10 - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant