New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximum string length with substr #9634
Comments
From skylar2@u.washington.eduCreated by skylar2@u.washington.edue've run into a problem where substr appears to have a maximum string === #!/usr/bin/perl use strict; my($file,$chars,$text,$length); open(FILE, "< $file") or $text = q{ }; # Make sure there's at least one character to run substr on close(FILE); $length = length($text); === This will create this output: === substr outside of string at /net/gs/vol1/home/skylar2/cfm/scripts/nick/big_substr.pl line 13. === Perl Info
|
From @nwc10On Thu, Jan 22, 2009 at 11:12:20AM -0800, skylar2@u.washington.edu (via RT) wrote:
Thanks for the bug report. I can replicate it somewhat more tersely: $ ./perl -lwe 'print "$_ gives " . substr ("x" x $_, 1, 1) for 2147483648, 2147483649' The problem seems to be this part of the implementation of substr: int PERL_ARGS_ASSERT_MAGIC_GETSUBSTR; if (SvUTF8(lsv)) which is using variables of type I32, which will be a signed 32 bit integer It's using I32 because the interface of sv_pos_u2b() is using I32 pointers, http://perl5.git.perl.org/perl.git/blame/fdf134946da249a71c49962435817212b8fa195a:/sv.c#l3236 I suspect we need to write a replacement that uses STRLEN pointers, and Nicholas Clark |
The RT System itself - Status changed from 'new' to 'open' |
From perlbug@plan9.deCreated by perlbug@plan9.deI found that substr acts weirdly when confronted with larger than ~2gb warn length $data; 15701683970 at ... My perl uses a uint64_t for STRLEN (standard amd64), so I would expect It seems that pp_substr uses I32 for everything, which is of course not A cursory glance over the rets of pp.c indicates that perl simply can't 15GB seems like a lot, but I don't think thr wish to, say, load a DVD Maybe the safe way for now would be to disallow >31 bit scalar lengths? Perl Info
|
From @fbriereThis is a bug report for perl from fbriere@fbriere.net, substr() is behaving strangely on values larger than an int: $ perl -le 'print substr "abcd", 0, $_ for 2**31-1, 2**31, 2**32' abc (Obviously, this was run on a 32-bit machine.) Flags: Site configuration information for perl 5.10.0: Configured by Debian Project at Sun Aug 16 22:37:28 UTC 2009. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Locally applied patches: @INC for perl 5.10.0: Environment for perl 5.10.0: |
From zefram@fysh.orgAttached patch should fix bug #62646. -zefram |
From zefram@fysh.orgInline Patchdiff --git a/pp.c b/pp.c
index c659b13..4735c94 100644
--- a/pp.c
+++ b/pp.c
@@ -3079,12 +3079,12 @@ PP(pp_substr)
{
dVAR; dSP; dTARGET;
SV *sv;
- I32 len = 0;
+ IV len = 0;
STRLEN curlen;
STRLEN utf8_curlen;
- I32 pos;
- I32 rem;
- I32 fail;
+ IV pos;
+ IV rem;
+ IV fail;
const I32 lvalue = PL_op->op_flags & OPf_MOD || LVRET;
const char *tmps;
const I32 arybase = CopARYBASE_get(PL_curcop);
@@ -3147,7 +3147,7 @@ PP(pp_substr)
rem = curlen;
else if (len >= 0) {
rem = pos+len;
- if (rem > (I32)curlen)
+ if (rem > (IV)curlen)
rem = curlen;
}
else {
@@ -3167,8 +3167,8 @@ PP(pp_substr)
RETPUSHUNDEF;
}
else {
- const I32 upos = pos;
- const I32 urem = rem;
+ const IV upos = pos;
+ const IV urem = rem;
if (utf8_curlen)
sv_pos_u2b(sv, &pos, &rem);
tmps += pos;
diff --git a/t/re/substr.t b/t/re/substr.t
index c3fa6e1..900c8f7 100644
--- a/t/re/substr.t
+++ b/t/re/substr.t
@@ -24,7 +24,7 @@ $SIG{__WARN__} = sub {
require './test.pl';
-plan(334);
+plan(338);
run_tests() unless caller;
@@ -682,4 +682,19 @@ is($x, "\x{100}\x{200}\xFFb");
is(substr($a,1,1), 'b');
}
+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP: {
+ skip("32-bit system", 4) unless ~0 > 0xffffffff;
+ my $a = "abc";
+ my $r;
+ $w = 0;
+ $r = substr($a, 0xffffffff, 1);
+ is($r, undef);
+ is($w, 1);
+ $w = 0;
+ $r = substr($a, 0xffffffff+1, 1);
+ is($r, undef);
+ is($w--, 1);
+}
+
} |
From @nwc10On Tue, Jan 05, 2010 at 10:18:42PM +0000, Zefram wrote:
Any reason for $w-- right at the end, instead of just $w? Nicholas Clark |
From zefram@fysh.orgNicholas Clark wrote:
Bah. No, that's an editing mistake. -zefram |
From @ikegamiOn Tue, Jan 5, 2010 at 5:18 PM, Zefram <zefram@fysh.org> wrote:
The patch doesn't change mg.c. Does that mean it doesn't fix lvalue subtr? |
From zefram@fysh.orgEric Brine wrote:
Oh, missed that. And upon looking further, it seems that #62646 has -zefram |
From @obraOn Tue 5.Jan'10 at 23:16:53 +0000, Zefram wrote:
Actually, it looks like this was nicholas' blocker. I'd certainly rather a partial solution than no solution, so long as
|
From @rgarcia2010/1/5 Zefram <zefram@fysh.org>:
Thanks, applied to bleadperl as |
From @nwc10On Fri, Jan 15, 2010 at 05:17:17PM +0100, Rafael Garcia-Suarez wrote:
I feel that the lvalue substr bug should remain on the blockers list, for now. Nicholas Clark |
From @obraCan be summarized as: $ ./perl -le 'print substr "abcd", 0, $_ for 2**31-1, 2**31, 2**32' abcd abc It's one of only three known release blockers left. |
From @ikegamiOn Thu, Feb 11, 2010 at 3:04 PM, Jesse Vincent <jesse@fsck.com> wrote:
Zephram did the non-lvalue portion. I offered to do the lvalue portion for |
From @obraOn Thu, Feb 11, 2010 at 06:10:52PM -0500, Eric Brine wrote:
Ooh. I'm sorry I missed your offer. If you're still game, that would be Best, |
From @ikegamiOn Thu, Feb 11, 2010 at 6:58 PM, jesse <jesse@fsck.com> wrote:
I had sent the offer directly to him If you're still game, that would be great. Already started. One problem (Craig noticed) is that sv_pos_u2b only works |
From @ikegamiOn Thu, Feb 11, 2010 at 7:18 PM, Eric Brine <ikegami@adaelis.com> wrote:
Turns out sv_pos_u2b uses STRLEN internally even though it presents I32 for I think some more improvements relating to types can be made to pp_substr, I wish I could do some testing with very large strings (2**31), but I don't Patch attached. - ELB |
From @ikegami0001-Remove-32-bit-limit-on-substr-arguments.patchFrom 946c161373a17c942c55dbe086606cbee7893a76 Mon Sep 17 00:00:00 2001
From: Eric Brine <ikegami@adaelis.com>
Date: Thu, 11 Feb 2010 20:28:29 -0500
Subject: [PATCH] Remove 32-bit limit on substr arguments
---
embed.fnc | 1 +
embed.h | 2 ++
global.sym | 1 +
mg.c | 19 +++++++++----------
pp.c | 24 +++++++++++++-----------
proto.h | 5 +++++
sv.c | 47 +++++++++++++++++++++++++++++++++++++++++------
t/re/substr.t | 37 ++++++++++++++++++++++++++++++++++++-
8 files changed, 108 insertions(+), 28 deletions(-)
diff --git a/embed.fnc b/embed.fnc
index 7463274..7e450aa 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1165,6 +1165,7 @@ ApdR |SV* |sv_newmortal
Apd |SV* |sv_newref |NULLOK SV *const sv
Ap |char* |sv_peek |NULLOK SV* sv
Apd |void |sv_pos_u2b |NULLOK SV *const sv|NN I32 *const offsetp|NULLOK I32 *const lenp
+Apd |void |sv_pos_u2b_proper|NULLOK SV *const sv|NN STRLEN *const offsetp|NULLOK STRLEN *const lenp
Apd |void |sv_pos_b2u |NULLOK SV *const sv|NN I32 *const offsetp
Amdb |char* |sv_pvn_force |NN SV* sv|NULLOK STRLEN* lp
Apd |char* |sv_pvutf8n_force|NN SV *const sv|NULLOK STRLEN *const lp
diff --git a/embed.h b/embed.h
index 246106b..1281fcc 100644
--- a/embed.h
+++ b/embed.h
@@ -967,6 +967,7 @@
#define sv_newref Perl_sv_newref
#define sv_peek Perl_sv_peek
#define sv_pos_u2b Perl_sv_pos_u2b
+#define sv_pos_u2b_proper Perl_sv_pos_u2b_proper
#define sv_pos_b2u Perl_sv_pos_b2u
#define sv_pvutf8n_force Perl_sv_pvutf8n_force
#define sv_pvbyten_force Perl_sv_pvbyten_force
@@ -3371,6 +3372,7 @@
#define sv_newref(a) Perl_sv_newref(aTHX_ a)
#define sv_peek(a) Perl_sv_peek(aTHX_ a)
#define sv_pos_u2b(a,b,c) Perl_sv_pos_u2b(aTHX_ a,b,c)
+#define sv_pos_u2b_proper(a,b,c) Perl_sv_pos_u2b_proper(aTHX_ a,b,c)
#define sv_pos_b2u(a,b) Perl_sv_pos_b2u(aTHX_ a,b)
#define sv_pvutf8n_force(a,b) Perl_sv_pvutf8n_force(aTHX_ a,b)
#define sv_pvbyten_force(a,b) Perl_sv_pvbyten_force(aTHX_ a,b)
diff --git a/global.sym b/global.sym
index f0361df..f0e462e 100644
--- a/global.sym
+++ b/global.sym
@@ -567,6 +567,7 @@ Perl_sv_newmortal
Perl_sv_newref
Perl_sv_peek
Perl_sv_pos_u2b
+Perl_sv_pos_u2b_proper
Perl_sv_pos_b2u
Perl_sv_pvn_force
Perl_sv_pvutf8n_force
diff --git a/mg.c b/mg.c
index b9a1464..4f8207c 100644
--- a/mg.c
+++ b/mg.c
@@ -2008,17 +2008,17 @@ Perl_magic_getsubstr(pTHX_ SV *sv, MAGIC *mg)
STRLEN len;
SV * const lsv = LvTARG(sv);
const char * const tmps = SvPV_const(lsv,len);
- I32 offs = LvTARGOFF(sv);
- I32 rem = LvTARGLEN(sv);
+ STRLEN offs = LvTARGOFF(sv);
+ STRLEN rem = LvTARGLEN(sv);
PERL_ARGS_ASSERT_MAGIC_GETSUBSTR;
PERL_UNUSED_ARG(mg);
if (SvUTF8(lsv))
- sv_pos_u2b(lsv, &offs, &rem);
- if (offs > (I32)len)
+ sv_pos_u2b_proper(lsv, &offs, &rem);
+ if (offs > len)
offs = len;
- if (rem + offs > (I32)len)
+ if (rem > len - offs)
rem = len - offs;
sv_setpvn(sv, tmps + offs, (STRLEN)rem);
if (SvUTF8(lsv))
@@ -2033,22 +2033,22 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
STRLEN len;
const char * const tmps = SvPV_const(sv, len);
SV * const lsv = LvTARG(sv);
- I32 lvoff = LvTARGOFF(sv);
- I32 lvlen = LvTARGLEN(sv);
+ STRLEN lvoff = LvTARGOFF(sv);
+ STRLEN lvlen = LvTARGLEN(sv);
PERL_ARGS_ASSERT_MAGIC_SETSUBSTR;
PERL_UNUSED_ARG(mg);
if (DO_UTF8(sv)) {
sv_utf8_upgrade(lsv);
- sv_pos_u2b(lsv, &lvoff, &lvlen);
+ sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
sv_insert(lsv, lvoff, lvlen, tmps, len);
LvTARGLEN(sv) = sv_len_utf8(sv);
SvUTF8_on(lsv);
}
else if (lsv && SvUTF8(lsv)) {
const char *utf8;
- sv_pos_u2b(lsv, &lvoff, &lvlen);
+ sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
LvTARGLEN(sv) = len;
utf8 = (char*)bytes_to_utf8((U8*)tmps, &len);
sv_insert(lsv, lvoff, lvlen, utf8, len);
@@ -2059,7 +2059,6 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
LvTARGLEN(sv) = len;
}
-
return 0;
}
diff --git a/pp.c b/pp.c
index 2f4703b..d78d712 100644
--- a/pp.c
+++ b/pp.c
@@ -3079,12 +3079,12 @@ PP(pp_substr)
{
dVAR; dSP; dTARGET;
SV *sv;
- I32 len = 0;
+ IV len = 0;
STRLEN curlen;
STRLEN utf8_curlen;
- I32 pos;
- I32 rem;
- I32 fail;
+ IV pos;
+ IV rem;
+ IV fail;
const I32 lvalue = PL_op->op_flags & OPf_MOD || LVRET;
const char *tmps;
const I32 arybase = CopARYBASE_get(PL_curcop);
@@ -3147,7 +3147,7 @@ PP(pp_substr)
rem = curlen;
else if (len >= 0) {
rem = pos+len;
- if (rem > (I32)curlen)
+ if (rem > (IV)curlen)
rem = curlen;
}
else {
@@ -3167,11 +3167,13 @@ PP(pp_substr)
RETPUSHUNDEF;
}
else {
- const I32 upos = pos;
- const I32 urem = rem;
+ const STRLEN upos = pos;
+ const STRLEN urem = rem;
+ STRLEN bpos = pos;
+ STRLEN brem = rem;
if (utf8_curlen)
- sv_pos_u2b(sv, (I32 *)&pos, (I32 *)&rem);
- tmps += pos;
+ sv_pos_u2b_proper(sv, &bpos, &brem);
+ tmps += bpos;
/* we either return a PV or an LV. If the TARG hasn't been used
* before, or is of that type, reuse it; otherwise use a mortal
* instead. Note that LVs can have an extended lifetime, so also
@@ -3185,7 +3187,7 @@ PP(pp_substr)
}
}
- sv_setpvn(TARG, tmps, rem);
+ sv_setpvn(TARG, tmps, brem);
#ifdef USE_LOCALE_COLLATE
sv_unmagic(TARG, PERL_MAGIC_collxfrm);
#endif
@@ -3202,7 +3204,7 @@ PP(pp_substr)
}
if (!SvOK(sv))
sv_setpvs(sv, "");
- sv_insert_flags(sv, pos, rem, repl, repl_len, 0);
+ sv_insert_flags(sv, bpos, brem, repl, repl_len, 0);
if (repl_is_utf8)
SvUTF8_on(sv);
SvREFCNT_dec(repl_sv_copy);
diff --git a/proto.h b/proto.h
index 4a343be..ae48597 100644
--- a/proto.h
+++ b/proto.h
@@ -3374,6 +3374,11 @@ PERL_CALLCONV void Perl_sv_pos_u2b(pTHX_ SV *const sv, I32 *const offsetp, I32 *
#define PERL_ARGS_ASSERT_SV_POS_U2B \
assert(offsetp)
+PERL_CALLCONV void Perl_sv_pos_u2b_proper(pTHX_ SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
+ __attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_SV_POS_U2B_PROPER \
+ assert(offsetp)
+
PERL_CALLCONV void Perl_sv_pos_b2u(pTHX_ SV *const sv, I32 *const offsetp)
__attribute__nonnull__(pTHX_2);
#define PERL_ARGS_ASSERT_SV_POS_B2U \
diff --git a/sv.c b/sv.c
index 4ab41f6..02be580 100644
--- a/sv.c
+++ b/sv.c
@@ -6240,7 +6240,7 @@ S_sv_pos_u2b_cached(pTHX_ SV *const sv, MAGIC **const mgp, const U8 *const start
/*
-=for apidoc sv_pos_u2b
+=for apidoc sv_pos_u2b_proper
Converts the value pointed to by offsetp from a count of UTF-8 chars from
the start of the string, to a count of the equivalent number of bytes; if
@@ -6252,14 +6252,14 @@ type coercion.
*/
/*
- * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * sv_pos_u2b_proper() uses, like sv_pos_b2u(), the mg_ptr of the potential
* PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
* byte offsets. See also the comments of S_utf8_mg_pos_cache_update().
*
*/
void
-Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+Perl_sv_pos_u2b_proper(pTHX_ register SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
{
const U8 *start;
STRLEN len;
@@ -6271,17 +6271,17 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
start = (U8*)SvPV_const(sv, len);
if (len) {
- STRLEN uoffset = (STRLEN) *offsetp;
+ STRLEN uoffset = *offsetp;
const U8 * const send = start + len;
MAGIC *mg = NULL;
const STRLEN boffset = sv_pos_u2b_cached(sv, &mg, start, send,
uoffset, 0, 0);
- *offsetp = (I32) boffset;
+ *offsetp = boffset;
if (lenp) {
/* Convert the relative offset to absolute. */
- const STRLEN uoffset2 = uoffset + (STRLEN) *lenp;
+ const STRLEN uoffset2 = uoffset + *lenp;
const STRLEN boffset2
= sv_pos_u2b_cached(sv, &mg, start, send, uoffset2,
uoffset, boffset) - boffset;
@@ -6298,6 +6298,41 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
return;
}
+/*
+=for apidoc sv_pos_u2b
+
+Converts the value pointed to by offsetp from a count of UTF-8 chars from
+the start of the string, to a count of the equivalent number of bytes; if
+lenp is non-zero, it does the same to lenp, but this time starting from
+the offset, rather than from the start of the string. Handles magic and
+type coercion.
+
+=cut
+*/
+
+/*
+ * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
+ * byte offsets. See also the comments of S_utf8_mg_pos_cache_update().
+ *
+ */
+
+/* This function is subject to size and sign problems */
+
+void
+Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+{
+ STRLEN uoffset = (STRLEN)*offsetp;
+ if (lenp) {
+ STRLEN ulen = (STRLEN)*lenp;
+ sv_pos_u2b_proper(sv, &uoffset, &ulen);
+ *lenp = (I32)ulen;
+ } else {
+ sv_pos_u2b_proper(sv, &uoffset, NULL);
+ }
+ *offsetp = (I32)uoffset;
+}
+
/* Create and update the UTF8 magic offset cache, with the proffered utf8/
byte length pairing. The (byte) length of the total SV is passed in too,
as blen, because for some (more esoteric) SVs, the call to SvPV_const()
diff --git a/t/re/substr.t b/t/re/substr.t
index c3fa6e1..c94e376 100644
--- a/t/re/substr.t
+++ b/t/re/substr.t
@@ -24,7 +24,7 @@ $SIG{__WARN__} = sub {
require './test.pl';
-plan(334);
+plan(358);
run_tests() unless caller;
@@ -682,4 +682,39 @@ is($x, "\x{100}\x{200}\xFFb");
is(substr($a,1,1), 'b');
}
+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP: {
+ skip("32-bit system", 24) unless ~0 > 0xffffffff;
+ my $a = "abc";
+ my $s;
+ my $r;
+
+ utf8::downgrade($a);
+ for (1..2) {
+ $w = 0;
+ $r = substr($a, 0xffffffff, 1);
+ is($r, undef);
+ is($w, 1);
+
+ $w = 0;
+ $r = substr($a, 0xffffffff+1, 1);
+ is($r, undef);
+ is($w, 1);
+
+ $w = 0;
+ ok( !eval { $r = substr($s=$a, 0xffffffff, 1, "_"); 1 } );
+ is($r, undef);
+ is($s, $a);
+ is($w, 0);
+
+ $w = 0;
+ ok( !eval { $r = substr($s=$a, 0xffffffff+1, 1, "_"); 1 } );
+ is($r, undef);
+ is($s, $a);
+ is($w, 0);
+
+ utf8::upgrade($a);
+ }
+}
+
}
--
1.6.5.2
|
From @nwc10On Thu, Feb 11, 2010 at 07:18:11PM -0500, Eric Brine wrote:
I think: 1: give it (well, all the public functions, as sv_pod_b2u will need it too) Nicholas Clark |
From @ikegamiOn Fri, Feb 12, 2010 at 2:14 AM, Nicholas Clark <nick@ccl4.org> wrote:
That's exactly what I did. (See the patch I posted.) I couldn't come up with |
From zefram@fysh.orgEric Brine wrote:
Yeah, sorry. I was intending to work on this, but for the past month or My patch for non-lvalue substr turned out to be faulty. The willingness -zefram |
From @nwc10On Fri, Feb 12, 2010 at 10:53:27AM +0000, Zefram wrote:
In which case I think that the patch as Eric proposed is technically wrong. Nicholas Clark |
From zefram@fysh.orgNicholas Clark wrote:
That's ssize_t, in standard C, just as the unsigned STRLEN is size_t. When IV is larger than ssize_t, you need some logic to handle offsets -zefram |
From @nwc10On Fri, Feb 12, 2010 at 11:12:43AM +0000, Zefram wrote:
config.h has the former available as SSize_t
Are you envisaging a shim layer that merely copes with passing pointers to Nicholas Clark |
From zefram@fysh.orgNicholas Clark wrote:
Correctly handling the overflows is essential to the bug we're dealing -zefram |
From @ikegamiOn Fri, Feb 12, 2010 at 6:00 AM, Nicholas Clark <nick@ccl4.org> wrote:
I did make a comment about that, but it didn't go in nearly enough details const STRLEN upos = pos; The code can be adjusted so rem is never anything but a STRLEN, and a check In the same code lurks another bug: On a 32-bit system, not all positions of I'll address these today. |
From @nwc10On Fri, Feb 12, 2010 at 07:14:44AM +0000, Nicholas Clark wrote:
For reference, in my unpacked CPAN: ack --sort -l '\b(?:Perl_)?sv_pos_(?:u2b|b2u)\b(?!\|)' gave: Convert-Binary-C/tests/include/perlinc/embed.h [I think that I need to investigate something a bit more sophisticated than Google codesearch doesn't show any users either of either, outside the core. Looks like we can convert the I32 versions to wrappers, deprecated them, Nicholas Clark |
From @ikegamiTake 2. substr() must accept both UVs and IVs for its $pos and $len arguments in It also fixes: $ perl -wle'$[=2; print substr("abcdefghij", 1);' |
From @ikegami0001-Removes-32-bit-limit-on-substr-arguments.patchFrom 11857816569bd87acaa6fb52a97c45445113e82c Mon Sep 17 00:00:00 2001
From: Eric Brine <ikegami@adaelis.com>
Date: Thu, 11 Feb 2010 20:28:29 -0500
Subject: [PATCH] Removes 32-bit limit on substr arguments. The full range of IV and UV is available for the pos and len arguments, with safe conversion to STRLEN where it's smaller than an IV.
---
embed.fnc | 1 +
embed.h | 2 +
global.sym | 1 +
mg.c | 19 ++++----
pp.c | 143 +++++++++++++++++++++++++++++++++++++--------------------
proto.h | 5 ++
sv.c | 47 ++++++++++++++++--
t/re/substr.t | 42 ++++++++++++++++-
8 files changed, 193 insertions(+), 67 deletions(-)
diff --git a/embed.fnc b/embed.fnc
index 7463274..7e450aa 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1165,6 +1165,7 @@ ApdR |SV* |sv_newmortal
Apd |SV* |sv_newref |NULLOK SV *const sv
Ap |char* |sv_peek |NULLOK SV* sv
Apd |void |sv_pos_u2b |NULLOK SV *const sv|NN I32 *const offsetp|NULLOK I32 *const lenp
+Apd |void |sv_pos_u2b_proper|NULLOK SV *const sv|NN STRLEN *const offsetp|NULLOK STRLEN *const lenp
Apd |void |sv_pos_b2u |NULLOK SV *const sv|NN I32 *const offsetp
Amdb |char* |sv_pvn_force |NN SV* sv|NULLOK STRLEN* lp
Apd |char* |sv_pvutf8n_force|NN SV *const sv|NULLOK STRLEN *const lp
diff --git a/embed.h b/embed.h
index 246106b..1281fcc 100644
--- a/embed.h
+++ b/embed.h
@@ -967,6 +967,7 @@
#define sv_newref Perl_sv_newref
#define sv_peek Perl_sv_peek
#define sv_pos_u2b Perl_sv_pos_u2b
+#define sv_pos_u2b_proper Perl_sv_pos_u2b_proper
#define sv_pos_b2u Perl_sv_pos_b2u
#define sv_pvutf8n_force Perl_sv_pvutf8n_force
#define sv_pvbyten_force Perl_sv_pvbyten_force
@@ -3371,6 +3372,7 @@
#define sv_newref(a) Perl_sv_newref(aTHX_ a)
#define sv_peek(a) Perl_sv_peek(aTHX_ a)
#define sv_pos_u2b(a,b,c) Perl_sv_pos_u2b(aTHX_ a,b,c)
+#define sv_pos_u2b_proper(a,b,c) Perl_sv_pos_u2b_proper(aTHX_ a,b,c)
#define sv_pos_b2u(a,b) Perl_sv_pos_b2u(aTHX_ a,b)
#define sv_pvutf8n_force(a,b) Perl_sv_pvutf8n_force(aTHX_ a,b)
#define sv_pvbyten_force(a,b) Perl_sv_pvbyten_force(aTHX_ a,b)
diff --git a/global.sym b/global.sym
index f0361df..f0e462e 100644
--- a/global.sym
+++ b/global.sym
@@ -567,6 +567,7 @@ Perl_sv_newmortal
Perl_sv_newref
Perl_sv_peek
Perl_sv_pos_u2b
+Perl_sv_pos_u2b_proper
Perl_sv_pos_b2u
Perl_sv_pvn_force
Perl_sv_pvutf8n_force
diff --git a/mg.c b/mg.c
index b9a1464..4f8207c 100644
--- a/mg.c
+++ b/mg.c
@@ -2008,17 +2008,17 @@ Perl_magic_getsubstr(pTHX_ SV *sv, MAGIC *mg)
STRLEN len;
SV * const lsv = LvTARG(sv);
const char * const tmps = SvPV_const(lsv,len);
- I32 offs = LvTARGOFF(sv);
- I32 rem = LvTARGLEN(sv);
+ STRLEN offs = LvTARGOFF(sv);
+ STRLEN rem = LvTARGLEN(sv);
PERL_ARGS_ASSERT_MAGIC_GETSUBSTR;
PERL_UNUSED_ARG(mg);
if (SvUTF8(lsv))
- sv_pos_u2b(lsv, &offs, &rem);
- if (offs > (I32)len)
+ sv_pos_u2b_proper(lsv, &offs, &rem);
+ if (offs > len)
offs = len;
- if (rem + offs > (I32)len)
+ if (rem > len - offs)
rem = len - offs;
sv_setpvn(sv, tmps + offs, (STRLEN)rem);
if (SvUTF8(lsv))
@@ -2033,22 +2033,22 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
STRLEN len;
const char * const tmps = SvPV_const(sv, len);
SV * const lsv = LvTARG(sv);
- I32 lvoff = LvTARGOFF(sv);
- I32 lvlen = LvTARGLEN(sv);
+ STRLEN lvoff = LvTARGOFF(sv);
+ STRLEN lvlen = LvTARGLEN(sv);
PERL_ARGS_ASSERT_MAGIC_SETSUBSTR;
PERL_UNUSED_ARG(mg);
if (DO_UTF8(sv)) {
sv_utf8_upgrade(lsv);
- sv_pos_u2b(lsv, &lvoff, &lvlen);
+ sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
sv_insert(lsv, lvoff, lvlen, tmps, len);
LvTARGLEN(sv) = sv_len_utf8(sv);
SvUTF8_on(lsv);
}
else if (lsv && SvUTF8(lsv)) {
const char *utf8;
- sv_pos_u2b(lsv, &lvoff, &lvlen);
+ sv_pos_u2b_proper(lsv, &lvoff, &lvlen);
LvTARGLEN(sv) = len;
utf8 = (char*)bytes_to_utf8((U8*)tmps, &len);
sv_insert(lsv, lvoff, lvlen, utf8, len);
@@ -2059,7 +2059,6 @@ Perl_magic_setsubstr(pTHX_ SV *sv, MAGIC *mg)
LvTARGLEN(sv) = len;
}
-
return 0;
}
diff --git a/pp.c b/pp.c
index 2f4703b..95dc5fd 100644
--- a/pp.c
+++ b/pp.c
@@ -3079,15 +3079,19 @@ PP(pp_substr)
{
dVAR; dSP; dTARGET;
SV *sv;
- I32 len = 0;
STRLEN curlen;
STRLEN utf8_curlen;
- I32 pos;
- I32 rem;
- I32 fail;
+ SV * pos_sv;
+ IV pos1_iv;
+ int pos1_is_uv;
+ IV pos2_iv;
+ int pos2_is_uv;
+ SV * len_sv;
+ IV len_iv = 0;
+ int len_is_uv = 1;
const I32 lvalue = PL_op->op_flags & OPf_MOD || LVRET;
const char *tmps;
- const I32 arybase = CopARYBASE_get(PL_curcop);
+ const IV arybase = CopARYBASE_get(PL_curcop);
SV *repl_sv = NULL;
const char *repl = NULL;
STRLEN repl_len;
@@ -3103,9 +3107,13 @@ PP(pp_substr)
repl = SvPV_const(repl_sv, repl_len);
repl_is_utf8 = DO_UTF8(repl_sv) && SvCUR(repl_sv);
}
- len = POPi;
+ len_sv = POPs;
+ len_iv = SvIV(len_sv);
+ len_is_uv = SvIOK_UV(len_sv);
}
- pos = POPi;
+ pos_sv = POPs;
+ pos1_iv = SvIV(pos_sv);
+ pos1_is_uv = SvIOK_UV(pos_sv);
sv = POPs;
PUTBACK;
if (repl_sv) {
@@ -3127,51 +3135,80 @@ PP(pp_substr)
else
utf8_curlen = 0;
- if (pos >= arybase) {
- pos -= arybase;
- rem = curlen-pos;
- fail = rem;
- if (num_args > 2) {
- if (len < 0) {
- rem += len;
- if (rem < 0)
- rem = 0;
- }
- else if (rem > len)
- rem = len;
+ if ( (pos1_is_uv && arybase < 0) || (pos1_iv >= arybase) ) { /* pos >= $[ */
+ UV pos1_uv = pos1_iv-arybase;
+ /* Overflow can occur when $[ < 0 */
+ if (arybase < 0 && pos1_uv < (UV)pos1_iv)
+ goto BOUND_FAIL;
+ pos1_iv = pos1_uv;
+ pos1_is_uv = 1;
+ }
+ else if (pos1_is_uv ? (UV)pos1_iv > 0 : pos1_iv > 0) {
+ goto BOUND_FAIL; /* $[=3; substr($_,2,...) */
+ }
+ else { /* pos < $[ */
+ if (pos1_iv == 0) { /* $[=1; substr($_,0,...) */
+ pos1_iv = curlen;
+ pos1_is_uv = 1;
+ } else {
+ if (curlen) {
+ pos1_is_uv = curlen-1 > ~(UV)pos1_iv;
+ pos1_iv += curlen;
+ }
}
}
- else {
- pos += curlen;
- if (num_args < 3)
- rem = curlen;
- else if (len >= 0) {
- rem = pos+len;
- if (rem > (I32)curlen)
- rem = curlen;
+ if (pos1_is_uv || pos1_iv > 0) {
+ if ((UV)pos1_iv > curlen)
+ goto BOUND_FAIL;
+ }
+
+ if (num_args > 2) {
+ if (!len_is_uv && len_iv < 0) {
+ pos2_iv = curlen + len_iv;
+ if (curlen)
+ pos2_is_uv = curlen-1 > ~(UV)len_iv;
+ else
+ pos2_is_uv = 0;
+ } else { /* len_iv >= 0 */
+ if (!pos1_is_uv && pos1_iv < 0) {
+ pos2_iv = pos1_iv + len_iv;
+ pos2_is_uv = (UV)len_iv > (UV)IV_MAX;
+ } else {
+ if ((UV)len_iv > curlen-(UV)pos1_iv)
+ pos2_iv = curlen;
+ else
+ pos2_iv = pos1_iv+len_iv;
+ pos2_is_uv = 1;
+ }
}
- else {
- rem = curlen+len;
- if (rem < pos)
- rem = pos;
- }
- if (pos < 0)
- pos = 0;
- fail = rem;
- rem -= pos;
- }
- if (fail < 0) {
- if (lvalue || repl)
- Perl_croak(aTHX_ "substr outside of string");
- Perl_ck_warner(aTHX_ packWARN(WARN_SUBSTR), "substr outside of string");
- RETPUSHUNDEF;
}
else {
- const I32 upos = pos;
- const I32 urem = rem;
+ pos2_iv = curlen;
+ pos2_is_uv = 1;
+ }
+
+ if (!pos2_is_uv && pos2_iv < 0) {
+ if (!pos1_is_uv && pos1_iv < 0)
+ goto BOUND_FAIL;
+ pos2_iv = 0;
+ }
+ else if (!pos1_is_uv && pos1_iv < 0)
+ pos1_iv = 0;
+
+ if ((UV)pos2_iv < (UV)pos1_iv)
+ pos2_iv = pos1_iv;
+ if ((UV)pos2_iv > curlen)
+ pos2_iv = curlen;
+
+ {
+ /* pos1_iv and pos2_iv both in 0..curlen, so the cast is safe */
+ const STRLEN pos = (STRLEN)( (UV)pos1_iv );
+ const STRLEN len = (STRLEN)( (UV)pos2_iv - (UV)pos1_iv );
+ STRLEN byte_pos = pos;
+ STRLEN byte_len = len;
if (utf8_curlen)
- sv_pos_u2b(sv, (I32 *)&pos, (I32 *)&rem);
- tmps += pos;
+ sv_pos_u2b_proper(sv, &byte_pos, &byte_len);
+ tmps += byte_pos;
/* we either return a PV or an LV. If the TARG hasn't been used
* before, or is of that type, reuse it; otherwise use a mortal
* instead. Note that LVs can have an extended lifetime, so also
@@ -3185,7 +3222,7 @@ PP(pp_substr)
}
}
- sv_setpvn(TARG, tmps, rem);
+ sv_setpvn(TARG, tmps, byte_len);
#ifdef USE_LOCALE_COLLATE
sv_unmagic(TARG, PERL_MAGIC_collxfrm);
#endif
@@ -3202,7 +3239,7 @@ PP(pp_substr)
}
if (!SvOK(sv))
sv_setpvs(sv, "");
- sv_insert_flags(sv, pos, rem, repl, repl_len, 0);
+ sv_insert_flags(sv, byte_pos, byte_len, repl, repl_len, 0);
if (repl_is_utf8)
SvUTF8_on(sv);
SvREFCNT_dec(repl_sv_copy);
@@ -3232,13 +3269,19 @@ PP(pp_substr)
SvREFCNT_dec(LvTARG(TARG));
LvTARG(TARG) = SvREFCNT_inc_simple(sv);
}
- LvTARGOFF(TARG) = upos;
- LvTARGLEN(TARG) = urem;
+ LvTARGOFF(TARG) = pos;
+ LvTARGLEN(TARG) = len;
}
}
SPAGAIN;
PUSHs(TARG); /* avoid SvSETMAGIC here */
RETURN;
+
+BOUND_FAIL:
+ if (lvalue || repl)
+ Perl_croak(aTHX_ "substr outside of string");
+ Perl_ck_warner(aTHX_ packWARN(WARN_SUBSTR), "substr outside of string");
+ RETPUSHUNDEF;
}
PP(pp_vec)
diff --git a/proto.h b/proto.h
index 4a343be..ae48597 100644
--- a/proto.h
+++ b/proto.h
@@ -3374,6 +3374,11 @@ PERL_CALLCONV void Perl_sv_pos_u2b(pTHX_ SV *const sv, I32 *const offsetp, I32 *
#define PERL_ARGS_ASSERT_SV_POS_U2B \
assert(offsetp)
+PERL_CALLCONV void Perl_sv_pos_u2b_proper(pTHX_ SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
+ __attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT_SV_POS_U2B_PROPER \
+ assert(offsetp)
+
PERL_CALLCONV void Perl_sv_pos_b2u(pTHX_ SV *const sv, I32 *const offsetp)
__attribute__nonnull__(pTHX_2);
#define PERL_ARGS_ASSERT_SV_POS_B2U \
diff --git a/sv.c b/sv.c
index 4ab41f6..02be580 100644
--- a/sv.c
+++ b/sv.c
@@ -6240,7 +6240,7 @@ S_sv_pos_u2b_cached(pTHX_ SV *const sv, MAGIC **const mgp, const U8 *const start
/*
-=for apidoc sv_pos_u2b
+=for apidoc sv_pos_u2b_proper
Converts the value pointed to by offsetp from a count of UTF-8 chars from
the start of the string, to a count of the equivalent number of bytes; if
@@ -6252,14 +6252,14 @@ type coercion.
*/
/*
- * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * sv_pos_u2b_proper() uses, like sv_pos_b2u(), the mg_ptr of the potential
* PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
* byte offsets. See also the comments of S_utf8_mg_pos_cache_update().
*
*/
void
-Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+Perl_sv_pos_u2b_proper(pTHX_ register SV *const sv, STRLEN *const offsetp, STRLEN *const lenp)
{
const U8 *start;
STRLEN len;
@@ -6271,17 +6271,17 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
start = (U8*)SvPV_const(sv, len);
if (len) {
- STRLEN uoffset = (STRLEN) *offsetp;
+ STRLEN uoffset = *offsetp;
const U8 * const send = start + len;
MAGIC *mg = NULL;
const STRLEN boffset = sv_pos_u2b_cached(sv, &mg, start, send,
uoffset, 0, 0);
- *offsetp = (I32) boffset;
+ *offsetp = boffset;
if (lenp) {
/* Convert the relative offset to absolute. */
- const STRLEN uoffset2 = uoffset + (STRLEN) *lenp;
+ const STRLEN uoffset2 = uoffset + *lenp;
const STRLEN boffset2
= sv_pos_u2b_cached(sv, &mg, start, send, uoffset2,
uoffset, boffset) - boffset;
@@ -6298,6 +6298,41 @@ Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp
return;
}
+/*
+=for apidoc sv_pos_u2b
+
+Converts the value pointed to by offsetp from a count of UTF-8 chars from
+the start of the string, to a count of the equivalent number of bytes; if
+lenp is non-zero, it does the same to lenp, but this time starting from
+the offset, rather than from the start of the string. Handles magic and
+type coercion.
+
+=cut
+*/
+
+/*
+ * sv_pos_u2b() uses, like sv_pos_b2u(), the mg_ptr of the potential
+ * PERL_MAGIC_utf8 of the sv to store the mapping between UTF-8 and
+ * byte offsets. See also the comments of S_utf8_mg_pos_cache_update().
+ *
+ */
+
+/* This function is subject to size and sign problems */
+
+void
+Perl_sv_pos_u2b(pTHX_ register SV *const sv, I32 *const offsetp, I32 *const lenp)
+{
+ STRLEN uoffset = (STRLEN)*offsetp;
+ if (lenp) {
+ STRLEN ulen = (STRLEN)*lenp;
+ sv_pos_u2b_proper(sv, &uoffset, &ulen);
+ *lenp = (I32)ulen;
+ } else {
+ sv_pos_u2b_proper(sv, &uoffset, NULL);
+ }
+ *offsetp = (I32)uoffset;
+}
+
/* Create and update the UTF8 magic offset cache, with the proffered utf8/
byte length pairing. The (byte) length of the total SV is passed in too,
as blen, because for some (more esoteric) SVs, the call to SvPV_const()
diff --git a/t/re/substr.t b/t/re/substr.t
index c3fa6e1..d0717ba 100644
--- a/t/re/substr.t
+++ b/t/re/substr.t
@@ -24,7 +24,7 @@ $SIG{__WARN__} = sub {
require './test.pl';
-plan(334);
+plan(360);
run_tests() unless caller;
@@ -201,6 +201,11 @@ is($w--, 1);
eval{substr($a,1) = "" ; }; # P=R=S Q
like($@, $FATAL_MSG);
+$b = substr($a,-7,-6) ; # warn # Q R P S
+is($w--, 1);
+eval{substr($a,-7,-6) = "" ; }; # Q R P S
+like($@, $FATAL_MSG);
+
my $a = 'zxcvbnm';
substr($a,2,0) = '';
is($a, 'zxcvbnm');
@@ -682,4 +687,39 @@ is($x, "\x{100}\x{200}\xFFb");
is(substr($a,1,1), 'b');
}
+# [perl #62646] offsets exceeding 32 bits on 64-bit system
+SKIP: {
+ skip("32-bit system", 24) unless ~0 > 0xffffffff;
+ my $a = "abc";
+ my $s;
+ my $r;
+
+ utf8::downgrade($a);
+ for (1..2) {
+ $w = 0;
+ $r = substr($a, 0xffffffff, 1);
+ is($r, undef);
+ is($w, 1);
+
+ $w = 0;
+ $r = substr($a, 0xffffffff+1, 1);
+ is($r, undef);
+ is($w, 1);
+
+ $w = 0;
+ ok( !eval { $r = substr($s=$a, 0xffffffff, 1, "_"); 1 } );
+ is($r, undef);
+ is($s, $a);
+ is($w, 0);
+
+ $w = 0;
+ ok( !eval { $r = substr($s=$a, 0xffffffff+1, 1, "_"); 1 } );
+ is($r, undef);
+ is($s, $a);
+ is($w, 0);
+
+ utf8::upgrade($a);
+ }
+}
+
}
--
1.6.5.2
|
From @nwc10On Sat, Feb 13, 2010 at 12:15:56AM -0500, Eric Brine wrote:
Whilst I've looked at the patch, and don't see anything wrong with it, for Whilst this *isn't* technically the same bug, it is related to I32 abuse and $ time ./perl -lwe 'open A, shift or die $!; read A, $a, (1<<32) + 4; chop $a; warn length $a; substr $a, 0x100000000, 1, chr 256; print ord substr $a, 0x100000000; print ord substr $a, 0xFFFFFFF0; warn length $a; warn length $a' /dev/zero real 37m17.458s I propose the following work around, which disables storing a bad value in Inline Patchdiff --git a/sv.c b/sv.c
index 02be580..87fc348 100644
--- a/sv.c
+++ b/sv.c
@@ -6072,6 +6072,10 @@ Perl_sv_len_utf8(pTHX_ register SV *const sv)
}
assert(mg);
mg->mg_len = ulen;
+ /* For now, treat "overflowed" as "still unknown".
+ See RT #72784. */
+ if (ulen != (STRLEN) mg->mg_len)
+ mg->mg_len = -1;
}
}
return ulen;
warn length $a; substr $a, 0x100000000, 1, chr 256; print ord substr $a, 0x100000000; print ord substr $a, 0xFFFFFFF0; warn length $a; warn length $a' /dev/zero real 3m41.042s [I've compiled with -DDEBUGGING so I get debug mode enabled. If I turn it off: $ time ./perl -lwe '${^UTF8CACHE}=1; open A, shift or die $!; read A, $a, (1<<32) + 4; chop $a; warn length $a; substr $a, 0x100000000, 1, chr 256; print ord substr $a, 0x100000000; print ord substr $a, 0xFFFFFFF0; warn length $a; warn length $a' /dev/zero real 2m21.951s I don't know how many fewer linear scans of the string that time drop equates I've created a meta-ticket to group all the tickets related to I32 abuse, http://rt.perl.org/rt3/Ticket/Display.html?id=72784 Nicholas Clark |
From @nwc10On Sun, Feb 14, 2010 at 02:23:37PM -0500, Eric Brine wrote:
I had committed it, and fixed that part, and found a bug in the test that's Nicholas Clark |
From @nwc10Thanks for the report. Sorry for the delay in replying. On Sat Sep 12 08:52:05 2009, perlbug@plan9.de wrote:
Yes. It's defective.
I don't think that there's actually an easy way to do that. There's a patch in blead now specifically for substr. Jesse isn't I made a new ticket to track tickets relating to the misuse of I32: http://rt.perl.org/rt3/Ticket/Display.html?id=72784 (the intent being that we close this ticket as it's just tracking substr) Nicholas Clark |
@nwc10 - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#62646 (status was 'resolved')
Searchable as RT62646$
The text was updated successfully, but these errors were encountered: