Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perl doesn't handle m{}g for >= 2GB files even with largefiles and 64bitall #12807

Closed
p5pRT opened this issue Feb 23, 2013 · 37 comments
Closed

Comments

@p5pRT
Copy link

p5pRT commented Feb 23, 2013

Migrated from rt.perl.org#116907 (status was 'resolved')

Searchable as RT116907$

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2013

From @jhi

To replicate one way is to first generate a big XML file​:

perl -e 'for (0..2**27){print "<id>$_</id>\n"}' > big.xml

This generates a (rather boring) XML of 2439025741 bytes, safely north
of 2**31 bytes.

This works, printing "--0--\n" etc.​:

dd if=big.xml bs=1048576 count=2047 | perl -wlne 'BEGIN{undef $/};print
length;while(m{<id>(.+?)</id>}g){print "--$1--"}'|head

This doesn't, the only thing output from perl is the length()​:

dd if=big.xml bs=1048576 count=2048 | perl -wlne 'BEGIN{undef $/};print
length;while(m{<id>(.+?)</id>}g){print "--$1--"}'|head

I originally had m{}gs since I was looking for multiline things but
managed to shave it down to this.

Verified to happen in 5.16.2 (in Ubuntu Precise Pangolin) x86_64, and in
bleadperl in whatever is the latest OS X. In both cases​:

useperlio=define,uselargefiles=define,use64bitint=define, use64bitall=define

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2013

From @arc

Jarkko Hietaniemi <perlbug-followup@​perl.org> wrote​:

This doesn't, the only thing output from perl is the length()​:

dd if=big.xml bs=1048576 count=2048 | perl -wlne 'BEGIN{undef $/};print
length;while(m{<id>(.+?)</id>}g){print "--$1--"}'|head

useperlio=define,uselargefiles=define,use64bitint=define, use64bitall=define

As a note for anyone looking into this further​: I believe this can't
be fixed without either (a) changing the type of mg_len from I32 to IV
or Ssize_t (not UV or Size_t, because I think there are parts of the
code which assume it can be negative), or (b) changing
PERL_MAGIC_regex_global to store the value somewhere other than
mg->mg_len.

There may be back-compat issues for the former approach, since
mg_length() currently returns a plain int and is in the public API.
There may also be performance implications for increasing the size of
struct magic.

The obvious "somewhere else" for the latter approach is presumably
mg_obj, but I fear that implies changing storage of pos() from a
machine integer to an SVIV, with an implied increase in memory
footprint and pointer-chasing overhead.

--
Aaron Crane ** http​://aaroncrane.co.uk/

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 3, 2013

From @Leont

On Sun, Mar 3, 2013 at 2​:55 PM, Aaron Crane <arc@​cpan.org> wrote​:

As a note for anyone looking into this further​: I believe this can't
be fixed without either (a) changing the type of mg_len from I32 to IV
or Ssize_t (not UV or Size_t, because I think there are parts of the
code which assume it can be negative), or (b) changing
PERL_MAGIC_regex_global to store the value somewhere other than
mg->mg_len.

There may be back-compat issues for the former approach, since
mg_length() currently returns a plain int and is in the public API.
There may also be performance implications for increasing the size of
struct magic.

AFAICT changing mg_len shouldn't break API unless someone is taking a
pointer to it, which doesn't seem very likely.

The obvious "somewhere else" for the latter approach is presumably
mg_obj, but I fear that implies changing storage of pos() from a
machine integer to an SVIV, with an implied increase in memory
footprint and pointer-chasing overhead.

A new buffer in mg_ptr seems more logical (didn't look at the
regex_global code but if mg_len is available mg_ptr should be too).
I'd go for enlarging mg_len first though.

Leon

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2013

From @cpansprout

On Sun Mar 03 06​:48​:09 2013, LeonT wrote​:

On Sun, Mar 3, 2013 at 2​:55 PM, Aaron Crane <arc@​cpan.org> wrote​:

As a note for anyone looking into this further​: I believe this can't
be fixed without either (a) changing the type of mg_len from I32 to IV
or Ssize_t (not UV or Size_t, because I think there are parts of the
code which assume it can be negative), or (b) changing
PERL_MAGIC_regex_global to store the value somewhere other than
mg->mg_len.

There may be back-compat issues for the former approach, since
mg_length() currently returns a plain int and is in the public API.
There may also be performance implications for increasing the size of
struct magic.

AFAICT changing mg_len shouldn't break API unless someone is taking a
pointer to it, which doesn't seem very likely.

Also, mg_length is deprecated in 5.18.0. It may or may not invoke
get-magic, and leaves the caller no indication as to whether it did.
The length returned is in bytes, but the UTF8 flag is not set.

So it’s pretty useless, except as an exercise in masochism.

The obvious "somewhere else" for the latter approach is presumably
mg_obj, but I fear that implies changing storage of pos() from a
machine integer to an SVIV, with an implied increase in memory
footprint and pointer-chasing overhead.

A new buffer in mg_ptr seems more logical (didn't look at the
regex_global code but if mg_len is available mg_ptr should be too).
I'd go for enlarging mg_len first though.

meetoo

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2013

From @cpansprout

On Sun Mar 03 06​:48​:09 2013, LeonT wrote​:

AFAICT changing mg_len shouldn't break API unless someone is taking a
pointer to it, which doesn't seem very likely.

I have just searched CPAN for /&.*mg_len/. The only uses of it are in
the perl core.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2013

From @cpansprout

On Sat Feb 23 15​:25​:15 2013, jhi wrote​:

To replicate one way is to first generate a big XML file​:

perl -e 'for (0..2**27){print "<id>$_</id>\n"}' > big.xml

This generates a (rather boring) XML of 2439025741 bytes, safely north
of 2**31 bytes.

This works, printing "--0--\n" etc.​:

dd if=big.xml bs=1048576 count=2047 | perl -wlne 'BEGIN{undef $/};print
length;while(m{<id>(.+?)</id>}g){print "--$1--"}'|head

This doesn't, the only thing output from perl is the length()​:

dd if=big.xml bs=1048576 count=2048 | perl -wlne 'BEGIN{undef $/};print
length;while(m{<id>(.+?)</id>}g){print "--$1--"}'|head

I originally had m{}gs since I was looking for multiline things but
managed to shave it down to this.

Verified to happen in 5.16.2 (in Ubuntu Precise Pangolin) x86_64, and in
bleadperl in whatever is the latest OS X. In both cases​:

useperlio=define,uselargefiles=define,use64bitint=define,
use64bitall=define

I think this demonstrates the problem more clearly​:

$ ./perl -lIlib -e '$x=" "x(2**31+20); pos $x = 2**31-5; for(1..10){
print pos $x; ++pos $x}'
2147483643
2147483644
2147483645
2147483646
2147483647

1
2
3
4

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2013

From @cpansprout

On Sat Jul 20 11​:05​:51 2013, sprout wrote​:

I think this demonstrates the problem more clearly​:

$ ./perl -lIlib -e '$x=" "x(2**31+20); pos $x = 2**31-5; for(1..10){
print pos $x; ++pos $x}'
2147483643
2147483644
2147483645
2147483646
2147483647

1
2
3
4

That’s only part of the problem. Another data point​:

$ ./perl -lIlib -e '$x=" "x(2**31-1); $x=~/./; print "[$&amp;]"'
[ ]

$ ./perl -lIlib -e '$x=" "x(2**31); $x=~/./; print "[$&amp;]"'
[]

So it’s not just pos() that is affected.

I think this is the same as, or similar to, #103260.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2013

From @jhi

On Saturday-201307-20 22​:25, Father Chrysostomos via RT wrote​:

On Sat Jul 20 11​:05​:51 2013, sprout wrote​:

I think this demonstrates the problem more clearly​:

$ ./perl -lIlib -e '$x=" "x(2**31+20); pos $x = 2**31-5; for(1..10){
print pos $x; ++pos $x}'
2147483643
2147483644
2147483645
2147483646
2147483647

1
2
3
4

That’s only part of the problem. Another data point​:

$ ./perl -lIlib -e '$x=" "x(2**31-1); $x=~/./; print "[$&amp;]"'
[ ]

$ ./perl -lIlib -e '$x=" "x(2**31); $x=~/./; print "[$&amp;]"'
[]

So it’s not just pos() that is affected.

Indeed.

I haven't looked at the code (that is, in the last ten years or so) but
I bet the problem is that some offsets/sizes are still explicitly I32
(or U32, though not in this particular issue) instead of IV/UV.

I think this is the same as, or similar to, #103260.

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2013

From @cpansprout

On Sun Jul 21 06​:57​:21 2013, jhi wrote​:

On Saturday-201307-20 22​:25, Father Chrysostomos via RT wrote​:

On Sat Jul 20 11​:05​:51 2013, sprout wrote​:

I think this demonstrates the problem more clearly​:

$ ./perl -lIlib -e '$x=" "x(2**31+20); pos $x = 2**31-5; for(1..10){
print pos $x; ++pos $x}'
2147483643
2147483644
2147483645
2147483646
2147483647

1
2
3
4

That’s only part of the problem. Another data point​:

$ ./perl -lIlib -e '$x=" "x(2**31-1); $x=~/./; print "[$&amp;]"'
[ ]

$ ./perl -lIlib -e '$x=" "x(2**31); $x=~/./; print "[$&amp;]"'
[]

So it’s not just pos() that is affected.

Indeed.

I haven't looked at the code (that is, in the last ten years or so)

And yet you are still ahead in terms of the sheer number of commits.
How did you manage that?

but
I bet the problem is that some offsets/sizes are still explicitly I32
(or U32, though not in this particular issue) instead of IV/UV.

That is exactly the problem. I have a local patch to fix it at least
for pos(), but the test I added is very slow because the strings gets
copied as a result of constant folding. And I thought constant folding
was supposed to be on optimisation....

So it is another of these multifaceted puzzles.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 21, 2013

From @jhi

I haven't looked at the code (that is, in the last ten years or so)

And yet you are still ahead in terms of the sheer number of commits.
How did you manage that?

Obsessive-compulsive disorder?

but
I bet the problem is that some offsets/sizes are still explicitly I32
(or U32, though not in this particular issue) instead of IV/UV.

That is exactly the problem. I have a local patch to fix it at least
for pos(), but the test I added is very slow because the strings gets
copied as a result of constant folding. And I thought constant folding
was supposed to be on optimisation....

An interesting experiment would be to just mechanically to do a
s/([IU])32/${1}V/g in all the core files (*), fix up all the resulting
compilation noise, and iterate the core test suite until it runs
again.

(*) Yes, even including all the external APIs, at least initially.
This would be experiment, not necessarily a finished product. And isn't
one of the goals of the current faster release schedule to be able to
jettison old incompatible releases faster?

So it is another of these multifaceted puzzles.

@p5pRT
Copy link
Author

p5pRT commented Jul 22, 2013

From PeterCMartini@GMail.com

On Sun, Jul 21, 2013 at 3​:00 PM, Jarkko Hietaniemi <jhi@​iki.fi> wrote​:

I haven't looked at the code (that is, in the last ten years or so)

And yet you are still ahead in terms of the sheer number of commits.
How did you manage that?

Obsessive-compulsive disorder?

but
I bet the problem is that some offsets/sizes are still explicitly I32
(or U32, though not in this particular issue) instead of IV/UV.

That is exactly the problem. I have a local patch to fix it at least
for pos(), but the test I added is very slow because the strings gets
copied as a result of constant folding. And I thought constant folding
was supposed to be on optimisation....

An interesting experiment would be to just mechanically to do a
s/([IU])32/${1}V/g in all the core files (*), fix up all the resulting
compilation noise, and iterate the core test suite until it runs
again.

(*) Yes, even including all the external APIs, at least initially.
This would be experiment, not necessarily a finished product. And isn't one
of the goals of the current faster release schedule to be able to
jettison old incompatible releases faster?

So it is another of these multifaceted puzzles.

FWIW, there's already a meta ticket for the I32/IV issue​:
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=72784

@p5pRT
Copy link
Author

p5pRT commented Jul 23, 2013

From @cpansprout

On Mon Jun 24 18​:22​:12 2013, sprout wrote​:

On Sun Mar 03 06​:48​:09 2013, LeonT wrote​:

AFAICT changing mg_len shouldn't break API unless someone is taking a
pointer to it, which doesn't seem very likely.

I have just searched CPAN for /&.*mg_len/. The only uses of it are in
the perl core.

One of those uses is Perl_hv_placeholders_p. Added in ca73285,
this function is marked as being in the public API (A in embed.fnc), but
has no documentation and no short form (hv_placeholders_p). There are
no uses of it on CPAN.

Is this another case where a function was mistakenly added to the API
(like stashpv_hvname_match, added in ed221c5 and removed in
c947b31), whereas it should simply have been exported (X in embed.fnc)?

The only use of this function *anywhere* in publicly-viewable code (OK,
just CPAN and the perl core) is the HvPLACEHOLDERS macro in hv.h.

Only one CPAN module, Data​::Alias, uses HvPLACEHOLDERS, and it is like
this​: HvPLACEHOLDERS(a2)++.

So it looks as though it is safe to change

ApoR |I32* |hv_placeholders_p |NN HV *hv

to

XpoR |SSize_t*|hv_placeholders_p |NN HV *hv

Comments?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 23, 2013

From @rurban

On Tue, Jul 23, 2013 at 12​:15 AM, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:
...

Is this another case where a function was mistakenly added to the API
(like stashpv_hvname_match, added in ed221c5 and removed in
c947b31), whereas it should simply have been exported (X in embed.fnc)?

The only use of this function *anywhere* in publicly-viewable code (OK,
just CPAN and the perl core) is the HvPLACEHOLDERS macro in hv.h.

Only one CPAN module, Data​::Alias, uses HvPLACEHOLDERS, and it is like
this​: HvPLACEHOLDERS(a2)++.

So it looks as though it is safe to change

ApoR |I32* |hv_placeholders_p |NN HV *hv

to

XpoR |SSize_t*|hv_placeholders_p |NN HV *hv

Comments?

No objection
--
Reini Urban
http​://cpanel.net/ http​://www.perl-compiler.org/

@p5pRT
Copy link
Author

p5pRT commented Jul 23, 2013

From @jkeenan

On 7/23/13 1​:15 AM, Father Chrysostomos via RT wrote​:

So it looks as though it is safe to change

ApoR |I32* |hv_placeholders_p |NN HV *hv

to

XpoR |SSize_t*|hv_placeholders_p |NN HV *hv

Comments?

A good candidate for a smoke-me.

@p5pRT
Copy link
Author

p5pRT commented Jul 24, 2013

From @cpansprout

On Mon Jul 22 22​:15​:05 2013, sprout wrote​:

On Mon Jun 24 18​:22​:12 2013, sprout wrote​:

On Sun Mar 03 06​:48​:09 2013, LeonT wrote​:

AFAICT changing mg_len shouldn't break API unless someone is taking a
pointer to it, which doesn't seem very likely.

I have just searched CPAN for /&.*mg_len/. The only uses of it are in
the perl core.

One of those uses is Perl_hv_placeholders_p. Added in ca73285,
this function is marked as being in the public API (A in embed.fnc), but
has no documentation and no short form (hv_placeholders_p). There are
no uses of it on CPAN.

I have now fixed setting and reading of pos in commit 6174b39, which
also changed the return type of Perl_hv_placeholders_p.

The rest of this ticket is as yet unresolved.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 24, 2013

From @cpansprout

On Tue Jul 23 16​:11​:37 2013, jkeen@​verizon.net wrote​:

On 7/23/13 1​:15 AM, Father Chrysostomos via RT wrote​:

So it looks as though it is safe to change

ApoR |I32* |hv_placeholders_p |NN HV *hv

to

XpoR |SSize_t*|hv_placeholders_p |NN HV *hv

Comments?

A good candidate for a smoke-me.

I only saw your message after pushing to blead. A smoke-me branch won’t
be necessary now, as it is already being smoked.

If you meant a CPAN smoke, that wouldn’t help, because only one module,
Data​::Alias, is using the function in question, and it hasn’t been
updated for 5.18.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2013

From @cpansprout

On Sat Jul 20 19​:25​:52 2013, sprout wrote​:

On Sat Jul 20 11​:05​:51 2013, sprout wrote​:

I think this demonstrates the problem more clearly​:

$ ./perl -lIlib -e '$x=" "x(2**31+20); pos $x = 2**31-5; for(1..10){
print pos $x; ++pos $x}'
2147483643
2147483644
2147483645
2147483646
2147483647

1
2
3
4

That’s only part of the problem. Another data point​:

$ ./perl -lIlib -e '$x=" "x(2**31-1); $x=~/./; print "[$&amp;]"'
[ ]

$ ./perl -lIlib -e '$x=" "x(2**31); $x=~/./; print "[$&amp;]"'
[]

I’ve managed to fix this one locally, but now I’m running into another
problem​:

void
Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren,
  SV * const sv)

How on earth do we change that ‘const I32 paren’ without breaking the
regexp API?

Is the regexp API considered experimental (please, please)? Or do we
need compatibility layers?

perlreapi says​:

  In order to install a new regexp handler, $^H{regcomp} is set to
  an integer which (when casted appropriately) resolves to one of
  these structures. When compiling, the "comp" method is executed,
  and the resulting regexp structure's engine field is expected to
  point back at the same structure.

Does that mean we need $^H{regcomp_wide} = 1? (_wide is not a good
name, but what do we call it?)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2013

From @Leont

On Thu, Jul 25, 2013 at 3​:19 AM, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

I’ve managed to fix this one locally, but now I’m running into another
problem​:

void
Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren,
SV * const sv)

How on earth do we change that ‘const I32 paren’ without breaking the
regexp API?

Is the regexp API considered experimental (please, please)? Or do we
need compatibility layers?

perlreapi says​:

   In order to install a new regexp handler\, $^H\{regcomp\} is set to
   an integer which \(when casted appropriately\) resolves to one of
   these structures\. When compiling\, the "comp" method is executed\,
   and the resulting regexp structure's engine field is expected to
   point back at the same structure\.

Does that mean we need $^H{regcomp_wide} = 1? (_wide is not a good
name, but what do we call it?)

Ouch!

Yeah, that is tricky :-/

Leon

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2013

From @cpansprout

On Wed Jul 24 18​:24​:17 2013, LeonT wrote​:

On Thu, Jul 25, 2013 at 3​:19 AM, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

I’ve managed to fix this one locally, but now I’m running into another
problem​:

void
Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren,
SV * const sv)

How on earth do we change that ‘const I32 paren’ without breaking the
regexp API?

Is the regexp API considered experimental (please, please)? Or do we
need compatibility layers?

perlreapi says​:

   In order to install a new regexp handler\, $^H\{regcomp\} is set to
   an integer which \(when casted appropriately\) resolves to one of
   these structures\. When compiling\, the "comp" method is executed\,
   and the resulting regexp structure's engine field is expected to
   point back at the same structure\.

Wait, I wasn’t thinking. I got an assertion failure from that function,
so I jumped to the conclusion that the I32 parameter had something to do
with it.

paren can be I32, since no system has enough memory to compile a regular
expression with more than 2 billion captures.

This is where the problem lies​:

  typedef struct regexp_paren_pair {
  I32 start;
  I32 end;
  } regexp_paren_pair;

start and end obviously need to be bigger.

What makes this tricky is that it is stored inside the regexp struct
itself, so it having a engine declare itself as 64-bit compatible
doesn’t solve anything. (Please prove me wrong.)

If regexp plugins never actually take pointers to start or end, then it
make be OK to change the size. Any engine that assigns an I32 to one of
those or one of those to an I32 will be just as buggy as before.

Regexp engines that use STRLEN will just work, except in older perls
where the value will be truncated to I32 and be just as buggy as before.

So it does not seem so bad after all.

The same thing applies to sublen in the regexp struct. It might apply
to prelen and wraplen, too, but capping the length of regular
expressions themselves at 2**31 seems reasonable to me, so we don’t have
to change those.

minlen and minlenret are only for optimisations, so they can stay as I32.

The other place where we have an I32 is the return value of
numbered_buff_LENGTH, but the core never calls that any more and the XS
front-end to it is deprecated, as length magic cannot work as long as we
support utf8.

Does that mean we need $^H{regcomp_wide} = 1? (_wide is not a good
name, but what do we call it?)

Ouch!

Yeah, that is tricky :-/

$^H{regcomp_wide} obviously wouldn’t work, because an older regexp
engine enabling itself in the same scope won’t know to unset it.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 18, 2013

From @cpansprout

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 18, 2013

From @cpansprout

From a03f5d860f145bd038f8d1e186c1c018dbfb949c Mon Sep 17 00​:00​:00 2001
From​: Father Chrysostomos <sprout@​cpan.org>
Date​: Tue, 30 Jul 2013 23​:49​:58 -0700
Subject​: [PATCH 1/2] Stop substr re optimisation from rejecting long strs

Using I32 for the fields that record information about the location of
a fixed string that must be found for a regular expression to match
can result in match failures, because I32 is not large enough to store
offsets >= 2**31.

SSize_t is appropriate, since it is 64 bits on 64-bit platforms and 32
bits on 32-bit platforms.

This commit changes enough instances of I32 to SSize_t to get the
added test passing and suppress compiler warnings. A later commit
will change many more.


embed.fnc | 4 ++--
perl.h | 2 ++
proto.h | 2 +-
regcomp.c | 39 ++++++++++++++++++++-------------------
regexec.c | 4 ++--
regexp.h | 4 ++--
t/bigmem/regexp.t | 6 +++++-
7 files changed, 34 insertions(+), 27 deletions(-)

Inline Patch
diff --git a/embed.fnc b/embed.fnc
index 3400222..edf0a2f 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -2042,9 +2042,9 @@ Esn	|void	|cl_and		|NN struct regnode_charclass_class *cl \
 Esn	|void	|cl_or		|NN const struct RExC_state_t *pRExC_state \
 				|NN struct regnode_charclass_class *cl \
 				|NN const struct regnode_charclass_class *or_with
-Es	|I32	|study_chunk	|NN struct RExC_state_t *pRExC_state \
+Es	|SSize_t|study_chunk	|NN struct RExC_state_t *pRExC_state \
 				|NN regnode **scanp|NN I32 *minlenp \
-				|NN I32 *deltap|NN regnode *last \
+				|NN SSize_t *deltap|NN regnode *last \
 				|NULLOK struct scan_data_t *data \
 				|I32 stopparen|NULLOK U8* recursed \
 				|NULLOK struct regnode_charclass_class *and_withp \
diff --git a/perl.h b/perl.h
index bdb64b3..d2c5568 100644
--- a/perl.h
+++ b/perl.h
@@ -1601,6 +1601,8 @@ typedef UVTYPE UV;
 #  endif
 #endif
 
+#define SSize_t_MAX (SSize_t)(~(size_t)0 >> 1)
+
 #ifndef HAS_QUAD
 # undef PERL_NEED_MY_HTOLE64
 # undef PERL_NEED_MY_LETOH64
diff --git a/proto.h b/proto.h
index 5969e7b..c50d60f 100644
--- a/proto.h
+++ b/proto.h
@@ -6770,7 +6770,7 @@ STATIC void	S_scan_commit(pTHX_ const struct RExC_state_t *pRExC_state, struct s
 #define PERL_ARGS_ASSERT_SCAN_COMMIT	\
 	assert(pRExC_state); assert(data); assert(minlenp)
 
-STATIC I32	S_study_chunk(pTHX_ struct RExC_state_t *pRExC_state, regnode **scanp, I32 *minlenp, I32 *deltap, regnode *last, struct scan_data_t *data, I32 stopparen, U8* recursed, struct regnode_charclass_class *and_withp, U32 flags, U32 depth)
+STATIC SSize_t	S_study_chunk(pTHX_ struct RExC_state_t *pRExC_state, regnode **scanp, I32 *minlenp, SSize_t *deltap, regnode *last, struct scan_data_t *data, I32 stopparen, U8* recursed, struct regnode_charclass_class *and_withp, U32 flags, U32 depth)
 			__attribute__nonnull__(pTHX_1)
 			__attribute__nonnull__(pTHX_2)
 			__attribute__nonnull__(pTHX_3)
diff --git a/regcomp.c b/regcomp.c
index 432e5eb..cd7e8ae 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -345,25 +345,25 @@ typedef struct RExC_state_t {
 typedef struct scan_data_t {
     /*I32 len_min;      unused */
     /*I32 len_delta;    unused */
-    I32 pos_min;
+    SSize_t pos_min;
     I32 pos_delta;
     SV *last_found;
     I32 last_end;	    /* min value, <0 unless valid. */
-    I32 last_start_min;
+    SSize_t last_start_min;
     I32 last_start_max;
     SV **longest;	    /* Either &l_fixed, or &l_float. */
     SV *longest_fixed;      /* longest fixed string found in pattern */
-    I32 offset_fixed;       /* offset where it starts */
+    SSize_t offset_fixed;   /* offset where it starts */
     I32 *minlen_fixed;      /* pointer to the minlen relevant to the string */
     I32 lookbehind_fixed;   /* is the position of the string modfied by LB */
     SV *longest_float;      /* longest floating string found in pattern */
-    I32 offset_float_min;   /* earliest point in string it can appear */
+    SSize_t offset_float_min; /* earliest point in string it can appear */
     I32 offset_float_max;   /* latest point in string it can appear */
     I32 *minlen_float;      /* pointer to the minlen relevant to the string */
-    I32 lookbehind_float;   /* is the position of the string modified by LB */
+    SSize_t lookbehind_float; /* is the pos of the string modified by LB */
     I32 flags;
     I32 whilem_c;
-    I32 *last_closep;
+    SSize_t *last_closep;
     struct regnode_charclass_class *start_class;
 } scan_data_t;
 
@@ -3028,9 +3028,9 @@ typedef struct scan_frame {
 
 #define SCAN_COMMIT(s, data, m) scan_commit(s, data, m, is_inf)
 
-STATIC I32
+STATIC SSize_t
 S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
-                        I32 *minlenp, I32 *deltap,
+                        I32 *minlenp, SSize_t *deltap,
 			regnode *last,
 			scan_data_t *data,
 			I32 stopparen,
@@ -3124,7 +3124,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	        /* NOTE - There is similar code to this block below for handling
 	           TRIE nodes on a re-study.  If you change stuff here check there
 	           too. */
-		I32 max1 = 0, min1 = I32_MAX, num = 0;
+		SSize_t max1 = 0, min1 = SSize_t_MAX, num = 0;
 		struct regnode_charclass_class accum;
 		regnode * const startbranch=scan;
 
@@ -3134,7 +3134,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		    cl_init_zero(pRExC_state, &accum);
 
 		while (OP(scan) == code) {
-		    I32 deltanext, minnext, f = 0, fake;
+		    SSize_t deltanext, minnext, fake;
+		    I32 f = 0;
 		    struct regnode_charclass_class this_class;
 
 		    num++;
@@ -3678,7 +3679,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	    flags &= ~SCF_DO_STCLASS;
 	}
 	else if (PL_regkind[OP(scan)] == EXACT) { /* But OP != EXACT! */
-	    I32 l = STR_LEN(scan);
+	    SSize_t l = STR_LEN(scan);
 	    UV uc = *((U8*)STRING(scan));
 
 	    /* Search for fixed substrings supports EXACT only. */
@@ -3795,8 +3796,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	    flags &= ~SCF_DO_STCLASS;
 	}
 	else if (REGNODE_VARIES(OP(scan))) {
-	    I32 mincount, maxcount, minnext, deltanext, fl = 0;
-	    I32 f = flags, pos_before = 0;
+	    SSize_t mincount, maxcount, minnext, deltanext;
+	    I32 fl = 0, f = flags, pos_before = 0;
 	    regnode * const oscan = scan;
 	    struct regnode_charclass_class this_class;
 	    struct regnode_charclass_class *oclass = NULL;
@@ -4373,7 +4374,7 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
                    In this case we can't do fixed string optimisation.
                 */
 
-                I32 deltanext, minnext, fake = 0;
+                SSize_t deltanext, minnext, fake = 0;
                 regnode *nscan;
                 struct regnode_charclass_class intrnl;
                 int f = 0;
@@ -4615,7 +4616,7 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
                 
                 for ( word=1 ; word <= trie->wordcount ; word++) 
                 {
-                    I32 deltanext=0, minnext=0, f = 0, fake;
+                    SSize_t deltanext=0, minnext=0, f = 0, fake;
                     struct regnode_charclass_class this_class;
                     
                     data_fake.flags = 0;
@@ -6075,11 +6076,11 @@ reStudy:
     /* testing for BRANCH here tells us whether there is "must appear"
        data in the pattern. If there is then we can use it for optimisations */
     if (!(RExC_seen & REG_TOP_LEVEL_BRANCHES)) { /*  Only one top-level choice. */
-	I32 fake;
+	SSize_t fake;
 	STRLEN longest_float_length, longest_fixed_length;
 	struct regnode_charclass_class ch_class; /* pointed to by data */
 	int stclass_flag;
-	I32 last_close = 0; /* pointed to by data */
+	SSize_t last_close = 0; /* pointed to by data */
         regnode *first= scan;
         regnode *first_next= regnext(first);
 	/*
@@ -6361,9 +6362,9 @@ reStudy:
     }
     else {
 	/* Several toplevels. Best we can is to set minlen. */
-	I32 fake;
+	SSize_t fake;
 	struct regnode_charclass_class ch_class;
-	I32 last_close = 0;
+	SSize_t last_close = 0;
 
 	DEBUG_PARSE_r(PerlIO_printf(Perl_debug_log, "\nMulti Top Level\n"));
 
diff --git a/regexec.c b/regexec.c
index fb8e97a..4a25f5c 100644
--- a/regexec.c
+++ b/regexec.c
@@ -615,7 +615,7 @@ Perl_re_intuit_start(pTHX_
 {
     dVAR;
     struct regexp *const prog = ReANY(rx);
-    I32 start_shift = 0;
+    SSize_t start_shift = 0;
     /* Should be nonnegative! */
     I32 end_shift   = 0;
     char *s;
@@ -751,7 +751,7 @@ Perl_re_intuit_start(pTHX_
        the "check" substring in the region corrected by start/end_shift. */
     
     {
-        I32 srch_start_shift = start_shift;
+        SSize_t srch_start_shift = start_shift;
         I32 srch_end_shift = end_shift;
         U8* start_point;
         U8* end_point;
diff --git a/regexp.h b/regexp.h
index ebbbde1..8542cb1 100644
--- a/regexp.h
+++ b/regexp.h
@@ -36,8 +36,8 @@ struct regexp_engine;
 struct regexp;
 
 struct reg_substr_datum {
-    I32 min_offset;
-    I32 max_offset;
+    SSize_t min_offset;
+    SSize_t max_offset;
     SV *substr;		/* non-utf8 variant */
     SV *utf8_substr;	/* utf8 variant */
     I32 end_shift;
diff --git a/t/bigmem/regexp.t b/t/bigmem/regexp.t
index 9404f2c..d5496f1 100644
--- a/t/bigmem/regexp.t
+++ b/t/bigmem/regexp.t
@@ -12,7 +12,7 @@ $ENV{PERL_TEST_MEMORY} >= 2
 $Config{ptrsize} >= 8
     or skip_all("Need 64-bit pointers for this test");
 
-plan(5);
+plan(6);
 
 # [perl #116907]
 # ${\2} to defeat constant folding, which in this case actually slows
@@ -33,3 +33,7 @@ $x =~ /./g;
 is "$'", 'efg', q "$' after match against long string";
 is "$-[0],$+[0]", '2147483651,2147483652',
    '@- and @+ after matches past 2**31';
+
+# Substring optimisations
+is $x =~ /(?:(?:.{32766}){32766}){2}(?:.{32766}){8}.{8}ef/, 1,
+  'anchored substr past 2**31';
-- 
1.7.12.4 (Apple Git-37)

From ae3358ed545fdb5d12870c5307142bba363ae590 Mon Sep 17 00​:00​:00 2001
From​: Father Chrysostomos <sprout@​cpan.org>
Date​: Sun, 18 Aug 2013 14​:03​:06 -0700
Subject​: [PATCH 2/2] Use SSize_t/STRLEN in more places in regexp code
MIME-Version​: 1.0
Content-Type​: text/plain; charset=UTF-8
Content-Transfer-Encoding​: 8bit

As part of getting the regexp engine to handle long strings, this com-
mit changes any variables, parameters and struct members that hold
lengths of the string being matched against (or parts thereof) to use
SSize_t or STRLEN instead of [IU]32.

To avoid having to change any logic, I kept the signedness the same.

I did not change anything that affects the length of the regular
expression itself, so regexps are still practically limited to
I32_MAX. Changing that would involve changing the size of regnodes,
which would be a lot more involved.

These changes should fix bugs, but are very hard to test. In most
cases, I don’t know the regexp engine well enough to come up with test
cases that test the paths in question with long strings. In other
cases I don’t have a box with enough memory to test the fix.


embed.fnc | 21 ++++----
pod/perlreapi.pod | 4 +-
proto.h | 14 +++---
regcomp.c | 140 +++++++++++++++++++++++++++++-------------------------
regexec.c | 49 +++++++++----------
regexp.h | 20 ++++----
6 files changed, 130 insertions(+), 118 deletions(-)

Inline Patch
diff --git a/embed.fnc b/embed.fnc
index edf0a2f..acd0b1c 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1095,8 +1095,8 @@ EMsR	|SV*	|_new_invlist_C_array|NN const UV* const list
 : Not used currently: EXMs	|bool	|_invlistEQ	|NN SV* const a|NN SV* const b|const bool complement_b
 #endif
 Ap	|I32	|pregexec	|NN REGEXP * const prog|NN char* stringarg \
-				|NN char* strend|NN char* strbeg|I32 minend \
-				|NN SV* screamer|U32 nosave
+				|NN char* strend|NN char* strbeg \
+				|SSize_t minend |NN SV* screamer|U32 nosave
 Ap	|void	|pregfree	|NULLOK REGEXP* r
 Ap	|void	|pregfree2	|NN REGEXP *rx
 : FIXME - is anything in re using this now?
@@ -1127,8 +1127,9 @@ EiPR	|I32	|regcurly	|NN const char *s                   \
 				|const bool rbrace_must_be_escaped
 #endif
 Ap	|I32	|regexec_flags	|NN REGEXP *const rx|NN char *stringarg \
-				|NN char *strend|NN char *strbeg|I32 minend \
-				|NN SV *sv|NULLOK void *data|U32 flags
+				|NN char *strend|NN char *strbeg \
+				|SSize_t minend|NN SV *sv \
+				|NULLOK void *data|U32 flags
 ApR	|regnode*|regnext	|NULLOK regnode* p
 EXp |SV*|reg_named_buff          |NN REGEXP * const rx|NULLOK SV * const key \
                                  |NULLOK SV * const value|const U32 flags
@@ -2030,8 +2031,8 @@ Ei	|U8   |compute_EXACTish|NN struct RExC_state_t *pRExC_state
 Es	|char *	|nextchar	|NN struct RExC_state_t *pRExC_state
 Es	|bool	|reg_skipcomment|NN struct RExC_state_t *pRExC_state
 Es	|void	|scan_commit	|NN const struct RExC_state_t *pRExC_state \
-				|NN struct scan_data_t *data|NN I32 *minlenp \
-				|int is_inf
+				|NN struct scan_data_t *data \
+				|NN SSize_t *minlenp|int is_inf
 Esn	|void	|cl_anything	|NN const struct RExC_state_t *pRExC_state \
 				|NN struct regnode_charclass_class *cl
 EsRn	|int	|cl_is_anything	|NN const struct regnode_charclass_class *cl
@@ -2043,7 +2044,7 @@ Esn	|void	|cl_or		|NN const struct RExC_state_t *pRExC_state \
 				|NN struct regnode_charclass_class *cl \
 				|NN const struct regnode_charclass_class *or_with
 Es	|SSize_t|study_chunk	|NN struct RExC_state_t *pRExC_state \
-				|NN regnode **scanp|NN I32 *minlenp \
+				|NN regnode **scanp|NN SSize_t *minlenp \
 				|NN SSize_t *deltap|NN regnode *last \
 				|NULLOK struct scan_data_t *data \
 				|I32 stopparen|NULLOK U8* recursed \
@@ -2101,15 +2102,15 @@ Es	|CHECKPOINT|regcppush	|NN const regexp *rex|I32 parenfloor\
 				|U32 maxopenparen
 Es	|void	|regcppop	|NN regexp *rex\
 				|NN U32 *maxopenparen_p
-ERsn	|U8*	|reghop3	|NN U8 *s|I32 off|NN const U8 *lim
+ERsn	|U8*	|reghop3	|NN U8 *s|SSize_t off|NN const U8 *lim
 ERsM	|SV*	|core_regclass_swash|NULLOK const regexp *prog \
 				|NN const struct regnode *node|bool doinit \
 				|NULLOK SV **listsvp
 #ifdef XXX_dmq
-ERsn	|U8*	|reghop4	|NN U8 *s|I32 off|NN const U8 *llim \
+ERsn	|U8*	|reghop4	|NN U8 *s|SSize_t off|NN const U8 *llim \
 				|NN const U8 *rlim
 #endif
-ERsn	|U8*	|reghopmaybe3	|NN U8 *s|I32 off|NN const U8 *lim
+ERsn	|U8*	|reghopmaybe3	|NN U8 *s|SSize_t off|NN const U8 *lim
 ERs	|char*	|find_byclass	|NN regexp * prog|NN const regnode *c \
 				|NN char *s|NN const char *strend \
 				|NULLOK regmatch_info *reginfo
diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod
index 659088e..cfc41d7 100644
--- a/pod/perlreapi.pod
+++ b/pod/perlreapi.pod
@@ -17,7 +17,7 @@ following format:
                          REGEXP * const rx,
                          char* stringarg,
                          char* strend, char* strbeg,
-                         I32 minend, SV* sv,
+                         SSize_t minend, SV* sv,
                          void* data, U32 flags);
         char*   (*intuit) (pTHX_
                            REGEXP * const rx, SV *sv,
@@ -238,7 +238,7 @@ certain optimisations when this is set.
 
     I32 exec(pTHX_ REGEXP * const rx,
              char *stringarg, char* strend, char* strbeg,
-             I32 minend, SV* sv,
+             SSize_t minend, SV* sv,
              void* data, U32 flags);
 
 Execute a regexp. The arguments are
diff --git a/proto.h b/proto.h
index c50d60f..fc007c9 100644
--- a/proto.h
+++ b/proto.h
@@ -3187,7 +3187,7 @@ PERL_CALLCONV REGEXP*	Perl_pregcomp(pTHX_ SV * const pattern, const U32 flags)
 #define PERL_ARGS_ASSERT_PREGCOMP	\
 	assert(pattern)
 
-PERL_CALLCONV I32	Perl_pregexec(pTHX_ REGEXP * const prog, char* stringarg, char* strend, char* strbeg, I32 minend, SV* screamer, U32 nosave)
+PERL_CALLCONV I32	Perl_pregexec(pTHX_ REGEXP * const prog, char* stringarg, char* strend, char* strbeg, SSize_t minend, SV* screamer, U32 nosave)
 			__attribute__nonnull__(pTHX_1)
 			__attribute__nonnull__(pTHX_2)
 			__attribute__nonnull__(pTHX_3)
@@ -3409,7 +3409,7 @@ PERL_CALLCONV void	Perl_regdump(pTHX_ const regexp* r)
 #define PERL_ARGS_ASSERT_REGDUMP	\
 	assert(r)
 
-PERL_CALLCONV I32	Perl_regexec_flags(pTHX_ REGEXP *const rx, char *stringarg, char *strend, char *strbeg, I32 minend, SV *sv, void *data, U32 flags)
+PERL_CALLCONV I32	Perl_regexec_flags(pTHX_ REGEXP *const rx, char *stringarg, char *strend, char *strbeg, SSize_t minend, SV *sv, void *data, U32 flags)
 			__attribute__nonnull__(pTHX_1)
 			__attribute__nonnull__(pTHX_2)
 			__attribute__nonnull__(pTHX_3)
@@ -6763,14 +6763,14 @@ STATIC char *	S_regwhite(struct RExC_state_t *pRExC_state, char *p)
 #define PERL_ARGS_ASSERT_REGWHITE	\
 	assert(pRExC_state); assert(p)
 
-STATIC void	S_scan_commit(pTHX_ const struct RExC_state_t *pRExC_state, struct scan_data_t *data, I32 *minlenp, int is_inf)
+STATIC void	S_scan_commit(pTHX_ const struct RExC_state_t *pRExC_state, struct scan_data_t *data, SSize_t *minlenp, int is_inf)
 			__attribute__nonnull__(pTHX_1)
 			__attribute__nonnull__(pTHX_2)
 			__attribute__nonnull__(pTHX_3);
 #define PERL_ARGS_ASSERT_SCAN_COMMIT	\
 	assert(pRExC_state); assert(data); assert(minlenp)
 
-STATIC SSize_t	S_study_chunk(pTHX_ struct RExC_state_t *pRExC_state, regnode **scanp, I32 *minlenp, SSize_t *deltap, regnode *last, struct scan_data_t *data, I32 stopparen, U8* recursed, struct regnode_charclass_class *and_withp, U32 flags, U32 depth)
+STATIC SSize_t	S_study_chunk(pTHX_ struct RExC_state_t *pRExC_state, regnode **scanp, SSize_t *minlenp, SSize_t *deltap, regnode *last, struct scan_data_t *data, I32 stopparen, U8* recursed, struct regnode_charclass_class *and_withp, U32 flags, U32 depth)
 			__attribute__nonnull__(pTHX_1)
 			__attribute__nonnull__(pTHX_2)
 			__attribute__nonnull__(pTHX_3)
@@ -6976,14 +6976,14 @@ STATIC CHECKPOINT	S_regcppush(pTHX_ const regexp *rex, I32 parenfloor, U32 maxop
 #define PERL_ARGS_ASSERT_REGCPPUSH	\
 	assert(rex)
 
-STATIC U8*	S_reghop3(U8 *s, I32 off, const U8 *lim)
+STATIC U8*	S_reghop3(U8 *s, SSize_t off, const U8 *lim)
 			__attribute__warn_unused_result__
 			__attribute__nonnull__(1)
 			__attribute__nonnull__(3);
 #define PERL_ARGS_ASSERT_REGHOP3	\
 	assert(s); assert(lim)
 
-STATIC U8*	S_reghopmaybe3(U8 *s, I32 off, const U8 *lim)
+STATIC U8*	S_reghopmaybe3(U8 *s, SSize_t off, const U8 *lim)
 			__attribute__warn_unused_result__
 			__attribute__nonnull__(1)
 			__attribute__nonnull__(3);
@@ -7032,7 +7032,7 @@ STATIC void	S_to_utf8_substr(pTHX_ regexp * prog)
 	assert(prog)
 
 #  if defined(XXX_dmq)
-STATIC U8*	S_reghop4(U8 *s, I32 off, const U8 *llim, const U8 *rlim)
+STATIC U8*	S_reghop4(U8 *s, SSize_t off, const U8 *llim, const U8 *rlim)
 			__attribute__warn_unused_result__
 			__attribute__nonnull__(1)
 			__attribute__nonnull__(3)
diff --git a/regcomp.c b/regcomp.c
index cd7e8ae..b2d6555 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -123,7 +123,7 @@ typedef struct RExC_state_t {
     char	*start;			/* Start of input for compile */
     char	*end;			/* End of input for compile */
     char	*parse;			/* Input-scan pointer. */
-    I32		whilem_seen;		/* number of WHILEM in this expr */
+    SSize_t	whilem_seen;		/* number of WHILEM in this expr */
     regnode	*emit_start;		/* Start of emitted-code area */
     regnode	*emit_bound;		/* First regnode outside of the allocated space */
     regnode	*emit;			/* Code-emit pointer; if = &emit_dummy,
@@ -132,10 +132,10 @@ typedef struct RExC_state_t {
     I32		naughty;		/* How bad is this pattern? */
     I32		sawback;		/* Did we see \1, ...? */
     U32		seen;
-    I32		size;			/* Code size. */
-    I32		npar;			/* Capture buffer count, (OPEN). */
-    I32		cpar;			/* Capture buffer count, (CLOSE). */
-    I32		nestroot;		/* root parens we are in - used by accept */
+    SSize_t	size;			/* Code size. */
+    I32	npar;				/* Capture buffer count, (OPEN). */
+    I32	cpar;				/* Capture buffer count, (CLOSE). */
+    I32	nestroot;			/* root parens we are in - used by accept */
     I32		extralen;
     I32		seen_zerolen;
     regnode	**open_parens;		/* pointers to open parens */
@@ -158,8 +158,8 @@ typedef struct RExC_state_t {
     I32		in_multi_char_class;
     struct reg_code_block *code_blocks;	/* positions of literal (?{})
 					    within pattern */
-    int		num_code_blocks;	/* size of code_blocks[] */
-    int		code_index;		/* next code_blocks[] slot */
+    I32		num_code_blocks;	/* size of code_blocks[] */
+    I32		code_index;		/* next code_blocks[] slot */
 #if ADD_TO_REGEXEC
     char 	*starttry;		/* -Dr: where regtry was called. */
 #define RExC_starttry	(pRExC_state->starttry)
@@ -301,7 +301,7 @@ typedef struct RExC_state_t {
   
   - max_offset
     Only used for floating strings. This is the rightmost point that
-    the string can appear at. If set to I32 max it indicates that the
+    the string can appear at. If set to SSize_t_MAX it indicates that the
     string can occur infinitely far to the right.
   
   - minlenp
@@ -346,20 +346,20 @@ typedef struct scan_data_t {
     /*I32 len_min;      unused */
     /*I32 len_delta;    unused */
     SSize_t pos_min;
-    I32 pos_delta;
+    SSize_t pos_delta;
     SV *last_found;
-    I32 last_end;	    /* min value, <0 unless valid. */
+    SSize_t last_end;	    /* min value, <0 unless valid. */
     SSize_t last_start_min;
-    I32 last_start_max;
+    SSize_t last_start_max;
     SV **longest;	    /* Either &l_fixed, or &l_float. */
     SV *longest_fixed;      /* longest fixed string found in pattern */
     SSize_t offset_fixed;   /* offset where it starts */
-    I32 *minlen_fixed;      /* pointer to the minlen relevant to the string */
+    SSize_t *minlen_fixed;  /* pointer to the minlen relevant to the string */
     I32 lookbehind_fixed;   /* is the position of the string modfied by LB */
     SV *longest_float;      /* longest floating string found in pattern */
     SSize_t offset_float_min; /* earliest point in string it can appear */
-    I32 offset_float_max;   /* latest point in string it can appear */
-    I32 *minlen_float;      /* pointer to the minlen relevant to the string */
+    SSize_t offset_float_max; /* latest point in string it can appear */
+    SSize_t *minlen_float;  /* pointer to the minlen relevant to the string */
     SSize_t lookbehind_float; /* is the pos of the string modified by LB */
     I32 flags;
     I32 whilem_c;
@@ -748,7 +748,8 @@ DEBUG_OPTIMISE_MORE_r(if(data){                                      \
    floating substrings if needed. */
 
 STATIC void
-S_scan_commit(pTHX_ const RExC_state_t *pRExC_state, scan_data_t *data, I32 *minlenp, int is_inf)
+S_scan_commit(pTHX_ const RExC_state_t *pRExC_state, scan_data_t *data,
+                    SSize_t *minlenp, int is_inf)
 {
     const STRLEN l = CHR_SVLEN(data->last_found);
     const STRLEN old_l = CHR_SVLEN(*data->longest);
@@ -772,9 +773,12 @@ S_scan_commit(pTHX_ const RExC_state_t *pRExC_state, scan_data_t *data, I32 *min
 	    data->offset_float_min = l ? data->last_start_min : data->pos_min;
 	    data->offset_float_max = (l
 				      ? data->last_start_max
-				      : (data->pos_delta == I32_MAX ? I32_MAX : data->pos_min + data->pos_delta));
-	    if (is_inf || (U32)data->offset_float_max > (U32)I32_MAX)
-		data->offset_float_max = I32_MAX;
+				      : (data->pos_delta == SSize_t_MAX
+					 ? SSize_t_MAX
+					 : data->pos_min + data->pos_delta));
+	    if (is_inf
+		 || (size_t)data->offset_float_max > (size_t)SSize_t_MAX)
+		data->offset_float_max = SSize_t_MAX;
 	    if (data->flags & SF_BEFORE_EOL)
 		data->flags
 		    |= ((data->flags & SF_BEFORE_EOL) << SF_FL_SHIFT_EOL);
@@ -3030,7 +3034,7 @@ typedef struct scan_frame {
 
 STATIC SSize_t
 S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
-                        I32 *minlenp, SSize_t *deltap,
+                        SSize_t *minlenp, SSize_t *deltap,
 			regnode *last,
 			scan_data_t *data,
 			I32 stopparen,
@@ -3046,17 +3050,18 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 			/* and_withp: Valid if flags & SCF_DO_STCLASS_OR */
 {
     dVAR;
-    I32 min = 0;    /* There must be at least this number of characters to match */
+    /* There must be at least this number of characters to match */
+    SSize_t min = 0;
     I32 pars = 0, code;
     regnode *scan = *scanp, *next;
-    I32 delta = 0;
+    SSize_t delta = 0;
     int is_inf = (flags & SCF_DO_SUBSTR) && (data->flags & SF_IS_INF);
     int is_inf_internal = 0;		/* The studied chunk is infinite */
     I32 is_par = OP(scan) == OPEN ? ARG(scan) : 0;
     scan_data_t data_fake;
     SV *re_trie_maxbuff = NULL;
     regnode *first_non_open = scan;
-    I32 stopmin = I32_MAX;
+    SSize_t stopmin = SSize_t_MAX;
     scan_frame *frame = NULL;
     GET_RE_DEBUG_FLAGS_DECL;
 
@@ -3166,9 +3171,9 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 					  stopparen, recursed, NULL, f,depth+1);
 		    if (min1 > minnext)
 			min1 = minnext;
-		    if (deltanext == I32_MAX) {
+		    if (deltanext == SSize_t_MAX) {
 			is_inf = is_inf_internal = 1;
-			max1 = I32_MAX;
+			max1 = SSize_t_MAX;
 		    } else if (max1 < minnext + deltanext)
 			max1 = minnext + deltanext;
 		    scan = next;
@@ -3193,16 +3198,17 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		    min1 = 0;
 		if (flags & SCF_DO_SUBSTR) {
 		    data->pos_min += min1;
-		    if (data->pos_delta >= I32_MAX - (max1 - min1))
-		        data->pos_delta = I32_MAX;
+		    if (data->pos_delta >= SSize_t_MAX - (max1 - min1))
+		        data->pos_delta = SSize_t_MAX;
 		    else
 		        data->pos_delta += max1 - min1;
 		    if (max1 != min1 || is_inf)
 			data->longest = &(data->longest_float);
 		}
 		min += min1;
-		if (delta == I32_MAX || I32_MAX - delta - (max1 - min1) < 0)
-		    delta = I32_MAX;
+		if (delta == SSize_t_MAX
+		 || SSize_t_MAX - delta - (max1 - min1) < 0)
+		    delta = SSize_t_MAX;
 		else
 		    delta += max1 - min1;
 		if (flags & SCF_DO_STCLASS_OR) {
@@ -3590,7 +3596,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	    }
 	}
 	else if (OP(scan) == EXACT) {
-	    I32 l = STR_LEN(scan);
+	    SSize_t l = STR_LEN(scan);
 	    UV uc;
 	    if (UTF) {
 		const U8 * const s = (U8*)STRING(scan);
@@ -3606,7 +3612,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		if (data->last_end == -1) { /* Update the start info. */
 		    data->last_start_min = data->pos_min;
  		    data->last_start_max = is_inf
- 			? I32_MAX : data->pos_min + data->pos_delta;
+ 			? SSize_t_MAX : data->pos_min + data->pos_delta;
 		}
 		sv_catpvn(data->last_found, STRING(scan), STR_LEN(scan));
 		if (UTF)
@@ -3796,8 +3802,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	    flags &= ~SCF_DO_STCLASS;
 	}
 	else if (REGNODE_VARIES(OP(scan))) {
-	    SSize_t mincount, maxcount, minnext, deltanext;
-	    I32 fl = 0, f = flags, pos_before = 0;
+	    SSize_t mincount, maxcount, minnext, deltanext, pos_before = 0;
+	    I32 fl = 0, f = flags;
 	    regnode * const oscan = scan;
 	    struct regnode_charclass_class this_class;
 	    struct regnode_charclass_class *oclass = NULL;
@@ -3934,11 +3940,11 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		}
 
 		min += minnext * mincount;
-		is_inf_internal |= deltanext == I32_MAX
+		is_inf_internal |= deltanext == SSize_t_MAX
 				     || (maxcount == REG_INFTY && minnext + deltanext > 0);
 		is_inf |= is_inf_internal;
 		if (is_inf)
-		    delta = I32_MAX;
+		    delta = SSize_t_MAX;
 		else
 		    delta += (minnext + deltanext) * maxcount - minnext * mincount;
 
@@ -4068,10 +4074,10 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 
 		    if (data->last_end > 0 && mincount != 0) { /* Ends with a string. */
 #if defined(SPARC64_GCC_WORKAROUND)
-			I32 b = 0;
+			SSize_t b = 0;
 			STRLEN l = 0;
 			const char *s = NULL;
-			I32 old = 0;
+			SSize_t old = 0;
 
 			if (pos_before >= data->last_start_min)
 			    b = pos_before;
@@ -4083,11 +4089,11 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 			old = b - data->last_start_min;
 
 #else
-			I32 b = pos_before >= data->last_start_min
+			SSize_t b = pos_before >= data->last_start_min
 			    ? pos_before : data->last_start_min;
 			STRLEN l;
 			const char * const s = SvPV_const(data->last_found, l);
-			I32 old = b - data->last_start_min;
+			SSize_t old = b - data->last_start_min;
 #endif
 
 			if (UTF)
@@ -4119,20 +4125,21 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 			} else {
 			    /* start offset must point into the last copy */
 			    data->last_start_min += minnext * (mincount - 1);
-			    data->last_start_max += is_inf ? I32_MAX
+			    data->last_start_max += is_inf ? SSize_t_MAX
 				: (maxcount - 1) * (minnext + data->pos_delta);
 			}
 		    }
 		    /* It is counted once already... */
 		    data->pos_min += minnext * (mincount - counted);
 #if 0
-PerlIO_printf(Perl_debug_log, "counted=%d deltanext=%d I32_MAX=%d minnext=%d maxcount=%d mincount=%d\n",
-    counted, deltanext, I32_MAX, minnext, maxcount, mincount);
+PerlIO_printf(Perl_debug_log, "counted=%zd deltanext=%zd SSize_t_MAX=%zd minnext=%zd maxcount=%zd mincount=%zd\n",
+    counted, deltanext, SSize_t_MAX, minnext, maxcount, mincount);
 if (deltanext != I32_MAX)
-PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext + deltanext) * maxcount - minnext * mincount, I32_MAX - data->pos_delta);
+PerlIO_printf(Perl_debug_log, "LHS=%zd RHS=%zd\n", -counted * deltanext + (minnext + deltanext) * maxcount - minnext * mincount, SSize_t_MAX - data->pos_delta);
 #endif
-		    if (deltanext == I32_MAX || -counted * deltanext + (minnext + deltanext) * maxcount - minnext * mincount >= I32_MAX - data->pos_delta)
-		        data->pos_delta = I32_MAX;
+		    if (deltanext == SSize_t_MAX ||
+			-counted * deltanext + (minnext + deltanext) * maxcount - minnext * mincount >= SSize_t_MAX - data->pos_delta)
+		        data->pos_delta = SSize_t_MAX;
 		    else
 		        data->pos_delta += - counted * deltanext +
 			(minnext + deltanext) * maxcount - minnext * mincount;
@@ -4152,7 +4159,7 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
 			    data->last_start_min =
 				data->pos_min - CHR_SVLEN(last_str);
 			    data->last_start_max = is_inf
-				? I32_MAX
+				? SSize_t_MAX
 				: data->pos_min + data->pos_delta
 				- CHR_SVLEN(last_str);
 			}
@@ -4443,7 +4450,8 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
                    length of the pattern, something we won't know about
                    until after the recurse.
                 */
-                I32 deltanext, fake = 0;
+                SSize_t deltanext;
+                I32 fake = 0;
                 regnode *nscan;
                 struct regnode_charclass_class intrnl;
                 int f = 0;
@@ -4453,8 +4461,8 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
                     have to worry about freeing them when we know
                     they wont be used, which would be a pain.
                  */
-                I32 *minnextp;
-                Newx( minnextp, 1, I32 );
+                SSize_t *minnextp;
+                Newx( minnextp, 1, SSize_t );
                 SAVEFREEPV(minnextp);
 
                 if (data) {
@@ -4583,7 +4591,7 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
 	    {
 	        if (!(RExC_rx->extflags & RXf_ANCH) && (flags & SCF_DO_SUBSTR))
 		    RExC_rx->extflags |= RXf_ANCH_GPOS;
-	        if (RExC_rx->gofs < (U32)min)
+	        if (RExC_rx->gofs < (STRLEN)min)
 		    RExC_rx->gofs = min;
             } else {
                 RExC_rx->extflags |= RXf_GPOS_FLOAT;
@@ -4599,7 +4607,7 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
             regnode *trie_node= scan;
             regnode *tail= regnext(scan);
             reg_trie_data *trie = (reg_trie_data*)RExC_rxi->data->data[ ARG(scan) ];
-            I32 max1 = 0, min1 = I32_MAX;
+            SSize_t max1 = 0, min1 = SSize_t_MAX;
             struct regnode_charclass_class accum;
 
             if (flags & SCF_DO_SUBSTR) /* XXXX Add !SUSPEND? */
@@ -4650,12 +4658,12 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
                     if (nextbranch && PL_regkind[OP(nextbranch)]==BRANCH)
                         nextbranch= regnext((regnode*)nextbranch);
                     
-                    if (min1 > (I32)(minnext + trie->minlen))
+                    if (min1 > (SSize_t)(minnext + trie->minlen))
                         min1 = minnext + trie->minlen;
-                    if (deltanext == I32_MAX) {
+                    if (deltanext == SSize_t_MAX) {
                         is_inf = is_inf_internal = 1;
-                        max1 = I32_MAX;
-                    } else if (max1 < (I32)(minnext + deltanext + trie->maxlen))
+                        max1 = SSize_t_MAX;
+                    } else if (max1 < (SSize_t)(minnext + deltanext + trie->maxlen))
                         max1 = minnext + deltanext + trie->maxlen;
                     
                     if (data_fake.flags & (SF_HAS_PAR|SF_IN_PAR))
@@ -4749,9 +4757,9 @@ PerlIO_printf(Perl_debug_log, "LHS=%d RHS=%d\n", -counted * deltanext + (minnext
     DEBUG_STUDYDATA("pre-fin:",data,depth);
 
     *scanp = scan;
-    *deltap = is_inf_internal ? I32_MAX : delta;
+    *deltap = is_inf_internal ? SSize_t_MAX : delta;
     if (flags & SCF_DO_SUBSTR && is_inf)
-	data->pos_delta = I32_MAX - data->pos_min;
+	data->pos_delta = SSize_t_MAX - data->pos_min;
     if (is_par > (I32)U8_MAX)
 	is_par = 0;
     if (is_par && pars==1 && data) {
@@ -5419,13 +5427,15 @@ S_compile_runtime_code(pTHX_ RExC_state_t * const pRExC_state,
 
 
 STATIC bool
-S_setup_longest(pTHX_ RExC_state_t *pRExC_state, SV* sv_longest, SV** rx_utf8, SV** rx_substr, I32* rx_end_shift, I32 lookbehind, I32 offset, I32 *minlen, STRLEN longest_length, bool eol, bool meol)
+S_setup_longest(pTHX_ RExC_state_t *pRExC_state, SV* sv_longest, SV** rx_utf8, SV** rx_substr, SSize_t* rx_end_shift,
+		      SSize_t lookbehind, SSize_t offset, SSize_t *minlen, STRLEN longest_length, bool eol, bool meol)
 {
     /* This is the common code for setting up the floating and fixed length
      * string data extracted from Perl_re_op_compile() below.  Returns a boolean
      * as to whether succeeded or not */
 
-    I32 t,ml;
+    I32 t;
+    SSize_t ml;
 
     if (! (longest_length
            || (eol /* Can't have SEOL and MULTI */
@@ -5450,7 +5460,7 @@ S_setup_longest(pTHX_ RExC_state_t *pRExC_state, SV* sv_longest, SV** rx_utf8, S
         follow this item. We calculate it ahead of time as once the
         lookbehind offset is added in we lose the ability to correctly
         calculate it.*/
-    ml = minlen ? *(minlen) : (I32)longest_length;
+    ml = minlen ? *(minlen) : (SSize_t)longest_length;
     *rx_end_shift = ml - offset
         - longest_length + (SvTAIL(sv_longest) != 0)
         + lookbehind;
@@ -5519,7 +5529,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
     char *exp;
     regnode *scan;
     I32 flags;
-    I32 minlen = 0;
+    SSize_t minlen = 0;
     U32 rx_flags;
     SV *pat;
     SV *code_blocksv = NULL;
@@ -6274,7 +6284,7 @@ reStudy:
         {
 	    r->float_min_offset = data.offset_float_min - data.lookbehind_float;
 	    r->float_max_offset = data.offset_float_max;
-	    if (data.offset_float_max < I32_MAX) /* Don't offset infinity */
+	    if (data.offset_float_max < SSize_t_MAX) /* Don't offset infinity */
 	        r->float_max_offset -= data.lookbehind_float;
 	    SvREFCNT_inc_simple_void_NN(data.longest_float);
 	}
@@ -6477,8 +6487,8 @@ reStudy:
     });
 #ifdef RE_TRACK_PATTERN_OFFSETS
     DEBUG_OFFSETS_r(if (ri->u.offsets) {
-        const U32 len = ri->u.offsets[0];
-        U32 i;
+        const STRLEN len = ri->u.offsets[0];
+        STRLEN i;
         GET_RE_DEBUG_FLAGS_DECL;
         PerlIO_printf(Perl_debug_log, "Offsets: [%"UVuf"]\n\t", (UV)ri->u.offsets[0]);
         for (i = 1; i <= len; i++) {
@@ -6666,7 +6676,7 @@ Perl_reg_named_buff_scalar(pTHX_ REGEXP * const r, const U32 flags)
 {
     SV *ret;
     AV *av;
-    I32 length;
+    SSize_t length;
     struct regexp *const rx = ReANY(r);
 
     PERL_ARGS_ASSERT_REG_NAMED_BUFF_SCALAR;
@@ -12266,7 +12276,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
     regnode * const orig_emit = RExC_emit; /* Save the original RExC_emit in
         case we need to change the emitted regop to an EXACT. */
     const char * orig_parse = RExC_parse;
-    const I32 orig_size = RExC_size;
+    const SSize_t orig_size = RExC_size;
     GET_RE_DEBUG_FLAGS_DECL;
 
     PERL_ARGS_ASSERT_REGCLASS;
diff --git a/regexec.c b/regexec.c
index 4a25f5c..63bee9a 100644
--- a/regexec.c
+++ b/regexec.c
@@ -369,7 +369,7 @@ S_regcppop(pTHX_ regexp *rex, U32 *maxopenparen_p)
     );
     paren = *maxopenparen_p;
     for ( ; i > 0; i -= REGCP_PAREN_ELEMS) {
-	I32 tmps;
+	SSize_t tmps;
 	rex->offs[paren].start_tmp = SSPOPINT;
 	rex->offs[paren].start = SSPOPIV;
 	tmps = SSPOPIV;
@@ -526,7 +526,7 @@ S_isFOO_utf8_lc(pTHX_ const U8 classnum, const U8* character)
  */
 I32
 Perl_pregexec(pTHX_ REGEXP * const prog, char* stringarg, char *strend,
-	 char *strbeg, I32 minend, SV *screamer, U32 nosave)
+	 char *strbeg, SSize_t minend, SV *screamer, U32 nosave)
 /* stringarg: the point in the string at which to begin matching */
 /* strend:    pointer to null at end of string */
 /* strbeg:    real beginning of string */
@@ -617,7 +617,7 @@ Perl_re_intuit_start(pTHX_
     struct regexp *const prog = ReANY(rx);
     SSize_t start_shift = 0;
     /* Should be nonnegative! */
-    I32 end_shift   = 0;
+    SSize_t end_shift   = 0;
     char *s;
     SV *check;
     char *t;
@@ -687,7 +687,7 @@ Perl_re_intuit_start(pTHX_
                                    See [perl #115242] */
           {
 	    /* Substring at constant offset from beg-of-str... */
-	    I32 slen;
+	    SSize_t slen;
 
 	    s = HOP3c(strpos, prog->check_offset_min, strend);
 	    
@@ -723,9 +723,9 @@ Perl_re_intuit_start(pTHX_
 	end_shift = prog->check_end_shift;
 	
 	if (!ml_anch) {
-	    const I32 end = prog->check_offset_max + CHR_SVLEN(check)
+	    const SSize_t end = prog->check_offset_max + CHR_SVLEN(check)
 					 - (SvTAIL(check) != 0);
-	    const I32 eshift = CHR_DIST((U8*)strend, (U8*)s) - end;
+	    const SSize_t eshift = CHR_DIST((U8*)strend, (U8*)s) - end;
 
 	    if (end_shift < eshift)
 		end_shift = eshift;
@@ -752,7 +752,7 @@ Perl_re_intuit_start(pTHX_
     
     {
         SSize_t srch_start_shift = start_shift;
-        I32 srch_end_shift = end_shift;
+        SSize_t srch_end_shift = end_shift;
         U8* start_point;
         U8* end_point;
         if (srch_start_shift < 0 && strbeg - s > srch_start_shift) {
@@ -1555,7 +1555,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
          * characters, and there are only 2 availabe, we know without
          * trying that it will fail; so don't start a match past the
          * required minimum number from the far end */
-        e = HOP3c(strend, -((I32)ln), s);
+        e = HOP3c(strend, -((SSize_t)ln), s);
 
         if (reginfo->intuit && e < s) {
             e = s;			/* Due to minlen logic of intuit() */
@@ -1601,7 +1601,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
          * only 2 are left, it's guaranteed to fail, so don't start a
          * match that would require us to go beyond the end of the string
          */
-        e = HOP3c(strend, -((I32)lnc), s);
+        e = HOP3c(strend, -((SSize_t)lnc), s);
 
         if (reginfo->intuit && e < s) {
             e = s;			/* Due to minlen logic of intuit() */
@@ -2099,7 +2099,7 @@ S_reg_set_capture_string(pTHX_ REGEXP * const rx,
         {
             SSize_t min = 0;
             SSize_t max = strend - strbeg;
-            I32 sublen;
+            SSize_t sublen;
 
             if (    (flags & REXEC_COPY_SKIP_POST)
                 && !(prog->extflags & RXf_PMf_KEEPCOPY) /* //p */
@@ -2179,7 +2179,8 @@ S_reg_set_capture_string(pTHX_ REGEXP * const rx,
              *   $x = "\x{100}" x 1E6; 1 while $x =~ /(.)/g;
              * from going quadratic */
             if (SvPOKp(sv) && SvPVX(sv) == strbeg)
-                sv_pos_b2u(sv, &(prog->subcoffset));
+                prog->subcoffset = sv_pos_b2u_flags(sv, prog->subcoffset,
+                                                SV_GMAGIC|SV_CONST_RETURN);
             else
                 prog->subcoffset = utf8_length((U8*)strbeg,
                                     (U8*)(strbeg+prog->suboffset));
@@ -2202,7 +2203,7 @@ S_reg_set_capture_string(pTHX_ REGEXP * const rx,
  */
 I32
 Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend,
-	      char *strbeg, I32 minend, SV *sv, void *data, U32 flags)
+	      char *strbeg, SSize_t minend, SV *sv, void *data, U32 flags)
 /* stringarg: the point in the string at which to begin matching */
 /* strend:    pointer to null at end of string */
 /* strbeg:    real beginning of string */
@@ -2219,8 +2220,8 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend,
     char *s;
     regnode *c;
     char *startpos;
-    I32 minlen;		/* must match at least this many chars */
-    I32 dontbother = 0;	/* how many characters not to try at end */
+    SSize_t minlen;		/* must match at least this many chars */
+    SSize_t dontbother = 0;	/* how many characters not to try at end */
     const bool utf8_target = cBOOL(DO_UTF8(sv));
     I32 multiline;
     RXi_GET_DECL(prog,progi);
@@ -2603,8 +2604,8 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend,
 	      || ((prog->float_substr != NULL || prog->float_utf8 != NULL)
 		  && prog->float_max_offset < strend - s)) {
 	SV *must;
-	I32 back_max;
-	I32 back_min;
+	SSize_t back_max;
+	SSize_t back_min;
 	char *last;
 	char *last1;		/* Last position checked before */
 #ifdef DEBUGGING
@@ -2649,7 +2650,7 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend,
 	    last = strend;
 	} else {
             last = HOP3c(strend,	/* Cannot start after this */
-        	  -(I32)(CHR_SVLEN(must)
+        	  -(SSize_t)(CHR_SVLEN(must)
         		 - (SvTAIL(must) != 0) + back_min), strbeg);
         }
 	if (s > reginfo->strbeg)
@@ -3601,7 +3602,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
     regnode *scan;
     regnode *next;
     U32 n = 0;	/* general value; init to avoid compiler warning */
-    I32 ln = 0; /* len or last;  init to avoid compiler warning */
+    SSize_t ln = 0; /* len or last;  init to avoid compiler warning */
     char *locinput = startpos;
     char *pushinput; /* where to continue after a PUSH */
     I32 nextchr;   /* is always set to UCHARAT(locinput) */
@@ -5558,10 +5559,10 @@ NULL
 
 		if (reginfo->poscache_iter-- == 0) {
 		    /* initialise cache */
-		    const I32 size = (reginfo->poscache_maxiter + 7)/8;
+		    const SSize_t size = (reginfo->poscache_maxiter + 7)/8;
                     regmatch_info_aux *const aux = reginfo->info_aux;
 		    if (aux->poscache) {
-			if ((I32)reginfo->poscache_size < size) {
+			if ((SSize_t)reginfo->poscache_size < size) {
 			    Renew(aux->poscache, size, char);
 			    reginfo->poscache_size = size;
 			}
@@ -5579,7 +5580,7 @@ NULL
 
 		if (reginfo->poscache_iter < 0) {
 		    /* have we already failed at this position? */
-		    I32 offset, mask;
+		    SSize_t offset, mask;
 
                     reginfo->poscache_iter = -1; /* stop eventual underflow */
 		    offset  = (scan->flags & 0xf) - 1
@@ -7519,7 +7520,7 @@ S_reginclass(pTHX_ regexp * const prog, const regnode * const n, const U8* const
 }
 
 STATIC U8 *
-S_reghop3(U8 *s, I32 off, const U8* lim)
+S_reghop3(U8 *s, SSize_t off, const U8* lim)
 {
     /* return the position 'off' UTF-8 characters away from 's', forward if
      * 'off' >= 0, backwards if negative.  But don't go outside of position
@@ -7554,7 +7555,7 @@ S_reghop3(U8 *s, I32 off, const U8* lim)
    we ifdef it out - dmq
 */
 STATIC U8 *
-S_reghop4(U8 *s, I32 off, const U8* llim, const U8* rlim)
+S_reghop4(U8 *s, SSize_t off, const U8* llim, const U8* rlim)
 {
     dVAR;
 
@@ -7581,7 +7582,7 @@ S_reghop4(U8 *s, I32 off, const U8* llim, const U8* rlim)
 #endif
 
 STATIC U8 *
-S_reghopmaybe3(U8* s, I32 off, const U8* lim)
+S_reghopmaybe3(U8* s, SSize_t off, const U8* lim)
 {
     dVAR;
 
diff --git a/regexp.h b/regexp.h
index 8542cb1..928a374 100644
--- a/regexp.h
+++ b/regexp.h
@@ -40,7 +40,7 @@ struct reg_substr_datum {
     SSize_t max_offset;
     SV *substr;		/* non-utf8 variant */
     SV *utf8_substr;	/* utf8 variant */
-    I32 end_shift;
+    SSize_t end_shift;
 };
 struct reg_substr_data {
     struct reg_substr_datum data[3];	/* Actual array */
@@ -63,7 +63,7 @@ typedef struct regexp_paren_pair {
      *	  "abc" =~ /(.(?{print "[$1]"}))+/
      *outputs [][a][b]
      * This field is not part of the API.  */
-    I32 start_tmp;
+    SSize_t start_tmp;
 } regexp_paren_pair;
 
 #if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_UTF8_C)
@@ -104,9 +104,9 @@ struct reg_code_block {
 	/* Information about the match that the perl core uses to */	\
 	/* manage things */						\
 	U32 extflags;	/* Flags used both externally and internally */	\
-	I32 minlen;	/* mininum possible number of chars in string to match */\
-	I32 minlenret;	/* mininum possible number of chars in $& */		\
-	U32 gofs;	/* chars left of pos that we search from */	\
+	SSize_t minlen;	/* mininum possible number of chars in string to match */\
+	SSize_t minlenret; /* mininum possible number of chars in $& */		\
+	STRLEN gofs;	/* chars left of pos that we search from */	\
 	/* substring data about strings that must appear in the */	\
 	/* final match, used for optimisations */			\
 	struct reg_substr_data *substrs;				\
@@ -125,8 +125,8 @@ struct reg_code_block {
 	char *subbeg;							\
 	SV_SAVED_COPY	/* If non-NULL, SV which is COW from original */\
 	SSize_t sublen;	/* Length of string pointed by subbeg */	\
-	I32 suboffset;	/* byte offset of subbeg from logical start of str */ \
-	I32 subcoffset;	/* suboffset equiv, but in chars (for @-/@+) */ \
+	SSize_t suboffset; /* byte offset of subbeg from logical start of str */ \
+	SSize_t subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ \
 	/* Information about the match that isn't often used */		\
 	/* offset from wrapped to the start of precomp */		\
 	PERL_BITFIELD32 pre_prefix:4;					\
@@ -146,7 +146,7 @@ typedef struct regexp {
 typedef struct re_scream_pos_data_s
 {
     char **scream_olds;		/* match pos */
-    I32 *scream_pos;		/* Internal iterator of scream. */
+    SSize_t *scream_pos;	/* Internal iterator of scream. */
 } re_scream_pos_data;
 
 /* regexp_engine structure. This is the dispatch table for regexes.
@@ -155,7 +155,7 @@ typedef struct re_scream_pos_data_s
 typedef struct regexp_engine {
     REGEXP* (*comp) (pTHX_ SV * const pattern, U32 flags);
     I32     (*exec) (pTHX_ REGEXP * const rx, char* stringarg, char* strend,
-                     char* strbeg, I32 minend, SV* sv,
+                     char* strbeg, SSize_t minend, SV* sv,
                      void* data, U32 flags);
     char*   (*intuit) (pTHX_
                         REGEXP * const rx,
@@ -606,7 +606,7 @@ typedef struct {
     STRLEN  suboffset;  /* saved suboffset  field from rex */
     STRLEN  subcoffset; /* saved subcoffset field from rex */
     MAGIC   *pos_magic; /* pos() magic attached to $_ */
-    I32     pos;        /* the original value of pos() in pos_magic */
+    SSize_t pos;        /* the original value of pos() in pos_magic */
     U8      pos_flags;  /* flags to be restored; currently only MGf_BYTES*/
 } regmatch_info_aux_eval;
 
-- 
1.7.12.4 (Apple Git-37)

@p5pRT
Copy link
Author

p5pRT commented Aug 18, 2013

From @cpansprout

On Sun Aug 18 16​:00​:47 2013, sprout wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

which I forgot to mention are also attached to that message.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 19, 2013

From @iabyn

On Sun, Aug 18, 2013 at 04​:00​:48PM -0700, Father Chrysostomos via RT wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

I can't really comment on the regcomp.c parts since I only really (for
some very loose definition of "really") understand the runtime part of the
regex engine. But those look plausible.

--
In the 70's we wore flares because we didn't know any better.
What possible excuse does the current generation have?

@p5pRT
Copy link
Author

p5pRT commented Aug 19, 2013

From @cpansprout

On Mon Aug 19 05​:16​:19 2013, davem wrote​:

On Sun, Aug 18, 2013 at 04​:00​:48PM -0700, Father Chrysostomos via RT
wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

I can't really comment on the regcomp.c parts

That’s unfortunate, as that is the part I am most worried about.

since I only really (for
some very loose definition of "really") understand the runtime part of the
regex engine. But those look plausible.

Thank you. (The last time you applied that adjective to one of my
patches, it had a bug in it. :-)

At the risk of changing the subject, you’ve just reminded me of something.

Is there any reason why this (in regexec.c)​:

  /* extract RE object from returned value; compiling if
  * necessary */

has to come after this?

  S_regcp_restore(aTHX_ rex, runops_cp, &maxopenparen);

Couldn’t it be merged with

  /* before restoring everything, evaluate the returned
  * value, so that 'uninit' warnings don't use the wrong
  * PL_op or pad. Also need to process any magic vars
  * (e.g. $1) *before* parentheses are restored */

further up?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 19, 2013

From @demerphq

On 19 August 2013 16​:15, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Mon Aug 19 05​:16​:19 2013, davem wrote​:

On Sun, Aug 18, 2013 at 04​:00​:48PM -0700, Father Chrysostomos via RT
wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

I can't really comment on the regcomp.c parts

That’s unfortunate, as that is the part I am most worried about.

I will do my best to review the regcomp.c bits.

Ill try to get it done this week sometime.

Yves

@p5pRT
Copy link
Author

p5pRT commented Aug 19, 2013

From @iabyn

On Mon, Aug 19, 2013 at 07​:15​:14AM -0700, Father Chrysostomos via RT wrote​:

On Mon Aug 19 05​:16​:19 2013, davem wrote​:

On Sun, Aug 18, 2013 at 04​:00​:48PM -0700, Father Chrysostomos via RT
wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

I can't really comment on the regcomp.c parts

That’s unfortunate, as that is the part I am most worried about.

since I only really (for
some very loose definition of "really") understand the runtime part of the
regex engine. But those look plausible.

Thank you. (The last time you applied that adjective to one of my
patches, it had a bug in it. :-)

At the risk of changing the subject, you’ve just reminded me of something.

Is there any reason why this (in regexec.c)​:

        /\* extract RE object from returned value; compiling if
         \* necessary \*/

has to come after this?

    S\_regcp\_restore\(aTHX\_ rex\, runops\_cp\, &maxopenparen\);

Couldn’t it be merged with

    /\* before restoring everything\, evaluate the returned
     \* value\, so that 'uninit' warnings don't use the wrong
     \* PL\_op or pad\. Also need to process any magic vars
     \* \(e\.g\. $1\) \*before\* parentheses are restored \*/

further up?

Well originally all the code was after the restore; I moved the extracting
part ahead of the restore so that code like this would work​:

  "abcdef" =~ m{^(a)(??{ "b" =~ /(.)/; $1 })}

Whether the rest of the compiling code could be moved back as well (thus
unifying it all before the restore), I'm not sure. Instinctively it feels
wrong​: the sub-regex should be compiled using the state and context
of the outer regex, since we have effectively 'returned' from the code and
PL_curcop etc shouldn't still be pointing there. So for example warnings
triggered during the regex compilation shouldn't appear to come from
inside the code block (which may not even be in the same regex, e.g.

  $r = qr/(??{ '\\b+' })/;
  # ...
  "a" =~ m/a$r/ or die;

here the warning generated when \b+ is is compiled, should appear to come
from line 3, not line 1.

--
Modern art​:
  "That's easy, I could have done that!"
  "Ah, but you didn't!"

@p5pRT
Copy link
Author

p5pRT commented Aug 23, 2013

From @khwilliamson

On 08/18/2013 05​:00 PM, Father Chrysostomos via RT wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

I am really familiar with a few areas of both regcomp.c and regexec.c.
I don't think anyone is really familiar with all of them.

Given what I know, I've looked at your patch, and it looks plausible to
me. I thought Jarkko's idea was a good one, to just change I32s and
U32s and then back off until it compiles, then the tests. Given that,
then it seems like we should apply your patch (possibly modified) and
see what happens.

Everything I have to say is a nit or bikeshedding​:
regcomp.c​: lines 161-162. You changed 'int' to 'I32' I think that I32
is big enough, but I thought we had a policy against introducing new I32s.

I would prefer using something other than SSize_t throughout, We mostly
use STRLEN instead of size_t. (Perhaps there is some distinction I'm
unaware of; if so it should be commented at the definition of STRLEN by
someone who can explain it.) If I'm right that there is no distinction,
than something like SSTRLEN or S_STRLEN would be more consistent with
existing practice and better than SSize_t. And, the casts to size_t at
the lines ending at #480 would then be to STRLEN instead.

Around line 4135, %z is not a C89 standard format, something else should
be used.

@p5pRT
Copy link
Author

p5pRT commented Aug 23, 2013

From @bulk88

On Fri Aug 23 14​:14​:52 2013, public@​khwilliamson.com wrote​:

Around line 4135, %z is not a C89 standard format, something else should
be used.

PerlIO_printf winds up using sv_vcatpvfn_flags in some/all (???) perl
builds, which implements its own format string handling independent of
what the OS c lib provides. So %z may (depending on if sv_vcatpvfn_flags
is used) be fine.
--
bulk88 ~ bulk88 at hotmail.com

@p5pRT
Copy link
Author

p5pRT commented Aug 23, 2013

From @cpansprout

On Fri Aug 23 14​:14​:52 2013, public@​khwilliamson.com wrote​:

Everything I have to say is a nit or bikeshedding​:
regcomp.c​: lines 161-162. You changed 'int' to 'I32'

That was a mistake. I’ll change it back.

I think that I32
is big enough, but I thought we had a policy against introducing new I32s.

I didn’t know of any such policy. I32 is sometimes appropriate.

I would prefer using something other than SSize_t throughout, We mostly
use STRLEN instead of size_t. (Perhaps there is some distinction I'm
unaware of; if so it should be commented at the definition of STRLEN by
someone who can explain it.)

I think most of these distinctions are historical artefacts, and nobody
really knows why we have so many aliases for the same thing. :-) I
believe it is mostly to make code self-documenting.

If I'm right that there is no distinction,
than something like SSTRLEN or S_STRLEN would be more consistent with
existing practice and better than SSize_t.

I see nothing particularly wrong with introducing SSTRLEN, apart from
the fact that we have too many types already. :-) And we are already
using SSize_t throughout perlio.c and pp_sys.c when dealing with string
lengths.

And, the casts to size_t at
the lines ending at #480 would then be to STRLEN instead.

Yes, they should be.

Around line 4135, %z is not a C89 standard format, something else should
be used.

Not so critical, as it is in an #if 0 section. But you have a point.

Does PerlIO_printf use the underlying C library’s (s)printf, or perl’s
own? If the former, we should use UVdf and cast to UV.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 25, 2013

From @cpansprout

On Sun Aug 18 16​:00​:47 2013, sprout wrote​:

I have fixed this bug and others on the sprout/regexp branch. I would
appreciate it if someone more familiar with the regexp engine could
review at least the last two commits.

The branch has been merged as ed56dbc.

In particular, commits 389ecb5 and 99a90e5 fix the original bug report.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 25, 2013

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Aug 25, 2013
@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From @nwc10

On Fri, Aug 23, 2013 at 02​:54​:05PM -0700, Father Chrysostomos via RT wrote​:

On Fri Aug 23 14​:14​:52 2013, public@​khwilliamson.com wrote​:

Everything I have to say is a nit or bikeshedding​:
regcomp.c​: lines 161-162. You changed 'int' to 'I32'

That was a mistake. I'll change it back.

I think that I32
is big enough, but I thought we had a policy against introducing new I32s.

I didn't know of any such policy. I32 is sometimes appropriate.

I don't think that it's been properly written down in the coding standards
(that we don't have) but I think it was discussed somewhere and poorly
archived.

I agree totally that I32 is sometimes appropriate. But I suspect that about
90% of the uses of I32 in the core are wrong. Which means that there are a
lot of I32s in the core that need auditing for

1) should this be unsigned?
  (in which case, U32, UV, or STRLEN, depending on purpose)
2) should this actually be 32 bits? Or is it the size of something?
  (so IV or SSize_t)
3) And if 16 bits is definitely enough, use unsigned int or int*

I32 seemed to be used as the "default type" for the first 10 years.
Which didn't matter when machines were (at most) 32 bit, and couldn't
even think about things larger than 2Gb

So unless something really should be both signed, and 32 bits, I think it's
better to use the type that more accurately reflects what the requirement
are, to avoid adding more I32s that need to be checked over.

Nicholas Clark

* I doubt that we actually build on any platforms with 16 bit ints.
  Hopefully someone will now prove me wrong with test results.

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From @nwc10

On Mon, Jul 22, 2013 at 10​:15​:05PM -0700, Father Chrysostomos via RT wrote​:

On Mon Jun 24 18​:22​:12 2013, sprout wrote​:

On Sun Mar 03 06​:48​:09 2013, LeonT wrote​:

AFAICT changing mg_len shouldn't break API unless someone is taking a
pointer to it, which doesn't seem very likely.

I have just searched CPAN for /&.*mg_len/. The only uses of it are in
the perl core.

One of those uses is Perl_hv_placeholders_p. Added in ca73285,
this function is marked as being in the public API (A in embed.fnc), but
has no documentation and no short form (hv_placeholders_p). There are
no uses of it on CPAN.

Is this another case where a function was mistakenly added to the API
(like stashpv_hvname_match, added in ed221c5 and removed in
c947b31), whereas it should simply have been exported (X in embed.fnc)?

Not intentionally. See below​:

The only use of this function *anywhere* in publicly-viewable code (OK,
just CPAN and the perl core) is the HvPLACEHOLDERS macro in hv.h.

Only one CPAN module, Data​::Alias, uses HvPLACEHOLDERS, and it is like
this​: HvPLACEHOLDERS(a2)++.

So it looks as though it is safe to change

ApoR |I32* |hv_placeholders_p |NN HV *hv

to

XpoR |SSize_t*|hv_placeholders_p |NN HV *hv

Comments?

Yes, I think it's safe to do this (given the lack of users, and the compiler
warning that will ensue from anyone else using it and now having the wrong
size).

But this has been bugging me for quite a bit.
I know that we've got a lot of them, but I'm not sure that I like LVALUE
macros. In particular, I'm not sure that I like them to keep increasing,
and be the official documented route in. In that

1) LVALUE macros make it very hard to refactor the internals.
  Having a GET() and SET() pair makes the intent of the caller easy to spot.
  Which means that the storage doesn't need to stay as the same type as it
  is.

  You can also trigger code to run on writing the value, but avoid running
  it on reading. (For example, default the value to 0, and only do work if
  it's set to non-0)
  You can't do this with an LVALUE macro, because you can't tell reads from
  writes.

  I don't think that anyone is going to try to do any refactoring as
  aggressive as PONIE in the future, but by adding LVALUE macros we don't
  make their lives easier.

  *_p functions look ugly, which was sort of part of the plan.

2) Unlike functions, macros don't have any way to formally document their
  type. Which is directly relevant here. hv_placeholders_p has a clearly
  defined size of integer, which matters for pointers.
  Whereas the expression &HvPLACEHOLDERS(a2) doesn't, because you can't
  know what size HvPLACEHOLDERS(a2) is from its macro declaration. You have
  to dig further.

I don't have an answer here, and I'm not sure if I'm mentally prioritising
the wrong things to clean up first.

Ultimately I think that the problem is that the use of lots of direct data
structure access (even if the structure's shape is abstracted by macros) is
not malleable enough to survive major and continued refactoring. And that
we're way too far down that road to retrace our steps to a better place.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From zefram@fysh.org

Nicholas Clark wrote​:

I know that we've got a lot of them, but I'm not sure that I like LVALUE
macros.

Indeed. We've been forced to change many into get/set pairs, when a value
stopped being stored as a straightforward mutable lvalue. We should learn
from that, and implement future macros in get/set pairs from the start.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 30, 2013

From @cpansprout

On Fri Aug 30 08​:38​:46 2013, zefram@​fysh.org wrote​:

Nicholas Clark wrote​:

I know that we've got a lot of them, but I'm not sure that I like LVALUE
macros.

Indeed. We've been forced to change many into get/set pairs, when a value
stopped being stored as a straightforward mutable lvalue. We should learn
from that, and implement future macros in get/set pairs from the start.

Lvalue macros can be really convenient. If we protect all new macros of
that sort with #ifdef PERL_CORE, it shouldn’t be too bad.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 31, 2013

From @jhi

I suggest moving this extended API discussion into its own ticket
and/or p5p thread since the original bug has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant