On Wed, Jul 31, 2013 at 01:10:33PM +0100, Dave Mitchell wrote: > On Tue, Jul 30, 2013 at 10:58:36PM -0700, Alexey Borzenkov wrote: > > When using threads::shared on Perl 5.14 using many kinds of > > expressions in sub's return becomes very dangerous, as return > > values just become undef undef certain circumstances. For example > > this program would die in Perl 5.14, but would work normally in > > Perl 5.12 or Perl 5.16: > > > > use strict; > > use threads (); > > use threads::shared; > > > > sub test_clone { > > my $ref = shared_clone([{a => 1, b => 2}]); > > return $ref->[0]; > > } > > > > die "not defined" unless defined(test_clone); This can be simplified to: #!./perl use strict; use threads (); use threads::shared; sub test_clone { my @a :shared; $a[0] = 6; $a[0]; } die "not defined" unless defined test_clone; __END__ > > This causes threaded perl programs on Debian 7.0 or Ubuntu 12.04 > > to produce really weird failures. > > This was fixed in perl 5.15.7 with commit > 6f48390ab209d16ee8f795f0a83677c8bd9ac69c, > which wont be back-ported to 5.14.x as its out of its maintenance period. > > You can work around the issue on older perls by assigning the value to > a temporary var before returning it; i.e. > > my $ret = $ref->[0]; > return $ret; Totally true. If assertions are enabled the test case triggers an assertion failure on 5.14.0. Given this, I was suspicious as to what introduced the assertion failure, it turns out that there's a heck of a lot more to this than first meets the eye. So a git bisect (with assertions enabled) shows that commit 6f48390ab209d16e (as mentioned above) makes the test case work. It *also* makes the test case work if assertions are disabled. commit 6f48390ab209d16ee8f795f0a83677c8bd9ac69c Author: Father Chrysostomos Date: Wed Jan 4 23:28:54 2012 -0800 [perl #95548] Returned magical temps are not copied return and leavesub, for speed, were not copying temp variables with a refcount of 1, which is fine as long as the fact that it was not cop- ied is not observable. With magical variables, that *can* be observed, so we have to forego the optimisation and copy the variable if's magical. This obviously applies only to rvalue subs. So, what caused the assertions to start failing between v5.12 and v5.14? A simple bisect suggests that it's this commit: commit f2338a2e8347fc967ab6b9af21d948258b88e341 Author: David Mitchell Date: Thu Apr 15 10:20:50 2010 +0100 use cBOOL for bool casts bool b = (bool)some_int doesn't necessarily do what you think. In some builds, bool is defined as char, and that cast's behaviour is thus undefined. So this line in mg.c: const bool was_temp = (bool)SvTEMP(sv); was actually setting was_temp to false even when the SVs_TEMP flag was set. Fix this by replacing all the (bool) casts with a new cBOOL() cast macro that (hopefully) does the right thing. which has nothing directly to do with threads shared, or (not?) copying values on subroutine return. It turns out after a bit of digging that it's this section of that change that matters: diff --git a/mg.c b/mg.c index 3fb8ec4..4a8d767 100644 --- a/mg.c +++ b/mg.c @@ -193,7 +193,7 @@ Perl_mg_get(pTHX_ SV *sv) { dVAR; const I32 mgs_ix = SSNEW(sizeof(MGS)); - const bool was_temp = (bool)SvTEMP(sv); + const bool was_temp = cBOOL(SvTEMP(sv)); bool have_new = 0; MAGIC *newmg, *head, *cur, *mg; /* guard against sv having being freed midway by holding a private It's not obvious from the code above, but SvTEMP(sv) returns either 0 or 0x00080000, and in 2010, C was typedef'd as C, so the above code always set was_temp to 0, due to truncation. The cast was added to avoid a warning about the truncation. The (buggy) truncation itself was added when the type of was_temp was changed from a suitably large integer to a bool as part of this commit: commit 35a4481cfdbca4941ab3a4206dc266f3e71c2385 Author: Andy Lester Date: Sun Mar 13 08:20:05 2005 -0600 Adding const qualifiers Message-ID: <20050313202005.GA23535@petdance.com> p4raw-id: //depot/perl@24037 (note that the commit message doesn't even note that some variables' types were changed). The variable in question is used to enable Perl_mg_get() to retain the SVs_TEMP bit set on values it is processing. This is needed because Perl_sv_2mortal() conflates "temporary" with "mortal" and automatically sets the SVs_TEMP flag bit on everything. Hence this code seeks to remove SVs_TEMP if Perl_sv_2mortal() set it. The upshot of the bug was that SVs_TEMP was always being removed, so code in subroutine entry that does *not* copy TEMPs is skipped because the value is becoming unmarked. Hence the values are always copied. So, all that conceals what actually going on. When that truncation bug was first introduced, it didn't matter that TEMPs weren't copied. When that bug was fixed, it mattered a lot that TEMPs weren't copied. What changed in the middle? I set out to bisect between the two commits, patching *that* the bug to fix it. The code doesn't change much, and test-building on the change points reveals that at the revision with the last change to that part of the code (changing a line adjacent to C), fixing *that* bug doesn't break the test case. So I made a patch to fix the bug, and bisected with that: Porting/bisect.pl -Dusethreads --start f9c6fee519b868a2e8ef8c5b701c0d3f95565423 --end db63319f533e643ef6aac622fcae9a2f7ceabb0d --late-fixup ../119089-W -Dnoextensions=Encode ../perl/119089 That reveals that the true trigger of the behaviour regression was this change: commit fd69380d5d5b95ef16e2521cf4251b34ee0ce151 Author: David Mitchell Date: Tue Mar 23 12:11:43 2010 +0000 Fix assorted bugs related to magic (such as pos) not "sticking" to magical array and hash elements; e.g. the following looped infinitely: $h{tainted_element} =~ /..../g There are two side-effects of this fix. First, MGf_GSKIP has been extended to work on tied array elements as well as hash elements. This is the mechanism that skips all but the first tied element magic gets until after the next set. Second, rvalue hash/array element access where the element has get magic, now directly returns the element rather than a mortal copy. The root cause of the bug was code similar to the following in pp_alem, pp_aelemfast, pp_helem and pp_rv2av: if (!lval && SvGMAGICAL(sv)) /* see note in pp_helem() */ sv = sv_mortalcopy(sv); According to the note, this was added in 1998 to make this work: local $tied{foo} = $tied{foo} Since it returns a copy rather than the element, this make //g fail. My first attempt, a few years ago, to fix this, took the approach that the LHS of the bind should be made an lvalue in the presence of //g, since it now modifies its LHS; i.e. expr =~ // expr is rvalue expr =~ s/// expr is lvalue expr =~ //g expr was rvalue, I proposed to change it to lvalue Unfortunately this fix broke too much stuff (stuff that was arguably already broken, but it upset people). For example, f() ~= s//// correctly gives the error Can't modify non-lvalue subroutine call My fix extended f() =~ //g to give the same error. Which is reasonable, because the g isn't doing what you want. But plenty of people had code that only needed to match once and the g had just been cargo-culted. So it broke their working code. So lets not do this. My new approach has been to remove the sv_mortalcopy(). It turns out that this is no longer needed to fix the local $tied{foo} issue. Presumably that went away as a side-effect of my container/value magic localisation rationalisation of a few years ago, although I haven't analysed it - just noted that the tests still pass (!). However, an issue with removing it is that mg_get() no longer gets called. So a plain $tied_hash{elem}; in void context no longer calls FETCH(). Which broke some tests and might break some code. Also, there's an issue with the delayed calling of magic in @+[n] and %+{foo}; by the time the get magic is called, the original pattern may have gone out of scope. The solution is to simply replace the original sv = sv_mortalcopy(sv); with mg_get(sv); This then caused problems with tied array FETCH() getting called too much. I fixed this by extending the MGf_GSKIP mechanism to tied arrays as well as hashes. I don't understand why tied arrays have always been treated differently than tied hashes, but unifying them didn't seem to break anything (except for a Storable test, whose comment indicated that the test's author thought FETCH() was being called to often anyway). The key part is this: diff --git a/pp_hot.c b/pp_hot.c index 3371e88..8f8af53 100644 --- a/pp_hot.c +++ b/pp_hot.c @@ -658,7 +658,7 @@ PP(pp_aelemfast) SV *sv = (svp ? *svp : &PL_sv_undef); EXTEND(SP, 1); if (!lval && SvGMAGICAL(sv)) /* see note in pp_helem() */ - sv = sv_mortalcopy(sv); + mg_get(sv); PUSHs(sv); RETURN; } ie it *gets rid of a copy*. So this removes one place that copied the return value, and the cast fix removes the other place, hence why both are needed to trigger the problem. And why the identified patch fixes it, by re-instating a copy. The story is not over. So, I assumed that taking blead, and reverting the patch that fixes the test case, would cause the test case to fail again. It doesn't. More bisecting revealed that subsequently this change also has an influence: commit 4bac9ae47b5ad7845a24e26b0e95609805de688a Author: Chip Salzenberg Date: Fri Jun 22 15:18:18 2012 -0700 Magic flags harmonization. In restore_magic(), which is called after any magic processing, all of the public OK flags have been shifted into the private OK flags. Thus the lack of an appropriate public OK flags was used to trigger both get magic and required conversions. This scheme did not cover ROK, however, so all properly written code had to make sure mg_get was called the right number of times anyway. Meanwhile the private OK flags gained a second purpose of marking converted but non-authoritative values (e.g. the IV conversion of an NV), and the inadequate flag shift mechanic broke this in some cases. This patch removes the shift mechanic for magic flags, thus exposing (and fixing) some improper usage of magic SVs in which mg_get() was not called correctly. It also has the side effect of making magic get functions specifically set their SVs to undef if that is desired, as the new behavior of empty get functions is to leave the value unchanged. This is a feature, as now get magic that does not modify its value, e.g. tainting, does not have to be special cased. The changes to cpan/ here are only temporary, for development only, to keep blead working until upstream applies them (or something like them). Thanks to Rik and Father C for review input. because part of it removes the entire was_temp (micro?) optimisation logic: @@ -3302,12 +3285,8 @@ S_restore_magic(pTHX_ const void *p) So artificially keep it alive a bit longer. We avoid turning on the TEMP flag, which can cause the SV's buffer to get stolen (and maybe other stuff). */ - int was_temp = SvTEMP(sv); sv_2mortal(sv); - if (!was_temp) { - SvTEMP_off(sv); - } - SvOK_off(sv); + SvTEMP_off(sv); } else SvREFCNT_dec(sv); /* undo the inc in S_save_magic() */ meaning that values will always be copied. (Given the nature of the first commit identified as a bug fix, this seems reasonable. magic values shouldn't be being short circuited.) So that explains the given test case. But the story is still not over. Whilst temporary values should be copied (by default), LVALUEs returned from :lvalue subs should not. This will fail assertions, SEGV, or worse even on blead: #!./perl use strict; use threads (); use threads::shared; use Devel::Peek; sub test_clone :lvalue { my @a :shared; $a[0]; } #Dump(test_clone); test_clone() = 0; __END__ I believe that the problem is because threads::shared wrongly assumes that the proxy for a shared aggregate will always outlive any LVALUE proxies for its elements. This is generally true, but not for element proxies returned from :lvalue subroutines for aggregate proxies which go out of scope with the subroutine. I believe that the fix is to change threads::shared to deal with cleanup differently, so that the last magic struct which frees up the relevant mg_obj is responsible for cleaning up shared space. This means adding "free" magic to the element proxies: diff --git a/dist/threads-shared/shared.xs b/dist/threads-shared/shared.xs index d3e859d..07379cc 100644 --- a/dist/threads-shared/shared.xs +++ b/dist/threads-shared/shared.xs @@ -1027,6 +1027,27 @@ sharedsv_elem_mg_DELETE(pTHX_ SV *sv, MAGIC *mg) return (0); } +int +sharedsv_elem_mg_free(pTHX_ SV *sv, MAGIC *mg) +{ + dTHXc; + PERL_UNUSED_ARG(sv); + ENTER_LOCK; + if (mg->mg_obj) { + if (!PL_dirty) { + assert(SvROK(mg->mg_obj)); + } + if (SvREFCNT(mg->mg_obj) == 1) { + /* If the element has the last pointer to the shared aggregate, then + it has to free the shared aggregate. mg->mg_obj itself is freed + by Perl_mg_free() */ + S_sharedsv_dec(aTHX_ SHAREDSV_FROM_OBJ(mg->mg_obj)); + } + } + LEAVE_LOCK; + return (0); +} + /* Called during cloning of PERL_MAGIC_tiedelem(p) magic in new * thread */ @@ -1044,7 +1065,7 @@ MGVTBL sharedsv_elem_vtbl = { sharedsv_elem_mg_STORE, /* set */ 0, /* len */ sharedsv_elem_mg_DELETE, /* clear */ - 0, /* free */ + sharedsv_elem_mg_free, /* free */ 0, /* copy */ sharedsv_elem_mg_dup, /* dup */ #ifdef MGf_LOCAL @@ -1114,7 +1135,21 @@ int sharedsv_array_mg_free(pTHX_ SV *sv, MAGIC *mg) { PERL_UNUSED_ARG(sv); - S_sharedsv_dec(aTHX_ (SV*)mg->mg_ptr); + if (!PL_dirty) { + assert(mg->mg_obj); + assert(SvROK(mg->mg_obj)); + assert(SvUV(SvRV(mg->mg_obj)) == PTR2UV(mg->mg_ptr)); + } + if (mg->mg_obj) { + if (SvREFCNT(mg->mg_obj) == 1) { + S_sharedsv_dec(aTHX_ (SV*)mg->mg_ptr); + } else { + /* An element of this aggregate still has PERL_MAGIC_tied(p) + pointing to this shared aggregate. It will take responsibility + for freeing the shared aggregate. Perl_mg_free() drops the + reference count on mg->mg_obj. */ + } + } return (0); } This change has been stewing on the branch smoke-me/nicholas/RT119089-variant and doesn't seem to have any problems. (The commit message is now not correct - 9b9aaeae0f62 fixed the problem on 5.12.0. Global destruction*) I think that it doesn't introduce any leaks, but I'd appreciate a second pair of eyes. This fix to threads::shared has the side effect of also fixing the original reported bug when running on v5.14 Nicholas Clark * Global destruction. threads::shared does not clean up "shared space", even when PL_destruct_level is 2. There is no C corresponding to the C in Perl_sharedsv_init(). I'm about to report that as a separate bug.