New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected tainting via regex using locale #13452
Comments
From Martin.vGagern@gmx.netCreated by Martin.vGagern@gmx.netWhen "use locale" is in effect, then the regular expression match Here is a small self-contained complete example: #!/usr/bin/perl -T Perl Info
|
From @jkeenanOn Tue Dec 03 11:54:14 2013, Martin.vGagern@gmx.net wrote:
I have confirmed that this is present in blead (5ea8618). |
The RT System itself - Status changed from 'new' to 'open' |
From @iabynOn Tue, Dec 03, 2013 at 11:54:15AM -0800, Martin.vGagern@gmx.net wrote:
It's the docs that were unclear, which I've amened with commit Locale info taints the whole regex, not just individual captures, since it /(\w+)(.)/ $2 needs to be tainted, since the tainted \w+ may eat fewer or more PS - your example appears to succeed in 5.16.0 and earlier, but this was -- |
@iabyn - Status changed from 'open' to 'resolved' |
From @khwilliamsonOn 12/04/2013 10:14 AM, Dave Mitchell wrote:
There is a bug in regexec.c that I'm fixing that fixes this sample |
From @khwilliamsonOn 12/04/2013 10:23 AM, Karl Williamson wrote:
This is now fixed by commit b99851e What was happening before is that in any bracketed character class, if a "foo.bar_baz" =~ /^(.*)[._](.*?)$/ then tainting shouldn't be turned on. The point is that this matches Having now looked at the taint code in regexec.c, I see some other bugs First, the documentation says that tainting is only done for things that Second, this is tainted: "abc" =~ /(abc)/i simply because of the /i. I don't think it should be. Similarly |
From Martin.vGagern@gmx.netOn 05.12.2013 04:12, karl williamson via RT wrote:
I guess character ranges depend on collation and therefore on locale.
I agree. The doc should reliably describe actual behaviour. I even
I'd say keep it as is. If I were to write such a /(abc)/i line, I'd |
From @rjbs* Martin von Gagern <Martin.vGagern@gmx.net> [2013-12-05T01:19:19]
I'm not sure I understand this line of argument. -- |
From @dmcbrideOn Friday December 6 2013 9:06:15 PM Ricardo Signes wrote:
If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does leave If, however, the /i makes $1 always tainted under locale, even if the input is |
From @ap* Darin McBride <dmcbride@cpan.org> [2013-12-07 03:45]:
Err. Except the only way I can read Karl’s proposal is that he thinks
… this argument is all quite moot, no? -- |
From @khwilliamsonOn 12/07/2013 01:07 AM, Aristotle Pagaltzis wrote:
And I wasn't proposing that. What I was suggesting is that: only if
So the argument is not moot, and I think has much to recommend it. To But that's not how it works now. Taken to its logical conclusion, It has worked for some time (perhaps forever; I haven't done the Because the behavior is currently inconsistent, something should be |
From @ap* Karl Williamson <public@khwilliamson.com> [2013-12-07 17:30]:
Ah. Then I am surprised you would propose that, and I have to agree with
Yes. Tainting of captures would be a property of the pattern determined
Indeed. It’s a property of the regexp – nothing about the input matters.
(I was thinking /i didn’t necessarily have to cause tainting because for
If a program depends on the current behaviour, can it be correct? If not, how much of the time is it likely to exhibit its brokenness? How widespread is the breakage likely to be? Regards, |
From @khwilliamsonOn 12/07/2013 10:17 AM, Aristotle Pagaltzis wrote:
http://en.wikipedia.org/wiki/ISO/IEC_646 lists several locales which exist where casing rules do change Here is "There are also some 7-bit character sets that are not officially part 7-bit Greek, ELOT 927. The Greek alphabet is mapped to positions |
From @khwilliamsonOn 12/07/2013 10:17 AM, Aristotle Pagaltzis wrote:
I am now uncertain about how tainting should behave. What Aristotle So I'd like to get more of a consensus before changing how things work. Here's my current quandary. It used to be that if you had a string A UTF-8 encoded string was not considered locale, and upper or I mostly solved this problem by partitioning the code point space into The reason I bring this up is the way I made tainting work on these So now, I'm working on extending Perl to work with UTF-8 locales, and as Opinions? |
From zefram@fysh.orgKarl Williamson wrote:
This sounds broken: in your example Latin-7 locale you now have two
What determined the choice of locale? An environment variable? Taint! Fleshing out a bit: logically the choice of locale may be tainted. This would lead to a lot of tainting. You might think it looks like too
This is effectively the behaviour that falls out of my analysis. -zefram |
From perl5-porters@perl.orgZefram wrote:
But that is the only sane model that sufficiently preserves backward
Agreed. But we cannot expect people to prevent strings from being
That would be something completely new, which would likely break |
From @tseeOn 01/16/2014 03:27 PM, Father Chrysostomos wrote:
Agreed as far as I can tell. There's a sane model that doesn't preserve --Steffen |
From @AbigailOn Wed, Dec 04, 2013 at 05:14:18PM +0000, Dave Mitchell wrote:
In fact, whether the expression matches or fails may depend on the locale. And hence, there's something to be said that despite it not having Abigail |
From @khwilliamsonOn 01/16/2014 11:27 AM, Abigail wrote:
In looking at this in more detail, I see more issues. I think the best "If there exists input to an operation which would have locale-dependent But as I said before, I haven't coded things that way, because the way Also, we don't taint TRUE/FALSE results. I don't know the logic behind |
From perl5-porters@perl.orgKarl Williamson wrote:
AFAIK, Perl has never had tainted booleans. (I have some code that |
From @rjbs* Karl Williamson <public@khwilliamson.com> [2014-01-22T15:05:57]
I think I agree with you, but want to clarify. Also, I should state that I'm a not a heavy user of locales *or* taint mode, so This means, in part, that this only happens in the scope of the locale pragma. In other words, if you use locale, you're likely to start seeing a lot more Perhaps the best paradigm for this is that the locale itself can be tainted: if This is my intuition on the whole thing. -- |
From @khwilliamsonOn 01/25/2014 12:46 PM, Ricardo Signes wrote:
Yes.
rjbs, and I clarified things on irc. This quote added in 1998 from "Locales--particularly on systems that allow unprivileged users to build The bottom line is we are moving to the policy that tainting is based on |
From @ap* Father Chrysostomos <sprout@cpan.org> [2014-01-16 15:30]:
So far, I agree.
No, we cannot expect people to prevent strings from being upgraded. But
I agree this isn’t feasible. But maybe we can follow the precedent of Regards, |
From @khwilliamsonOn 01/25/2014 04:06 PM, Aristotle Pagaltzis wrote:
That seems reasonable to me. |
From @demerphqOn 16 January 2014 22:27, Father Chrysostomos <sprout@cpan.org> wrote:
FWIW Figuring out a way to mark a string as not being upgradable, and Yves -- |
From zefram@fysh.orgdemerphq wrote:
Sounds like a bad idea. What exactly do you mean by "a string" here? -zefram |
From @khwilliamsonOn 01/27/2014 11:30 AM, Karl Williamson wrote:
Now done in 613abc6 |
From @iabynOn Mon, Dec 29, 2014 at 01:53:52PM -0700, Karl Williamson wrote:
Karl, that commit's failing very badly on my system on debugging builds. use POSIX 'locale_h'; my $l = 'zh_HK.big5hkscs'; setlocale(&POSIX::LC_CTYPE(), $l) or die; which dies like this: perl: util.c:1900: Perl_ck_warner: Assertion `pat' failed. Its dying while doing the fc(). pp_fc() does this: _CHECK_AND_WARN_PROBLEMATIC_LOCALE; which is calling Perl_ck_warner() with a null pat. This macro is defined as: # define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \ and at this point, PL_warn_locale looks like: SV = PVNV(0xaa2ee8) at 0xac42e0 so SvPVX(PL_warn_locale) is null. I don't understand the locale system well enough for the fix to be obvious -- |
From @khwilliamsonOn 12/29/2014 03:15 PM, Dave Mitchell wrote:
I'll look into it. I can reproduce it on dromedary. I note that |
From @khwilliamsonOn 12/29/2014 03:15 PM, Dave Mitchell wrote:
This should now be fixed by 215c513. |
Migrated from rt.perl.org#120675 (status was 'resolved')
Searchable as RT120675$
The text was updated successfully, but these errors were encountered: