Skip Menu |
Report information
Id: 120675
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: Martin.vGagern [at] gmx.net
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: (no value)
Type: docs
Perl Version: (no value)
Fixed In: (no value)



Subject: Unexpected tainting via regex using locale
CC: gagern [...] ma.tum.de
To: perlbug [...] perl.org
From: <Martin.vGagern [...] gmx.net>
Date: Tue, 3 Dec 2013 20:54:02 +0100 (CET)
Download (untitled) / with headers
text/plain 5.5k
This is a bug report for perl from <Martin.vGagern@gmx.net>, generated with the help of perlbug 1.39 running under perl 5.18.1. ----------------------------------------------------------------- [Please describe your issue here] When "use locale" is in effect, then the regular expression match m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not tainted. The perllocale manual mentions locale-dependent regular expressions which taint their capture groups, but the way I read that documentation, only regular expressions using one of \w, \W, \s or \S should be affected. So either the above regular expression should not taint the group, or the documentation is incomplete and should be more precise. Here is a small self-contained complete example: #!/usr/bin/perl -T use strict; use warnings; use locale; use Scalar::Util qw(tainted); my $var = "foo.bar_baz"; $var =~ m/^(.*)[._](.*?)$/; print(tainted($1) ? "tainted\n" : "untainted\n"); [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl 5.18.1: Configured by Gentoo at Fri Sep 6 12:56:43 CEST 2013. Summary of my perl5 (revision 5 version 18 subversion 1) configuration: Platform: osname=linux, osvers=3.10.10-gentoo, archname=x86_64-linux-thread-multi uname='linux server 3.10.10-gentoo #1 smp preempt sat aug 31 14:25:33 cest 2013 x86_64 amd phenom(tm) ii x4 945 processor authenticamd gnulinux ' config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-march=amdfam10 -O2 -ggdb -pipe -Dldflags=-Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib64/perl5/5.18.1 -Darchlib=/usr/lib64/perl5/5.18.1/x86_64-linux-thread-multi -Dsitelib=/usr/local/lib64/perl5/5.18.1 -Dsitearch=/usr/local/lib64/perl5/5.18.1/x86_64-linux-thread-multi -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.1 -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.1/x86_64-linux-thread-multi -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.1 -Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@localhost -Di nstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=-g -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dnoextensions=ODBM_File' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-march=amdfam10 -O2 -ggdb -pipe', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector' ccversion='', gccversion='4.7.3', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,--as-needed -fstack-protector' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.17.so, so=so, useshrplib=true, libperl=libperl.so.5.18.1 gnulibc_version='2.17' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/5.18.1/x86_64-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -march=amdfam10 -O2 -ggdb -pipe -fstack-protector' Locally applied patches: --- @INC for perl 5.18.1: /home/mvg/perl/lib/perl5/site_perl /home/mvg/perl/lib/perl5 /home/mvg/perl/lib/perl5/site_perl /home/mvg/perl/lib/perl5 /usr/local/lib64/perl5/5.18.1/x86_64-linux-thread-multi /usr/local/lib64/perl5/5.18.1 /usr/lib64/perl5/vendor_perl/5.18.1/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.18.1 /usr/lib64/perl5/5.18.1/x86_64-linux-thread-multi /usr/lib64/perl5/5.18.1 /usr/local/lib64/perl5 /usr/lib64/perl5/vendor_perl . --- Environment for perl 5.18.1: HOME=/home/mvg LANG=de_DE.utf8 LANGUAGE= LC_CTYPE=de_DE.utf8 LC_MESSAGES=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/mvg/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.3:/usr/x86_64-pc-linux-gnu/mingw32/gcc-bin/4.7.3:/opt/stuffit/bin:/usr/x86_64-pc-linux-gnu/gnat-gcc-bin/4.5:/usr/libexec/gnat-gcc/x86_64-pc-linux-gnu/4.5:/opt/android-sdk-update-manager/tools:/opt/android-sdk-update-manager/platform-tools:/opt/nvidia-cg-toolkit/bin:/usr/games/bin:/opt/vmware/bin PERL5LIB=/home/mvg/perl/lib/perl5/site_perl:/home/mvg/perl/lib/perl5:/home/mvg/perl/lib/perl5/site_perl:/home/mvg/perl/lib/perl5 PERLDOC_PAGER=less -R PERL_BADLANG (unset) SHELL=/bin/bash
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.1k
On Tue Dec 03 11:54:14 2013, Martin.vGagern@gmx.net wrote: Show quoted text
> > This is a bug report for perl from <Martin.vGagern@gmx.net>, > generated with the help of perlbug 1.39 running under perl 5.18.1. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > When "use locale" is in effect, then the regular expression match > m/^(.*)[._](.*?)$/ taints its first capture group even if the input is > not > tainted. The perllocale manual mentions locale-dependent regular > expressions which taint their capture groups, but the way I read that > documentation, only regular expressions using one of \w, \W, \s or \S > should > be affected. So either the above regular expression should not taint > the > group, or the documentation is incomplete and should be more precise. > > Here is a small self-contained complete example: > > #!/usr/bin/perl -T > use strict; > use warnings; > use locale; > use Scalar::Util qw(tainted); > my $var = "foo.bar_baz"; > $var =~ m/^(.*)[._](.*?)$/; > print(tainted($1) ? "tainted\n" : "untainted\n"); > >
I have confirmed that this is present in blead (5ea8618b).
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
From: Dave Mitchell <davem [...] iabyn.com>
To: perl5-porters [...] perl.org
Date: Wed, 4 Dec 2013 17:14:18 +0000
Download (untitled) / with headers
text/plain 1.4k
On Tue, Dec 03, 2013 at 11:54:15AM -0800, Martin.vGagern@gmx.net wrote: Show quoted text
> When "use locale" is in effect, then the regular expression match > m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not > tainted. The perllocale manual mentions locale-dependent regular > expressions which taint their capture groups, but the way I read that > documentation, only regular expressions using one of \w, \W, \s or \S should > be affected. So either the above regular expression should not taint the > group, or the documentation is incomplete and should be more precise. > > Here is a small self-contained complete example: > > #!/usr/bin/perl -T > use strict; > use warnings; > use locale; > use Scalar::Util qw(tainted); > my $var = "foo.bar_baz"; > $var =~ m/^(.*)[._](.*?)$/; > print(tainted($1) ? "tainted\n" : "untainted\n");
It's the docs that were unclear, which I've amened with commit 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c Locale info taints the whole regex, not just individual captures, since it may affect other captures. As a trivial example, in /(\w+)(.)/ $2 needs to be tainted, since the tainted \w+ may eat fewer or more characters than you expect, altering what $2 captures. PS - your example appears to succeed in 5.16.0 and earlier, but this was due to a bug in Scalar::Util::tainted(). Using quotes: tainted("$1") makes it fail "properly". -- "Foul and greedy Dwarf - you have eaten the last candle." -- "Hordes of the Things", BBC Radio.
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
From: Karl Williamson <public [...] khwilliamson.com>
To: Dave Mitchell <davem [...] iabyn.com>, perl5-porters [...] perl.org
Date: Wed, 04 Dec 2013 10:23:52 -0700
Download (untitled) / with headers
text/plain 1.5k
On 12/04/2013 10:14 AM, Dave Mitchell wrote: Show quoted text
> On Tue, Dec 03, 2013 at 11:54:15AM -0800, Martin.vGagern@gmx.net wrote:
>> When "use locale" is in effect, then the regular expression match >> m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not >> tainted. The perllocale manual mentions locale-dependent regular >> expressions which taint their capture groups, but the way I read that >> documentation, only regular expressions using one of \w, \W, \s or \S should >> be affected. So either the above regular expression should not taint the >> group, or the documentation is incomplete and should be more precise. >> >> Here is a small self-contained complete example: >> >> #!/usr/bin/perl -T >> use strict; >> use warnings; >> use locale; >> use Scalar::Util qw(tainted); >> my $var = "foo.bar_baz"; >> $var =~ m/^(.*)[._](.*?)$/; >> print(tainted($1) ? "tainted\n" : "untainted\n");
> > It's the docs that were unclear, which I've amened with commit > > 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c > > Locale info taints the whole regex, not just individual captures, since it > may affect other captures. As a trivial example, in > > /(\w+)(.)/ > > $2 needs to be tainted, since the tainted \w+ may eat fewer or more > characters than you expect, altering what $2 captures. > > PS - your example appears to succeed in 5.16.0 and earlier, but this was > due to a bug in Scalar::Util::tainted(). Using quotes: tainted("$1") > makes it fail "properly". >
There is a bug in regexec.c that I'm fixing that fixes this sample program. Details to follow
From: Karl Williamson <public [...] khwilliamson.com>
To: Dave Mitchell <davem [...] iabyn.com>, perl5-porters [...] perl.org, "perlbug-followup [...] perl.org" <perlbug-followup [...] perl.org>
Date: Wed, 04 Dec 2013 20:12:23 -0700
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
Download (untitled) / with headers
text/plain 3.4k
On 12/04/2013 10:23 AM, Karl Williamson wrote: Show quoted text
> On 12/04/2013 10:14 AM, Dave Mitchell wrote:
>> On Tue, Dec 03, 2013 at 11:54:15AM -0800, Martin.vGagern@gmx.net wrote:
>>> When "use locale" is in effect, then the regular expression match >>> m/^(.*)[._](.*?)$/ taints its first capture group even if the input >>> is not >>> tainted. The perllocale manual mentions locale-dependent regular >>> expressions which taint their capture groups, but the way I read that >>> documentation, only regular expressions using one of \w, \W, \s or \S >>> should >>> be affected. So either the above regular expression should not taint >>> the >>> group, or the documentation is incomplete and should be more precise. >>> >>> Here is a small self-contained complete example: >>> >>> #!/usr/bin/perl -T >>> use strict; >>> use warnings; >>> use locale; >>> use Scalar::Util qw(tainted); >>> my $var = "foo.bar_baz"; >>> $var =~ m/^(.*)[._](.*?)$/; >>> print(tainted($1) ? "tainted\n" : "untainted\n");
>> >> It's the docs that were unclear, which I've amened with commit >> >> 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c >> >> Locale info taints the whole regex, not just individual captures, >> since it >> may affect other captures. As a trivial example, in >> >> /(\w+)(.)/ >> >> $2 needs to be tainted, since the tainted \w+ may eat fewer or more >> characters than you expect, altering what $2 captures. >> >> PS - your example appears to succeed in 5.16.0 and earlier, but this was >> due to a bug in Scalar::Util::tainted(). Using quotes: tainted("$1") >> makes it fail "properly". >>
> > There is a bug in regexec.c that I'm fixing that fixes this sample > program. Details to follow >
This is now fixed by commit b99851e1941e002dd4816ee6c76fd49bbee1d7f3 What was happening before is that in any bracketed character class, if a target character did not match exactly one of those characters explicitly in the class, the result was tainted. This tainting should not have been turned on unless there was an attempt to match case-insensitively or against a POSIX class, both of which are affected by locale. But if the character class doesn't include those possibilities, as in the sample program from the ticket: "foo.bar_baz" =~ /^(.*)[._](.*?)$/ then tainting shouldn't be turned on. The point is that this matches regardless of whatever locale is in effect. Thus a defective or malicious locale plays no part in determining if this matches or not, and so the result should not be tainted. (And it now isn't in blead.) Having now looked at the taint code in regexec.c, I see some other bugs and inconsistencies with the documentation. First, the documentation says that tainting is only done for things that use \w, \s, and their complements. In fact, tainting is done for any POSIX class. That seems correct to me, and I think the documentation should change. Why taint on \w and not on [:word:], for example; they mean the exact same thing. And why not taint on \d or [:digit:]? It's true that most locales don't define them other than [0-9], but it's legal to do so, some locales do define them otherwise, and Perl relies on those definitions. So I think tainting should happen for all POSIX classes; this is what is currently implemented. Any disagreement? Second, this is tainted: "abc" =~ /(abc)/i simply because of the /i. I don't think it should be. Similarly backreferences like \1 will always be tainted under /i even if they match regardless of the effective locale. Do people think these are worth fixing?
To: perlbug-followup [...] perl.org
Date: Thu, 05 Dec 2013 07:19:19 +0100
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
From: Martin von Gagern <Martin.vGagern [...] gmx.net>
Download (untitled) / with headers
text/plain 1.3k
On 05.12.2013 04:12, karl williamson via RT wrote: Show quoted text
> This tainting should > not have been turned on unless there was an attempt to match > case-insensitively or against a POSIX class, both of which are affected > by locale.
I guess character ranges depend on collation and therefore on locale. Show quoted text
> First, the documentation says that tainting is only done for things that > use \w, \s, and their complements. In fact, tainting is done for any > POSIX class. That seems correct to me, and I think the documentation > should change.
I agree. The doc should reliably describe actual behaviour. I even wonder whether you should add a note that up to the current version, any bracketed character list would taint, to help portability across versions. Not sure whether I've ever seen anything like this i perl docs, but I know some projects do this kind of thing. Show quoted text
> Second, this is tainted: > > "abc" =~ /(abc)/i > > simply because of the /i. I don't think it should be. Similarly > backreferences like \1 will always be tainted under /i even if they > match regardless of the effective locale. Do people think these are > worth fixing?
I'd say keep it as is. If I were to write such a /(abc)/i line, I'd probably think that users may well enter "ABC", but during the development I'd only test with "abc". So I wouldn't notice the taint during development, and would be surprised if users reported problems.
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
CC: perlbug-followup [...] perl.org
To: Martin von Gagern <Martin.vGagern [...] gmx.net>
Date: Fri, 6 Dec 2013 21:06:15 -0500
Download (untitled) / with headers
text/plain 709b
* Martin von Gagern <Martin.vGagern@gmx.net> [2013-12-05T01:19:19] Show quoted text
> > Second, this is tainted: > > > > "abc" =~ /(abc)/i > > > > simply because of the /i. I don't think it should be. Similarly > > backreferences like \1 will always be tainted under /i even if they > > match regardless of the effective locale. Do people think these are > > worth fixing?
> > I'd say keep it as is. If I were to write such a /(abc)/i line, I'd > probably think that users may well enter "ABC", but during the > development I'd only test with "abc". So I wouldn't notice the taint > during development, and would be surprised if users reported problems.
I'm not sure I understand this line of argument. -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

From: Darin McBride <dmcbride [...] cpan.org>
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
To: perl5-porters [...] perl.org, perlbug-followup [...] perl.org
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
Date: Fri, 06 Dec 2013 19:39:49 -0700
Download (untitled) / with headers
text/plain 1.4k
On Friday December 6 2013 9:06:15 PM Ricardo Signes wrote: Show quoted text
> * Martin von Gagern <Martin.vGagern@gmx.net> [2013-12-05T01:19:19] >
> > > Second, this is tainted: > > > "abc" =~ /(abc)/i > > > > > > simply because of the /i. I don't think it should be. Similarly > > > backreferences like \1 will always be tainted under /i even if they > > > match regardless of the effective locale. Do people think these are > > > worth fixing?
> > > > I'd say keep it as is. If I were to write such a /(abc)/i line, I'd > > probably think that users may well enter "ABC", but during the > > development I'd only test with "abc". So I wouldn't notice the taint > > during development, and would be surprised if users reported problems.
> > I'm not sure I understand this line of argument.
If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does leave $1 tainted, this may be a surprise. Since most of us will test with lower case abc, but fewer of us will remember to test some/all upper case, this surprise can result in tainting at runtime that we're not prepared for. If, however, the /i makes $1 always tainted under locale, even if the input is a raw match without requiring i, then all tests will get the tainting and the developer will be forced to deal with tainting all the time, including their own tests. This makes it harder to miss the tainting of /i if it always taints under locale rather than just sometimes.
Download signature.asc
application/pgp-signature 198b

Message body not shown because it is not plain text.

To: perl5-porters [...] perl.org
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
Date: Sat, 7 Dec 2013 09:07:34 +0100
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
* Darin McBride <dmcbride@cpan.org> [2013-12-07 03:45]: Show quoted text
> * Ricardo Signes <perl.p5p@rjbs.manxome.org> [2013-12-07 03:10]:
> > * Martin von Gagern <Martin.vGagern@gmx.net> [2013-12-05 19:00]:
> > > * Karl Williamson <public@khwilliamson.com> [2013-12-05 04:15]:
> > > > Second, this is tainted: > > > > "abc" =~ /(abc)/i > > > > > > > > simply because of the /i. I don't think it should be. Similarly > > > > backreferences like \1 will always be tainted under /i even if > > > > they match regardless of the effective locale. Do people think > > > > these are worth fixing?
> > > > > > I'd say keep it as is. If I were to write such a /(abc)/i line, > > > I'd probably think that users may well enter "ABC", but during the > > > development I'd only test with "abc". So I wouldn't notice the > > > taint during development, and would be surprised if users reported > > > problems.
> > > > I'm not sure I understand this line of argument.
> > If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does > leave $1 tainted, this may be a surprise.
Err. Except the only way I can read Karl’s proposal is that he thinks that using /i itself should never taint captures, only locale-dependent classes should do that. Nor did I see him mention input string casing at all, let alone propose that tainting should depend on it. I cannot even imagine how it would seem plausible to anyone that he proposed that… but maybe I am therefore misreading him? Because if not, Show quoted text
> Since most of us will test with lower case abc, but fewer of us will > remember to test some/all upper case, this surprise can result in > tainting at runtime that we're not prepared for. > > If, however, the /i makes $1 always tainted under locale, even if the > input is a raw match without requiring i, then all tests will get the > tainting and the developer will be forced to deal with tainting all > the time, including their own tests. This makes it harder to miss the > tainting of /i if it always taints under locale rather than just > sometimes.
… this argument is all quite moot, no? -- Aristotle Pagaltzis // <http://plasmasturm.org/>
From: Karl Williamson <public [...] khwilliamson.com>
Date: Sat, 07 Dec 2013 09:28:45 -0700
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 3.9k
On 12/07/2013 01:07 AM, Aristotle Pagaltzis wrote: Show quoted text
> * Darin McBride <dmcbride@cpan.org> [2013-12-07 03:45]:
>> * Ricardo Signes <perl.p5p@rjbs.manxome.org> [2013-12-07 03:10]:
>>> * Martin von Gagern <Martin.vGagern@gmx.net> [2013-12-05 19:00]:
>>>> * Karl Williamson <public@khwilliamson.com> [2013-12-05 04:15]:
>>>>> Second, this is tainted: >>>>> "abc" =~ /(abc)/i >>>>> >>>>> simply because of the /i. I don't think it should be. Similarly >>>>> backreferences like \1 will always be tainted under /i even if >>>>> they match regardless of the effective locale. Do people think >>>>> these are worth fixing?
>>>> >>>> I'd say keep it as is. If I were to write such a /(abc)/i line, >>>> I'd probably think that users may well enter "ABC", but during the >>>> development I'd only test with "abc". So I wouldn't notice the >>>> taint during development, and would be surprised if users reported >>>> problems.
>>> >>> I'm not sure I understand this line of argument.
>> >> If "abc" =~ /(abc)/i leaves $1 not tainted, but "ABC" =~ /(abc)/i does >> leave $1 tainted, this may be a surprise.
> > Err. Except the only way I can read Karl’s proposal is that he thinks > that using /i itself should never taint captures, only locale-dependent > classes should do that. Nor did I see him mention input string casing at > all, let alone propose that tainting should depend on it. I cannot even > imagine how it would seem plausible to anyone that he proposed that…
And I wasn't proposing that. What I was suggesting is that: only if folding were actually needed to do the match should there be tainting. How it would work would be to try an exact match, and only if that failed would a fold match be tried and tainting turned on (regardless of the outcome of that fold match). I hope this is clearer. but Show quoted text
> maybe I am therefore misreading him? Because if not, >
>> Since most of us will test with lower case abc, but fewer of us will >> remember to test some/all upper case, this surprise can result in >> tainting at runtime that we're not prepared for. >> >> If, however, the /i makes $1 always tainted under locale, even if the >> input is a raw match without requiring i, then all tests will get the >> tainting and the developer will be forced to deal with tainting all >> the time, including their own tests. This makes it harder to miss the >> tainting of /i if it always taints under locale rather than just >> sometimes.
> > … this argument is all quite moot, no? >
So the argument is not moot, and I think has much to recommend it. To put it more succinctly (in my understanding): If there exists a target string for which a given pattern match requires locale information in order to determine if that string matches or not, then the results are tainted for every string regardless of whether or not this particular string actually invoked the portions of the pattern where locales make a difference. But that's not how it works now. Taken to its logical conclusion, tainting pattern matches becomes much simpler. It's turned on at the beginning of matching even if we don't actually attempt the match because the optimizer says the match cannot succeed because of, say, we know that no string shorter than 15 bytes can match and the current string is 14. It has worked for some time (perhaps forever; I haven't done the digging) that if a string matches a character in a bracketed character class, tainting is not turned on. My proposal was to have the same behavior for matches outside character classes, so that things were consistent. However it is simpler and faster to just say that if the /l pattern contains any POSIX classes (including things like \w) or /i matching, that tainting is turned on for any attempted match of it. Because the behavior is currently inconsistent, something should be changed to make it consistent. I think Martin's approach is the one that leads to the fewest surprises; but will change the behavior that existing programs might have come to rely on.
To: perl5-porters [...] perl.org
Date: Sat, 7 Dec 2013 18:17:41 +0100
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 2.8k
* Karl Williamson <public@khwilliamson.com> [2013-12-07 17:30]: Show quoted text
> What I was suggesting is that: only if folding were actually needed to > do the match should there be tainting. How it would work would be to > try an exact match, and only if that failed would a fold match be > tried and tainting turned on (regardless of the outcome of that fold > match). I hope this is clearer.
Ah. Then I am surprised you would propose that, and I have to agree with the others. Show quoted text
> So the argument is not moot, and I think has much to recommend it. To > put it more succinctly (in my understanding): If there exists a target > string for which a given pattern match requires locale information in > order to determine if that string matches or not, then the results are > tainted for every string regardless of whether or not this particular > string actually invoked the portions of the pattern where locales make > a difference.
Yes. Tainting of captures would be a property of the pattern determined at (pattern) compile time, and would then apply to each and every match the regexp is used in. Show quoted text
> But that's not how it works now. Taken to its logical conclusion, > tainting pattern matches becomes much simpler. It's turned on at the > beginning of matching even if we don't actually attempt the match > because the optimizer says the match cannot succeed because of, say, > we know that no string shorter than 15 bytes can match and the current > string is 14.
Indeed. It’s a property of the regexp – nothing about the input matters. Show quoted text
> It has worked for some time (perhaps forever; I haven't done the > digging) that if a string matches a character in a bracketed character > class, tainting is not turned on. My proposal was to have the same > behavior for matches outside character classes, so that things were > consistent. However it is simpler and faster to just say that if the > /l pattern contains any POSIX classes (including things like \w) or /i > matching, that tainting is turned on for any attempted match of it.
(I was thinking /i didn’t necessarily have to cause tainting because for some reason I was thinking there are characters whose casing rules does not change between locales, e.g. “a”. Then I realised, this may be true in practice (among sane locales) but it is not *true by definition*, so patterns must always taint under /li.) Show quoted text
> Because the behavior is currently inconsistent, something should be > changed to make it consistent. I think Martin's approach is the one > that leads to the fewest surprises; but will change the behavior that > existing programs might have come to rely on.
If a program depends on the current behaviour, can it be correct? If not, how much of the time is it likely to exhibit its brokenness? (Or conversely: how much of the time is it likely to accidentally do the right thing regardless?) How widespread is the breakage likely to be? Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
From: Karl Williamson <public [...] khwilliamson.com>
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
Date: Wed, 11 Dec 2013 16:19:00 -0700
Download (untitled) / with headers
text/plain 1.4k
On 12/07/2013 10:17 AM, Aristotle Pagaltzis wrote: Show quoted text
> (I was thinking /i didn’t necessarily have to cause tainting because for > some reason I was thinking there are characters whose casing rules does > not change between locales, e.g. “a”. Then I realised, this may be true > in practice (among sane locales) but it is not*true by definition*, so > patterns must always taint under /li.)
http://en.wikipedia.org/wiki/ISO/IEC_646 lists several locales which exist where casing rules do change Here is their text: "There are also some 7-bit character sets that are not officially part of the ISO 646 standard. Examples include: 7-bit Greek, ELOT 927. The Greek alphabet is mapped to positions 0x61–0x71 and 0x73–0x79, on top of the Latin lowercase letters. 7-bit Cyrillic, KOI-7 or Short KOI. The Cyrillic characters are mapped to positions 0x60–0x7E, on top of the Latin lowercase letters. Superseded by the KOI-8 variants. 7-bit Hebrew, SI 960. The Hebrew alphabet is mapped to positions 0x60–0x7A, on top of the lowercase Latin letters (and grave accent for aleph). 7-bit Hebrew was always stored in visual order. This mapping with the high bit set, i.e. with the Hebrew letters in 0xE0–0xFA, is ISO 8859-8. 7-bit Arabic, ASMO 449. The Arabic alphabet is mapped to positions 0x41–0x5A and 0x60–0x6A, on top of both uppercase and lowercase Latin letters. This mapping with the high bit set is ISO 8859-6."
Subject: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Date: Wed, 15 Jan 2014 13:18:31 -0700
Download (untitled) / with headers
text/plain 3.5k
On 12/07/2013 10:17 AM, Aristotle Pagaltzis wrote: Show quoted text
> Yes. Tainting of captures would be a property of the pattern determined > at (pattern) compile time, and would then apply to each and every match > the regexp is used in. >
>> >But that's not how it works now. Taken to its logical conclusion, >> >tainting pattern matches becomes much simpler. It's turned on at the >> >beginning of matching even if we don't actually attempt the match >> >because the optimizer says the match cannot succeed because of, say, >> >we know that no string shorter than 15 bytes can match and the current >> >string is 14.
> Indeed. It’s a property of the regexp – nothing about the input matters. >
I am now uncertain about how tainting should behave. What Aristotle (and others) say makes sense to me, but it isn't the way I've coded a bunch of stuff, which was based on how it looked to me how things behaved. I have now reexamined some of the source code for 5.8.9, and tainting in regexes in the places I looked at is turned on based on the target string contents, rather than being a property of the regex. Also, a regex can have portions where locale matters, and portions where it doesn't. So I'd like to get more of a consensus before changing how things work. Here's my current quandary. It used to be that if you had a string encoded in some locale, say Latin7 -- Greek, and somehow, say through calling a function in some module, the UTF-8 flag gets turned on. Suddenly everything changes, your SMALL CHI becomes a DIVISION SIGN, for example, without notice, even though the underlying code point hasn't budged. A UTF-8 encoded string was not considered locale, and upper or lowercasing such a string never tainted, even if the string was originally derived from locale rules, hence taintable. I mostly solved this problem by partitioning the code point space into two ranges: 0-255 and 256-infinity. Hence now, under locale rules, the code points 0-255 are considered to be what the current locale says they should be, regardless of the string's UTF8ness. The code points 256 and up represent what Unicode says they should represent. (The POSIX locale system does not have names for code points, so Perl doesn't know that your SMALL CHI is that character, but it does know from the libc functions that it is a lowercase letter, and what the code point of its uppercase version is, etc.) This solution isn't perfect, but if your application is well-behaved and isn't mixing Unicode and non-Unicode characters, it does completely solve the issue of your strings utf8 flag unwittingly getting set. The reason I bring this up is the way I made tainting work on these strings. When operating on a string, the result is tainted if and only if it contains a code point in the 0-255 range. And that is content-related. Previously no UTF-8 encoded string was checked, and that is also effectively content related. So now, I'm working on extending Perl to work with UTF-8 locales, and as long as it thinks the locale is UTF-8, it won't actually look at the locale's rules, which could therefore be malicious or just plain screwed up. Instead, it uses the known Unicode rules built into Perl, so one could argue that it shouldn't taint. On the other hand, it should taint in any other locale type, so tainting or not becomes dependent on the underlying locale. (This is really no different than the situation I inherited where tainting or not was dependent on the utf8 flag). Or one could say that tainting should happen for every locale, for consistency and for developers to have tests that are not locale-dependent. Opinions?
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
To: perl5-porters [...] perl.org
From: Zefram <zefram [...] fysh.org>
Date: Thu, 16 Jan 2014 11:46:54 +0000
Download (untitled) / with headers
text/plain 2.6k
Karl Williamson wrote: Show quoted text
> Hence now, under locale >rules, the code points 0-255 are considered to be what the current >locale says they should be, regardless of the string's UTF8ness. The >code points 256 and up represent what Unicode says they should >represent.
This sounds broken: in your example Latin-7 locale you now have two representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) and no representations of U+f7 division sign. I think we should have a big notice in the documentation to the effect that non-UTF-8 locales and Unicode don't mix. Possibly more sensible behaviour for this situation would be that 0x0 to 0xff get locale behaviour and 0x100 upwards get null behaviour (don't match any properties, case convert to self, as if unassigned by Unicode). Show quoted text
> On the other hand, >it should taint in any other locale type, so tainting or not becomes >dependent on the underlying locale.
What determined the choice of locale? An environment variable? Taint! Fleshing out a bit: logically the choice of locale may be tainted. Normally the choice will come from an environment variable, which is itself tainted, so the locale selection is tainted. Then any decision that depends on the choice of locale, including whether to use Unicode character semantics based on whether it's a UTF-8 locale, is itself tainted. So in this system, with a tainted locale every string operation that depends on character properties introduces taint, even if the properties themselves are clean. But in theory a program could set the locale environment variables itself from a clean source, get an untainted choice of locale, and then use character properties without tainting otherwise-clean data. This would lead to a lot of tainting. You might think it looks like too much tainting. But any reduction, such as treating Unicode character properties used due to a UTF-8 locale as untainted, amounts to adding an explicit exception to the taint rules. Perl has a small number of well-defined exceptions to tainting, and increasing the set sounds like a recipe for security problems. Programmers can always explicitly untaint the locale choice if they want that behaviour, so the downside of being surprised by broader tainting is limited. Show quoted text
> Or one could say that tainting should happen for >every locale, for consistency and for developers to have tests that >are not locale-dependent.
This is effectively the behaviour that falls out of my analysis. Locale-dependent character properties are tainted (if the choice of locale is tainted) regardless of which properties those actually are. -zefram
Date: 16 Jan 2014 14:27:34 -0000
To: perl5-porters [...] perl.org
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
From: Father Chrysostomos <sprout [...] cpan.org>
Download (untitled) / with headers
text/plain 994b
Zefram wrote: Show quoted text
> This sounds broken: in your example Latin-7 locale you now have two > representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) > and no representations of U+f7 division sign.
But that is the only sane model that sufficiently preserves backward compatibility. Show quoted text
> I think we should have a > big notice in the documentation to the effect that non-UTF-8 locales and > Unicode don't mix.
Agreed. But we cannot expect people to prevent strings from being upgraded, since perl does that transparently. (E.g., put your locale and Unicode string in the same data store, and then extract them. Flags could flip either way depending on how the data are stored.) Show quoted text
> Possibly more sensible behaviour for this situation > would be that 0x0 to 0xff get locale behaviour and 0x100 upwards get > null behaviour (don't match any properties, case convert to self, as if > unassigned by Unicode).
That would be something completely new, which would likely break existing programs.
From: Steffen Mueller <smueller [...] cpan.org>
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
To: perl5-porters [...] perl.org
Date: Thu, 16 Jan 2014 18:40:58 +0100
Download (untitled) / with headers
text/plain 486b
On 01/16/2014 03:27 PM, Father Chrysostomos wrote: Show quoted text
> Zefram wrote:
>> This sounds broken: in your example Latin-7 locale you now have two >> representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) >> and no representations of U+f7 division sign.
> > But that is the only sane model that sufficiently preserves backward > compatibility.
Agreed as far as I can tell. There's a sane model that doesn't preserve backward compatibility, though. Lose taint support. --Steffen
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
To: Dave Mitchell <davem [...] iabyn.com>
From: Abigail <abigail [...] abigail.be>
CC: perl5-porters [...] perl.org
Date: Thu, 16 Jan 2014 19:27:12 +0100
Download (untitled) / with headers
text/plain 1.6k
On Wed, Dec 04, 2013 at 05:14:18PM +0000, Dave Mitchell wrote: Show quoted text
> On Tue, Dec 03, 2013 at 11:54:15AM -0800, Martin.vGagern@gmx.net wrote:
> > When "use locale" is in effect, then the regular expression match > > m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not > > tainted. The perllocale manual mentions locale-dependent regular > > expressions which taint their capture groups, but the way I read that > > documentation, only regular expressions using one of \w, \W, \s or \S should > > be affected. So either the above regular expression should not taint the > > group, or the documentation is incomplete and should be more precise. > > > > Here is a small self-contained complete example: > > > > #!/usr/bin/perl -T > > use strict; > > use warnings; > > use locale; > > use Scalar::Util qw(tainted); > > my $var = "foo.bar_baz"; > > $var =~ m/^(.*)[._](.*?)$/; > > print(tainted($1) ? "tainted\n" : "untainted\n");
> > It's the docs that were unclear, which I've amened with commit > > 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c > > Locale info taints the whole regex, not just individual captures, since it > may affect other captures. As a trivial example, in > > /(\w+)(.)/ > > $2 needs to be tainted, since the tainted \w+ may eat fewer or more > characters than you expect, altering what $2 captures.
In fact, whether the expression matches or fails may depend on the locale. And hence, there's something to be said that despite it not having capturing parenthesis, /\w/ should taint $1 and friends. After all, an unexpected locale may cause the pattern to fail, leaving $1 to whatever it was, instead of setting it to undefined. Abigail
CC: perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
To: Abigail <abigail [...] abigail.be>, Dave Mitchell <davem [...] iabyn.com>
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
Date: Wed, 22 Jan 2014 13:05:57 -0700
Download (untitled) / with headers
text/plain 2.5k
On 01/16/2014 11:27 AM, Abigail wrote: Show quoted text
> On Wed, Dec 04, 2013 at 05:14:18PM +0000, Dave Mitchell wrote:
>> On Tue, Dec 03, 2013 at 11:54:15AM -0800, Martin.vGagern@gmx.net wrote:
>>> When "use locale" is in effect, then the regular expression match >>> m/^(.*)[._](.*?)$/ taints its first capture group even if the input is not >>> tainted. The perllocale manual mentions locale-dependent regular >>> expressions which taint their capture groups, but the way I read that >>> documentation, only regular expressions using one of \w, \W, \s or \S should >>> be affected. So either the above regular expression should not taint the >>> group, or the documentation is incomplete and should be more precise. >>> >>> Here is a small self-contained complete example: >>> >>> #!/usr/bin/perl -T >>> use strict; >>> use warnings; >>> use locale; >>> use Scalar::Util qw(tainted); >>> my $var = "foo.bar_baz"; >>> $var =~ m/^(.*)[._](.*?)$/; >>> print(tainted($1) ? "tainted\n" : "untainted\n");
>> >> It's the docs that were unclear, which I've amened with commit >> >> 9fc477bf4a1f21e94d5dfe8e99d8a93308d5388c >> >> Locale info taints the whole regex, not just individual captures, since it >> may affect other captures. As a trivial example, in >> >> /(\w+)(.)/ >> >> $2 needs to be tainted, since the tainted \w+ may eat fewer or more >> characters than you expect, altering what $2 captures.
> > > In fact, whether the expression matches or fails may depend on the locale. > > And hence, there's something to be said that despite it not having > capturing parenthesis, /\w/ should taint $1 and friends. After all, an > unexpected locale may cause the pattern to fail, leaving $1 to whatever > it was, instead of setting it to undefined. > > > > Abigail >
In looking at this in more detail, I see more issues. I think the best way to do things, is to follow something like: "If there exists input to an operation which would have locale-dependent results, that operation taints regardless of the current input" But as I said before, I haven't coded things that way, because the way it worked often was based on the particular input. In looking at what it would take to get the casing operations to change to this rule, I ran across another example. We don't change if we try to change the case of an empty string. But the rule I just stated would indicate we should. Also, we don't taint TRUE/FALSE results. I don't know the logic behind that decision, except probably implementation issues, both in the core and the perl program.
Date: 22 Jan 2014 21:12:17 -0000
From: Father Chrysostomos <sprout [...] cpan.org>
To: perl5-porters [...] perl.org
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
Download (untitled) / with headers
text/plain 279b
Karl Williamson wrote: Show quoted text
> Also, we don't taint TRUE/FALSE results. I don't know the logic behind > that decision, except probably implementation issues, both in the core > and the perl program.
AFAIK, Perl has never had tainted booleans. (I have some code that assumes that.)
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
To: Karl Williamson <public [...] khwilliamson.com>
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
CC: Abigail <abigail [...] abigail.be>, Dave Mitchell <davem [...] iabyn.com>, perl5-porters [...] perl.org
Date: Sat, 25 Jan 2014 14:46:16 -0500
Download (untitled) / with headers
text/plain 1.4k
* Karl Williamson <public@khwilliamson.com> [2014-01-22T15:05:57] Show quoted text
> In looking at this in more detail, I see more issues. I think the > best way to do things, is to follow something like: > > "If there exists input to an operation which would have > locale-dependent results, that operation taints regardless of the > current input"
I think I agree with you, but want to clarify. Also, I should state that I'm a not a heavy user of locales *or* taint mode, so I'm open to having my underlying assumptions corrected! This means, in part, that this only happens in the scope of the locale pragma. That is, because the operation can't rely on locale without that pragma in effect, it can't taint. (I *think* that the /l modifier would also make this condition true, even outside `use locale`, right?) In other words, if you use locale, you're likely to start seeing a lot more taint. Then again, taint mode looks for conditions resulting from user input, not any ol' thing. If I enable locales and set the local by hand with a constant string, then the locale should not introduce taint, should it? Perhaps the best paradigm for this is that the locale itself can be tainted: if our locale comes from ENV, it is tainted. If it's set, with setlocale, *with a tainted parameter*, it is tainted. Then, when the locale setting can affect the expression, the result is tainted. This is my intuition on the whole thing. -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Subject: Re: [perl #120675] Unexpected tainting via regex using locale
CC: Abigail <abigail [...] abigail.be>, Dave Mitchell <davem [...] iabyn.com>, perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Date: Sat, 25 Jan 2014 14:08:02 -0700
Download (untitled) / with headers
text/plain 2.1k
On 01/25/2014 12:46 PM, Ricardo Signes wrote: Show quoted text
> * Karl Williamson <public@khwilliamson.com> [2014-01-22T15:05:57]
>> In looking at this in more detail, I see more issues. I think the >> best way to do things, is to follow something like: >> >> "If there exists input to an operation which would have >> locale-dependent results, that operation taints regardless of the >> current input"
> > I think I agree with you, but want to clarify. > > Also, I should state that I'm a not a heavy user of locales *or* taint mode, so > I'm open to having my underlying assumptions corrected! > > This means, in part, that this only happens in the scope of the locale pragma. > That is, because the operation can't rely on locale without that pragma in > effect, it can't taint. (I *think* that the /l modifier would also make this > condition true, even outside `use locale`, right?)
Yes. Show quoted text
> > In other words, if you use locale, you're likely to start seeing a lot more > taint. Then again, taint mode looks for conditions resulting from user input, > not any ol' thing. If I enable locales and set the local by hand with a > constant string, then the locale should not introduce taint, should it? > > Perhaps the best paradigm for this is that the locale itself can be tainted: if > our locale comes from ENV, it is tainted. If it's set, with setlocale, *with a > tainted parameter*, it is tainted. Then, when the locale setting can affect > the expression, the result is tainted. > > This is my intuition on the whole thing. >
rjbs, and I clarified things on irc. This quote added in 1998 from perllocale gives the rationale "Locales--particularly on systems that allow unprivileged users to build their own locales--are untrustworthy. A malicious (or just plain broken) locale can make a locale-aware application give unexpected results." The bottom line is we are moving to the policy that tainting is based on the operation and being in locale, without regard to the particular operand's contents passed this time to the operation. This means simpler core code and more consistent tainting results. And it lessens the likelihood that there are paths in the core that should taint but don't
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
To: perl5-porters [...] perl.org
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
Date: Sun, 26 Jan 2014 00:06:48 +0100
Download (untitled) / with headers
text/plain 1.7k
* Father Chrysostomos <sprout@cpan.org> [2014-01-16 15:30]: Show quoted text
> Zefram wrote:
> > This sounds broken: in your example Latin-7 locale you now have two > > representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) > > and no representations of U+f7 division sign.
> > But that is the only sane model that sufficiently preserves backward > compatibility.
So far, I agree. Show quoted text
> > I think we should have a big notice in the documentation to the > > effect that non-UTF-8 locales and Unicode don't mix.
> > Agreed. But we cannot expect people to prevent strings from being > upgraded, since perl does that transparently. (E.g., put your locale > and Unicode string in the same data store, and then extract them. > Flags could flip either way depending on how the data are stored.)
No, we cannot expect people to prevent strings from being upgraded. But *can* expect them not to perform locale-based processing on character strings (rather than octet strings). Now we cannot know what strings are octet strings, we do have at least one class of strings that cannot possibly be octet strings: those that contain elements > 0xFF. So when a locale-based regexp sees one of those, something is unambiguously broken. Show quoted text
> > Possibly more sensible behaviour for this situation would be that > > 0x0 to 0xff get locale behaviour and 0x100 upwards get null > > behaviour (don't match any properties, case convert to self, as if > > unassigned by Unicode).
> > That would be something completely new, which would likely break > existing programs.
I agree this isn’t feasible. But maybe we can follow the precedent of other octet-based functions: have locale-based regexps throw a “wide character” warning any time they encounter one. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Date: Mon, 27 Jan 2014 11:30:03 -0700
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.8k
On 01/25/2014 04:06 PM, Aristotle Pagaltzis wrote: Show quoted text
> * Father Chrysostomos <sprout@cpan.org> [2014-01-16 15:30]:
>> Zefram wrote:
>>> This sounds broken: in your example Latin-7 locale you now have two >>> representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) >>> and no representations of U+f7 division sign.
>> >> But that is the only sane model that sufficiently preserves backward >> compatibility.
> > So far, I agree. >
>>> I think we should have a big notice in the documentation to the >>> effect that non-UTF-8 locales and Unicode don't mix.
>> >> Agreed. But we cannot expect people to prevent strings from being >> upgraded, since perl does that transparently. (E.g., put your locale >> and Unicode string in the same data store, and then extract them. >> Flags could flip either way depending on how the data are stored.)
> > No, we cannot expect people to prevent strings from being upgraded. But > *can* expect them not to perform locale-based processing on character > strings (rather than octet strings). Now we cannot know what strings are > octet strings, we do have at least one class of strings that cannot > possibly be octet strings: those that contain elements > 0xFF. So when > a locale-based regexp sees one of those, something is unambiguously > broken. >
>>> Possibly more sensible behaviour for this situation would be that >>> 0x0 to 0xff get locale behaviour and 0x100 upwards get null >>> behaviour (don't match any properties, case convert to self, as if >>> unassigned by Unicode).
>> >> That would be something completely new, which would likely break >> existing programs.
> > I agree this isn’t feasible. But maybe we can follow the precedent of > other octet-based functions: have locale-based regexps throw a “wide > character” warning any time they encounter one.
That seems reasonable to me.
From: demerphq <demerphq [...] gmail.com>
To: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
Date: Tue, 28 Jan 2014 02:47:43 +0800
Download (untitled) / with headers
text/plain 383b
On 16 January 2014 22:27, Father Chrysostomos <sprout@cpan.org> wrote: Show quoted text
> Agreed. But we cannot expect people to prevent strings from being > upgraded, since perl does that transparently.
FWIW Figuring out a way to mark a string as not being upgradable, and causing Perl to die if one tries is on my todo list right now. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
To: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
From: Zefram <zefram [...] fysh.org>
Date: Mon, 27 Jan 2014 19:55:21 +0000
Download (untitled) / with headers
text/plain 409b
demerphq wrote: Show quoted text
>FWIW Figuring out a way to mark a string as not being upgradable,
Sounds like a bad idea. What exactly do you mean by "a string" here? Is this a status that would be preserved across copying such as scalar assignment? If so, adding any new hidden flag is a bad idea, but one that prevents using what is otherwise always a legitimate representation of a string is especially bad. -zefram
From: Karl Williamson <public [...] khwilliamson.com>
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
Date: Mon, 29 Dec 2014 13:53:52 -0700
Download (untitled) / with headers
text/plain 1.9k
On 01/27/2014 11:30 AM, Karl Williamson wrote: Show quoted text
> On 01/25/2014 04:06 PM, Aristotle Pagaltzis wrote:
>> * Father Chrysostomos <sprout@cpan.org> [2014-01-16 15:30]:
>>> Zefram wrote:
>>>> This sounds broken: in your example Latin-7 locale you now have two >>>> representations of U+3c7 Greek small letter chi (at 0xf7 and 0x3c7) >>>> and no representations of U+f7 division sign.
>>> >>> But that is the only sane model that sufficiently preserves backward >>> compatibility.
>> >> So far, I agree. >>
>>>> I think we should have a big notice in the documentation to the >>>> effect that non-UTF-8 locales and Unicode don't mix.
>>> >>> Agreed. But we cannot expect people to prevent strings from being >>> upgraded, since perl does that transparently. (E.g., put your locale >>> and Unicode string in the same data store, and then extract them. >>> Flags could flip either way depending on how the data are stored.)
>> >> No, we cannot expect people to prevent strings from being upgraded. But >> *can* expect them not to perform locale-based processing on character >> strings (rather than octet strings). Now we cannot know what strings are >> octet strings, we do have at least one class of strings that cannot >> possibly be octet strings: those that contain elements > 0xFF. So when >> a locale-based regexp sees one of those, something is unambiguously >> broken. >>
>>>> Possibly more sensible behaviour for this situation would be that >>>> 0x0 to 0xff get locale behaviour and 0x100 upwards get null >>>> behaviour (don't match any properties, case convert to self, as if >>>> unassigned by Unicode).
>>> >>> That would be something completely new, which would likely break >>> existing programs.
>> >> I agree this isn’t feasible. But maybe we can follow the precedent of >> other octet-based functions: have locale-based regexps throw a “wide >> character” warning any time they encounter one.
> > That seems reasonable to me. >
Now done in 613abc6d16e99bd9834fe6afd79beb61a3a4734d
Date: Mon, 29 Dec 2014 22:15:15 +0000
From: Dave Mitchell <davem [...] iabyn.com>
To: Karl Williamson <public [...] khwilliamson.com>
CC: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
On Mon, Dec 29, 2014 at 01:53:52PM -0700, Karl Williamson wrote: Show quoted text
> Now done in 613abc6d16e99bd9834fe6afd79beb61a3a4734d
Karl, that commit's failing very badly on my system on debugging builds. Several test are dying with assertion failures. I can reduce charset.t to this code: use POSIX 'locale_h'; my $l = 'zh_HK.big5hkscs'; setlocale(&POSIX::LC_CTYPE(), $l) or die; use locale; no warnings 'locale'; # We may be trying out a weird locale setlocale(&POSIX::LC_CTYPE(), $l) or die; my $c = chr utf8::unicode_to_native(0xdf); CORE::fc($c); which dies like this: perl: util.c:1900: Perl_ck_warner: Assertion `pat' failed. Its dying while doing the fc(). pp_fc() does this: _CHECK_AND_WARN_PROBLEMATIC_LOCALE; which is calling Perl_ck_warner() with a null pat. This macro is defined as: # define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \ STMT_START { \ if (PL_warn_locale) { \ /*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \ Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \ SvPVX(PL_warn_locale), \ 0 /* dummy to avoid comp warning */ ); \ /* GCC_DIAG_RESTORE; */ \ SvREFCNT_dec_NN(PL_warn_locale); \ PL_warn_locale = NULL; \ } \ } STMT_END and at this point, PL_warn_locale looks like: SV = PVNV(0xaa2ee8) at 0xac42e0 REFCNT = 1 FLAGS = (TEMP,IOK,NOK,pIOK,pNOK) IV = 223 NV = 223 PV = 0 so SvPVX(PL_warn_locale) is null. I don't understand the locale system well enough for the fix to be obvious to me. -- Any [programming] language that doesn't occasionally surprise the novice will pay for it by continually surprising the expert. -- Larry Wall
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
To: Dave Mitchell <davem [...] iabyn.com>
From: Karl Williamson <public [...] khwilliamson.com>
CC: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Date: Mon, 29 Dec 2014 15:37:33 -0700
Download (untitled) / with headers
text/plain 2.2k
On 12/29/2014 03:15 PM, Dave Mitchell wrote: Show quoted text
> On Mon, Dec 29, 2014 at 01:53:52PM -0700, Karl Williamson wrote:
>> Now done in 613abc6d16e99bd9834fe6afd79beb61a3a4734d
> > Karl, that commit's failing very badly on my system on debugging builds. > Several test are dying with assertion failures. I can reduce charset.t > to this code: > > use POSIX 'locale_h'; > > my $l = 'zh_HK.big5hkscs'; > setlocale(&POSIX::LC_CTYPE(), $l) or die; > use locale; > no warnings 'locale'; # We may be trying out a weird locale > > setlocale(&POSIX::LC_CTYPE(), $l) or die; > my $c = chr utf8::unicode_to_native(0xdf); > CORE::fc($c); > > which dies like this: > > perl: util.c:1900: Perl_ck_warner: Assertion `pat' failed. > > Its dying while doing the fc(). > > pp_fc() does this: > > _CHECK_AND_WARN_PROBLEMATIC_LOCALE; > > which is calling Perl_ck_warner() with a null pat. This macro is defined as: > > # define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \ > STMT_START { \ > if (PL_warn_locale) { \ > /*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \ > Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \ > SvPVX(PL_warn_locale), \ > 0 /* dummy to avoid comp warning */ ); \ > /* GCC_DIAG_RESTORE; */ \ > SvREFCNT_dec_NN(PL_warn_locale); \ > PL_warn_locale = NULL; \ > } \ > } STMT_END > > > and at this point, PL_warn_locale looks like: > > SV = PVNV(0xaa2ee8) at 0xac42e0 > REFCNT = 1 > FLAGS = (TEMP,IOK,NOK,pIOK,pNOK) > IV = 223 > NV = 223 > PV = 0 > > so SvPVX(PL_warn_locale) is null. > > I don't understand the locale system well enough for the fix to be obvious > to me. > > >
I'll look into it. I can reproduce it on dromedary. I note that smoke-me did not find any problem.
From: Karl Williamson <public [...] khwilliamson.com>
To: Dave Mitchell <davem [...] iabyn.com>
Subject: Re: What should tainting behavior be? Was [perl #120675] Unexpected tainting via regex using locale
CC: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, perl5-porters [...] perl.org
Date: Mon, 29 Dec 2014 18:45:30 -0700
Download (untitled) / with headers
text/plain 2.1k
On 12/29/2014 03:15 PM, Dave Mitchell wrote: Show quoted text
> On Mon, Dec 29, 2014 at 01:53:52PM -0700, Karl Williamson wrote:
>> Now done in 613abc6d16e99bd9834fe6afd79beb61a3a4734d
> > Karl, that commit's failing very badly on my system on debugging builds. > Several test are dying with assertion failures. I can reduce charset.t > to this code: > > use POSIX 'locale_h'; > > my $l = 'zh_HK.big5hkscs'; > setlocale(&POSIX::LC_CTYPE(), $l) or die; > use locale; > no warnings 'locale'; # We may be trying out a weird locale > > setlocale(&POSIX::LC_CTYPE(), $l) or die; > my $c = chr utf8::unicode_to_native(0xdf); > CORE::fc($c); > > which dies like this: > > perl: util.c:1900: Perl_ck_warner: Assertion `pat' failed. > > Its dying while doing the fc(). > > pp_fc() does this: > > _CHECK_AND_WARN_PROBLEMATIC_LOCALE; > > which is calling Perl_ck_warner() with a null pat. This macro is defined as: > > # define _CHECK_AND_WARN_PROBLEMATIC_LOCALE \ > STMT_START { \ > if (PL_warn_locale) { \ > /*GCC_DIAG_IGNORE(-Wformat-security); Didn't work */ \ > Perl_ck_warner(aTHX_ packWARN(WARN_LOCALE), \ > SvPVX(PL_warn_locale), \ > 0 /* dummy to avoid comp warning */ ); \ > /* GCC_DIAG_RESTORE; */ \ > SvREFCNT_dec_NN(PL_warn_locale); \ > PL_warn_locale = NULL; \ > } \ > } STMT_END > > > and at this point, PL_warn_locale looks like: > > SV = PVNV(0xaa2ee8) at 0xac42e0 > REFCNT = 1 > FLAGS = (TEMP,IOK,NOK,pIOK,pNOK) > IV = 223 > NV = 223 > PV = 0 > > so SvPVX(PL_warn_locale) is null. > > I don't understand the locale system well enough for the fix to be obvious > to me. > > >
This should now be fixed by 215c5139cb98a8536a622f8aaace5a0b808475a7.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org