New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lc(), uc() and ucfirst() broken inside utf8 regex #6038
Comments
From @audreytCreated by autrijus@autrijus.orgThe following snippet demonstrated corrupted utf8 strings for utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n"; It seem to only happen when the following combination is met: Since it occurred in Mail::Header it effectively corrupted Thanks, Perl Info
|
From ams@wiw.orgAt 2002-10-28 00:11:02 -0000, perlbug@perl.org wrote:
This is a horrible morass of interacting bugs. Let's look at one of the utf8::upgrade($_="t.est.t.est"); Perl_pp_lc() is called twice each for "t" and "est" respectively. The SVf_UTF8 is set the next time, and we enter the while loop at pp.c:3452. The code in the bug report avoids breaking lcfirst only by a fortunate Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first -- ams, who foolishly saw "a nice opportunity to explore Unicode." |
From @nwc10On Tue, Nov 05, 2002 at 02:36:46PM +0530, Abhijit Menon-Sen wrote:
OK. I can do that one in latin1, which may make it easier to view for many perl -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; I think it is one I mailed to p5p a year or two ago, but never got the
"Foolish mortal" I think that there ought to be inspiration for a good spoof of Tolkien's Nicholas Clark |
From @hvdsAbhijit Menon-Sen <ams@wiw.org> wrote: It is because we're working on $1, which is magic, but we're checking Could you try some tests with the patch below? This at least starts to Hugo Inline Patch--- pp.c.old Sat Oct 19 16:21:46 2002
+++ pp.c Tue Nov 5 13:24:06 2002
@@ -3323,12 +3323,13 @@
register U8 *s;
STRLEN slen;
+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
STRLEN ulen;
STRLEN tculen;
- s = (U8*)SvPV(sv, slen);
+ s = (U8*)SvPV_nomg(sv, slen);
utf8_to_uvchr(s, &ulen);
toTITLE_utf8(s, tmpbuf, &tculen);
@@ -3342,7 +3343,7 @@
SETs(TARG);
}
else {
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
Copy(tmpbuf, s, tculen, U8);
}
}
@@ -3350,11 +3351,11 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
if (*s) {
if (IN_LOCALE_RUNTIME) {
TAINT;
@@ -3365,8 +3366,7 @@
*s = toUPPER(*s);
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}
@@ -3377,7 +3377,8 @@
register U8 *s;
STRLEN slen;
- if (DO_UTF8(sv) && (s = (U8*)SvPV(sv, slen)) && slen && UTF8_IS_START(*s)) {
+ SvGETMAGIC(sv);
+ if (DO_UTF8(sv) && (s = (U8*)SvPV_nomg(sv, slen)) && slen && UTF8_IS_START(*s)) {
STRLEN ulen;
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
U8 *tend;
@@ -3396,7 +3397,7 @@
SETs(TARG);
}
else {
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
Copy(tmpbuf, s, ulen, U8);
}
}
@@ -3404,11 +3405,11 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
if (*s) {
if (IN_LOCALE_RUNTIME) {
TAINT;
@@ -3419,8 +3420,7 @@
*s = toLOWER(*s);
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}
@@ -3431,6 +3431,7 @@
register U8 *s;
STRLEN len;
+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
dTARGET;
STRLEN ulen;
@@ -3438,7 +3439,7 @@
U8 *send;
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
- s = (U8*)SvPV(sv,len);
+ s = (U8*)SvPV_nomg(sv,len);
if (!len) {
SvUTF8_off(TARG); /* decontaminate */
sv_setpvn(TARG, "", 0);
@@ -3468,11 +3469,11 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, len);
+ s = (U8*)SvPV_force_nomg(sv, len);
if (len) {
register U8 *send = s + len;
@@ -3488,8 +3489,7 @@
}
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}
@@ -3500,6 +3500,7 @@
register U8 *s;
STRLEN len;
+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
dTARGET;
STRLEN ulen;
@@ -3507,7 +3508,7 @@
U8 *send;
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
- s = (U8*)SvPV(sv,len);
+ s = (U8*)SvPV_nomg(sv,len);
if (!len) {
SvUTF8_off(TARG); /* decontaminate */
sv_setpvn(TARG, "", 0);
@@ -3554,12 +3555,12 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, len);
+ s = (U8*)SvPV_force_nomg(sv, len);
if (len) {
register U8 *send = s + len;
@@ -3575,8 +3576,7 @@
}
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}
|
From @TuxOn Tue 05 Nov 2002 14:30, hv@crypt.org wrote:
What kind of tests? A full smoke for HP-UX 11.00?
-- |
From @nwc10On Tue, Nov 05, 2002 at 02:38:49PM +0100, H.Merijn Brand wrote:
This hunk failed:
..../perl -Ilib -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; Well, that one works now, as do all variants except ucfirst. I can't see how Nicholas Clark |
From nick.ing-simmons@elixent.com<hv@crypt.org> writes:
...... And presumably doing one SvGETMAGIC at the top should be faster -- |
From ams@wiw.orgAt 2002-11-05 13:30:00 +0000, hv@crypt.org wrote:
Now I feel stupid. #18107 (my earlier patch) and #18109 (your patch with one hunk fixed) $_ = "\N{greek:sigma}foo.bar"; -- ams |
From @nwc10On Tue, Nov 05, 2002 at 07:55:23PM +0530, Abhijit Menon-Sen wrote:
Bah. You need full unicode for your example. That's tantamount to cheating perl5.00503 -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' ..../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_' Who is this B0B character? :-( Nicholas Clark PS I'm not really using chr 256. It's just a quantum effect until it meets |
From @hvdsAbhijit Menon-Sen <ams@wiw.org> wrote: Happens to the best of us. :) :#18107 (my earlier patch) and #18109 (your patch with one hunk fixed) That one hunk needed fixing only because #18107 changed the context, :fix all four cases in the original bug report, but the swash_init() I'll try to look at that one later. Hugo |
From @hvds"H.Merijn Brand" <h.m.brand@hccnet.nl> wrote: No, I just meant 'try some lc/uc/s{}{}e combinations'. I'm sure the Hugo |
From nick.ing-simmons@elixent.com<hv@crypt.org> writes:
With the swash code it happens to _all_ of us. For the uninitiated SWASHes are hashes that represent property data -- |
From @andk
> <hv@crypt.org> writes:
> With the swash code it happens to _all_ of us. > For the uninitiated SWASHes are hashes that represent property data While you are at it, may I ask you what you think about this: when code is running in the debugger and triggers that swash code, it Thanks for any insight, |
From ams@wiw.orgAt 2002-11-05 14:44:28 +0000, nick@ccl4.org wrote:
Nicholas, say goodbye to B0B. The appended patch teaches save_re_context() to save the existing $1..$n (By the way, I liked p4d2p. Can p4genpatch be made to read from stdin?) -- ams ==== //depot/perl/regcomp.c#311 - /home/ams/build/perl/current/regcomp.c ==== ==== //depot/perl/t/op/lc.t#10 - /home/ams/build/perl/current/t/op/lc.t ==== -print "1..51\n"; my $test = 1; @@ -136,3 +136,18 @@ +# #18107: A host of bugs involving [ul]c{,first}. AMS 20021106 |
From @andk
> (By the way, I liked p4d2p. Can p4genpatch be made to read from stdin?) That would do it without re-indentation (to keep the patch short). Not * accepts more than one patchnumber on commandline -- Inline Patch--- p4genpatch@18113 Wed Nov 6 16:01:13 2002
+++ p4genpatch Wed Nov 6 15:58:26 2002
@@ -20,7 +20,7 @@
sub Usage ();
$0 =~ s|^.*[\\/]||;
-my $VERSION = '0.05';
+my $VERSION = '0.06';
my $TOPDIR = cwd();
my @P4opt;
our %OPT = ( "d" => "u", b => "//depot/perl", "D" => "diff" );
@@ -28,13 +28,17 @@
GetOptions(\%OPT, "b=s", "p=s", "d=s", "D=s", "h", "v", "V") or die Usage;
print Usage and exit if $OPT{h};
print "$VERSION\n" and exit if $OPT{V};
-die Usage unless @ARGV == 1 && $ARGV[0] =~ /^\d+$/;
-my $CHANGE = shift;
+unless (@ARGV) {
+ @ARGV = <>;
+ chomp @ARGV;
+}
for my $p4opt (qw(p)) {
push @P4opt, "-$p4opt $OPT{$p4opt}" if $OPT{$p4opt};
}
+while (my $CHANGE = shift @ARGV) {
+die "Illegal CHANGE[$CHANGE]\n", Usage unless $CHANGE =~ /^\d+$/;
my $system = "p4 @P4opt describe -s $CHANGE |";
open my $p4, $system or die "Could not run $system";
my @action;
@@ -136,6 +140,7 @@
}
}
print "End of Patch.\n";
+}
my($tz_offset);
sub correctmtime ($$$) {
@@ -155,7 +160,7 @@
}
sub Usage () {
- qq{Usage: $0 [OPTIONS] patchnumber
+ qq{Usage: $0 [OPTIONS] [patchnumber]...
-p host:port p4 port (e.g. myhost:1666)
-d diffopt option to pass to diff(1)
@@ -171,6 +176,8 @@
temporary directory with sensible names and sensible modification
times and composes a patch to STDOUT using external diff command.
Requires repository access.
+
+If no files are specified, reads patchnumbers from stdin.
Examples:
perl $0 12345 | gzip -c > 12345.gz |
From @hvdsAbhijit Menon-Sen <ams@wiw.org> wrote: If I correctly understand what this is doing, this feels like an approach Can you give a detailed trace of what's going wrong in a sample testcase? Hugo |
From @ysthOn Tue, 05 Nov 2002 13:30:00 +0000, hv@crypt.org wrote:
Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work: ~/bleadperl/perl $./perl -Ilib -MTest::More=no_plan -wle' Also, it appears that the utf8 branches for [ul]c(first)? are doing And [ul]cfirst are assuming (in the SvPADTMP true branch) that I'm going to try to come up with first a test patch and then a |
From @hvdssthoenna@efn.org (Yitzchak Scott-Thoennes) wrote: Ah, yes. :Also, it appears that the utf8 branches for [ul]c(first)? are doing Yup. Though my patch didn't change that. :And [ul]cfirst are assuming (in the SvPADTMP true branch) that I don't think so. Hmm, I wonder if there is also any possibility :I'm going to try to come up with first a test patch and then a Thank you. In case it is helpful, I attach below the patch that I think Hugo Inline Patch--- pp.c.old Tue Nov 5 13:24:06 2002
+++ pp.c Fri Nov 8 12:55:33 2002
@@ -3320,40 +3320,36 @@
{
dSP;
SV *sv = TOPs;
- register U8 *s;
STRLEN slen;
+ register U8 *s = (U8*)SvPV(sv, slen); /* handles GMAGIC, AMAGIC */
- SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
STRLEN ulen;
STRLEN tculen;
- s = (U8*)SvPV_nomg(sv, slen);
utf8_to_uvchr(s, &ulen);
-
toTITLE_utf8(s, tmpbuf, &tculen);
- utf8_to_uvchr(tmpbuf, 0);
- if (!SvPADTMP(sv) || SvREADONLY(sv)) {
+ if (!SvPADTMP(sv) || SvREADONLY(sv) || ulen != tculen) {
dTARGET;
sv_setpvn(TARG, (char*)tmpbuf, tculen);
sv_catpvn(TARG, (char*)(s + ulen), slen - ulen);
SvUTF8_on(TARG);
SETs(TARG);
+ sv = TARG;
}
else {
s = (U8*)SvPV_force_nomg(sv, slen);
- Copy(tmpbuf, s, tculen, U8);
+ Copy(tmpbuf, s, ulen, U8);
}
}
else {
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
- SvUTF8_off(TARG); /* decontaminate */
sv_setsv_nomg(TARG, sv);
+ SETs(TARG);
sv = TARG;
- SETs(sv);
}
s = (U8*)SvPV_force_nomg(sv, slen);
if (*s) { |
From nick.ing-simmons@elixent.comYitzchak Scott-Thoennes <sthoenna@efn.org> writes:
No. -- |
From nick.ing-simmons@elixent.com<hv@crypt.org> writes:
There are problem cases. iso-8859-1 0xff (y diaresis) does not have -- |
From @ysthOn Fri, 08 Nov 2002 12:59:49 +0000, hv@crypt.org wrote:
Sorry, I didn't mean to imply it did. Been that way since at least 5.6.1.
Sounded like NI-S thought so. What needs to be done?
Thanks. I'm still working on a test patch but have been busy with I'm not sure we should even be doing plain sv_setsv. That's passing ~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'
or in pure perl: ~/bleadperl/perl $./miniperl -wle'print length(~ucfirst !1)' (which I think should print 0) |
From @ysthOn Tue, 12 Nov 2002, Yitzchak Scott-Thoennes wrote:
I need to work on my cut-and-pasting skills. That should have been:
|
From ams@wiw.orgAt 2002-11-06 16:32:25 +0000, hv@crypt.org wrote:
I don't see any cheaper way (and as Jarkko has shown, this way doesn't
Perl_to_utf8_case (which is called by toLOWER_utf8 and siblings) calls #0 Perl_call_sv (sv=0x8180454, flags=64) at perl.c:1760 (From perl -le '$_="foo.bar".chr(256);chop;s/([^.]+)/uc($1)/ge;print') -- ams |
From @jhi
I tried the speed in two ways: (1) run op/regexp.t and op/pat.100 both 10 times on an almost zero-load (2) Since neither of the above tests really stress test UTF-8 use Benchmark; The results of 2+2 runs cleaned up: a) b) c) d) with1 with2 out1 out2 a) number of "abc":s in one (\w+) run Yes, quite consistently *with* the patch things get *faster*. -- |
From @jhiSince no better ideas have come along, I'm applying Abhijit's patch |
@jhi - Status changed from 'new' to 'resolved' |
From @hvdsAbhijit Menon-Sen <ams@wiw.org> wrote: Thanks, belatedly applied as #18266. Hugo |
@jhi - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#18107 (status was 'resolved')
Searchable as RT18107$
The text was updated successfully, but these errors were encountered: