lc(), uc() and ucfirst() broken inside utf8 regex #6038

p5pRT · 2002-10-28T00:11:00Z

Migrated from rt.perl.org#18107 (status was 'resolved')

Searchable as RT18107$

p5pRT · 2002-10-28T00:11:00Z

From @audreyt

Created by autrijus@autrijus.org

The following snippet demonstrated corrupted utf8 strings for
lc(), uc() and ucfirst(), but not lcfirst():

utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst: $_\n";

It seem to only happen when the following combination is met:
- String with unicode flag on
- Regular expression with captured variables
- Variables manipulated by lc, uc, ucfirst or the equivalent \L escapes

Since it occurred in Mail::Header it effectively corrupted
utf8-enabled mail applications writing in perl.

Thanks,
/Autrijus/

Perl Info


Flags:
    category=core
    severity=high

This perlbug was built using Perl v5.8.0 - Mon Jun 10 19:48:03 CST 2002
It is being executed now by  Perl v5.8.0 - Sat Oct  5 11:17:02 GMT 2002.

Site configuration information for perl v5.8.0:

Configured by root at Sat Oct  5 11:17:02 GMT 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=freebsd, osvers=4.7-rc, archname=i386-freebsd
    uname='freebsd gohan17.freebsd.org 4.7-rc freebsd 4.7-rc #0: sun apr 1 02:34:56 pst 2002 asami@bento.freebsd.org:usrsrcsyscompilebento i386 '
    config_args='-sde -Dprefix=/usr/local -Darchlib=/usr/local/lib/perl5/5.8.0/mach -Dprivlib=/usr/local/lib/perl5/5.8.0 -Dman3dir=/usr/local/lib/perl5/5.8.0/man/man3 -Dsitearch=/usr/local/lib/perl5/site_perl/5.8.0/mach -Dsitelib=/usr/local/lib/perl5/site_perl/5.8.0 -Ui_malloc -Ui_iconv -Uinstallusrbinperl -Dccflags=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.0/BSDPAN" -Ui_gdbm -Dusemymalloc=n'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.0/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -I/usr/local/include',
    optimize='-O -pipe ',
    cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.0/BSDPAN" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.95.4 20020320 [FreeBSD]', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-E  -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib
    libs=-lm -lc -lcrypt -lutil
    perllibs=-lm -lc -lcrypt -lutil
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib'

Locally applied patches:
    DEVEL17060


@INC for perl v5.8.0:
    /usr/local/lib/perl5/site_perl/5.8.0/mach
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl
    /usr/local/lib/perl5/5.8.0/BSDPAN
    /usr/local/lib/perl5/5.8.0/mach
    /usr/local/lib/perl5/5.8.0
    .


Environment for perl v5.8.0:
    HOME=/home/autrijus
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_US.ISO_8859-1
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/autrijus/bin:/usr/X11R6/bin:/usr/games:/bin:/usr/bin:/usr/local/bin:/usr/local/sbin:/sbin:/usr/sbin:/home/autrijus/Komodo-2.0
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh

p5pRT · 2002-11-05T09:13:58Z

From ams@wiw.org

At 2002-10-28 00:11:02 -0000, perlbug@perl.org wrote:

utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst: $_\n";

This is a horrible morass of interacting bugs. Let's look at one of the
cases above in more detail:

utf8::upgrade($_="t.est.t.est");
s/(\w+)/uc($1)/eg; # $_ eq "T.E0B.T.EST"

Perl_pp_lc() is called twice each for "t" and "est" respectively. The
first bug is that $1 does not have SVf_UTF8 set on the first call. This
makes no difference with "t", but using "\N{greek:sigma}" instead causes
more brokenness.

SVf_UTF8 is set the next time, and we enter the while loop at pp.c:3452.
The first time toUPPER_utf8() is called, Perl_to_utf8() case decides to
call swash_init(), which wipes out the PV we're stepping over somewhere
down the line (and causes the brokenness seen in the bug report). Once
the swash is initialised, however, subsequent calls to pp_lc() work as
expected. (The other three cases are essentially similar.)

The code in the bug report avoids breaking lcfirst only by a fortunate
coincidence. I extended this to ucfirst in #18107, but it doesn't fix
any of the real problems.

Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
call to pp_lc? Fixing that would solve one class of bugs, and help me to
fix the other (and, I suspect, discover a third).

-- ams, who foolishly saw "a nice opportunity to explore Unicode."

p5pRT · 2002-11-05T10:59:00Z

From @nwc10

On Tue, Nov 05, 2002 at 02:36:46PM +0530, Abhijit Menon-Sen wrote:

At 2002-10-28 00:11:02 -0000, perlbug@perl.org wrote:

utf8::upgrade($_="t.est"); s/([a-z]+)/lc($1)/ge; print "lc: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/uc($1)/ge; print "uc: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/lcfirst($1)/ge; print "lcfirst: $_\n";
utf8::upgrade($_="t.est"); s/([a-z]+)/ucfirst($1)/ge; print "ucfirst: $_\n";

This is a horrible morass of interacting bugs. Let's look at one of the
cases above in more detail:
utf8&#8203;::upgrade$$\_="t\.est\.t\.est"$;
s/$\\w\+$/uc$$1$/eg;                  \# $\_ eq "T\.E0B\.T\.EST"
Perl_pp_lc() is called twice each for "t" and "est" respectively. The
first bug is that $1 does not have SVf_UTF8 set on the first call. This
makes no difference with "t", but using "\N{greek:sigma}" instead causes
more brokenness.

Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
call to pp_lc? Fixing that would solve one class of bugs, and help me to
fix the other (and, I suspect, discover a third).

OK. I can do that one in latin1, which may make it easier to view for many
readers (like me) living in 8 bit land:

perl -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; $b =~ s/./lc $&/e; print $b; $c =~ s/./lc $&/e; print $c; '
É
É
é

I think it is one I mailed to p5p a year or two ago, but never got the
tuity things organised to make a TODO test.

-- ams, who foolishly saw "a nice opportunity to explore Unicode."

"Foolish mortal"

I think that there ought to be inspiration for a good spoof of Tolkien's
verse about the ring here, but I can't create one on the spur of the moment.

Nicholas Clark

p5pRT · 2002-11-05T13:30:11Z

From @hvds

Abhijit Menon-Sen <ams@wiw.org> wrote:
:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:fix the other (and, I suspect, discover a third).

It is because we're working on $1, which is magic, but we're checking
utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

Hugo

Inline Patch

--- pp.c.old	Sat Oct 19 16:21:46 2002
+++ pp.c	Tue Nov  5 13:24:06 2002
@@ -3323,12 +3323,13 @@
     register U8 *s;
     STRLEN slen;
 
+    SvGETMAGIC(sv);
     if (DO_UTF8(sv)) {
 	U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
 	STRLEN ulen;
 	STRLEN tculen;
 
-	s = (U8*)SvPV(sv, slen);
+	s = (U8*)SvPV_nomg(sv, slen);
 	utf8_to_uvchr(s, &ulen);
 
 	toTITLE_utf8(s, tmpbuf, &tculen);
@@ -3342,7 +3343,7 @@
 	    SETs(TARG);
 	}
 	else {
-	    s = (U8*)SvPV_force(sv, slen);
+	    s = (U8*)SvPV_force_nomg(sv, slen);
 	    Copy(tmpbuf, s, tculen, U8);
 	}
     }
@@ -3350,11 +3351,11 @@
 	if (!SvPADTMP(sv) || SvREADONLY(sv)) {
 	    dTARGET;
 	    SvUTF8_off(TARG);				/* decontaminate */
-	    sv_setsv(TARG, sv);
+	    sv_setsv_nomg(TARG, sv);
 	    sv = TARG;
 	    SETs(sv);
 	}
-	s = (U8*)SvPV_force(sv, slen);
+	s = (U8*)SvPV_force_nomg(sv, slen);
 	if (*s) {
 	    if (IN_LOCALE_RUNTIME) {
 		TAINT;
@@ -3365,8 +3366,7 @@
 		*s = toUPPER(*s);
 	}
     }
-    if (SvSMAGICAL(sv))
-	mg_set(sv);
+    SvSETMAGIC(sv);
     RETURN;
 }
 
@@ -3377,7 +3377,8 @@
     register U8 *s;
     STRLEN slen;
 
-    if (DO_UTF8(sv) && (s = (U8*)SvPV(sv, slen)) && slen && UTF8_IS_START(*s)) {
+    SvGETMAGIC(sv);
+    if (DO_UTF8(sv) && (s = (U8*)SvPV_nomg(sv, slen)) && slen && UTF8_IS_START(*s)) {
 	STRLEN ulen;
 	U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
 	U8 *tend;
@@ -3396,7 +3397,7 @@
 	    SETs(TARG);
 	}
 	else {
-	    s = (U8*)SvPV_force(sv, slen);
+	    s = (U8*)SvPV_force_nomg(sv, slen);
 	    Copy(tmpbuf, s, ulen, U8);
 	}
     }
@@ -3404,11 +3405,11 @@
 	if (!SvPADTMP(sv) || SvREADONLY(sv)) {
 	    dTARGET;
 	    SvUTF8_off(TARG);				/* decontaminate */
-	    sv_setsv(TARG, sv);
+	    sv_setsv_nomg(TARG, sv);
 	    sv = TARG;
 	    SETs(sv);
 	}
-	s = (U8*)SvPV_force(sv, slen);
+	s = (U8*)SvPV_force_nomg(sv, slen);
 	if (*s) {
 	    if (IN_LOCALE_RUNTIME) {
 		TAINT;
@@ -3419,8 +3420,7 @@
 		*s = toLOWER(*s);
 	}
     }
-    if (SvSMAGICAL(sv))
-	mg_set(sv);
+    SvSETMAGIC(sv);
     RETURN;
 }
 
@@ -3431,6 +3431,7 @@
     register U8 *s;
     STRLEN len;
 
+    SvGETMAGIC(sv);
     if (DO_UTF8(sv)) {
 	dTARGET;
 	STRLEN ulen;
@@ -3438,7 +3439,7 @@
 	U8 *send;
 	U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
 
-	s = (U8*)SvPV(sv,len);
+	s = (U8*)SvPV_nomg(sv,len);
 	if (!len) {
 	    SvUTF8_off(TARG);				/* decontaminate */
 	    sv_setpvn(TARG, "", 0);
@@ -3468,11 +3469,11 @@
 	if (!SvPADTMP(sv) || SvREADONLY(sv)) {
 	    dTARGET;
 	    SvUTF8_off(TARG);				/* decontaminate */
-	    sv_setsv(TARG, sv);
+	    sv_setsv_nomg(TARG, sv);
 	    sv = TARG;
 	    SETs(sv);
 	}
-	s = (U8*)SvPV_force(sv, len);
+	s = (U8*)SvPV_force_nomg(sv, len);
 	if (len) {
 	    register U8 *send = s + len;
 
@@ -3488,8 +3489,7 @@
 	    }
 	}
     }
-    if (SvSMAGICAL(sv))
-	mg_set(sv);
+    SvSETMAGIC(sv);
     RETURN;
 }
 
@@ -3500,6 +3500,7 @@
     register U8 *s;
     STRLEN len;
 
+    SvGETMAGIC(sv);
     if (DO_UTF8(sv)) {
 	dTARGET;
 	STRLEN ulen;
@@ -3507,7 +3508,7 @@
 	U8 *send;
 	U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
 
-	s = (U8*)SvPV(sv,len);
+	s = (U8*)SvPV_nomg(sv,len);
 	if (!len) {
 	    SvUTF8_off(TARG);				/* decontaminate */
 	    sv_setpvn(TARG, "", 0);
@@ -3554,12 +3555,12 @@
 	if (!SvPADTMP(sv) || SvREADONLY(sv)) {
 	    dTARGET;
 	    SvUTF8_off(TARG);				/* decontaminate */
-	    sv_setsv(TARG, sv);
+	    sv_setsv_nomg(TARG, sv);
 	    sv = TARG;
 	    SETs(sv);
 	}
 
-	s = (U8*)SvPV_force(sv, len);
+	s = (U8*)SvPV_force_nomg(sv, len);
 	if (len) {
 	    register U8 *send = s + len;
 
@@ -3575,8 +3576,7 @@
 	    }
 	}
     }
-    if (SvSMAGICAL(sv))
-	mg_set(sv);
+    SvSETMAGIC(sv);
     RETURN;
 }

p5pRT · 2002-11-05T13:39:57Z

From @Tux

On Tue 05 Nov 2002 14:30, hv@crypt.org wrote:

Abhijit Menon-Sen <ams@wiw.org> wrote:
:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:fix the other (and, I suspect, discover a third).

It is because we're working on $1, which is magic, but we're checking
utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

What kind of tests? A full smoke for HP-UX 11.00?

Hugo
--- pp.c.old Sat Oct 19 16:21:46 2002
+++ pp.c Tue Nov 5 13:24:06 2002
@@ -3323,12 +3323,13 @@
register U8 *s;
STRLEN slen;

+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
STRLEN ulen;
STRLEN tculen;

- s = (U8*)SvPV(sv, slen);
+ s = (U8*)SvPV_nomg(sv, slen);
utf8_to_uvchr(s, &ulen);
 toTITLE\_utf8$s\, tmpbuf\, &tculen$;
@@ -3342,7 +3343,7 @@
SETs(TARG);
}
else {
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
Copy(tmpbuf, s, tculen, U8);
}
}
@@ -3350,11 +3351,11 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
if (*s) {
if (IN_LOCALE_RUNTIME) {
TAINT;
@@ -3365,8 +3366,7 @@
*s = toUPPER(*s);
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}

@@ -3377,7 +3377,8 @@
register U8 *s;
STRLEN slen;

- if (DO_UTF8(sv) && (s = (U8*)SvPV(sv, slen)) && slen && UTF8_IS_START(*s)) {
+ SvGETMAGIC(sv);
+ if (DO_UTF8(sv) && (s = (U8*)SvPV_nomg(sv, slen)) && slen && UTF8_IS_START(*s)) {
STRLEN ulen;
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
U8 *tend;
@@ -3396,7 +3397,7 @@
SETs(TARG);
}
else {
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
Copy(tmpbuf, s, ulen, U8);
}
}
@@ -3404,11 +3405,11 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, slen);
+ s = (U8*)SvPV_force_nomg(sv, slen);
if (*s) {
if (IN_LOCALE_RUNTIME) {
TAINT;
@@ -3419,8 +3420,7 @@
*s = toLOWER(*s);
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}

@@ -3431,6 +3431,7 @@
register U8 *s;
STRLEN len;

+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
dTARGET;
STRLEN ulen;
@@ -3438,7 +3439,7 @@
U8 *send;
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];

- s = (U8*)SvPV(sv,len);
+ s = (U8*)SvPV_nomg(sv,len);
if (!len) {
SvUTF8_off(TARG); /* decontaminate */
sv_setpvn(TARG, "", 0);
@@ -3468,11 +3469,11 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}
- s = (U8*)SvPV_force(sv, len);
+ s = (U8*)SvPV_force_nomg(sv, len);
if (len) {
register U8 *send = s + len;

@@ -3488,8 +3489,7 @@
}
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}

@@ -3500,6 +3500,7 @@
register U8 *s;
STRLEN len;

+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
dTARGET;
STRLEN ulen;
@@ -3507,7 +3508,7 @@
U8 *send;
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];

- s = (U8*)SvPV(sv,len);
+ s = (U8*)SvPV_nomg(sv,len);
if (!len) {
SvUTF8_off(TARG); /* decontaminate */
sv_setpvn(TARG, "", 0);
@@ -3554,12 +3555,12 @@
if (!SvPADTMP(sv) || SvREADONLY(sv)) {
dTARGET;
SvUTF8_off(TARG); /* decontaminate */
- sv_setsv(TARG, sv);
+ sv_setsv_nomg(TARG, sv);
sv = TARG;
SETs(sv);
}

- s = (U8*)SvPV_force(sv, len);
+ s = (U8*)SvPV_force_nomg(sv, len);
if (len) {
register U8 *send = s + len;

@@ -3575,8 +3576,7 @@
}
}
}
- if (SvSMAGICAL(sv))
- mg_set(sv);
+ SvSETMAGIC(sv);
RETURN;
}

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using perl-5.6.1, 5.8.0 & 633 on HP-UX 10.20 & 11.00, AIX 4.2, AIX 4.3,
WinNT 4, Win2K pro & WinCE 2.11. Smoking perl CORE: smokers@perl.org
http://archives.develooper.com/daily-build@perl.org/ perl-qa@perl.org
send smoke reports to: smokers-reports@perl.org, QA: http://qa.perl.org

p5pRT · 2002-11-05T13:45:40Z

From @nwc10

On Tue, Nov 05, 2002 at 02:38:49PM +0100, H.Merijn Brand wrote:

On Tue 05 Nov 2002 14:30, hv@crypt.org wrote:

Abhijit Menon-Sen <ams@wiw.org> wrote:
:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:fix the other (and, I suspect, discover a third).

It is because we're working on $1, which is magic, but we're checking
utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

What kind of tests? A full smoke for HP-UX 11.00?

This hunk failed:

--- pp.c.old Sat Oct 19 16:21:46 2002
+++ pp.c Tue Nov 5 13:24:06 2002
@@ -3323,12 +3323,13 @@
register U8 *s;
STRLEN slen;

+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
STRLEN ulen;
STRLEN tculen;

- s = (U8*)SvPV(sv, slen);
+ s = (U8*)SvPV_nomg(sv, slen);
utf8_to_uvchr(s, &ulen);
 toTITLE\_utf8$s\, tmpbuf\, &tculen$;

..../perl -Ilib -wle '$a = chr (201) . chr 256; chop $a; print $a; $b = $a; $c = $a; $b =~ +s/./lc $&/e; print $b; $c =~ s/./lc $&/e; print $c; '
É
é
é

Well, that one works now, as do all variants except ucfirst. I can't see how
to apply the rejected hunk.

Nicholas Clark

p5pRT · 2002-11-05T14:23:48Z

From nick.ing-simmons@elixent.com

<hv@crypt.org> writes:

Abhijit Menon-Sen <ams@wiw.org> wrote:
:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:fix the other (and, I suspect, discover a third).

It is because we're working on $1, which is magic, but we're checking
utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

Hugo
--- pp.c.old Sat Oct 19 16:21:46 2002
+++ pp.c Tue Nov 5 13:24:06 2002
@@ -3323,12 +3323,13 @@
register U8 *s;
STRLEN slen;

+ SvGETMAGIC(sv);
if (DO_UTF8(sv)) {
U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
STRLEN ulen;
STRLEN tculen;

- s = (U8*)SvPV(sv, slen);
+ s = (U8*)SvPV_nomg(sv, slen);

......

And presumably doing one SvGETMAGIC at the top should be faster
than re-doing it in various spots below :-)

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

p5pRT · 2002-11-05T14:25:57Z

From ams@wiw.org

At 2002-11-05 13:30:00 +0000, hv@crypt.org wrote:

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

Now I feel stupid.

#18107 (my earlier patch) and #18109 (your patch with one hunk fixed)
fix all four cases in the original bug report, but the swash_init()
problem remains:

$_ = "\N{greek:sigma}foo.bar";
s/(\w+)/uc($1)/ge;
print "broken\n" unless $_ eq "\N{greek:Sigma}FOO.BAR";

-- ams

p5pRT · 2002-11-05T14:44:48Z

From @nwc10

On Tue, Nov 05, 2002 at 07:55:23PM +0530, Abhijit Menon-Sen wrote:

At 2002-11-05 13:30:00 +0000, hv@crypt.org wrote:

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

Now I feel stupid.

#18107 (my earlier patch) and #18109 (your patch with one hunk fixed)
fix all four cases in the original bug report, but the swash_init()
problem remains:
$\_ = "\\N\{greek&#8203;:sigma\}foo\.bar";
s/$\\w\+$/uc$$1$/ge;
print "broken\\n" unless $\_ eq "\\N\{greek&#8203;:Sigma\}FOO\.BAR";

Bah. You need full unicode for your example. That's tantamount to cheating
Can I do it in mere 8 bit? No. I can do better than that. Here we go
(wrong) in glorious 7 bit ASCII:

perl5.00503 -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_'
FOO.BAR

..../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_'
FOO.B0B

Who is this B0B character? :-(

Nicholas Clark

PS I'm not really using chr 256. It's just a quantum effect until it meets
its anticharacter, chr -256, and annihilates. :-) [AKA I need a way to
set the utf8 flag on an ASCII scalar]

p5pRT · 2002-11-05T15:31:38Z

From @hvds

Abhijit Menon-Sen <ams@wiw.org> wrote:
:At 2002-11-05 13:30:00 +0000, hv@crypt.org wrote:
:> Could you try some tests with the patch below? This at least starts to
:> fix things, and may be the whole story.
:
:Now I feel stupid.

Happens to the best of us. :)

:#18107 (my earlier patch) and #18109 (your patch with one hunk fixed)

That one hunk needed fixing only because #18107 changed the context,
right?

:fix all four cases in the original bug report, but the swash_init()
:problem remains:
:
: $_ = "\N{greek:sigma}foo.bar";
: s/(\w+)/uc($1)/ge;
: print "broken\n" unless $_ eq "\N{greek:Sigma}FOO.BAR";

I'll try to look at that one later.

Hugo

p5pRT · 2002-11-05T15:32:02Z

From @hvds

"H.Merijn Brand" <h.m.brand@hccnet.nl> wrote:
:> Could you try some tests with the patch below? This at least starts to
:> fix things, and may be the whole story.
:
:What kind of tests? A full smoke for HP-UX 11.00?

No, I just meant 'try some lc/uc/s{}{}e combinations'. I'm sure the
smoke will out in any case.

Hugo

p5pRT · 2002-11-05T16:41:51Z

From nick.ing-simmons@elixent.com

<hv@crypt.org> writes:

Abhijit Menon-Sen <ams@wiw.org> wrote:
:At 2002-11-05 13:30:00 +0000, hv@crypt.org wrote:
:> Could you try some tests with the patch below? This at least starts to
:> fix things, and may be the whole story.
:
:Now I feel stupid.

Happens to the best of us. :)

With the swash code it happens to _all_ of us.

For the uninitiated SWASHes are hashes that represent property data
from the Unicode tables. The hashes are consulted by the C code to
to upper/lower case and similar things. BUT the hashes are (lazily) populated
by running perl code which uses regular expressions to poke about
in the Unicode data files. So we re-enter lots of things (like the RE engine)
which were not really expecting it. So in the swash code we attempt to save
state - call the perl code - and then restore state.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

p5pRT · 2002-11-05T16:53:55Z

From @andk

On Tue, 05 Nov 2002 16:41:21 +0000, Nick Ing-Simmons <nick.ing-simmons@elixent.com> said:

> <hv@crypt.org> writes:

Abhijit Menon-Sen <ams@wiw.org> wrote:
:At 2002-11-05 13:30:00 +0000, hv@crypt.org wrote:
:> Could you try some tests with the patch below? This at least starts to
:> fix things, and may be the whole story.
:
:Now I feel stupid.

Happens to the best of us. :)

> With the swash code it happens to _all_ of us.

> For the uninitiated SWASHes are hashes that represent property data
> from the Unicode tables. The hashes are consulted by the C code to
> to upper/lower case and similar things. BUT the hashes are (lazily) populated
> by running perl code which uses regular expressions to poke about
> in the Unicode data files. So we re-enter lots of things (like the RE engine)
> which were not really expecting it. So in the swash code we attempt to save
> state - call the perl code - and then restore state.

While you are at it, may I ask you what you think about this:

when code is running in the debugger and triggers that swash code, it
may happen that things become really sloooow. I've seen this but
couldn't write a simple test case, so I delayed talking about it. Do
you see any possibility to disable the debugger for *this* perl code?

Thanks for any insight,
--
andreas

p5pRT · 2002-11-06T14:09:06Z

From ams@wiw.org

At 2002-11-05 14:44:28 +0000, nick@ccl4.org wrote:

../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_'
FOO.B0B

Who is this B0B character? :-(

Nicholas, say goodbye to B0B.

The appended patch teaches save_re_context() to save the existing $1..$n
SV's, and fixes all the problems I've seen so far from the original bug
report. I'll apply it shortly, unless somebody has a better suggestion
for how to fetch the SV's to be saved.

(By the way, I liked p4d2p. Can p4genpatch be made to read from stdin?)

-- ams

==== //depot/perl/regcomp.c#311 - /home/ams/build/perl/current/regcomp.c ====
@@ -5070,6 +5070,23 @@
SAVEVPTR(PL_reg_curpm); /* from regexec.c */
SAVEI32(PL_regnpar); /* () count. */
SAVEI32(PL_regsize); /* from regexec.c */
+
+ {
+ /* Save $1..$n (#18107: UTF-8 s/(\w+)/uc($1)/e); AMS 20021106. */
+ int i;
+ GV *mgv;
+ REGEXP *rx;
+ char digits[16];
+
+ if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
+ for (i = 1; i <= rx->nparens; i++) {
+ sprintf(digits, "%lu", i);
+ if ((mgv = gv_fetchpv(digits, FALSE, SVt_PV)))
+ save_scalar(mgv);
+ }
+ }
+ }
+
#ifdef DEBUGGING
SAVEPPTR(PL_reg_starttry); /* from regexec.c */
#endif

==== //depot/perl/t/op/lc.t#10 - /home/ams/build/perl/current/t/op/lc.t ====
@@ -1,6 +1,6 @@
#!./perl

-print "1..51\n";
+print "1..55\n";

my $test = 1;

@@ -136,3 +136,18 @@
ok(uc("\x{1C5}") eq "\x{1C4}", "U+01C5 uc is U+01C4");
ok(uc("\x{1C6}") eq "\x{1C4}", "U+01C6 uc is U+01C4, too");

+# #18107: A host of bugs involving [ul]c{,first}. AMS 20021106
+$a = "\x{3c3}foo.bar"; # \x{3c3} == GREEK SMALL LETTER SIGMA.
+$b = "\x{3a3}FOO.BAR"; # \x{3a3} == GREEK CAPITAL LETTER SIGMA.
+
+($c = $b) =~ s/(\w+)/lc($1)/ge;
+ok($c eq $a, "Using s///e to change case.");
+
+($c = $a) =~ s/(\w+)/uc($1)/ge;
+ok($c eq $b, "Using s///e to change case.");
+
+($c = $b) =~ s/(\w+)/lcfirst($1)/ge;
+ok($c eq "\x{3c3}FOO.bAR", "Using s///e to change case.");
+
+($c = $a) =~ s/(\w+)/ucfirst($1)/ge;
+ok($c eq "\x{3a3}foo.Bar", "Using s///e to change case.");

p5pRT · 2002-11-06T15:07:37Z

From @andk

On Wed, 6 Nov 2002 19:38:11 +0530, Abhijit Menon-Sen <ams@wiw.org> said:

> (By the way, I liked p4d2p. Can p4genpatch be made to read from stdin?)

That would do it without re-indentation (to keep the patch short). Not
well tested.

* accepts more than one patchnumber on commandline
* reads patchnumbers from STDIN, one per line

--
andreas

Inline Patch

--- p4genpatch@18113	Wed Nov  6 16:01:13 2002
+++ p4genpatch	Wed Nov  6 15:58:26 2002
@@ -20,7 +20,7 @@
 sub Usage ();
 
 $0 =~ s|^.*[\\/]||;
-my $VERSION = '0.05';
+my $VERSION = '0.06';
 my $TOPDIR = cwd();
 my @P4opt;
 our %OPT = ( "d" => "u", b => "//depot/perl", "D" => "diff" );
@@ -28,13 +28,17 @@
 GetOptions(\%OPT, "b=s", "p=s", "d=s", "D=s", "h", "v", "V") or die Usage;
 print Usage and exit if $OPT{h};
 print "$VERSION\n" and exit if $OPT{V};
-die Usage unless @ARGV == 1 && $ARGV[0] =~ /^\d+$/;
-my $CHANGE = shift;
+unless (@ARGV) {
+  @ARGV = <>;
+  chomp @ARGV;
+}
 
 for my $p4opt (qw(p)) {
   push @P4opt, "-$p4opt $OPT{$p4opt}" if $OPT{$p4opt};
 }
 
+while (my $CHANGE = shift @ARGV) {
+die "Illegal CHANGE[$CHANGE]\n", Usage unless $CHANGE =~ /^\d+$/;
 my $system = "p4 @P4opt describe -s $CHANGE |";
 open my $p4, $system or die "Could not run $system";
 my @action;
@@ -136,6 +140,7 @@
   }
 }
 print "End of Patch.\n";
+}
 
 my($tz_offset);
 sub correctmtime ($$$) {
@@ -155,7 +160,7 @@
 }
 
 sub Usage () {
-    qq{Usage: $0 [OPTIONS] patchnumber
+    qq{Usage: $0 [OPTIONS] [patchnumber]...
 
       -p host:port    p4 port (e.g. myhost:1666)
       -d diffopt      option to pass to diff(1)
@@ -171,6 +176,8 @@
 temporary directory with sensible names and sensible modification
 times and composes a patch to STDOUT using external diff command.
 Requires repository access.
+
+If no files are specified, reads patchnumbers from stdin.
 
 Examples:
           perl $0 12345 | gzip -c > 12345.gz

p5pRT · 2002-11-06T16:34:36Z

From @hvds

Abhijit Menon-Sen <ams@wiw.org> wrote:
:The appended patch teaches save_re_context() to save the existing $1..$n
:SV's, and fixes all the problems I've seen so far from the original bug
:report. I'll apply it shortly, unless somebody has a better suggestion
:for how to fetch the SV's to be saved.

If I correctly understand what this is doing, this feels like an approach
that is in danger of slowing things down substantially. Is there really
no cheaper way to fix this?

Can you give a detailed trace of what's going wrong in a sample testcase?
Is the problem specific to swash creation, or does this also affect other
intruded perl code, such as re_eval?

Hugo

p5pRT · 2002-11-08T05:28:05Z

From @ysth

On Tue, 05 Nov 2002 13:30:00 +0000, hv@crypt.org wrote:

Abhijit Menon-Sen <ams@wiw.org> wrote:
:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:fix the other (and, I suspect, discover a third).

It is because we're working on $1, which is magic, but we're checking
utfness before we check for magic.

Could you try some tests with the patch below? This at least starts to
fix things, and may be the whole story.

Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work:

~/bleadperl/perl $./perl -Ilib -MTest::More=no_plan -wle'
use overload q{""}=>sub{v256};
for $try ( lc(v256), lc(bless{}) ) {
is(("v".join".",map ord, split //, $try), "v257")
}'
ok 1
not ok 2
# Failed test (-e at line 2)
# got: 'v256'
# expected: 'v257'
1..2
# Looks like you failed 1 tests of 2.

Also, it appears that the utf8 branches for [ul]c(first)? are doing
setmagic on the source, not the target (the non-utf8 branch does
sv = TARG except when suborning the source).

And [ul]cfirst are assuming (in the SvPADTMP true branch) that
the upper/lower-cased first character will have exactly the same
byte length as the original. Is this a valid assumtion?

I'm going to try to come up with first a test patch and then a
code patch.

p5pRT · 2002-11-08T13:00:00Z

From @hvds

sthoenna@efn.org (Yitzchak Scott-Thoennes) wrote:
:On Tue, 05 Nov 2002 13:30:00 +0000, hv@crypt.org wrote:
:>Abhijit Menon-Sen <ams@wiw.org> wrote:
:>:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:>:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:>:fix the other (and, I suspect, discover a third).
:>
:>It is because we're working on $1, which is magic, but we're checking
:>utfness before we check for magic.
:>
:>Could you try some tests with the patch below? This at least starts to
:>fix things, and may be the whole story.
:
:Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work:

Ah, yes.

:Also, it appears that the utf8 branches for [ul]c(first)? are doing
:setmagic on the source, not the target (the non-utf8 branch does
:sv = TARG except when suborning the source).

Yup. Though my patch didn't change that.

:And [ul]cfirst are assuming (in the SvPADTMP true branch) that
:the upper/lower-cased first character will have exactly the same
:byte length as the original. Is this a valid assumtion?

I don't think so. Hmm, I wonder if there is also any possibility
that the locale+nonutf8 branch would end up needing a utf8 result.

:I'm going to try to come up with first a test patch and then a
:code patch.

Thank you. In case it is helpful, I attach below the patch that I think
would fix ucfirst; as far as I can tell the C< utf8_to_uvchr(tmpbuf, 0); >
does nothing, and the C< SvUTF8_off(TARG); /* decontaminate */ > should
not be needed before an sv_setsv(). This does at least pass existing
tests and the moral equivalent of your example overloading test.

Hugo

Inline Patch

--- pp.c.old	Tue Nov  5 13:24:06 2002
+++ pp.c	Fri Nov  8 12:55:33 2002
@@ -3320,40 +3320,36 @@
 {
     dSP;
     SV *sv = TOPs;
-    register U8 *s;
     STRLEN slen;
+    register U8 *s = (U8*)SvPV(sv, slen);	/* handles GMAGIC, AMAGIC */
 
-    SvGETMAGIC(sv);
     if (DO_UTF8(sv)) {
 	U8 tmpbuf[UTF8_MAXLEN_UCLC+1];
 	STRLEN ulen;
 	STRLEN tculen;
 
-	s = (U8*)SvPV_nomg(sv, slen);
 	utf8_to_uvchr(s, &ulen);
-
 	toTITLE_utf8(s, tmpbuf, &tculen);
-	utf8_to_uvchr(tmpbuf, 0);
 
-	if (!SvPADTMP(sv) || SvREADONLY(sv)) {
+	if (!SvPADTMP(sv) || SvREADONLY(sv) || ulen != tculen) {
 	    dTARGET;
 	    sv_setpvn(TARG, (char*)tmpbuf, tculen);
 	    sv_catpvn(TARG, (char*)(s + ulen), slen - ulen);
 	    SvUTF8_on(TARG);
 	    SETs(TARG);
+	    sv = TARG;
 	}
 	else {
 	    s = (U8*)SvPV_force_nomg(sv, slen);
-	    Copy(tmpbuf, s, tculen, U8);
+	    Copy(tmpbuf, s, ulen, U8);
 	}
     }
     else {
 	if (!SvPADTMP(sv) || SvREADONLY(sv)) {
 	    dTARGET;
-	    SvUTF8_off(TARG);				/* decontaminate */
 	    sv_setsv_nomg(TARG, sv);
+	    SETs(TARG);
 	    sv = TARG;
-	    SETs(sv);
 	}
 	s = (U8*)SvPV_force_nomg(sv, slen);
 	if (*s) {

p5pRT · 2002-11-11T08:37:26Z

From nick.ing-simmons@elixent.com

Yitzchak Scott-Thoennes <sthoenna@efn.org> writes:

And [ul]cfirst are assuming (in the SvPADTMP true branch) that
the upper/lower-cased first character will have exactly the same
byte length as the original. Is this a valid assumtion?

No.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

p5pRT · 2002-11-11T08:41:42Z

From nick.ing-simmons@elixent.com

<hv@crypt.org> writes:

I don't think so. Hmm, I wonder if there is also any possibility
that the locale+nonutf8 branch would end up needing a utf8 result.

There are problem cases. iso-8859-1 0xff (y diaresis) does not have
upper case version anywhere "near" it. Upper-case of accented characters
in some locales (French?) drops the accent and makes it shorter in UTF-8.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

p5pRT · 2002-11-14T23:09:25Z

From @ysth

On Fri, 08 Nov 2002 12:59:49 +0000, hv@crypt.org wrote:

sthoenna@efn.org (Yitzchak Scott-Thoennes) wrote:
:On Tue, 05 Nov 2002 13:30:00 +0000, hv@crypt.org wrote:
:>Abhijit Menon-Sen <ams@wiw.org> wrote:
:>:Can anyone (Hugo?) help to explain why SVf_UTF8 isn't set for the first
:>:call to pp_lc? Fixing that would solve one class of bugs, and help me to
:>:fix the other (and, I suspect, discover a third).
:>
:>It is because we're working on $1, which is magic, but we're checking
:>utfness before we check for magic.
:>
:>Could you try some tests with the patch below? This at least starts to
:>fix things, and may be the whole story.
:
:Nope :( I think SvPV has to come before DO_UTF8 for AMAGIC to work:

Ah, yes.

:Also, it appears that the utf8 branches for [ul]c(first)? are doing
:setmagic on the source, not the target (the non-utf8 branch does
:sv = TARG except when suborning the source).

Yup. Though my patch didn't change that.

Sorry, I didn't mean to imply it did. Been that way since at least 5.6.1.

:And [ul]cfirst are assuming (in the SvPADTMP true branch) that
:the upper/lower-cased first character will have exactly the same
:byte length as the original. Is this a valid assumtion?

I don't think so. Hmm, I wonder if there is also any possibility
that the locale+nonutf8 branch would end up needing a utf8 result.

Sounded like NI-S thought so. What needs to be done?
Will toupper(255) ever really return >255?

:I'm going to try to come up with first a test patch and then a
:code patch.

Thank you. In case it is helpful, I attach below the patch that I think
would fix ucfirst; as far as I can tell the C< utf8_to_uvchr(tmpbuf, 0); >
does nothing, and the C< SvUTF8_off(TARG); /* decontaminate */ > should
not be needed before an sv_setsv(). This does at least pass existing
tests and the moral equivalent of your example overloading test.

Thanks. I'm still working on a test patch but have been busy with
other things. Some further notes:

I'm not sure we should even be doing plain sv_setsv. That's passing
through any IV or NV values:

~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'

Dump dualvar(123,"abc")'
SV = PVNV(0x1034e4a8) at 0x10261164
REFCNT = 1
FLAGS = (TEMP,IOK,POK,pIOK,pPOK)
IV = 123
NV = 0
PV = 0x102989f8 "abc"\0
CUR = 3
LEN = 4

or in pure perl:

~/bleadperl/perl $./miniperl -wle'print length(~ucfirst !1)'
20

(which I think should print 0)

p5pRT · 2002-11-14T23:16:14Z

From @ysth

On Tue, 12 Nov 2002, Yitzchak Scott-Thoennes wrote:

~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'

Dump dualvar(123,"abc")'
SV = PVNV(0x1034e4a8) at 0x10261164
REFCNT = 1
FLAGS = (TEMP,IOK,POK,pIOK,pPOK)
IV = 123
NV = 0
PV = 0x102989f8 "abc"\0
CUR = 3
LEN = 4

I need to work on my cut-and-pasting skills. That should have been:
~/bleadperl/perl $./perl -Ilib -MDevel::Peek -MScalar::Util=/./ -we'

Dump ucfirst dualvar(123,"abc")'
SV = PVNV(0x1034e4c8) at 0x1027391c
REFCNT = 1
FLAGS = (PADTMP,IOK,POK,pIOK,pPOK)
IV = 123
NV = 0
PV = 0x102989f8 "Abc"\0
CUR = 3
LEN = 4

p5pRT · 2002-11-26T06:11:58Z

From ams@wiw.org

At 2002-11-06 16:32:25 +0000, hv@crypt.org wrote:

The appended patch teaches save_re_context() to save the existing
$1..$n SV's [...]

Is there really no cheaper way to fix this?

I don't see any cheaper way (and as Jarkko has shown, this way doesn't
seem unacceptably expensive). One could skimp and save a few selected
bits instead of the entire scalar, but I don't think that's worth it.

Can you give a detailed trace of what's going wrong in a sample
testcase?

Perl_to_utf8_case (which is called by toLOWER_utf8 and siblings) calls
swatch_init() if the swatch it needs isn't loaded. swatch_init() calls
the SWASHNEW method, which wipes out $1 somewhere in the Perl code.

#0 Perl_call_sv (sv=0x8180454, flags=64) at perl.c:1760
#1 Perl_call_method (methname=0x8120ad2 "SWASHNEW", flags=0)
at perl.c:1743
#2 Perl_swash_init (pkg=0x81209e4 "utf8", name=0x8120a7e "ToUpper",
listsv=0x81302d8, minbits=4, none=0) at utf8.c:1549
#3 Perl_to_utf8_case (p=0x817af90 "foo", ustrp=0xbffff5a4 "\030",
lenp=0xbffff59c, swashp=0x8130be4, normal=0x8120a7e "ToUpper",
special=0x8120a6c "utf8::ToSpecUpper") at utf8.c:1349
#4 Perl_to_utf8_upper (p=0x817af90 "foo", ustrp=0xbffff5a4 "\030",
lenp=0xbffff59c) at utf8.c:1441
#5 Perl_pp_uc () at pp.c:3454

(From perl -le '$_="foo.bar".chr(256);chop;s/([^.]+)/uc($1)/ge;print')

-- ams

p5pRT · 2002-11-26T16:34:48Z

From @jhi

I don't see any cheaper way (and as Jarkko has shown, this way doesn't
seem unacceptably expensive). One could skimp and save a few selected
bits instead of the entire scalar, but I don't think that's worth it.

I tried the speed in two ways:

(1) run op/regexp.t and op/pat.100 both 10 times on an almost zero-load
quad-Alpha box: the user+system time difference was within plus minus
1.4% (the pat.t was slower, but regexp.t was faster with the patch,
but I think the tests were so fast that all that was measuring noise).

(2) Since neither of the above tests really stress test UTF-8
substitutions, I ran also this (on an empty single-cpu x86 box)
(the repetition counts were hand-crafted to give something in the 3..30
seconds range)

use Benchmark;
sub inita { $a = (("abc" x $_[0]) . " ") x $_[1]; utf8::upgrade($a) }
inita( 1, 1000); timethis( 50, '$a =~ s/(\w+)/\U$1/g');
inita( 1, 10000); timethis( 1, '$a =~ s/(\w+)/\U$1/g');
inita( 10, 100); timethis( 200, '$a =~ s/(\w+)/\U$1/g');
inita( 10, 1000); timethis( 10, '$a =~ s/(\w+)/\U$1/g');
inita( 10, 10000); timethis( 1, '$a =~ s/(\w+)/\U$1/g');
inita( 100, 10); timethis( 200, '$a =~ s/(\w+)/\U$1/g');
inita( 100, 100); timethis( 10, '$a =~ s/(\w+)/\U$1/g');
inita( 100, 1000); timethis( 5, '$a =~ s/(\w+)/\U$1/g');
inita( 1000, 1); timethis( 200, '$a =~ s/(\w+)/\U$1/g');
inita( 10000, 1); timethis( 20, '$a =~ s/(\w+)/\U$1/g');

The results of 2+2 runs cleaned up:

a) b) c) d) with1 with2 out1 out2
1 1000 4000 50 3.90 3.89 3.97 3.95
1 10000 40000 1 3.44 3.44 3.45 3.45
10 100 3100 200 5.72 5.71 6.11 6.11
10 1000 31000 10 5.04 5.03 5.25 5.25
10 10000 310000 1 27.01 27.19 27.11 27.46
100 10 3010 200 4.98 4.97 5.38 5.37
100 100 30100 10 2.73 2.73 2.96 2.96
100 1000 301000 5 26.75 26.70 26.36 27.95
1000 1 3001 200 4.86 4.85 5.30 5.31
10000 1 30001 20 4.96 4.96 5.45 5.46

a) number of "abc":s in one (\w+) run
b) number of "abc... ":in $a
c) total size of $a
d) number of repetitions

Yes, quite consistently *with* the patch things get *faster*.
No, don't ask me *why*.

--
Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT · 2002-12-06T23:35:28Z

From @jhi

Since no better ideas have come along, I'm applying Abhijit's patch
the on maint-5.8/perl branch and marking the issue as resolved.

p5pRT · 2002-12-06T23:35:30Z

@jhi - Status changed from 'new' to 'resolved'

p5pRT · 2002-12-09T01:13:34Z

From @hvds

Abhijit Menon-Sen <ams@wiw.org> wrote:
:At 2002-11-05 14:44:28 +0000, nick@ccl4.org wrote:
:>
:> ../perl -Ilib -lwe '$_ = "foo.bar" . chr 256; chop $_; s/(\w+)/uc($1)/ge; print $_'
:> FOO.B0B
:>
:> Who is this B0B character? :-(
:
:Nicholas, say goodbye to B0B.
:
:The appended patch teaches save_re_context() to save the existing $1..$n
:SV's, and fixes all the problems I've seen so far from the original bug
:report. I'll apply it shortly, unless somebody has a better suggestion
:for how to fetch the SV's to be saved.

Thanks, belatedly applied as #18266.

Hugo

p5pRT · 2002-12-09T04:07:11Z

@jhi - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Dec 9, 2002

p5pRT added the Severity High label Oct 18, 2019

p5pRT added affects-5.8 distro-freebsd type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lc(), uc() and ucfirst() broken inside utf8 regex #6038

lc(), uc() and ucfirst() broken inside utf8 regex #6038

p5pRT commented Oct 28, 2002

p5pRT commented Oct 28, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 5, 2002

p5pRT commented Nov 6, 2002

p5pRT commented Nov 6, 2002

p5pRT commented Nov 6, 2002

p5pRT commented Nov 8, 2002

p5pRT commented Nov 8, 2002

p5pRT commented Nov 11, 2002

p5pRT commented Nov 11, 2002

p5pRT commented Nov 14, 2002

p5pRT commented Nov 14, 2002

p5pRT commented Nov 26, 2002

p5pRT commented Nov 26, 2002

p5pRT commented Dec 6, 2002

p5pRT commented Dec 6, 2002

p5pRT commented Dec 9, 2002

p5pRT commented Dec 9, 2002

lc(), uc() and ucfirst() broken inside utf8 regex #6038

lc(), uc() and ucfirst() broken inside utf8 regex #6038

Comments

p5pRT commented Oct 28, 2002

p5pRT commented Oct 28, 2002

From @audreyt

Created by autrijus@autrijus.org

p5pRT commented Nov 5, 2002

From ams@wiw.org

p5pRT commented Nov 5, 2002

From @nwc10

p5pRT commented Nov 5, 2002

From @hvds

p5pRT commented Nov 5, 2002

From @Tux

p5pRT commented Nov 5, 2002

From @nwc10

p5pRT commented Nov 5, 2002

From nick.ing-simmons@elixent.com

p5pRT commented Nov 5, 2002

From ams@wiw.org

p5pRT commented Nov 5, 2002

From @nwc10

p5pRT commented Nov 5, 2002

From @hvds

p5pRT commented Nov 5, 2002

From @hvds

p5pRT commented Nov 5, 2002

From nick.ing-simmons@elixent.com

p5pRT commented Nov 5, 2002

From @andk

p5pRT commented Nov 6, 2002

From ams@wiw.org

p5pRT commented Nov 6, 2002

From @andk

p5pRT commented Nov 6, 2002

From @hvds

p5pRT commented Nov 8, 2002

From @ysth

p5pRT commented Nov 8, 2002

From @hvds

p5pRT commented Nov 11, 2002

From nick.ing-simmons@elixent.com

p5pRT commented Nov 11, 2002

From nick.ing-simmons@elixent.com

p5pRT commented Nov 14, 2002

From @ysth

p5pRT commented Nov 14, 2002

From @ysth

p5pRT commented Nov 26, 2002

From ams@wiw.org

p5pRT commented Nov 26, 2002

From @jhi

p5pRT commented Dec 6, 2002

From @jhi

p5pRT commented Dec 6, 2002

p5pRT commented Dec 9, 2002

From @hvds

p5pRT commented Dec 9, 2002