Substr giving wrong results on $1 with utf8 #12321

p5pRT · 2012-08-07T06:49:16Z

Migrated from rt.perl.org#114410 (status was 'resolved')

Searchable as RT114410$

p5pRT · 2012-08-07T06:49:16Z

From choroba@matfyz.cz

Created by choroba@matfyz.cz

Running substr($1, 0, 1) gives strange results when matching a utf8
string read from STDIN: sometimes, the string is longer, sometimes, it
contains the EF-BF-BD replacement character.

The following code should demonstrate the problem:

use strict;
use warnings;

open my $PL, '>', 'utf2.pl' or die $!;
print {$PL} << '__PL__';
##########################################################
use strict;
use warnings;

binmode STDOUT, ':utf8';
binmode STDIN, ':utf8';

while (my $line = <>) {
if (my ($word) = $line =~ /^(.+)$/) {
my $one = substr($1, 0, 1); # doesn't work
my $w_one = substr($word, 0, 1); # works
print "'$one' != '$w_one'\tat $line" unless $one eq $w_one;
}
}
##########################################################
__PL__
close $PL;

open my $OUT1, '>', 'utf1' or die $!;
print {$OUT1} map chr hex, qw/61 61 c5 99 0a c4 8d 0a 61 61 c5 99 0a/;
close $OUT1;
open my $OUT2, '>', 'utf2' or die $!;
print {$OUT2} map chr hex, qw/c4 8d 0a 61 61 c5 99 0a c4 8d 0a/;
close $OUT2;

system "$^X utf2.pl < utf1";
print "\n";
system "$^X utf2.pl < utf2";

__END__

Perl Info


Flags:
     category=core
     severity=medium

Site configuration information for perl 5.17.3:

Configured by choroba at Mon Aug  6 22:34:00 CEST 2012.

Summary of my perl5 (revision 5 version 17 subversion 3) configuration:
   Commit id: fbfa7c02afa6e3e6975eb25b333402cf754833e3
   Platform:
     osname=linux, osvers=3.1.10-1.16-desktop, archname=x86_64-linux-thread-multi
     uname='linux still 3.1.10-1.16-desktop #1 smp preempt wed jun 27 05:21:40 utc 2012 (d016078) x86_64 x86_64 x86_64 gnulinux '
     config_args='-Dprefix=/home/choroba/localperl'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.6.2', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /lib/../lib64 /usr/lib/../lib64 /lib /usr/lib /lib64 /usr/lib64 /usr/local/lib64
     libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     libc=/lib/libc-2.14.1.so, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.14.1'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches:



@INC for perl 5.17.3:
     /home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi
     /home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi
     /home/choroba/perl5/lib/perl5
     /home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi
     /home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi
     /home/choroba/perl5/lib/perl5
     /home/choroba/localperl/lib/site_perl/5.17.3/x86_64-linux-thread-multi
     /home/choroba/localperl/lib/site_perl/5.17.3
     /home/choroba/localperl/lib/5.17.3/x86_64-linux-thread-multi
     /home/choroba/localperl/lib/5.17.3
     .


Environment for perl 5.17.3:
     HOME=/home/choroba
     LANG=cs_CZ.UTF-8
     LANGUAGE=
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/home/choroba/perl5/bin:/home/choroba/bin:/bin:/sbin:/usr/bin:/usr/local/bin:/usr/games:/usr/X11R6/bin:/opt/gnome/bin:/home/choroba/perl/bin:.
     PERL5LIB=/home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi:/home/choroba/perl5/lib/perl5:/home/choroba/perl5/lib/perl5/x86_64-linux-thread-multi:/home/choroba/perl5/lib/perl5:
     PERL_BADLANG (unset)
     PERL_LOCAL_LIB_ROOT=/home/choroba/perl5
     PERL_MB_OPT=--install_base /home/choroba/perl5
     PERL_MM_OPT=INSTALL_BASE=/home/choroba/perl5
     SHELL=/bin/bash

p5pRT · 2012-08-30T20:24:38Z

From @cpansprout

On Mon Aug 06 23:49:16 2012, choroba@matfyz.cz wrote:

Running substr($1, 0, 1) gives strange results when matching a utf8
string read from STDIN: sometimes, the string is longer, sometimes, it
contains the EF-BF-BD replacement character.

The following code should demonstrate the problem:
...

It doesn’t have to come from input. Here is a simpler example:

"\x{100}" =~ /(.+)/;
substr $1, 0, 1;
"a\x{100}" =~ /(.+)/;
warn ord substr $1, 1, 1;

And here is the output in various perl versions (5.17.3 is actually
v5.17.2-146-g7e2a0d4 and 5.17.4 is v5.17.3-139-g61dad97):

$ pbpaste|perl5.8.1
256 at - line 4.
$ pbpaste|perl5.8.9
256 at - line 4.
$ pbpaste|perl5.10.0
Malformed UTF-8 character (unexpected continuation byte 0x80, with no
preceding start byte) in ord at - line 4.
0 at - line 4.
$ pbpaste|perl5.10.1
Malformed UTF-8 character (unexpected continuation byte 0x80, with no
preceding start byte) in ord at - line 4.
0 at - line 4.
$ pbpaste|perl5.14.1
0 at - line 4.
$ pbpaste|perl5.16.0
0 at - line 4.
$ pbpaste|perl5.17.3
0 at - line 4.
$ pbpaste|perl5.17.4
Wide character in substr at - line 4.
panic: sv_pos_u2b_cache cache 3 real 1 for aĀ at - line 4.

What’s happening is that the utf-8 length/pos cache is becoming stale
without being reset.

The first substr results in pos information being cached. The second
pattern match changes the content of $1 (actually changes what $1 points
to underneath; $1’s contents are not updated until it is read). The
second substr reuses the cache that is still there.

I suspect we need to rethink the way the magic mechanism interacts with
utf8 caching.

It affects tied variables as well:

$y = "a\x{100}";
sub TIESCALAR{bless[]}
sub FETCH{$y}
tie $x, "";
warn ord substr $x, 0, 1;
$y = "\x{100}";
warn ord substr $x, 0, 1;
__END__
1 at - line 5.
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc4) in ord at - line 7.
0 at - line 7.

And substr lvalues:

$x = "a\x{100}";
$l = \substr $x, 0;
warn ord substr $$l, 1, 1;
substr $x, 0, 1, = "\x{100}";
warn ord substr $$l, 1, 1;
__END__
256 at - line 3.
Wide character in substr at - line 5.
panic: sv_pos_u2b_cache cache 4 real 2 for ĀĀ at - line 5.

And nonexistent hash elements:

sub {
$_[0] = "a\x{100}";
warn ord substr $_[0], 1, 1;
$h{k} = "\x{100}"x2;
warn ord substr $_[1], 1, 1;
}->($h{k});
__END__
256 at - line 3.
0 at - line 5.

--

Father Chrysostomos

p5pRT · 2012-08-30T20:24:39Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2012-08-31T01:20:12Z

From @cpansprout

I’ve fixed this in commit 7d1328b by reset utf8 caches in mg_get

--

Father Chrysostomos

p5pRT · 2012-08-31T01:20:14Z

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT · 2012-08-31T09:28:08Z

From @nwc10

On Thu, Aug 30, 2012 at 06:20:13PM -0700, Father Chrysostomos via RT wrote:

I've fixed this in commit 7d1328b by reset utf8 caches in mg_get

This seems like a sensible solution. (I can't spot any flaws in the approach)

Historically, if something breaks because of tie, it usually also breaks
with overloading:

$ cat 114410.pl
#!/perl -w
use strict;

package UTF8Toggle;
use strict;

use overload '""' => 'stringify', fallback => 1;

sub new {
my $class = shift;
my $value = shift;
my $state = shift||0;
return bless [$value, $state], $class;
}

sub stringify {
my $self = shift;
$self->[1] = ! $self->[1];
if ($self->[1]) {
utf8::downgrade($self->[0]);
} else {
utf8::upgrade($self->[0]);
}
$self->[0];
}

package main;

my $u = UTF8Toggle->new(" \x{c2}7 ");

printf "%d\n", ord substr $u, 1;
printf "%d\n", ord substr $u, 1;

__END__
$ ./perl -Ilib 114410.pl
194
panic: sv_pos_u2b_cache cache 5 real 4 for Â7 at 114410.pl line 32.

I'm not sure what the best fix is here. Given that I'd been wondering if the
fix for #114410 was to outlaw caching of tied values, but simply expiring
the cache on the next read works, is the right fix here to trap all points
that call into overload value returning routines and reset the cache?

Nicholas Clark

p5pRT · 2012-08-31T15:29:47Z

From @cpansprout

On Fri Aug 31 02:28:08 2012, nicholas wrote:

On Thu, Aug 30, 2012 at 06:20:13PM -0700, Father Chrysostomos via RT
wrote:

I've fixed this in commit 7d1328b by reset utf8 caches in mg_get

This seems like a sensible solution. (I can't spot any flaws in the
approach)

Historically, if something breaks because of tie, it usually also
breaks
with overloading:

$ cat 114410.pl
#!/perl -w
use strict;

package UTF8Toggle;
use strict;

use overload '""' => 'stringify', fallback => 1;

sub new {
my $class = shift;
my $value = shift;
my $state = shift||0;
return bless [$value, $state], $class;
}

sub stringify {
my $self = shift;
$self->[1] = ! $self->[1];
if ($self->[1]) {
utf8::downgrade($self->[0]);
} else {
utf8::upgrade($self->[0]);
}
$self->[0];
}

package main;

my $u = UTF8Toggle->new(" \x{c2}7 ");

printf "%d\n", ord substr $u, 1;
printf "%d\n", ord substr $u, 1;

__END__
$ ./perl -Ilib 114410.pl
194
panic: sv_pos_u2b_cache cache 5 real 4 for �7 at 114410.pl line 32.

I'm not sure what the best fix is here. Given that I'd been wondering
if the
fix for #114410 was to outlaw caching of tied values, but simply
expiring
the cache on the next read works, is the right fix here to trap all
points
that call into overload value returning routines and reset the cache?

As you may have noticed (if not: git log d8f2f09), I went searching
for instances of sv_len_utf8 that were incorrect.

In the process of doing so, I noticed this in a few places (this one in
pp_sys.c:pp_syswrite):

if (SvGMAGICAL(bufsv) || SvAMAGIC(bufsv)) {
/* Don't call sv_len_utf8 again because it will call magic
or overloading a second time, and we might get back a
different result. */
blen_chars = utf8_length((U8*)buffer, (U8*)buffer + blen);
} else {
/* It's safe, and it may well be cached. */
blen_chars = sv_len_utf8(bufsv);
}

And I slept on it and came to the conclusion that overload couldn’t work
correctly despite my changes (which you have now demonstrated).

I’m also wondering whether it’s even worth creating the utf8 cache to
begin with on magical values, as it will be invalidated almost
immediately. It seems that the extra work to facilitate an optimisation
actually slows things down.

In that case, my sv_len_utf8_nomg should go, and we should have a macro
that does what syswrite already does.

--

Father Chrysostomos

p5pRT · 2012-08-31T19:34:39Z

@cpansprout - Status changed from 'resolved' to 'open'

p5pRT · 2012-08-31T19:35:38Z

From @cpansprout

I’m reopening this as there are still unresolved issues.

--

Father Chrysostomos

p5pRT · 2012-09-06T13:06:51Z

From @nwc10

On Fri, Aug 31, 2012 at 08:29:48AM -0700, Father Chrysostomos via RT wrote:

I'm also wondering whether it's even worth creating the utf8 cache to
begin with on magical values, as it will be invalidated almost
immediately. It seems that the extra work to facilitate an optimisation
actually slows things down.

I'm not sure if any code relies for performance on being able to call it
twice on the same string. I thought that pp_substr() did, but actually
that's taking advantage of the fact that the call takes two positions to
convert.

pp_index() calls sv_pos_u2b() and sv_pos_b2c() on the same SV.
I'm not sure whether (a) that matters (b) whether they can be condensed into
one call.

If no op or clearly related code path ends up calling more than one function
twice, then it seems to make most sense to avoid actually storing the cache
structure on the SV for "active" values (magic and overloading - I think
that's all of them).

Nicholas Clark

p5pRT · 2012-09-09T05:31:40Z

From @cpansprout

On Thu Sep 06 06:06:51 2012, nicholas wrote:

On Fri, Aug 31, 2012 at 08:29:48AM -0700, Father Chrysostomos via RT
wrote:

I'm also wondering whether it's even worth creating the utf8 cache to
begin with on magical values, as it will be invalidated almost
immediately. It seems that the extra work to facilitate an optimisation
actually slows things down.

I'm not sure if any code relies for performance on being able to call it
twice on the same string. I thought that pp_substr() did, but actually
that's taking advantage of the fact that the call takes two positions to
convert.

pp_index() calls sv_pos_u2b() and sv_pos_b2c() on the same SV.
I'm not sure whether (a) that matters (b) whether they can be
condensed into
one call.

If no op or clearly related code path ends up calling more than one
function
twice, then it seems to make most sense to avoid actually storing the
cache
structure on the SV for "active" values (magic and overloading - I think
that's all of them).

Even if it does call it twice, doing so on overloading is still
incorrect, as it will call overloading twice.

So I think what I suggested is the correct fix.

--

Father Chrysostomos

p5pRT · 2012-10-01T19:59:12Z

From @cpansprout

I have resolved the overloading issues in the series of commits ending
with 2b674b0 (which includes many other fixes, too).

--

Father Chrysostomos

p5pRT · 2012-10-01T19:59:12Z

From [Unknown Contact. See original ticket]

I have resolved the overloading issues in the series of commits ending
with 2b674b0 (which includes many other fixes, too).

--

Father Chrysostomos

p5pRT · 2012-10-01T19:59:12Z

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Oct 1, 2012

p5pRT added Severity Medium distro-Linux type-core labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substr giving wrong results on $1 with utf8 #12321

Substr giving wrong results on $1 with utf8 #12321

p5pRT commented Aug 7, 2012

p5pRT commented Aug 7, 2012

p5pRT commented Aug 30, 2012

p5pRT commented Aug 30, 2012

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

p5pRT commented Sep 6, 2012

p5pRT commented Sep 9, 2012

p5pRT commented Oct 1, 2012

p5pRT commented Oct 1, 2012

p5pRT commented Oct 1, 2012

Substr giving wrong results on $1 with utf8 #12321

Substr giving wrong results on $1 with utf8 #12321

Comments

p5pRT commented Aug 7, 2012

p5pRT commented Aug 7, 2012

From choroba@matfyz.cz

Created by choroba@matfyz.cz

p5pRT commented Aug 30, 2012

From @cpansprout

p5pRT commented Aug 30, 2012

p5pRT commented Aug 31, 2012

From @cpansprout

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

From @nwc10

p5pRT commented Aug 31, 2012

From @cpansprout

p5pRT commented Aug 31, 2012

p5pRT commented Aug 31, 2012

From @cpansprout

p5pRT commented Sep 6, 2012

From @nwc10

p5pRT commented Sep 9, 2012

From @cpansprout

p5pRT commented Oct 1, 2012

From @cpansprout

p5pRT commented Oct 1, 2012

From [Unknown Contact. See original ticket]

p5pRT commented Oct 1, 2012