Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Length-caching bug in utf8::decode #10873

Closed
p5pRT opened this issue Dec 3, 2010 · 12 comments
Closed

Length-caching bug in utf8::decode #10873

p5pRT opened this issue Dec 3, 2010 · 12 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 3, 2010

Migrated from rt.perl.org#80190 (status was 'resolved')

Searchable as RT80190$

@p5pRT
Copy link
Author

p5pRT commented Dec 3, 2010

From @ikegami

Created by @ikegami

There's a length caching bug in utf8​::decode.

----- BEGIN TEST CODE -----
#!/usr/bin/perl
use strict;
use warnings;

use Test​::More tests => 8;

{ # Baseline.
  my $s = "\xE8\xAB\x86\x0A";
  utf8​::downgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A");
  utf8​::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n");
}
{ # Check for length-caching bug.
  my $s = "\xE8\xAB\x86\x0A";
  utf8​::upgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A");
  utf8​::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n");
}
----- END TEST CODE -----

----- BEGIN TEST OUTPUT -----
1..8
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
not ok 7
# Failed test at a.pl line 15.
# got​: '4'
# expected​: '2'
ok 8
# Looks like you failed 1 test of 8.
----- END TEST OUTPUT -----

davem says it's still present in blead.

Perl Info

Flags:
    category=library
    severity=low
    module=utf8

Site configuration information for perl 5.12.2:

Configured by eric at Wed Sep  8 16:01:41 EDT 2010.

Summary of my perl5 (revision 5 version 12 subversion 2) configuration:

  Platform:
    osname=linux, osvers=2.6.26-2-686, archname=i686-linux
    uname='linux fmdev10 2.6.26-2-686 #1 smp tue mar 9 17:35:51 utc 2010
i686 gnulinux '
    config_args='-de -Dprefix=/home/eric/usr/perlbrew/perls/perl-5.12.2'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib
-fstack-protector'

Locally applied patches:



@INC for perl 5.12.2:

/home/eric/usr/perlbrew/perls/perl-5.12.2/lib/site_perl/5.12.2/i686-linux
    /home/eric/usr/perlbrew/perls/perl-5.12.2/lib/site_perl/5.12.2
    /home/eric/usr/perlbrew/perls/perl-5.12.2/lib/5.12.2/i686-linux
    /home/eric/usr/perlbrew/perls/perl-5.12.2/lib/5.12.2
    .


Environment for perl 5.12.2:
    HOME=/home/eric
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)

PATH=/home/eric/usr/perlbrew/bin:/home/eric/usr/perlbrew/perls/current/bin:.:/home/eric/bin:/usr/local/bin:/usr/bin:/bin:/usr/games
    PERLBREW_ROOT=/home/eric/usr/perlbrew
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Dec 4, 2010

From @ikegami

On Fri, Dec 3, 2010 at 1​:41 PM, Eric Brine <perlbug-followup@​perl.org>wrote​:

# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=80190 >
There's a length caching bug in utf8​::decode.

I don't know how to invalidate the cache, but it should be done after
SvUTF8_on(sv) in Perl_sv_utf8_decode in sv.c.

@p5pRT
Copy link
Author

p5pRT commented Dec 6, 2010

From @tonycoz

On Fri, Dec 03, 2010 at 07​:22​:34PM -0500, Eric Brine wrote​:

On Fri, Dec 3, 2010 at 1​:41 PM, Eric Brine <perlbug-followup@​perl.org>wrote​:

# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=80190 >
There's a length caching bug in utf8​::decode.

I don't know how to invalidate the cache, but it should be done after
SvUTF8_on(sv) in Perl_sv_utf8_decode in sv.c.

Looks like​:

  MAGIC *mg = NULL;
  utf8_mg_len_cache_update(sv, &mg, -1);

except that it add the magic if it wasn't already there. Maybe​:

  MAGIC *mg = mg_find(sv, PERL_MAGIC_utf8);
  if (mg)
  utf8_mg_len_cache_update(sv, &mg, -1);

Looks like the pos() cache should also be updated/cleared.

Tony

@p5pRT
Copy link
Author

p5pRT commented Dec 6, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 10, 2011

From @tonycoz

On Mon, Dec 06, 2010 at 04​:50​:52PM +1100, Tony Cook wrote​:

On Fri, Dec 03, 2010 at 07​:22​:34PM -0500, Eric Brine wrote​:

On Fri, Dec 3, 2010 at 1​:41 PM, Eric Brine <perlbug-followup@​perl.org>wrote​:

# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=80190 >
There's a length caching bug in utf8​::decode.

I don't know how to invalidate the cache, but it should be done after
SvUTF8_on(sv) in Perl_sv_utf8_decode in sv.c.

Looks like​:

MAGIC *mg = NULL;
utf8_mg_len_cache_update(sv, &mg, -1);

except that it add the magic if it wasn't already there. Maybe​:

MAGIC *mg = mg_find(sv, PERL_MAGIC_utf8);
if (mg)
utf8_mg_len_cache_update(sv, &mg, -1);

Looks like the pos() cache should also be updated/cleared.

Actually, looking at how this is all implemented, the solution could
be as simple as​:

  SvSETMAGIC(sv);

since Perl_magic_setutf8() clears the saved length and pos cache.

Putting that in XS_utf8_decode() in universal.c would be safest in
terms of the least side-effects on other code, but may leave other
code that calls sv_utf8_decode() with the same problem.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jan 10, 2011

From @Leont

On Mon, Jan 10, 2011 at 1​:54 PM, Tony Cook <tony@​develop-help.com> wrote​:

Actually, looking at how this is all implemented, the solution could
be as simple as​:

 SvSETMAGIC(sv);

since Perl_magic_setutf8() clears the saved length and pos cache.

Putting that in XS_utf8_decode() in universal.c would be safest in
terms of the least side-effects on other code, but may leave other
code that calls sv_utf8_decode() with the same problem.

Adding a sv_utf8_decode_flags() may mitigate that a little, though we
certainly don't have a lack of *_flags functions.

Leon

@p5pRT
Copy link
Author

p5pRT commented Jan 26, 2011

From @tonycoz

On Mon, Jan 10, 2011 at 11​:54​:01PM +1100, Tony Cook wrote​:

Actually, looking at how this is all implemented, the solution could
be as simple as​:

SvSETMAGIC(sv);

since Perl_magic_setutf8() clears the saved length and pos cache.

Putting that in XS_utf8_decode() in universal.c would be safest in
terms of the least side-effects on other code, but may leave other
code that calls sv_utf8_decode() with the same problem.

Unfortunately while adding SvSETMAGIC() to the XS fixes the saved
length problem, several utf8 taint tests fail.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jan 26, 2011

From @tonycoz

0001-perl-80190-utf8-.-calls-weren-t-calling-set-magic-co.patch
From 3949878860f4fa25898f4f5f5d2b0fb46fd4a074 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Sun, 23 Jan 2011 19:16:55 +1100
Subject: [PATCH] [perl #80190] utf8::...() calls weren't calling set magic correctly

---
 MANIFEST       |    1 +
 t/uni/length.t |   29 +++++++++++++++++++++++++++++
 universal.c    |    4 ++++
 3 files changed, 34 insertions(+), 0 deletions(-)
 create mode 100644 t/uni/length.t

diff --git a/MANIFEST b/MANIFEST
index 05ec676..cbb24b6 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -5031,6 +5031,7 @@ t/uni/class.t			See if Unicode classes work (\p)
 t/uni/fold.t			See if Unicode folding works
 t/uni/greek.t			See if Unicode in greek works
 t/uni/latin2.t			See if Unicode in latin2 works
+t/uni/length.t			Test length caching around various utf8::*()
 t/uni/lex_utf8.t		See if Unicode in lexer works
 t/uni/lower.t			See if Unicode casing works
 t/uni/overload.t		See if Unicode overloading works
diff --git a/t/uni/length.t b/t/uni/length.t
new file mode 100644
index 0000000..a3bab41
--- /dev/null
+++ b/t/uni/length.t
@@ -0,0 +1,29 @@
+BEGIN {
+    chdir 't' if -d 't';
+    @INC = qw(../lib .);
+    require "test.pl";
+}
+
+$|=1;
+use Devel::Peek;
+plan tests => 14;
+# from RT #80190
+{ # Baseline.
+    my $s = "\xE8\xAB\x86\x0A";
+    utf8::downgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A");
+    utf8::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n");
+}
+{ # Check for length-caching bug.
+    my $s = "\xE8\xAB\x86\x0A";
+    utf8::upgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A");
+    utf8::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n");
+}
+
+# some other cases
+{
+    my $s = "\x{8AC6}\x0A";
+    is(length($s), 2); is($s, "\x{8AC6}\x0A");
+    utf8::encode($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A");
+    utf8::upgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A");
+}
+
diff --git a/universal.c b/universal.c
index 07bbe96..f149f74 100644
--- a/universal.c
+++ b/universal.c
@@ -684,6 +684,7 @@ XS(XS_utf8_encode)
     if (items != 1)
 	croak_xs_usage(cv, "sv");
     sv_utf8_encode(ST(0));
+    SvSETMAGIC(ST(0));
     XSRETURN_EMPTY;
 }
 
@@ -696,6 +697,7 @@ XS(XS_utf8_decode)
     else {
 	SV * const sv = ST(0);
 	const bool RETVAL = sv_utf8_decode(sv);
+	SvSETMAGIC(sv);
 	ST(0) = boolSV(RETVAL);
 	sv_2mortal(ST(0));
     }
@@ -714,6 +716,7 @@ XS(XS_utf8_upgrade)
 	dXSTARG;
 
 	RETVAL = sv_utf8_upgrade(sv);
+	SvSETMAGIC(sv);
 	XSprePUSH; PUSHi((IV)RETVAL);
     }
     XSRETURN(1);
@@ -730,6 +733,7 @@ XS(XS_utf8_downgrade)
         const bool failok = (items < 2) ? 0 : (int)SvIV(ST(1));
         const bool RETVAL = sv_utf8_downgrade(sv, failok);
 
+	SvSETMAGIC(sv);
 	ST(0) = boolSV(RETVAL);
 	sv_2mortal(ST(0));
     }
-- 
1.7.1

@p5pRT
Copy link
Author

p5pRT commented Mar 19, 2011

From @iabyn

On Wed, Jan 26, 2011 at 08​:50​:12PM +1100, Tony Cook wrote​:

On Mon, Jan 10, 2011 at 11​:54​:01PM +1100, Tony Cook wrote​:

Actually, looking at how this is all implemented, the solution could
be as simple as​:

SvSETMAGIC(sv);

since Perl_magic_setutf8() clears the saved length and pos cache.

Putting that in XS_utf8_decode() in universal.c would be safest in
terms of the least side-effects on other code, but may leave other
code that calls sv_utf8_decode() with the same problem.

Unfortunately while adding SvSETMAGIC() to the XS fixes the saved
length problem, several utf8 taint tests fail.

I've now fixed it (and pos issues too) with this commit;

commit 75da9d4
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Sat Mar 19 19​:26​:49 2011 +0000
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Sat Mar 19 19​:41​:55 2011 +0000

  reset pos and utf8 cache when de/encoding utf8 str
 
  When using
  utf8​::upgrade
  utf8​::downgrade
  utf8​::encode
  utf8​::decode
  or the underlying C-level functions
  sv_utf8_upgrade_flags_grow
  sv_utf8_downgrade
  sv_utf8_encode
  sv_utf8_decode
  and
  sv_recode_to_utf8
 
  update the position of the pos magic, if any, and clear the utf8
  length/position-mapping cache.
 
  This fixes [perl #80190].

M lib/utf8.t
M sv.c

--
It's not that I'm afraid to die, I just don't want to be there when it
happens.
  -- Woody Allen

@p5pRT
Copy link
Author

p5pRT commented Mar 19, 2011

@iabyn - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Mar 19, 2011
@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2012

From @cpansprout

On Sat Mar 19 12​:45​:44 2011, davem wrote​:

On Wed, Jan 26, 2011 at 08​:50​:12PM +1100, Tony Cook wrote​:

On Mon, Jan 10, 2011 at 11​:54​:01PM +1100, Tony Cook wrote​:

Actually, looking at how this is all implemented, the solution could
be as simple as​:

SvSETMAGIC(sv);

since Perl_magic_setutf8() clears the saved length and pos cache.

Putting that in XS_utf8_decode() in universal.c would be safest in
terms of the least side-effects on other code, but may leave other
code that calls sv_utf8_decode() with the same problem.

Unfortunately while adding SvSETMAGIC() to the XS fixes the saved
length problem, several utf8 taint tests fail.

I've now fixed it (and pos issues too) with this commit;

commit 75da9d4
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Sat Mar 19 19​:26​:49 2011 +0000
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Sat Mar 19 19​:41​:55 2011 +0000

reset pos and utf8 cache when de/encoding utf8 str

When using
    utf8&#8203;::upgrade
    utf8&#8203;::downgrade
    utf8&#8203;::encode
    utf8&#8203;::decode
or the underlying C\-level functions
    sv\_utf8\_upgrade\_flags\_grow
    sv\_utf8\_downgrade
    sv\_utf8\_encode
    sv\_utf8\_decode
and
    sv\_recode\_to\_utf8

update the position of the pos magic\, if any\, and clear the utf8
length/position\-mapping cache\.

This fixes \[perl \#80190\]\.

M lib/utf8.t
M sv.c

The problem with that approach is that it still leaves us with
utf8​::decode not calling STORE on tied variables.

And for pos() to survive a modification to the scalar makes utf8​::decode
quite unique.

I think a more correct solution is to preserve taint magic explicitly
around a call to SvSETMAGIC.

I will do that soon if nobody objects.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Oct 1, 2012

From @cpansprout

On Fri Sep 28 09​:49​:35 2012, sprout wrote​:

On Sat Mar 19 12​:45​:44 2011, davem wrote​:

On Wed, Jan 26, 2011 at 08​:50​:12PM +1100, Tony Cook wrote​:

On Mon, Jan 10, 2011 at 11​:54​:01PM +1100, Tony Cook wrote​:

Actually, looking at how this is all implemented, the solution could
be as simple as​:

SvSETMAGIC(sv);

since Perl_magic_setutf8() clears the saved length and pos cache.

Putting that in XS_utf8_decode() in universal.c would be safest in
terms of the least side-effects on other code, but may leave other
code that calls sv_utf8_decode() with the same problem.

Unfortunately while adding SvSETMAGIC() to the XS fixes the saved
length problem, several utf8 taint tests fail.

I've now fixed it (and pos issues too) with this commit;

commit 75da9d4
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Sat Mar 19 19​:26​:49 2011 +0000
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Sat Mar 19 19​:41​:55 2011 +0000

reset pos and utf8 cache when de/encoding utf8 str

When using
    utf8&#8203;::upgrade
    utf8&#8203;::downgrade
    utf8&#8203;::encode
    utf8&#8203;::decode
or the underlying C\-level functions
    sv\_utf8\_upgrade\_flags\_grow
    sv\_utf8\_downgrade
    sv\_utf8\_encode
    sv\_utf8\_decode
and
    sv\_recode\_to\_utf8

update the position of the pos magic\, if any\, and clear the utf8
length/position\-mapping cache\.

This fixes \[perl \#80190\]\.

M lib/utf8.t
M sv.c

The problem with that approach is that it still leaves us with
utf8​::decode not calling STORE on tied variables.

And for pos() to survive a modification to the scalar makes utf8​::decode
quite unique.

I think a more correct solution is to preserve taint magic explicitly
around a call to SvSETMAGIC.

I will do that soon if nobody objects.

Done with 77fc86e.

--

Father Chrysostomos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant