Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

substr/UTF8 related problem with perl 5.8.3 on linux #7266

Closed
p5pRT opened this issue Apr 25, 2004 · 7 comments
Closed

substr/UTF8 related problem with perl 5.8.3 on linux #7266

p5pRT opened this issue Apr 25, 2004 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Apr 25, 2004

Migrated from rt.perl.org#29149 (status was 'resolved')

Searchable as RT29149$

@p5pRT
Copy link
Author

p5pRT commented Apr 25, 2004

From pajas@ufal.ms.mff.cuni.cz

Created by pajas@ufal.ms.mff.cuni.cz

Under some circumstances, substr($data, $pos, 1) on a UTF8 encoded
string returns more than 1 character (e.g. '" " ' instead of just
'"'), see below.

The test-case at the bottom of the report uses Text​::Balanced to
demonstrate the problem at least on my platform - but Damian Conway
wasn't able to reproduce a similar test case on Mac OS X.

Although I wasn't able to craft a test case that wouldn't use
Text​::Balanced, I'm quite sure this is a CORE problem, because perl
debugger reveals, that with my test case substr call in
Text/Balanced.pm at line 865 which reads as

my $initial = substr($$textref,$oppos,1);

returns '" " ' which is 4 characters, even though a single character
substring was requested. This only happens if $$textref contains some
UTF8 data but there is probably something else in T​::B that triggers
this behavior. If I for example put pos($$textref)=0; before that
line, or make a copy of $$textref and use substr on that copy,
the result is as expected.

I marked this bug report with medium severity since the possibility of
tricking substr to return more data from a string that requested,
could possibly have some security impacts.

The test-case tries to extract a perl-code block { $a." " } from a
string that also contains some double-quoted UTF8 char. It also
demonstrates, that everything works if the UTF8 data is replaced by
pure-ascii data (dropping the UTF8 flag whould have the same effect,
though).

#!/usr/bin/perl
use Text​::Balanced 'extract_codeblock';
use charnames '​:full';

($bad,$good) = (
qq({ \$a." " }; "\N{LATIN SMALL LETTER I WITH ACUTE}"), # BAD​: non-ascii
qq({ \$a." " }; "i") # OK​: ascii char
);
foreach my $data ($bad, $good) {
  foreach (extract_codeblock( $data )) {
  print "--> $_\n";
  }
  print "\n\n";
}

__END__

Thanks,

  Petr

Perl Info

Flags:
    category=core
    severity=medium

This perlbug was built using Perl v5.8.3 - Sun Apr 18 10:56:44 CEST 2004
It is being executed now by  Perl v5.8.3 - Sun Apr 18 10:50:10 CEST 2004.

Site configuration information for perl v5.8.3:

Configured by root at Sun Apr 18 10:50:10 CEST 2004.

Summary of my perl5 (revision 5.0 version 8 subversion 3) configuration:
  Platform:
    osname=linux, osvers=2.4.25, archname=i586-linux-thread-multi
    uname='linux baraq2 2.4.25 #4 thu feb 19 16:50:52 cet 2004 i686 i686 i386 gnulinux '
    config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true -Doptimize=-O2 -g -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -pipe'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing'
    ccversion='', gccversion='3.3.2 20031216 (prerelease) (SuSE Linux)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =''
    libpth=/lib /usr/lib /usr/local/lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.3/i586-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared'

Locally applied patches:
    


@INC for perl v5.8.3:
    /usr/lib/perl5/5.8.3/i586-linux-thread-multi
    /usr/lib/perl5/5.8.3
    /usr/lib/perl5/site_perl/5.8.3/i586-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.3
    /usr/lib/perl5/site_perl/5.8.1/i586-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.1
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.3/i586-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.3
    /usr/lib/perl5/vendor_perl/5.8.1/i586-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.1
    /usr/lib/perl5/vendor_perl
    .


Environment for perl v5.8.3:
    HOME=/home/pajas
    LANG=cs_CZ
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/pajas/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/java/jre/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Apr 28, 2004

From BQW10602@nifty.com

On 25 Apr 2004 17​:59​:16 -0000
"pajas@​ufal.ms.mff.cuni.cz (via RT)" <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by pajas@​ufal.ms.mff.cuni.cz
# Please include the string​: [perl #29149]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org​:80/rt3/Ticket/Display.html?id=29149 >

Here is a shorter test case w/o Text​::Balanced.

In the case when the text includes non-ASCII character,
&PL_vtbl_utf8 seems upset.
Otherwise (if $$textref is in only ASCII),
it does not matter whether the UTF-8 flag is ON or OFF.

#!perl

my $bad = pack('U*').qq( { \$a." " }; "\xED");
foo(\$bad);

my $good = pack('U*').qq( { \$a." " }; "i");
foo(\$good);

sub foo {
  my $textref = shift;
  $$textref =~ m/\G\s*(\{)/gc;

  my $a = substr($$textref, pos($$textref), 20);
  # 20 ?? When the 3rd arg > 4, the result seems bad.

  $$textref =~ m/\G\s*(\s\$\w+)./gc;
  printf "%s\n", substr($$textref,pos($$textref),1);
  use Devel​::Peek; Dump($$textref);
}
__END__

Output from printf is qq/" " };\n/ for $bad and qq/"\n/ for $good.

SV = PVMG(0x167ebec) at 0x155cc08
  REFCNT = 3
  FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x16823dc " { $a.\" \" }; \"\303\255\""\0 [UTF8 " { $a." " }; "\x{ed}""]
  CUR = 17
  LEN = 18
  MAGIC = 0x169ee8c
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = 16
  MG_PTR = 0x169d1bc
  0​: 6 -> 6
  1​: 7 -> 7
  MAGIC = 0x169d27c
  MG_VIRTUAL = &PL_vtbl_mglob
  MG_TYPE = PERL_MAGIC_regex_global(g)
  MG_LEN = 6
SV = PVMG(0x169f82c) at 0x155cbf0
  REFCNT = 3
  FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x16c0fac " { $a.\" \" }; \"i\""\0 [UTF8 " { $a." " }; "i""]
  CUR = 16
  LEN = 17
  MAGIC = 0x16c027c
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = 16
  MG_PTR = 0x16c0f7c
  0​: 6 -> 6
  1​: 0 -> 0
  MAGIC = 0x16c02ac
  MG_VIRTUAL = &PL_vtbl_mglob
  MG_TYPE = PERL_MAGIC_regex_global(g)
  MG_LEN = 6

Regard,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Apr 28, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 29, 2004

From BQW10602@nifty.com

On 25 Apr 2004 17​:59​:16 -0000
"pajas@​ufal.ms.mff.cuni.cz (via RT)" <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by pajas@​ufal.ms.mff.cuni.cz
# Please include the string​: [perl #29149]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org​:80/rt3/Ticket/Display.html?id=29149 >

Here is a shorter test case w/o Text​::Balanced.

In the case when the text includes non-ASCII character,
&PL_vtbl_utf8 seems upset.
Otherwise (if $$textref is in only ASCII),
it does not matter whether the UTF-8 flag is ON or OFF.

Test is more simplified.
This test should fail since perl 5.8.1 (including perl-current)
and succeed with perl 5.8.0.

It smells of something in UTF-8 cache,
but that is very complicated for me...

#!perl
my $bad = "0123456789\xED ";
utf8​::upgrade($bad);
foo($bad);

my $good = "0123456789i ";
utf8​::upgrade($good);
foo($good);

sub foo {
  my $text = shift;
  my $pos = 5;
  pos($text) = $pos;
  my $a = substr($text, $pos, $pos);
  print substr($text,$pos,1) eq $pos ? "ok" : "not ok", " # $text\n";
}
__END__

Regard,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Apr 29, 2004

From @andk

Blame analysis shows, on the trunk we have the breakage in patch
18530, on the maintanance track it is in patch 18353.

--
andreas

@p5pRT
Copy link
Author

p5pRT commented Apr 29, 2004

From @nwc10

On Thu, Apr 29, 2004 at 10​:53​:17AM +0900, SADAHIRO Tomoyuki wrote​:

Test is more simplified.
This test should fail since perl 5.8.1 (including perl-current)
and succeed with perl 5.8.0.

It smells of something in UTF-8 cache,
but that is very complicated for me...

Yes it is there, and it's complicated for me too.
Actually only the first of the 3 hunks changed in sv.c is this bug - I'm
assuming that the other 2 are needed (somewhat cargo cult based on a
previous bug) but I have no test for these.

Nicholas Clark

Change 22755 by nicholas@​ship-in-a-bottle on 2004/04/29 18​:30​:18

  Fix 29149 - another UTF8 cache bug hit by substr.
  Regression test from​:
 
  Subject​: Re​: [perl #29149] substr/UTF8 related problem with perl 5.8.3 on linux
  From​: SADAHIRO Tomoyuki <bqw10602@​nifty.com>
  Message-Id​: <20040429103926.5BA6.BQW10602@​nifty.com>
  Date​: Thu, 29 Apr 2004 10​:53​:17 +0900

Affected files ...

... //depot/perl/sv.c#739 edit
... //depot/perl/t/op/substr.t#29 edit

Differences ...

==== //depot/perl/sv.c#739 (text) ====

@​@​ -6437,8 +6437,7 @​@​
  s += UTF8SKIP(s);
  if (s >= send)
  s = send;
- if (utf8_mg_pos_init(sv, &mg, &cache, 2, lenp, s, start))
- cache[2] += *offsetp;
+ utf8_mg_pos_init(sv, &mg, &cache, 2, lenp, s, start);
  }
  *lenp = s - start;
  }
@​@​ -6531,6 +6530,11 @​@​

  cache[0] -= ubackw;
  *offsetp = cache[0];
+
+ /* Drop the stale "length" cache */
+ cache[2] = 0;
+ cache[3] = 0;
+
  return;
  }
  }
@​@​ -6568,6 +6572,9 @​@​

  cache[0] = len;
  cache[1] = *offsetp;
+ /* Drop the stale "length" cache */
+ cache[2] = 0;
+ cache[3] = 0;
  }

  *offsetp = len;

==== //depot/perl/t/op/substr.t#29 (xtext) ====

@​@​ -1,6 +1,6 @​@​
#!./perl

-print "1..188\n";
+print "1..189\n";

#P = start of string Q = start of substr R = end of substr S = end of string

@​@​ -640,3 +640,14 @​@​
  $foo = '123456789';
  ok 188, bar eq '123456789';
}
+
+# [perl #29149]
+{
+ my $text = "0123456789\xED ";
+ utf8​::upgrade($text);
+ my $pos = 5;
+ pos($text) = $pos;
+ my $a = substr($text, $pos, $pos);
+ ok 189, substr($text,$pos,1) eq $pos;
+
+}

@p5pRT
Copy link
Author

p5pRT commented May 31, 2008

p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant