Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pos() function doesn't handle unicode well #9423

Closed
p5pRT opened this issue Jul 17, 2008 · 6 comments
Closed

pos() function doesn't handle unicode well #9423

p5pRT opened this issue Jul 17, 2008 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 17, 2008

Migrated from rt.perl.org#57040 (status was 'resolved')

Searchable as RT57040$

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2008

From mmaslano@redhat.com

Function pos() doesn't return correct values for unicode strings.
For example​:
perl -e '$string = "ěščřžýáíéň";while ($string =~ /š/gi) {printf "Found
š at %d\n", pos($string)-1;}';

In this case it could be solved 'use utf8'. But the problem is still in
other functions, which are
using pos(). For example expand from Text​::Tabs​:
perl -e'chop($ustr="\taa\t..\t\x{100}");for my
$s("\t\x{010a}\x{010a}\t..\t","\taa\t..\t",$ustr){
$_=$s;s/\t/print(pos(),$");"\t"/ge; print "\n"}'
Here should be all numbers the same.

Perl Info

Flags:
    category=core
    severity=medium

This perlbug was built using Perl 5.10.0 in the Fedora build system.
It is being executed now by Perl 5.10.0 - Wed Jul  2 05:13:09 EDT 2008.

Site configuration information for perl 5.10.0:

Configured by Red Hat, Inc. at Wed Jul  2 05:13:09 EDT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.18-92.1.6.el5, archname=i386-linux-thread-multi
    uname='linux x86-6 2.6.18-92.1.6.el5 #1 smp fri jun 20 02:36:06 edt 
2008 i686 i686 i386 gnulinux '
    config_args='-des -Doptimize=-O2 -g -pipe -Wall 
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector 
--param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic 
-fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV -Dversion=5.10.0 
-Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red 
Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr 
-Dprivlib=/usr/lib/perl5/5.10.0 
-Dsitelib=/usr/local/lib/perl5/site_perl/5.10.0 
-Dvendorlib=/usr/lib/perl5/vendor_perl/5.10.0 
-Darchlib=/usr/lib/perl5/5.10.0/i386-linux-thread-multi 
-Dsitearch=/usr/local/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi 
-Dvendorarch=/usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi 
-Darchname=i386-linux-thread-multi 
-Dotherlibdirs=/usr/lib/perl5/site_perl/5.10.0 -Dvendorprefix=/usr 
-Dsiteprefix=/usr/local -Duseshrplib -Dusethreads -Duseithreads 
-Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm 
-Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n 
-Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr 
-Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto 
-Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto 
-Ud_setservent_r_proto -Dscriptdir=/usr/bin'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING 
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 
-mtune=generic -fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING 
-fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.3.0 20080428 (Red Hat 4.3.0-8)', 
gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.8.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.8'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E 
-Wl,-rpath,/usr/lib/perl5/5.10.0/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall 
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector 
--param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic 
-fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV -L/usr/local/lib'

Locally applied patches:
   


@INC for perl 5.10.0:
    /usr/lib/perl5/5.10.0/i386-linux-thread-multi
    /usr/lib/perl5/5.10.0
    /usr/local/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
    /usr/local/lib/perl5/site_perl/5.10.0
    /usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.10.0
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.10.0
    .


Environment for perl 5.10.0:
    HOME=/home/marca
    LANG=en_US.UTF-8
    LANGUAGE=
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    
PATH=/usr/lib/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/home/marca/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2008

From @moritz

Marcela Maslanova wrote​:

# New Ticket Created by Marcela Maslanova
# Please include the string​: [perl #57040]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=57040 >

generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

Function pos() doesn't return correct values for unicode strings.
For example​:
perl -e '$string = "ěščřžýáíéň";while ($string =~ /š/gi) {printf "Found
š at %d\n", pos($string)-1;}';

I don't see the bug here. pos() returns byte values if you use the
string with byte semenatics (for example not upgraded UTF-8), and
codepoint values in cases of text semantics (here in the case of 'use
utf8';). In both cases substr() will work with the same semantics, so
it'll do the right thing.

I don't see how that principle is violated in your example above.
So pos() and lenth() agree that "ěš" is four bytes long.
$ perl -wle 'print length "ěš"'
4

Or am I missing a subtle off-by-one error?

In this case it could be solved 'use utf8'. But the problem is still in
other functions, which are
using pos(). For example expand from Text​::Tabs​:
perl -e'chop($ustr="\taa\t..\t\x{100}");for my
$s("\t\x{010a}\x{010a}\t..\t","\taa\t..\t",$ustr){
$_=$s;s/\t/print(pos(),$");"\t"/ge; print "\n"}'
Here should be all numbers the same.

As a non-golfed version​:

for my $s ( "\t\x{010a}\x{010a}\t..\t", "\taa\t..\t" ) {
  $_ = $s;
  s/\t/print(pos(),$");"\t"/ge;
  print "\n"
}

Output​:
0 2 4
0 3 6

This looks a bit weird indeed. At least to me ;-)

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=medium
---
This perlbug was built using Perl 5.10.0 in the Fedora build system.
It is being executed now by Perl 5.10.0 - Wed Jul 2 05​:13​:09 EDT 2008.

Site configuration information for perl 5.10.0​:

Configured by Red Hat, Inc. at Wed Jul 2 05​:13​:09 EDT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration​:
Platform​:
osname=linux, osvers=2.6.18-92.1.6.el5, archname=i386-linux-thread-multi
uname='linux x86-6 2.6.18-92.1.6.el5 #1 smp fri jun 20 02​:36​:06 edt
2008 i686 i686 i386 gnulinux '
config_args='-des -Doptimize=-O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic
-fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV -Dversion=5.10.0
-Dmyhostname=localhost -Dperladmin=root@​localhost -Dcc=gcc -Dcf_by=Red
Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr
-Dprivlib=/usr/lib/perl5/5.10.0
-Dsitelib=/usr/local/lib/perl5/site_perl/5.10.0
-Dvendorlib=/usr/lib/perl5/vendor_perl/5.10.0
-Darchlib=/usr/lib/perl5/5.10.0/i386-linux-thread-multi
-Dsitearch=/usr/local/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
-Dvendorarch=/usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi
-Darchname=i386-linux-thread-multi
-Dotherlibdirs=/usr/lib/perl5/site_perl/5.10.0 -Dvendorprefix=/usr
-Dsiteprefix=/usr/local -Duseshrplib -Dusethreads -Duseithreads
-Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm
-Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n
-Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr
-Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto
-Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto
-Ud_setservent_r_proto -Dscriptdir=/usr/bin'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386
-mtune=generic -fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
ccversion='', gccversion='4.3.0 20080428 (Red Hat 4.3.0-8)',
gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries​:
ld='gcc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.8.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.8'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
-Wl,-rpath,/usr/lib/perl5/5.10.0/i386-linux-thread-multi/CORE'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic
-fasynchronous-unwind-tables -DPERL_USE_SAFE_PUTENV -L/usr/local/lib'

Locally applied patches​:

---
@​INC for perl 5.10.0​:
/usr/lib/perl5/5.10.0/i386-linux-thread-multi
/usr/lib/perl5/5.10.0
/usr/local/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
/usr/local/lib/perl5/site_perl/5.10.0
/usr/lib/perl5/vendor_perl/5.10.0/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.10.0
/usr/lib/perl5/vendor_perl
/usr/lib/perl5/site_perl/5.10.0/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.10.0
.

---
Environment for perl 5.10.0​:
HOME=/home/marca
LANG=en_US.UTF-8
LANGUAGE=
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/usr/lib/qt-3.3/bin​:/usr/kerberos/bin​:/usr/local/bin​:/usr/bin​:/bin​:/home/marca/bin
PERL_BADLANG (unset)
SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2008

From @ikegami

On Thu, Jul 17, 2008 at 6​:42 AM, via RT Marcela Maslanova
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Marcela Maslanova
# Please include the string​: [perl #57040]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=57040 >

A simpler test that demonstrates the problem violently​:

perl -e"$_=qq{\x{2660}\t}; s/\t/ qq{\t}/ge"

perl -e"$_=qq{\x{2660}\t}; s/\t/pos(); qq{\t}/ge"
Malformed UTF-8 character (unexpected end of string) in match position
at -e line 1.

ActivePerl 5.10.0.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2008

From @andk

On Thu, 17 Jul 2008 15​:55​:17 -0400, "Eric Brine" <ikegami@​adaelis.com> said​:

  > On Thu, Jul 17, 2008 at 6​:42 AM, via RT Marcela Maslanova
  > <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Marcela Maslanova
# Please include the string​: [perl #57040]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=57040 >

  > A simpler test that demonstrates the problem violently​:

perl -e"$_=qq{\x{2660}\t}; s/\t/ qq{\t}/ge"

perl -e"$_=qq{\x{2660}\t}; s/\t/pos(); qq{\t}/ge"
  > Malformed UTF-8 character (unexpected end of string) in match position
  > at -e line 1.

This bug has somehow disappeared in bleadperl just right at the same
moment as this patch went in​:

Change 33580 by nicholas@​nicholas-saigo on 2008/03/26 21​:05​:20

  The offset for pos is stored as bytes, and converted to (Unicode)
  character position when read, if needed. The code for setting pos
  inside subst was incorrectly converting to character position before
  storing the value. This code appears to have been buggy since it was
  added in 2000 in change 7562.

I think the ticket can be closed.

Thanks,
--
andreas

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2008

@rgs - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant