Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text::Tabs fails to expand correctly in the presence of UTF8 characters #8853

Closed
p5pRT opened this issue Mar 29, 2007 · 9 comments
Closed

Text::Tabs fails to expand correctly in the presence of UTF8 characters #8853

p5pRT opened this issue Mar 29, 2007 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 29, 2007

Migrated from rt.perl.org#42167 (status was 'resolved')

Searchable as RT42167$

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2007

From rnorwood@redhat.com

Created by rnorwood@redhat.com

This is a bug report for perl from rnorwood@​redhat.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
It appears that Text​::Tabs doesn't expand tabs properly when the tab comes after UTF8 characters.

perl -CS -MText​::Tabs -e 'print expand("\taa\t.\n\t\x{010a}\x{010a}\t."), "\n"'

  aa .
  �%GÄ�Ä��%@​ .

My text editor/mailer may munge the UTF8 improperly - essentially the
line with two UTF8 characters gets an extra space before the dot when
run through Text​::Tabs​::expand.

This appears to be also broken in 5.9.4.

The bug is in Red Hat's bugzilla as​:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217833

Incidentally, the problem appears to stem from the pos() function not
counting UTF8 characters correctly - I haven't delved into the source
deeply enough to figure out why, though. There is an alternative
version of expand() in the Tabs.pm source file (after __END__) that
does not have this bug. Since the repo browser at
http​://public.activestate.com/cgi-bin/perlbrowse seems broken right
now ('no space left on device' errors - reported to the email address
listed on that page), I don't have access to the annotations/history
of that file to see why.

Perl Info

Flags:
    category=library
    severity=medium

This perlbug was built using Perl v5.8.8 in the Red Hat build system.
It is being executed now by Perl v5.8.8 - Tue Oct  3 11:01:05 EDT 2006.

Site configuration information for perl v5.8.8:

Configured by Red Hat, Inc. at Tue Oct  3 11:01:05 EDT 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.9-34.elsmp, archname=i386-linux-thread-multi
    uname='linux hs20-bc2-2.build.redhat.com 2.6.9-34.elsmp #1 smp fri feb 24 16:56:28 est 2006 i686 i686 i386 gnulinux '
    config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -Dversion=5.8.8 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dinc_version_list=5.8.7 5.8.6 5.8.5 -Dscriptdir=/usr/bin'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.1 20060928 (Red Hat 4.1.1-28)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.8/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.7/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl/5.8.7
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl/5.8.5
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.7/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl/5.8.7
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl/5.8.5
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/5.8.8/i386-linux-thread-multi
    /usr/lib/perl5/5.8.8
    .


Environment for perl v5.8.8:
    HOME=/home/rnorwood
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/opt/oracle/lib
    LOGDIR (unset)
    PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/rnorwood/bin::/sbin:/usr/sbin:/usr/local/sbin:/opt/bin:/home/rnorwood/bin::/sbin:/usr/sbin:/usr/local/sbin:/opt/bin:/home/rnorwood/bin::/sbin:/usr/sbin:/usr/local/sbin:/opt/bin:/home/rnorwood/bin::/sbin:/usr/sbin:/usr/local/sbin:/opt/bin:/home/rnorwood/bin::/sbin:/usr/sbin:/usr/local/sbin:/opt/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

From @smpeters

On Thu Mar 29 09​:16​:21 2007, rnorwood wrote​:

This is a bug report for perl from rnorwood <!-- x --> at redhat.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
It appears that Text​::Tabs doesn't expand tabs properly when the tab
comes after UTF8 characters.

perl -CS -MText​::Tabs -e 'print
expand("\taa\t.\n\t\x{010a}\x{010a}\t."), "\n"'

    aa      \.
    �%G���%@&#8203;       \.

My text editor/mailer may munge the UTF8 improperly - essentially the
line with two UTF8 characters gets an extra space before the dot when
run through Text​::Tabs​::expand.

This appears to be also broken in 5.9.4.

The bug is in Red Hat's bugzilla as​:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=217833

Incidentally, the problem appears to stem from the pos() function not
counting UTF8 characters correctly - I haven't delved into the source
deeply enough to figure out why, though. There is an alternative
version of expand() in the Tabs.pm source file (after __END__) that
does not have this bug. Since the repo browser at
http​://public.activestate.com/cgi-bin/perlbrowse seems broken right
now ('no space left on device' errors - reported to the email address
listed on that page), I don't have access to the annotations/history
of that file to see why.

With bleadperl, I see

[steve@​kirk perl-current]$ ./perl -CS -Ilib -MText​::Tabs -e 'print
expand("\taa\t.\n\t\x{010a}\x{010a}\t."), "\n"'
  aa .
  ÄÄ .

So, I'm thinking this problem has been resolved.

Thanks!

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

@smpeters - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Jul 25, 2007
@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

From rnorwood@redhat.com

"Steve Peters via RT" <perlbug-followup@​perl.org> writes​:

With bleadperl, I see

[steve@​kirk perl-current]$ ./perl -CS -Ilib -MText​::Tabs -e 'print
expand("\taa\t.\n\t\x{010a}\x{010a}\t."), "\n"'
aa .
ÄÄ .

So, I'm thinking this problem has been resolved.

If the tabs were expanding properly, wouldn't the period characters be
lined up? iow, a tab after the two UTF8 characters should land on the
same spot as a tab after the two standard ascii characters.

-RN

--
Robin Norwood
Red Hat, Inc.

"The Sage does nothing, yet nothing remains undone."
-Lao Tzu, Te Tao Ching

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

From @Juerd

Robin Norwood skribis 2007-07-25 15​:09 (-0400)​:

[steve@​kirk perl-current]$ ./perl -CS -Ilib -MText​::Tabs -e 'print
expand("\taa\t.\n\t\x{010a}\x{010a}\t."), "\n"'
aa .
ÄÄ .
So, I'm thinking this problem has been resolved.
If the tabs were expanding properly, wouldn't the period characters be
lined up? iow, a tab after the two UTF8 characters should land on the
same spot as a tab after the two standard ascii characters.

I think you mean Unicode characters, not UTF8 characters.

Unicode characters can be 2 columns wide, but U+010A is not.

foobar
--
korajn salutojn,

  juerd waalboer​: perl hacker <juerd@​juerd.nl> <http​://juerd.nl/sig>
  convolution​: ict solutions and consultancy <sales@​convolution.nl>

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

From @demerphq

On 7/25/07, Robin Norwood <rnorwood@​redhat.com> wrote​:

"Steve Peters via RT" <perlbug-followup@​perl.org> writes​:

With bleadperl, I see

[steve@​kirk perl-current]$ ./perl -CS -Ilib -MText​::Tabs -e 'print
expand("\taa\t.\n\t\x{010a}\x{010a}\t."), "\n"'
aa .
ÄÄ .

So, I'm thinking this problem has been resolved.

If the tabs were expanding properly, wouldn't the period characters be
lined up? iow, a tab after the two UTF8 characters should land on the
same spot as a tab after the two standard ascii characters.

I agree its a bug, you can reduce it down to​:

./perl -e'chop($ustr="\taa\t..\t\x{100}");for my $s
("\t\x{010a}\x{010a}\t..\t","\taa\t..\t",$ustr){ $_=$s;
s/\t/print(pos(),$");"\t"/ge; print "\n"}'

The output i get from above is​:

0 2 4
0 3 6
0 3 6

All three should be 0 3 6.

Apparently pos isnt being correctly set, (possible off by one error?),
prior to executing the rhs of a s/// when the string contains (or
maybe the cursor has passed over) utf8 multibyte sequences.

Cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2007

From rnorwood@redhat.com

demerphq <demerphq@​gmail.com> writes​:

On 7/25/07, Robin Norwood <rnorwood@​redhat.com> wrote​:

If the tabs were expanding properly, wouldn't the period characters be
lined up? iow, a tab after the two UTF8 characters should land on the
same spot as a tab after the two standard ascii characters.

I agree its a bug, you can reduce it down to​:

./perl -e'chop($ustr="\taa\t..\t\x{100}");for my $s
("\t\x{010a}\x{010a}\t..\t","\taa\t..\t",$ustr){ $_=$s;
s/\t/print(pos(),$");"\t"/ge; print "\n"}'

The output i get from above is​:

0 2 4
0 3 6
0 3 6

All three should be 0 3 6.

Apparently pos isnt being correctly set, (possible off by one error?),
prior to executing the rhs of a s/// when the string contains (or
maybe the cursor has passed over) utf8 multibyte sequences.

Yeah - when I looked at it a couple of months ago it appeared to be
because the pos() function was misbehaving when these characters are
present. But my knowledge of perl internals was too weak to track it
down much further before I was distracted.

Note that at least in 5.8.8's Text​::Tabs, there's an older version of
the expand() function hidden after __END__ that doesn't use pos(), and
doesn't have this bug. I think that version is slower, though.

Thanks,

-RN

--
Robin Norwood
Red Hat, Inc.

"The Sage does nothing, yet nothing remains undone."
-Lao Tzu, Te Tao Ching

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2008

From mmaslano@redhat.com

I'm sorry, but this problem is still there. I tested with perl-5.10.0
and the code of the unexpand is still the same in bleadperl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant