Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'split'/'index' problem for utf8 #6547

Closed
p5pRT opened this issue May 29, 2003 · 11 comments
Closed

'split'/'index' problem for utf8 #6547

p5pRT opened this issue May 29, 2003 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented May 29, 2003

Migrated from rt.perl.org#22375 (status was 'resolved')

Searchable as RT22375$

@p5pRT
Copy link
Author

p5pRT commented May 29, 2003

From perl@geez.org

Greetings,

I've found a really odd behavior either with split or index,
I'm not really sure which. The script below will demonstrate
the problem. In short index is not returning the right value
when a string reference is used vs a constant string (that is
$string vs "abcde"). It appears that the sequence that chars
where in before they were split effects the outcome. The chars
are used as the substring argument of split.

Its really weird and unfortunately also blocking a real world
problem.

thanks,

/Daniel

I'm using Perl 5.8.0 on a Redhat 8 Linux​:

begin 775 split-utf-break.pl.gz
M'XL("#"$UCX``W-P;&ET+75T9BUB<F5A​:RYP;`#%4L%.XS`4/..O>*252J5"
M1)$J1+>PA^6P%T[<6!2Y[0NQUK4C^X52K;C3/^'87^J?\.PTI>U>N!'Y$(_'
M,_,F​:1VGE7?I6)FT1*?A="XJCU!1?CF,;YZ<FM!0B-D"V@​9QJA%&D*S?5NOE
M*AD&^&<->\9]J17!"​:1I;\ON\N72*4.0W*,G99[@​Q^;L^G3G^6-JN78A%Y[D
MY"^,V.8]KHU9;AW*2<$&6\\N_!-'&WF`AW;V"*-K2'J@​S!1?F+FOP;$RZ/​:8
M.^(I>N*HH6U=/QDW,=&K$"W1`H<E2H)Q14!.&E]​:+H<*A,V4&HG0>>;NMK1<
ML2^+?*$BE==A5PE+-B?^X?QQ;\0.$SJ!T-EA=(Z;I%_N^7N​:S++;NU]9)L0]
M%Y=;K>T\Q%0^-FDK*D.]!??\&SSB%1.;26K7O4$$IV0XQNR'%'VXB=BRQBX"
M=L'8ITC\'O^+'%XX%!XP)L2=)​:S3A;1​:>H)GJ2L,^24D@​P2DF8*Q%'9]WGG`
MEQ(GA-.S.''I[%CC#)XLURWG<@​'S`DU4RY5CN?HG@​MS9642;)@​4[.)S9YZ#T
)`6"B()NS`P``
`
end

@p5pRT
Copy link
Author

p5pRT commented May 29, 2003

From @nwc10

On Thu, May 29, 2003 at 10​:19​:27PM -0000, Daniel Yacob wrote​:

I'm using Perl 5.8.0 on a Redhat 8 Linux​:

The usual answer to that is "try running with LANG=C" (RedHat defaults
to utf8 locales which are causing trouble)
But your script doesn't look like that is the problem.
Annoyingly I can't replicate your described output on Debian with 5.8.0.

I see 2, (the "right" answer, if I understand things correctly) not 6
(what you see, and buggy)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 29, 2003

From perl@geez.org

thanks for the test, I set LC_ALL=C and LANG=C and still get
the 6 value. With these settings the error message appears
four times​: "Wide character in print at ./split-utf-break.pl"

So it actually worked better with a UTF8 locale (am_ET.UTF-8).
Is there any additional info I can send you?

thanks,

/Daniel

@p5pRT
Copy link
Author

p5pRT commented May 30, 2003

From @andk

Nicholas Clark <nick@​ccl4.org> writes​:

Annoyingly I can't replicate your described output on Debian with 5.8.0.

I can and I'm also running Debian.

I see 2, (the "right" answer, if I understand things correctly) not 6
(what you see, and buggy)

The "6" was coming with bleadperl 18530 and is also what maint gives.

--
andreas

Summary of my perl5 (revision 5.0 version 9 subversion 0 patch 18374) configuration​:
  Platform​:
  osname=linux, osvers=2.4.18-k7, archname=i686-linux-64int
  uname='linux k75 2.4.18-k7 #1 sun apr 14 13​:19​:11 est 2002 i686 gnulinux '
  config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/pKfBBXB/perl-5.8.0@​18530 -Dinstallusrbinperl=n -Uversiononly -Doptimize=-g -des -Duse64bitint -Dusedevel'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=undef useithreads=undef usemultiplicity=undef
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=define use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-g',
  cppflags='-DDEBUGGING -fno-strict-aliasing -I/usr/local/include'
  ccversion='', gccversion='3.2.3 20030415 (Debian prerelease)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -ldb -ldl -lm -lc -lcrypt -lutil -lrt
  perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil -lrt
  libc=/lib/libc-2.3.1.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.3.1'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
  cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Characteristics of this binary (from libperl)​:
  Compile-time options​: DEBUGGING USE_64_BIT_INT USE_LARGE_FILES
  Locally applied patches​:
  DEVEL18374
  Built under linux
  Compiled at May 30 2003 07​:05​:58
  @​INC​:
  /home/src/perl/repoperls/installed-perls/perl/pKfBBXB/perl-5.8.0@​18530/lib/5.9.0/i686-linux-64int
  /home/src/perl/repoperls/installed-perls/perl/pKfBBXB/perl-5.8.0@​18530/lib/5.9.0
  /home/src/perl/repoperls/installed-perls/perl/pKfBBXB/perl-5.8.0@​18530/lib/site_perl/5.9.0/i686-linux-64int
  /home/src/perl/repoperls/installed-perls/perl/pKfBBXB/perl-5.8.0@​18530/lib/site_perl/5.9.0
  /home/src/perl/repoperls/installed-perls/perl/pKfBBXB/perl-5.8.0@​18530/lib/site_perl

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From @doy

I can't reproduce this, even using the system perl on Debian (5.12.4).
Can anyone else reproduce it?

-doy

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From @cpansprout

On Fri Jul 06 13​:25​:00 2012, doy wrote​:

I can't reproduce this, even using the system perl on Debian (5.12.4).
Can anyone else reproduce it?

I can’t, on 5.8.1, 5.10.1 and 5.17.2.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From [Unknown Contact. See original ticket]

On Fri Jul 06 13​:25​:00 2012, doy wrote​:

I can't reproduce this, even using the system perl on Debian (5.12.4).
Can anyone else reproduce it?

I can’t, on 5.8.1, 5.10.1 and 5.17.2.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From @doy

Closing this, since we can't reproduce it. If someone is able to
reproduce it, feel free to reopen this ticket.

-doy

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From [Unknown Contact. See original ticket]

Closing this, since we can't reproduce it. If someone is able to
reproduce it, feel free to reopen this ticket.

-doy

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

@doy - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Jul 6, 2012
@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From @nwc10

On Fri, Jul 06, 2012 at 01​:54​:53PM -0700, Jesse Luehrs via RT wrote​:

Closing this, since we can't reproduce it. If someone is able to
reproduce it, feel free to reopen this ticket.

The test case in the ticket fails on the revision that Andreas mentions in
a comment in the ticket (bleadperl 18530 - ie 7e8c5da)

Adapting the test case to die if the two values are not equal permits
bisecting, which finds that this commit fixed it​:

d69d2d9 is the first bad commit
commit d69d2d9
Author​: Jarkko Hietaniemi <jhi@​iki.fi>
Date​: Fri May 30 05​:47​:15 2003 +0000

  Fix for "#22375 'split'/'index' problem for utf8".

  p4raw-id​: //depot/perl@​19640

:100644 100644 d82e354341db1415bc03834f7cf84763568a16b8 310ba50465ec1a2c866438208805f6bcf626227a M sv.c
:040000 040000 6b5b18db188316b074b9cafdbe453ee41c29ca94 f76a7e8cef2fa3fbc79f3d2fd79ea65f83977269 M t
bisect run success
That took 1550 seconds

The actual fix is tiny​:

Inline Patch
diff --git a/sv.c b/sv.c
index d82e354..310ba50 100644
--- a/sv.c
+++ b/sv.c
@@ -5952,8 +5952,6 @@ Perl_sv_pos_b2u(pTHX_ register SV* sv, I32* offsetp)
                        }
 
                        cache[0] -= ubackw;
-
-                       return;
                    }
                }
            }



With some more bisecting, it turns out that the bug was introduced *at* the commit that Andreas mentioned in the ticket​:

7e8c5da is the first bad commit
commit 7e8c5da
Author​: Hugo van der Sanden <hv@​crypt.org>
Date​: Tue Jan 21 01​:37​:03 2003 +0000

  integrate (by hand) #18353 and #18359 from maint-5.8​:
  Introduce a cache for UTF-8 data​: length and byte<->char offset
  mapping are stored in a new type of magic. Speeds up length(),
  substr(), index(), rindex(), pos(), and some parts of s///.
 
  The speedup varies a lot (on the usual suspects​: what is the
  access pattern of the data, compiler, CPU), but should be at
  least one order of magnitude, and getting to the same magnitude
  as byte string speeds, and in some cases (length on unchanged data)
  even reaching the byte string speed. On the other hand, in some
  cases (index) the byte speed is still faster by a factor of five
  or so, but the bottleneck there does not seem to be any more
  the byte<->char offset mapping (instead, the fbm_instr() speed).
 
  There is one cache slot for the length, and only two for the
  byte<->char offset mapping (the first one for the start->offset,
  and the second for the offset->offset+length, when talking
  in substr() terms).
 
  Code this hairy is bound to have hairy trolls hiding under it.
  [...]
  A small tweak on top of #18353​: don't display mg_len bytes of
  mg_ptr for PERL_MAGIC_utf8 because that's not what's there.
 
  p4raw-id​: //depot/perl@​18530

Nicholas Clark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant