Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale, utf8 and string functions #9974

Closed
p5pRT opened this issue Nov 20, 2009 · 9 comments
Closed

locale, utf8 and string functions #9974

p5pRT opened this issue Nov 20, 2009 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 20, 2009

Migrated from rt.perl.org#70688 (status was 'resolved')

Searchable as RT70688$

@p5pRT
Copy link
Author

p5pRT commented Nov 20, 2009

From wanradt@gmail.com

Created by wanradt@gmail.com

Synopsis​: perl does not use locale information in string functions.

Example​:
---
#!/usr/bin/perl

use locale;
print uc("abcõäöüšž"), "\n";
---

Expected output​: ABCÕÄÖÜŠŽ
Real output​: ABCõäöüšž

Improved example which shows that perl is getting/setting locale, but
does not use it in string fungtions​:

---
#!/usr/bin/perl

use POSIX qw(locale_h);
use locale;

print setlocale(LC_CTYPE), "\n";
print uc("abcõäöüšž"), "\n";

print setlocale(LC_CTYPE, "et_EE.UTF-8"), "\n";
print uc("abcõäöüšž"), "\n";

print setlocale(LC_CTYPE, "en_GB.UTF-8"), "\n";
print uc("abcõäöüšž"), "\n";
---

Output​:
et_EE.UTF-8
ABCõäöüšž
et_EE.UTF-8
ABCõäöüšž
en_GB.UTF-8
ABCõäöüšž

So, no uppercase for non-ASCII chars. Of course, it is common to all
string functions which depend on locale.

Conclusion​: something is wrong here, for me it seems like a bug.

Remark​: i marked severity 'medium', but for me is this and other
unicode releated things high-priority. I mangeled with such bad
behaviour through the years. And, to be honest, it is annoying.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.0:

Configured by Debian Project at Mon Sep 21 08:42:41 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.24-24-server, archname=i486-linux-gnu-thread-multi
    uname='linux palmer 2.6.24-24-server #1 smp tue aug 18 17:46:20
utc 2009 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.10.0
-Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio
-Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib
-Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
-pipe -I/usr/local/include'
    ccversion='', gccversion='4.4.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.10.1.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.10.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/wanradt
    LANG=et_EE.UTF-8
    LANGUAGE=et_EE:et:en_GB:en
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/wanradt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Nov 23, 2009

From @rgarcia

2009/11/20 WK <perlbug-followup@​perl.org>​:

Synopsis​: perl does not use locale information in string functions.

Example​:
---
#!/usr/bin/perl

use locale;
print uc("abcõäöüšž"), "\n";
---

Expected output​: ABCÕÄÖÜŠŽ
Real output​:     ABCõäöüšž

Works here :
$ LC_ALL=fr_FR.utf8 perl -CS -Mutf8 -Mlocale -E 'say uc("é")'
É
(The -CS -Mutf8 being needed because my terminal is in utf8)

Either your strings aren't encoded in utf8, or your locale is buggy.
(which happens)

Remark​: i marked severity 'medium', but for me is this and other
unicode releated things high-priority. I mangeled with such bad
behaviour through the years. And, to be honest, it is annoying.

I personally think that locales are a pain to work with. I also fail
to see how locales are unicode-related; the locale system predated
Unicode by years; Unicode doesn't integrate with locales at all
either.

@p5pRT
Copy link
Author

p5pRT commented Nov 23, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 23, 2009

From @ap

* WK <perlbug-followup@​perl.org> [2009-11-23 13​:10]​:

Synopsis​: perl does not use locale information in string functions.

Correct synopsis​: you forgot to put `use utf8` in your code so
Perl does not know that your sources are UTF-8 encoded.

Conclusion​: something is wrong here, for me it seems like
a bug.

Yes, but not in Perl.

Remark​: i marked severity 'medium', but for me is this and
other unicode releated things high-priority. I mangeled with
such bad behaviour through the years. And, to be honest, it is
annoying.

Then fix it. :-)

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Jan 17, 2010

From doom@kzsu.stanford.edu

Yes, I can get the second code example to work with the
addition of​:

use utf8;
binmode STDOUT, '​:encoding(utf8)'; # silences wide character warnings

But why is it necessary to explicitly state "use utf8"?
He's already said "use locale", and since his locale specifies
utf8, shouldn't that be enough?

  LANG=et_EE.UTF-8

And why should it make any difference if the string is in utf8
or latin-1? Shouldn't "uc" work with either one?

I quote from perllocale​:

  The use locale pragma
  By default, Perl ignores the current locale. The "use locale" pragma
  tells Perl to use the current locale for some operations​:

  · Regular expressions and case-modification functions (uc(), lc(),
  ucfirst(), and lcfirst()) use "LC_CTYPE"

Circa perl 5.8, a "use locale" was all it took to get "uc" and
friends to work correctly (though as I remember it, I was using
latin-1 text, and possibly a latin-1 locale).

Why the change?

@p5pRT
Copy link
Author

p5pRT commented Jan 17, 2010

From doom@kzsu.stanford.edu

This does what I would've thought that "use locale" should do (but I
could've sworn I'd tried it before without success... and it has a
reputation for being buggy)​:
  use encoding '​:locale';

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From @doy

Aristotle is correct here, this is not a bug. The "it seemed to work
circa 5.8" issue is likely related to the Unicode changes in 5.8.0 that
were reverted in 5.8.1 due to being a terrible idea.

-doy

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

From [Unknown Contact. See original ticket]

Aristotle is correct here, this is not a bug. The "it seemed to work
circa 5.8" issue is likely related to the Unicode changes in 5.8.0 that
were reverted in 5.8.1 due to being a terrible idea.

-doy

@p5pRT
Copy link
Author

p5pRT commented Jul 6, 2012

@doy - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant