range operator does not use locale to create alphabet #9977

p5pRT · 2009-11-22T23:06:23Z

Migrated from rt.perl.org#70732 (status was 'rejected')

Searchable as RT70732$

p5pRT · 2009-11-22T23:06:25Z

From wanradt@gmail.com

Created by wanradt@gmail.com

When trying to make a list of alphabet with range operator (..), it
seems not use locale (et_EE.UTF-8) information, in Estonian alphabet
correct sequence would be: R S Š Z, when in English is R S T U V W X Y
Z

In example code i used lc to make sure, that locale information is got
for character operations:

#!/usr/bin/perl

use strict;
use utf8;
use locale;
use POSIX;
use open ':std', ':locale';

print setlocale( LC_CTYPE ), "\n\n";;

my @real = qw(R S Š Z);
my @fake = 'R'..'Z';
print "@real\n";
print "@fake\n";

__END__

Perl Info


Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.0:

Configured by Debian Project at Thu Oct  1 22:38:45 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.24-23-server, archname=i486-linux-gnu-thread-multi
    uname='linux vernadsky 2.6.24-23-server #1 smp wed apr 1 22:22:14
utc 2009 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.10.0
-Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio
-Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib
-Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
-pipe -I/usr/local/include'
    ccversion='', gccversion='4.4.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.10.1.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.10.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/wanradt
    LANG=et_EE.UTF-8
    LANGUAGE=
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/wanradt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/bin/bash


-- 
Kõike hääd,

G

p5pRT · 2009-11-23T12:36:22Z

From @rgarcia

2009/11/23 WK <perlbug-followup@perl.org>:

When trying to make a list of alphabet with range operator (..), it
seems not use locale (et_EE.UTF-8) information, in Estonian alphabet
correct sequence would be: R S Š Z, when in English is R S T U V W X Y
Z

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module. Most languages use a
different alphabet anyway, diacritics being sometimes considered as
separate letters, sometimes not; ordering sometimes changes, too.

p5pRT · 2009-11-23T12:36:23Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2009-11-23T17:08:18Z

From @ikegami

On Mon, Nov 23, 2009 at 7:35 AM, Rafael Garcia-Suarez <rgs@consttype.org>wrote:

2009/11/23 WK <perlbug-followup@perl.org>:

When trying to make a list of alphabet with range operator (..), it
seems not use locale (et_EE.UTF-8) information, in Estonian alphabet
correct sequence would be: R S Š Z, when in English is R S T U V W X Y
Z

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module. Most languages use a
different alphabet anyway, diacritics being sometimes considered as
separate letters, sometimes not; ordering sometimes changes, too.

Is that his point? Or are you saying that locale != language?

p5pRT · 2010-01-17T04:04:43Z

From doom@kzsu.stanford.edu

rgs@consttype.org wrote:

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module.

Any such code (whether or not it's a CPAN module) will need some way to
determine user defaults intelligently, and the locale is an obvious
thing to check.

Yes, language should be an attribute of the text, not the user, but the
default language really is an attribute of the user.

p5pRT · 2010-01-18T13:16:14Z

From wanradt@gmail.com

rgs@consttype.org wrote:

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module.

I'm sorry, but i hoped to get any feedback to my report. Until todays
Joe letter i had no clue about some answers to my report.

So, i'd like to ask what to use then instead this "dying system"? For
me locales is one simple solution to use, but i am open to use any
(systematic) other possible way.

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

--
Wbr,
Kõike hääd,

WK

p5pRT · 2010-01-18T13:32:02Z

@rgs - Status changed from 'open' to 'rejected'

p5pRT · 2010-01-18T13:32:17Z

From @rgarcia

2010/1/17 WK <wanradt@gmail.com>:

rgs@consttype.org wrote:

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module.

I'm sorry, but i hoped to get any feedback to my report. Until todays
Joe letter i had no clue about some answers to my report.

So, i'd like to ask what to use then instead this "dying system"? For
me locales is one simple solution to use, but i am open to use any
(systematic) other possible way.

You should write code to handle alphabetical ordering respecting the
languages and contexts you want to process. This does not exist in
Perl yet, as far as I know, except for a few languages.

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

Locales are broken by design. They were invented at a time where all
strings were char*, and all letters were one byte in a given character
set. That got a bit better afterwards, but in a hackish way. So, for a
start, locales won't handle properly languages with letters
represented by more that one character.

p5pRT · 2010-01-18T21:43:40Z

From doom@kzsu.stanford.edu

On Mon Jan 18 05:32:17 2010, rgs@consttype.org wrote:

2010/1/17 WK <wanradt@gmail.com>:

rgs@consttype.org wrote:

I would say, let's not go there : locales are a dying system, and
that functionality could go in a CPAN module.

So, i'd like to ask what to use then instead this "dying system"?
For me locales is one simple solution to use, but i am open to use >
any (systematic) other possible way.

You should write code to handle alphabetical ordering respecting the
languages and contexts you want to process.

That's what he's trying to do.

This does not exist in Perl yet, as far as I know, except for a
few languages.

And it would exist for a few more if the handling of locales wasn't so
badly broken.

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

Locales are broken by design. They were invented at a time where all
strings were char*, and all letters were one byte in a given character
set. That got a bit better afterwards, but in a hackish way. So, for a
start, locales won't handle properly languages with letters
represented by more that one character.

Locales may very well be a dying system, but they aren't dead yet,
because we have nothing to replace them. For example, if you want to
write portable perl code how are you supposed to find out what encoding
the user is expecting in output? It might be utf8, it might be latin-1,
it might be something else. One place the use can specify this (on some
systems) is in their choice of locale. Is there something else a
programmer should check?

p5pRT · 2010-01-18T21:47:34Z

From @davidnicol

On Sun, Jan 17, 2010 at 9:52 AM, WK <wanradt@gmail.com> wrote:

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

--
Wbr,
Kõike hääd,

WK

The following has not been tested in any way:

sub ranger(@) {
my @letters = @_;
my %lookup = map { $letters[$_] => $_ } 0 .. $#letters;
sub($$){
@letters[$lookup{$_[0]} .. $lookup{$_[1]}]
}
}

my $EErange = ranger qw { A B C D E F G H I J K L M N O P Q R S Š Z Ž
T U V W Õ Ä Ö Ü X Y };

print "$_\n" for $EErange->('S','Z');

p5pRT · 2010-01-19T01:24:39Z

From doom@kzsu.stanford.edu

Yes, that looks good...

davidnicol@gmail.com wrote:

my $EErange = ranger qw { A B C D E F G H I J K L M N O P Q R S Š Z Ž
T U V W Õ Ä Ö Ü X Y };

But how do you know what alphabet to pass in to the routine?

The client programmer is now supposed to hack their own version of the
intelligence that was built into the locale system?

p5pRT · 2010-01-19T19:31:39Z

From wanradt@gmail.com

2010/1/18 Rafael Garcia-Suarez <rgs@consttype.org>:

Locales are broken by design. They were invented at a time where all
strings were char*, and all letters were one byte in a given character
set. That got a bit better afterwards, but in a hackish way. So, for a
start, locales won't handle properly languages with letters
represented by more that one character.

You are porbably right in many aspects, but i'd like to add some bits
from my view:

- as far we don't have better system replacing locale we are freezing without it

- latin1 is also broken/hack from todays view, so maybe let's throw
this also away and lets everyone uses unicode ;) (meaning: we don't do
something like this and with purpose)

- sorting accepts my locale (i mean UTF8 and multibyte chars too), so
why not range operator ? Such little example works fine:

#!/usr/bin/perl

use strict;
use warnings;
use locale;

use utf8;
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";

my @a = qw(x y ü ö ä õ ž z š s); # chars are in opposite order

print "$_ " foreach sort @a; # comes: s š z ž õ ä ö ü x y
print "\n";

{
no locale;
print "$_ " foreach sort @a; # comes: s x y z ä õ ö ü š ž
print "\n";
}

- despite i have just average knowledge of Perl, i avoid every hackish
way if possible. I prefer systematic approach. Using locale is
systematic from my point of view, from inside it may be still hack.

- locale may be broken by architecture, but from point of use it is
very convenient to read from environment users preferences for data
formatting rules. So, if we have such mechanism, there is easy way to
swap some day locale to another future solution. And locale contains
so many information besides problem with multi-byte characters...

For my little problem there are certainly workarounds, but those are
just proving the point: shortcut is broken.

--
Wbr,
Kõike hääd,

WK

p5pRT closed this as completed Jan 18, 2010

p5pRT added Severity Medium distro-Linux type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

range operator does not use locale to create alphabet #9977

range operator does not use locale to create alphabet #9977

p5pRT commented Nov 22, 2009

p5pRT commented Nov 22, 2009

p5pRT commented Nov 23, 2009

p5pRT commented Nov 23, 2009

p5pRT commented Nov 23, 2009

p5pRT commented Jan 17, 2010

p5pRT commented Jan 18, 2010

p5pRT commented Jan 18, 2010

p5pRT commented Jan 18, 2010

p5pRT commented Jan 18, 2010

p5pRT commented Jan 18, 2010

p5pRT commented Jan 19, 2010

p5pRT commented Jan 19, 2010

range operator does not use locale to create alphabet #9977

range operator does not use locale to create alphabet #9977

Comments

p5pRT commented Nov 22, 2009

p5pRT commented Nov 22, 2009

From wanradt@gmail.com

Created by wanradt@gmail.com

p5pRT commented Nov 23, 2009

From @rgarcia

p5pRT commented Nov 23, 2009

p5pRT commented Nov 23, 2009

From @ikegami

p5pRT commented Jan 17, 2010

From doom@kzsu.stanford.edu

p5pRT commented Jan 18, 2010

From wanradt@gmail.com

p5pRT commented Jan 18, 2010

p5pRT commented Jan 18, 2010

From @rgarcia

p5pRT commented Jan 18, 2010

From doom@kzsu.stanford.edu

p5pRT commented Jan 18, 2010

From @davidnicol

p5pRT commented Jan 19, 2010

From doom@kzsu.stanford.edu

p5pRT commented Jan 19, 2010

From wanradt@gmail.com