Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

range operator does not use locale to create alphabet #9977

Closed
p5pRT opened this issue Nov 22, 2009 · 12 comments
Closed

range operator does not use locale to create alphabet #9977

p5pRT opened this issue Nov 22, 2009 · 12 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 22, 2009

Migrated from rt.perl.org#70732 (status was 'rejected')

Searchable as RT70732$

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 2009

From wanradt@gmail.com

Created by wanradt@gmail.com

When trying to make a list of alphabet with range operator (..), it
seems not use locale (et_EE.UTF-8) information, in Estonian alphabet
correct sequence would be​: R S Š Z, when in English is R S T U V W X Y
Z

In example code i used lc to make sure, that locale information is got
for character operations​:

#!/usr/bin/perl

use strict;
use utf8;
use locale;
use POSIX;
use open '​:std', '​:locale';

print setlocale( LC_CTYPE ), "\n\n";;

my @​real = qw(R S Š Z);
my @​fake = 'R'..'Z';
print "@​real\n";
print "@​fake\n";

__END__

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.0:

Configured by Debian Project at Thu Oct  1 22:38:45 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.24-23-server, archname=i486-linux-gnu-thread-multi
    uname='linux vernadsky 2.6.24-23-server #1 smp wed apr 1 22:22:14
utc 2009 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.10.0
-Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio
-Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib
-Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
-pipe -I/usr/local/include'
    ccversion='', gccversion='4.4.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.10.1.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.10.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/wanradt
    LANG=et_EE.UTF-8
    LANGUAGE=
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/wanradt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/bin/bash


-- 
Kõike hääd,

G

@p5pRT
Copy link
Author

p5pRT commented Nov 23, 2009

From @rgarcia

2009/11/23 WK <perlbug-followup@​perl.org>​:

When trying to make a list of alphabet with range operator (..), it
seems not use locale (et_EE.UTF-8) information, in Estonian alphabet
correct sequence would be​: R S Š Z, when in English is R S T U V W X Y
Z

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module. Most languages use a
different alphabet anyway, diacritics being sometimes considered as
separate letters, sometimes not; ordering sometimes changes, too.

@p5pRT
Copy link
Author

p5pRT commented Nov 23, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 23, 2009

From @ikegami

On Mon, Nov 23, 2009 at 7​:35 AM, Rafael Garcia-Suarez <rgs@​consttype.org>wrote​:

2009/11/23 WK <perlbug-followup@​perl.org>​:

When trying to make a list of alphabet with range operator (..), it
seems not use locale (et_EE.UTF-8) information, in Estonian alphabet
correct sequence would be​: R S Š Z, when in English is R S T U V W X Y
Z

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module. Most languages use a
different alphabet anyway, diacritics being sometimes considered as
separate letters, sometimes not; ordering sometimes changes, too.

Is that his point? Or are you saying that locale != language?

@p5pRT
Copy link
Author

p5pRT commented Jan 17, 2010

From doom@kzsu.stanford.edu

rgs@​consttype.org wrote​:

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module.

Any such code (whether or not it's a CPAN module) will need some way to
determine user defaults intelligently, and the locale is an obvious
thing to check.

Yes, language should be an attribute of the text, not the user, but the
default language really is an attribute of the user.

@p5pRT
Copy link
Author

p5pRT commented Jan 18, 2010

From wanradt@gmail.com

rgs@​consttype.org wrote​:

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module.

I'm sorry, but i hoped to get any feedback to my report. Until todays
Joe letter i had no clue about some answers to my report.

So, i'd like to ask what to use then instead this "dying system"? For
me locales is one simple solution to use, but i am open to use any
(systematic) other possible way.

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

--
Wbr,
Kõike hääd,

WK

@p5pRT
Copy link
Author

p5pRT commented Jan 18, 2010

@rgs - Status changed from 'open' to 'rejected'

@p5pRT p5pRT closed this as completed Jan 18, 2010
@p5pRT
Copy link
Author

p5pRT commented Jan 18, 2010

From @rgarcia

2010/1/17 WK <wanradt@​gmail.com>​:

rgs@​consttype.org wrote​:

I would say, let's not go there : locales are a dying system, and that
functionality could go in a CPAN module.

I'm sorry, but i hoped to get any feedback to my report. Until todays
Joe letter i had no clue about some answers to my report.

So, i'd like to ask what to use then instead this "dying system"? For
me locales is one simple solution to use, but i am open to use any
(systematic) other possible way.

You should write code to handle alphabetical ordering respecting the
languages and contexts you want to process. This does not exist in
Perl yet, as far as I know, except for a few languages.

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

Locales are broken by design. They were invented at a time where all
strings were char*, and all letters were one byte in a given character
set. That got a bit better afterwards, but in a hackish way. So, for a
start, locales won't handle properly languages with letters
represented by more that one character.

@p5pRT
Copy link
Author

p5pRT commented Jan 18, 2010

From doom@kzsu.stanford.edu

On Mon Jan 18 05​:32​:17 2010, rgs@​consttype.org wrote​:

2010/1/17 WK <wanradt@​gmail.com>​:

rgs@​consttype.org wrote​:

I would say, let's not go there : locales are a dying system, and
that functionality could go in a CPAN module.

So, i'd like to ask what to use then instead this "dying system"?
For me locales is one simple solution to use, but i am open to use >
any (systematic) other possible way.

You should write code to handle alphabetical ordering respecting the
languages and contexts you want to process.

That's what he's trying to do.

This does not exist in Perl yet, as far as I know, except for a
few languages.

And it would exist for a few more if the handling of locales wasn't so
badly broken.

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

Locales are broken by design. They were invented at a time where all
strings were char*, and all letters were one byte in a given character
set. That got a bit better afterwards, but in a hackish way. So, for a
start, locales won't handle properly languages with letters
represented by more that one character.

Locales may very well be a dying system, but they aren't dead yet,
because we have nothing to replace them. For example, if you want to
write portable perl code how are you supposed to find out what encoding
the user is expecting in output? It might be utf8, it might be latin-1,
it might be something else. One place the use can specify this (on some
systems) is in their choice of locale. Is there something else a
programmer should check?

@p5pRT
Copy link
Author

p5pRT commented Jan 18, 2010

From @davidnicol

On Sun, Jan 17, 2010 at 9​:52 AM, WK <wanradt@​gmail.com> wrote​:

We already have (had?) in Perl support to locales. Why break it? I'd
better fix it.

--
Wbr,
Kõike hääd,

WK

The following has not been tested in any way​:

sub ranger(@​) {
  my @​letters = @​_;
  my %lookup = map { $letters[$_] => $_ } 0 .. $#letters;
  sub($$){
  @​letters[$lookup{$_[0]} .. $lookup{$_[1]}]
  }
}

my $EErange = ranger qw { A B C D E F G H I J K L M N O P Q R S Š Z Ž
T U V W Õ Ä Ö Ü X Y };

print "$_\n" for $EErange->('S','Z');

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From doom@kzsu.stanford.edu

Yes, that looks good...

davidnicol@​gmail.com wrote​:

my $EErange = ranger qw { A B C D E F G H I J K L M N O P Q R S Š Z Ž
T U V W Õ Ä Ö Ü X Y };

But how do you know what alphabet to pass in to the routine?

The client programmer is now supposed to hack their own version of the
intelligence that was built into the locale system?

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From wanradt@gmail.com

2010/1/18 Rafael Garcia-Suarez <rgs@​consttype.org>​:

Locales are broken by design. They were invented at a time where all
strings were char*, and all letters were one byte in a given character
set. That got a bit better afterwards, but in a hackish way. So, for a
start, locales won't handle properly languages with letters
represented by more that one character.

You are porbably right in many aspects, but i'd like to add some bits
from my view​:

- as far we don't have better system replacing locale we are freezing without it

- latin1 is also broken/hack from todays view, so maybe let's throw
this also away and lets everyone uses unicode ;) (meaning​: we don't do
something like this and with purpose)

- sorting accepts my locale (i mean UTF8 and multibyte chars too), so
why not range operator ? Such little example works fine​:

#!/usr/bin/perl

use strict;
use warnings;
use locale;

use utf8;
binmode STDIN, "​:utf8";
binmode STDOUT, "​:utf8";

my @​a = qw(x y ü ö ä õ ž z š s); # chars are in opposite order

print "$_ " foreach sort @​a; # comes​: s š z ž õ ä ö ü x y
print "\n";

{
  no locale;
  print "$_ " foreach sort @​a; # comes​: s x y z ä õ ö ü š ž
  print "\n";
}

- despite i have just average knowledge of Perl, i avoid every hackish
way if possible. I prefer systematic approach. Using locale is
systematic from my point of view, from inside it may be still hack.

- locale may be broken by architecture, but from point of use it is
very convenient to read from environment users preferences for data
formatting rules. So, if we have such mechanism, there is easy way to
swap some day locale to another future solution. And locale contains
so many information besides problem with multi-byte characters...

For my little problem there are certainly workarounds, but those are
just proving the point​: shortcut is broken.

--
Wbr,
Kõike hääd,

WK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant