Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken Locale::Language in a UTF environment #5943

Closed
p5pRT opened this issue Sep 19, 2002 · 6 comments
Closed

broken Locale::Language in a UTF environment #5943

p5pRT opened this issue Sep 19, 2002 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 19, 2002

Migrated from rt.perl.org#17439 (status was 'resolved')

Searchable as RT17439$

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2002

From marty@kasei.com

Created by marty@kasei.com

This is a bug report for perl from marty@​kasei.com,
generated with the help of perlbug 1.34 running under perl v5.8.0.

-----------------------------------------------------------------
When running in a UTF environment, Locale​::Language doesn't load​:

LANG=en_GB.utf8 perl -we 'use Locale​::Language'

Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 115, <DATA> line 109.
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 109.
Malformed UTF-8 character (unexpected non-continuation byte 0x6c, immediately after start byte 0xe5) in lc at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 109.
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 115, <DATA> line 178.
Malformed UTF-8 character (unexpected end of string) at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 178.
Malformed UTF-8 character (unexpected non-continuation byte 0x6b, immediately after start byte 0xfc) in lc at /usr/share/perl/5.8.0/Locale/Language.pm line 117, <DATA> line 178.

The fix​:

Inline Patch
--- lib/Locale/Language.pm.orig 2002-09-19 15:17:16.000000000 +0200
+++ lib/Locale/Language.pm      2002-09-19 15:17:41.000000000 +0200
@@ -231,7 +231,7 @@
 my:Burmese
 
 na:Nauru
-nb:Norwegian Bokmål
+nb:Norwegian Bokmal
 nd:Ndebele, North
 ne:Nepali
 ng:Ndonga
@@ -300,7 +300,7 @@
 uz:Uzbek
 
 vi:Vietnamese
-vo:Volapük
+vo:Volapuk
 
 wo:Wolof
Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.0:

Configured by Debian Project at Sat Sep 14 18:17:32 UTC 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.19, archname=i386-linux-thread-multi
    uname='linux cyberhq 2.4.19 #1 smp sun aug 4 11:30:45 pdt 2002 i686 unknown unknown gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i386-linux -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8.0 -Darchlib=/usr/lib/perl/5.8.0 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.0 -Dsitearch=/usr/local/lib/perl/5.8.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.95.4 20011002 (Debian prerelease)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=true, libperl=libperl.so.5.8.0
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.0:
    /home/marty/Perl
    /etc/perl
    /usr/local/lib/perl/5.8.0
    /usr/local/share/perl/5.8.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8.0
    /usr/share/perl/5.8.0
    /usr/local/lib/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/marty
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/marty/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
    PERLLIB=/home/marty/Perl
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2002

From marty+p5p@kasei.com

I should have added a better explanation to this bug report and the
proposed fix, so here goes​:

The DATA is Locale​::Language contains 2 Latin-1 characters.
When Perl 5.8 is running in a UTF8 locale it expects the DATA to be UTF8
so it dies when it finds a malformed character.

Adding 'use bytes' to Locale​::Language would stop the death, but the
included non-ASCII characters don't work properly on non-Latin1 systems.
So I think it is better to replace the 2 problem characters with ASCII.

Here's my suggested patch. I've tried to ensure I've included the
actual Latin1 characters in this email, but as I don't use a Latin1
system they will probably be converted when I send this​: sorry.

Inline Patch
--- lib/Locale/Language.pm.orig 2002-09-19 15:17:16.000000000 +0200
+++ lib/Locale/Language.pm      2002-09-19 15:17:41.000000000 +0200
@@ -231,7 +231,7 @@
 my:Burmese
 
 na:Nauru
-nb:Norwegian Bokm?l
+nb:Norwegian Bokmal
 nd:Ndebele, North
 ne:Nepali
 ng:Ndonga
@@ -300,7 +300,7 @@
 uz:Uzbek
 
 vi:Vietnamese
-vo:Volap?k
+vo:Volapuk
 
 wo:Wolof

--- ./lib/Locale/Codes/t/languages.t.orig 2002-09-19 15:17:16.000000000 +0200
+++ ./lib/Locale/Codes/t/languages.t 2002-09-19 15:17:16.000000000 +0200
@@ -47,7 +47,7 @@
  'code2language("nd") eq "Ndebele, North"',
  'code2language("ng") eq "Ndonga"',
  'code2language("nn") eq "Norwegian Nynorsk"',
- 'code2language("nb") eq "Norwegian Bokm?l"',
+ 'code2language("nb") eq "Norwegian Bokmal"',
  'code2language("ny") eq "Chichewa; Nyanja"',
  'code2language("oc") eq "Occitan (post 1500)"',
  'code2language("os") eq "Ossetian; Ossetic"',

-- 

Marty

@p5pRT
Copy link
Author

p5pRT commented Sep 20, 2002

From nick.ing-simmons@elixent.com

Marty Pauley <marty+p5p@​kasei.com> writes​:

I should have added a better explanation to this bug report and the
proposed fix, so here goes​:

The DATA is Locale​::Language contains 2 Latin-1 characters.
When Perl 5.8 is running in a UTF8 locale it expects the DATA to be UTF8
so it dies when it finds a malformed character.

Adding 'use bytes' to Locale​::Language would stop the death, but the
included non-ASCII characters don't work properly on non-Latin1 systems.
So I think it is better to replace the 2 problem characters with ASCII.

Why not \x{00xx} escape ? - would be more robust for patching as well.
As mailers (including mine) are variously mangling these diffs.

Here's my suggested patch. I've tried to ensure I've included the
actual Latin1 characters in this email, but as I don't use a Latin1
system they will probably be converted when I send this​: sorry.

--- lib/Locale/Language.pm.orig 2002-09-19 15​:17​:16.000000000 +0200
+++ lib/Locale/Language.pm 2002-09-19 15​:17​:41.000000000 +0200
@​@​ -231,7 +231,7 @​@​
my​:Burmese

na​:Nauru
-nb​:Norwegian Bokm?l
+nb​:Norwegian Bokmal
nd​:Ndebele, North
ne​:Nepali
ng​:Ndonga
@​@​ -300,7 +300,7 @​@​
uz​:Uzbek

vi​:Vietnamese
-vo​:Volap?k
+vo​:Volapuk

wo​:Wolof

--- ./lib/Locale/Codes/t/languages.t.orig 2002-09-19 15​:17​:16.000000000 +0200
+++ ./lib/Locale/Codes/t/languages.t 2002-09-19 15​:17​:16.000000000 +0200
@​@​ -47,7 +47,7 @​@​
'code2language("nd") eq "Ndebele, North"',
'code2language("ng") eq "Ndonga"',
'code2language("nn") eq "Norwegian Nynorsk"',
- 'code2language("nb") eq "Norwegian Bokm?l"',
+ 'code2language("nb") eq "Norwegian Bokmal"',
'code2language("ny") eq "Chichewa; Nyanja"',
'code2language("oc") eq "Occitan (post 1500)"',
'code2language("os") eq "Ossetian; Ossetic"',
--
Nick Ing-Simmons
http​://www.ni-s.u-net.com/

@p5pRT
Copy link
Author

p5pRT commented Sep 23, 2002

From marty+p5p@kasei.com

On Fri Sep 20 16​:22​:02 2002, Nick Ing-Simmons wrote​:

Marty Pauley <marty+p5p@​kasei.com> writes​:

Adding 'use bytes' to Locale​::Language would stop the death, but the
included non-ASCII characters don't work properly on non-Latin1 systems.
So I think it is better to replace the 2 problem characters with ASCII.

Why not \x{00xx} escape ? - would be more robust for patching as well.
As mailers (including mine) are variously mangling these diffs.

The Language.pm file contains the Latin1 characters in the DATA section
so I can't use the escape sequences there. But the other reason was
more important to me​: the Latin1 characters cause bad things to happen
when used in a non-Latin1 environment; in EUC-JP for example, they
either don't display at all, or they merge with the next character and
display some obscure kanji.

--
Marty

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2002

From @hvds

Marty Pauley <marty+p5p@​kasei.com> wrote​:
:Here's my suggested patch. I've tried to ensure I've included the
:actual Latin1 characters in this email, but as I don't use a Latin1
:system they will probably be converted when I send this​: sorry.

Thanks, applied as change #17927.

Sending the patch as an attachment, either instead of or as well as
the inline version, is usually the best way to ensure the integrity
of the patch when you are unsure what your mailer will do to it.

Hugo

@p5pRT
Copy link
Author

p5pRT commented May 9, 2003

@cwest - Status changed from 'new' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant