Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perl::Encode doesn't handle UTF-8 NFD strings #6485

Closed
p5pRT opened this issue May 5, 2003 · 11 comments
Closed

perl::Encode doesn't handle UTF-8 NFD strings #6485

p5pRT opened this issue May 5, 2003 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented May 5, 2003

Migrated from rt.perl.org#22111 (status was 'rejected')

Searchable as RT22111$

@p5pRT
Copy link
Author

p5pRT commented May 5, 2003

From debianbugs@j3e.de

Created by debianbugs@j3e.de

the encode function of perl is not able to convert from UTF-8 which is in normatization form D (NFD). Normalization is handled by Unicode​::Normalize. To use encode one has to use the workaround
from_to(encode_utf8(NFC(decode_utf8($string))),"utf8","...")
but encode should correctly treat NFD encoded strings.

Bjoern

Perl Info

Flags:
    category=library
    severity=medium

This perlbug was built using Perl v5.8.0 - Mon Sep  9 18:12:37 UTC 2002
It is being executed now by  Perl v5.8.0 - Mon Sep  9 18:02:36 UTC 2002.

Site configuration information for perl v5.8.0:

Configured by root at Mon Sep  9 18:02:36 UTC 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.19, archname=i586-linux-thread-multi
    uname='linux bloembergen 2.4.19 #1 mon apr 15 08:57:26 gmt 2002 i686 unknown '
    config_args='-ds -e -Dprefix=/usr -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3 --pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing'
    ccversion='', gccversion='3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =''
    libpth=/lib /usr/lib /usr/local/lib
    libs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i586-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared'

Locally applied patches:
    


@INC for perl v5.8.0:
    /usr/lib/perl5/5.8.0/i586-linux-thread-multi
    /usr/lib/perl5/5.8.0
    /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.0
    /usr/lib/perl5/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/bjacke
    LANG=de_DE.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/home/bjacke/lib
    LOGDIR (unset)
    PATH=/home/bjacke/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/usr/openwin/bin:.
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 6, 2003

From dankogai@dan.co.jp

On Tuesday, May 6, 2003, at 06​:20 AM, debianbugs@​j3e.de (via RT) wrote​:

# New Ticket Created by debianbugs@​j3e.de
# Please include the string​: [perl #22111]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt2/Ticket/Display.html?id=22111 >

This is a bug report for perl from debianbugs@​j3e.de,
generated with the help of perlbug 1.34 running under perl v5.8.0.

-----------------------------------------------------------------
[Please enter your report here]

the encode function of perl is not able to convert from UTF-8 which is
in normatization form D (NFD). Normalization is handled by
Unicode​::Normalize. To use encode one has to use the workaround
from_to(encode_utf8(NFC(decode_utf8($string))),"utf8","...")
but encode should correctly treat NFD encoded strings.

Bjoern

If perl is an application like, say, a word processor, I would agree
that perl and Encode should handle Normalization internally and
transparently so "canonically-equivalent" strings compare as equal.
But perl is a PROGRAMMING LANGUAGE so you have to be able to treat
different (though may be equivalent Unicode-wise) things different by
default. Otherwise you can't even implement new normalization in perl.
  So I do not consider this as a bug since perl 5.8 comes with both
Encode and Unicode​::Normalize.

If you want to do it transparently, you can always use Encode​::Encoding
to implement your own. Here is an example.

package Encode​::UTF8​::NFD;
use strict;
use base qw(Encode​::Encoding);
use Unicode​::Normalize;
__PACKAGE__->Define('utf8-nfd');

sub decode($$;;$){
  my ($obj, $str, $chk) = @​_;
  $str = NFD(decode('utf8' => $str));
  $_[1] = '' if $chk; # this is what in-place edit means
  return $str;
}

sub encode($$;;$){
  my ($obj, $str, $chk) = @​_;
  $str = encode('utf8' => NFC($str));
  $_[1] = '' if $chk; # this is what in-place edit means
  return $str;
}

1;

Normalization is not an "easy thing that should be done easily". It is
definitely a "hard thing that should be possible" and it is possible
already.

Dan the Encode Maintainer

@p5pRT
Copy link
Author

p5pRT commented May 6, 2003

From @jhi

this gives a chance to workaround this bug (yes, I think it is).

And I think it is not. Normalization should not magically be done.
If the Unicode string has been normalized to form D and some
originally composed characters have been decomposed, it is no more
the same string as the original.

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work

You are assuming the equivalence of (pre)composed characters and
their composed forms. Perl doesn't do this at any level.

around this bug but from_to should not care if the initial string is
NFC or NFD.

--
Jarkko Hietaniemi <jhi@​iki.fi> http​://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

@p5pRT
Copy link
Author

p5pRT commented May 6, 2003

From debianbugs@j3e.de

On 2003-05-06 at 21​:21 +0900 Dan Kogai sent off​:

If perl is an application like, say, a word processor, I would agree
that perl and Encode should handle Normalization internally and
transparently so "canonically-equivalent" strings compare as equal.
But perl is a PROGRAMMING LANGUAGE so you have to be able to treat
different (though may be equivalent Unicode-wise) things different by
default. Otherwise you can't even implement new normalization in perl.
So I do not consider this as a bug since perl 5.8 comes with both
Encode and Unicode​::Normalize.

this gives a chance to workaround this bug (yes, I think it is).

If you want to do it transparently, you can always use Encode​::Encoding
to implement your own. Here is an example.

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work
around this bug but from_to should not care if the initial string is
NFC or NFD.

Bjoern

@p5pRT
Copy link
Author

p5pRT commented May 6, 2003

From BQW10602@nifty.com

On Tue, 6 May 2003 14​:46​:06 +0200
Bjoern Jacke <debianbugs@​j3e.de> wrote​:

(snip)

If you want to do it transparently, you can always use Encode​::Encoding
to implement your own. Here is an example.

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work
around this bug but from_to should not care if the initial string is
NFC or NFD.

Bjoern

You must suffer some information loss
when you convert Unicode to a legacy (non-Unicode)
encoding whose repertoire is a subset of Unicode.
Legacy encodings, of course, include latin1.

"Normalizability" (normalization behavior) of a legacy
encoding is defined in UAX #15.

http​://www.unicode.org/reports/tr15/#Legacy_Encodings

According to this annex, Latin1 is unnormalizable except in NFC.
So latin1 is not appropriate to NFD, NFKC, and NFKD.
Actually, a legacy encoding may be unnormalizable in all the
normalization forms; e.g. encodings specified by JIS X 0208.

In a sense, the legacy encoding is just a *legacy*;
i.e., that would not be reproduced any more.

SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented May 7, 2003

From nick.ing-simmons@elixent.com

Bjoern Jacke <debianbugs@​j3e.de> writes​:

If you want to do it transparently, you can always use Encode​::Encoding
to implement your own. Here is an example.

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work
around this bug but from_to should not care if the initial string is
NFC or NFD.

Most of perl's encodings are octet-sequence/octet-sequence converters.
Which are easy to code, compact reasonably fast and ... dumb!
I also probably gave more thought to decode (from some form to Unicode)
rather than encode step - for decode producing NFC is natural.

Perhaps it makes sense to add a tweak to encode side so that if no encoding
exists for the code point and code-point sequence is not normalize it tries
to normalize?

Bjoern
--
Nick Ing-Simmons
http​://www.ni-s.u-net.com/

@p5pRT
Copy link
Author

p5pRT commented May 13, 2003

From debianbugs@j3e.de

On 2003-05-06 at 15​:58 +0300 Jarkko Hietaniemi sent off​:

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work

You are assuming the equivalence of (pre)composed characters and
their composed forms. Perl doesn't do this at any level.

you are not right.

$string = "äpfel";
$string_nfd=NFD($string);
$string_nfc=NFC($string);
if ($string_nfd eq $string_nfc) {
  print "This will be printed!";
}
if (NFD($string_nfd) eq $string_nfc) {
  print "This will *not* be printed!";
}

I still say that this is a bug and encode should be able to convert
NFD("Äpfel") to latin1. If you say it shouldn't it's like saying an
English translator shouldn't be able to translate American English,
just because they have a few differnet words than the British folks.

@p5pRT
Copy link
Author

p5pRT commented May 13, 2003

From @jhi

On Tue, May 13, 2003 at 01​:43​:40PM +0200, Bjoern Jacke wrote​:

On 2003-05-06 at 15​:58 +0300 Jarkko Hietaniemi sent off​:

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work

You are assuming the equivalence of (pre)composed characters and
their composed forms. Perl doesn't do this at any level.

you are not right.

$string = "äpfel";
$string_nfd=NFD($string);
$string_nfc=NFC($string);
if ($string_nfd eq $string_nfc) {
print "This will be printed!";
}
if (NFD($string_nfd) eq $string_nfc) {
print "This will *not* be printed!";
}

I am confused. The above prints nothing (no surprise there since the
bytes 0x61 0xcc 0x88 0x70 0x66 0x65 0x6c are very different from the
bytes 0xc3 0xa4 0x70 0x66 0x65 0x6c). Are you saying it should test
true in the the first case? If so, I strongly disagree.

I still say that this is a bug and encode should be able to convert

There is no Encode in the above.

NFD("Äpfel") to latin1. If you say it shouldn't it's like saying an
English translator shouldn't be able to translate American English,
just because they have a few differnet words than the British folks.

I am sorry but I think you are simply flat out wrong and I do not feel
like arguing about this any more. Perl works at the level of bytes
and characters, not at the level of character equivalences-- that a
native Latin1 character should be equivalent to a somehow decomposed
Unicode presentation of the same character. They are not.

--
Jarkko Hietaniemi <jhi@​iki.fi> http​://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

@p5pRT
Copy link
Author

p5pRT commented May 13, 2003

From nick.ing-simmons@elixent.com

Jarkko Hietaniemi <jhi@​iki.fi> writes​:

There is no Encode in the above.

NFD("Äpfel") to latin1. If you say it shouldn't it's like saying an
English translator shouldn't be able to translate American English,
just because they have a few differnet words than the British folks.

I am sorry but I think you are simply flat out wrong and I do not feel
like arguing about this any more. Perl works at the level of bytes
and characters, not at the level of character equivalences-- that a
native Latin1 character should be equivalent to a somehow decomposed
Unicode presentation of the same character. They are not.

For what it is worth Encode works at character level as well.
Some decomposed (NFD) chars are to some extent representable in latin1
in that (for example) � could be 'A' and <U00A8> # DIAERESIS with
a little laxity perhaps bit it is possible.

If Encode or perl coerced the normalization then these would get lost.
So the current scheme makes easy things easy (possibly with a call
to NFC() if necessary) and hard things possible.

--
Nick Ing-Simmons
http​://www.ni-s.u-net.com/

@p5pRT
Copy link
Author

p5pRT commented Aug 24, 2003

From BQW10602@nifty.com

On Wed, 07 May 2003 08​:42​:25 +0100
Nick Ing-Simmons <nick.ing-simmons@​elixent.com> wrote​:

Bjoern Jacke <debianbugs@​j3e.de> writes​:

well, see​: from_to claims to convert from encoding1 to encoding2.
encoding1 in this case is utf-8. Also the non-composed UTF-8 is
perfectly valid UTF-8 and there's absolutely no reason, why
from_to($string,"utf8","latin1") should not work just because I used
the NFD form and not the NFC form. Your example is just a way to work
around this bug but from_to should not care if the initial string is
NFC or NFD.

Most of perl's encodings are octet-sequence/octet-sequence converters.
Which are easy to code, compact reasonably fast and ... dumb!
I also probably gave more thought to decode (from some form to Unicode)
rather than encode step - for decode producing NFC is natural.

Perhaps it makes sense to add a tweak to encode side so that if no encoding
exists for the code point and code-point sequence is not normalize it tries
to normalize?

For transcoding/normalization at once, I write a tiny module,
which is somewhat broken, though​:

(1) Module name?
(2) Is '//' good as a separator between an encoding name and
  a normalization form name? (at least, it would be bad
  if there were an encoding name including '/'.)
(3) Is the result exactly normalized? (This point must be
  most important. Enough verification should be to do.)

http​://homepage1.nifty.com/nomenclator/perl/Encode-UnicodeNormalization-0.00.tar.gz

HTML (POD)
http​://homepage1.nifty.com/nomenclator/perl/Encode-UnicodeNormalization.html

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Sep 29, 2010

@cpansprout - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant