Perl needs to normalize its identifiers #11573

p5pRT · 2011-08-11T19:39:21Z

Migrated from rt.perl.org#96814 (status was 'open')

Searchable as RT96814$

p5pRT · 2011-08-11T19:39:24Z

From tchrist@perl.com

Python runs its Unicode identifiers through NFD transforms, although Perl,
Ruby, and Java do not. That means a user has to know which form all his
idents are in, and which form his editor condescended to enter for him,
even though he cannot see which is which in his editor. This is prone to
bugs and errors, some of which will go long unnoticed.

*You* cannot tell which one got entered, and *you* cannot see which is
which, but Perl distinguished otherwise identifical things.

How can this possibly not be a bug?

I get figure out a tie map for hashes to make this work right, so that your
strings are autonormalized, but I cannot figure out how to do that sort of
magic to lookups in stashes, let alone in pads.

Since this is something each user must take especially care to do "right"
every single time, or else he gets bugs, it is something that Perl should
be doing for him, based on the proven principle that nothing too important
to risk bieng forgotten should be *able* to be forgotten.

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration:

Platform:
osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
uname='openbsd chthon 4.4 generic#0 i386 '
config_args='-des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=y, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
optimize='-O2',
cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-lgdbm -lm -lutil -lc
perllibs=-lm -lutil -lc
libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl):
Compile-time options: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
USE_PERL_ATOF
Built under openbsd
Compiled at Jun 11 2011 11:48:28
%ENV:
PERL_UNICODE="SA"
@INC:
/usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
/usr/local/lib/perl5/site_perl/5.14.0
/usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
/usr/local/lib/perl5/5.14.0
/usr/local/lib/perl5/site_perl/5.12.3
/usr/local/lib/perl5/site_perl/5.11.3
/usr/local/lib/perl5/site_perl/5.10.1
/usr/local/lib/perl5/site_perl/5.10.0
/usr/local/lib/perl5/site_perl/5.8.7
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl/5.005
/usr/local/lib/perl5/site_perl
.

p5pRT · 2011-08-12T07:26:33Z

From @Hugmeir

On Thu, Aug 11, 2011 at 4:39 PM, tchrist1 <perlbug-followup@perl.org> wrote:

Python runs its Unicode identifiers through NFD transforms, although Perl,
Ruby, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC, but I haven't
gotten too far into the accompanying discussion.

In any case, I agree that this needs to change, but I have doubts on how it
would be called from Perl-space. 'use normalization qw< NFD >;' implies that
all of the source is normalized, including string literals, so you'd
actually need to do something like 'use normalization indentifiers =>
"NFD";' to avoid confusion... But that gives the impression that you can
also normalize other areas. And what about symbolic references, should those
be normalized too? Can you opt(in|out) of that? :)

I get figure out a tie map for hashes to make this work right, so that your
strings are autonormalized, but I cannot figure out how to do that sort of
magic to lookups in stashes, let alone in pads.

Tieing stashes is broken, so that won't do for the moment. Without giving it
much thought, I imagine we could "simply" add checks in the core, or maybe
install store/fetch hooks for GVs/pads, if those aren't a hugely terrible
idea.

Unrelated to the bug report, what does Python do with bidi control
characters? The PEP thread has a couple of suggestions (
http://mail.python.org/pipermail/python-3000/2007-May/007750.html,
http://mail.python.org/pipermail/python-3000/2007-May/007823.html,<http://mail.python.org/pipermail/python-3000/2007-May/007823.html>
http://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I
don't how what they ended up implementing.

p5pRT · 2011-08-12T07:26:34Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2011-08-12T08:12:01Z

From tchrist@perl.com

"Brian Fraser via RT" <perlbug-followup@perl.org> wrote
on Fri, 12 Aug 2011 00:26:34 PDT:

Python runs its Unicode identifiers through NFD transforms, although
Perl, Ruby, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC, but I haven't
gotten too far into the accompanying discussion.

Sorry, you're right, it's NFC:

#!/usr/bin/env python3.2
# -*- coding: UTF-8 -*-
écran = "NFD screen"
écran = "NFC screen"
print("First screen is", écran)
print("Second screen is", écran)

print out

First screen is NFC screen
Second screen is NFC screen

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

In any case, I agree that this needs to change, but I have doubts on how it
would be called from Perl-space. 'use normalization qw< NFD >;' implies that
all of the source is normalized, including string literals, so you'd
actually need to do something like 'use normalization indentifiers =>
"NFD";' to avoid confusion... But that gives the impression that you can
also normalize other areas. And what about symbolic references, should those
be normalized too? Can you opt(in|out) of that? :)

I agree that it has to be just for identifiers, not string literals,
because there are times you need to compare with something exactly.

$nfd = "écran";
$nfc = "écran";

Those need to be distinct.

I think the solution for hashes should probably be a tie layer
that normalizes its keys. That doesn't require any core changes.

I get figure out a tie map for hashes to make this work right, so that your
strings are autonormalized, but I cannot figure out how to do that sort of
magic to lookups in stashes, let alone in pads.

Tieing stashes is broken, so that won't do for the moment.

I was kinda just kidding, because I did remember this.

Without giving it much thought, I imagine we could "simply" add checks
in the core, or maybe install store/fetch hooks for GVs/pads, if those
aren't a hugely terrible idea.

Unrelated to the bug report, what does Python do with bidi control
characters? The PEP thread has a couple of suggestions (

http://mail.python.org/pipermail/python-3000/2007-May/007750.html,
http://mail.python.org/pipermail/python-3000/2007-May/007823.html,
<http://mail.python.org/pipermail/python-3000/2007-May/007823.html>
http://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I
don't how what they ended up implementing.

Haven't looked at that. Bidi is ugly, since Perl stuff goes left to
right, and an RTL string could flip around weak bidi mirrors so they
look different.

Interesting:

I'll repeat that UTR#39 explicitly discourages support
for formatting characters in identifiers.

And this one

http://mail.python.org/pipermail/python-3000/2007-May/007725.html

points out that Java can get away with this because they have all these
default-ignorables they let by in source code. Yes, you can put nulls and
bells all over your Java source and the compiler will ignore them outside
literals. Scary.

This

http://mail.python.org/pipermail/python-3000/2007-May/007833.html

seems as far as they got. I don't see any resolution. Too tired to
hack out stupid bidi tricks right now to test.

Hm, I wonder whether this has anything useful to say about the matter,
since they've had to think about it for URLs:

http://www.w3.org/International/iri-edit/draft-duerst-iri-05.txt

--tom

p5pRT · 2011-08-12T09:23:56Z

From @nwc10

On Fri, Aug 12, 2011 at 02:10:55AM -0600, Tom Christiansen wrote:

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

Strictly it doesn't:

http://developer.apple.com/library/mac/technotes/tn/tn1150.html#UnicodeSubtleties

IMPORTANT:

An implementation must not use the Unicode utilities implemented
by its native platform (for decomposition and comparison), unless
those algorithms are equivalent to the HFS Plus algorithms defined
here, and are guaranteed to be so forever. This is rarely the
case. Platform algorithms tend to evolve with the Unicode
standard. The HFS Plus algorithms cannot evolve because such
evolution would invalidate existing HFS Plus volumes.

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*.
And it's not allowed to change.

Which I think was an issue Father C raised - Unicode evolves, therefore
normalisation changes. Should Perl snapshot a particular normalisation and
keep that as canonical forever? Or should we run the (small risk) that
(dangerously written) scripts will change behaviour as a side effect of
running on a perl (newer or older) that doesn't use the same Unicode database.

This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that
there isn't a working Python solution to adopt.

This
http&#8203;://mail\.python\.org/pipermail/python\-3000/2007\-May/007833\.html
seems as far as they got. I don't see any resolution. Too tired to
hack out stupid bidi tricks right now to test.

Shame.

Does any language have a working implementation of normalised Unicode
identifiers?

Nicholas Clark

p5pRT · 2011-08-12T13:27:21Z

From tchrist@perl.com

Nicholas Clark <nick@ccl4.org> wrote
on Fri, 12 Aug 2011 10:23:09 BST:

On Fri, Aug 12, 2011 at 02:10:55AM -0600, Tom Christiansen wrote:

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

Strictly it doesn't:

...

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*.
And it's not allowed to change.

I usually hedge that by saying that it's quasi-NFD. I don't know any
module that implements it, so it's really annoying to predict. I hate
the poke it and see what shows up approach, but maybe that's all one
can do.

Which I think was an issue Father C raised - Unicode evolves, therefore
normalisation changes. Should Perl snapshot a particular normalisation and
keep that as canonical forever? Or should we run the (small risk) that
(dangerously written) scripts will change behaviour as a side effect of
running on a perl (newer or older) that doesn't use the same Unicode database.

Is the fear that an unassigned code point would later get assigned something
that changes under normalization? If people are using unassigned code points,
then I suppose this may happen, but I can't see any other way. That's because
of Unicode's strong stability guarantee on normalization. The key point is
the last of the lines I quote below:

http://unicode.org/policies/stability_policy.html

Unlike many other standards, the Unicode Standard is continually
expanding—new characters are added to meet a variety of uses, ranging from
technical symbols to letters for archaic languages. Character properties
are also expanded or revised to meet implementation requirements.

In each new version of the Unicode Standard, the Unicode Consortium may add
characters or make certain changes to characters that were encoded in a
previous version of the standard. However, the Consortium imposes
limitations on the types of changes that can be made, in an effort to
minimize the impact on existing implementations.

...

Normalization Stability

Strong Normalization Stability
Applicable Version: Unicode 4.1+

If a string contains only characters from a given version of Unicode, and it
is put into a normalized form in accordance with that version of Unicode,
then the results will be identical to the results of putting that string
into a normalized form in accordance with any subsequent version of Unicode.

More formally, given versions V and U of Unicode, and any string S
which only contains characters assigned according to both V and U, the
following are always true:

toNFCV(S) = toNFCU(S)
toNFDV(S) = toNFDU(S)
toNFKCV(S) = toNFKCU(S)
toNFKDV(S) = toNFKDU(S)

In particular, once a character is encoded, its canonical combining
class and decomposition mapping will not be changed in any way.

Now, HSF+ came out in 1998, but the stability guarantee only applies to
Unicode version 4.1 and up, and 4.1 itself came out 2005-03-31.

This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that
there isn't a working Python solution to adopt.

I can't see that they've done anything about bidis.

Does any language have a working implementation of normalised
Unicode identifiers?

What exactly do you mean by this? As I said, Python runs them
through NFC. This may have ramifications on HFS+. Python
issue 11230 is about being able to import library modules
with non-ASCII names, as

http://bugs.python.org/issue11230

And in particular

http://bugs.python.org/msg128724

which reads:

Short answer:

In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path.

Longer answer:

I fixed the import machinery to handle correctly non-ASCII characters
in module *paths*. But the import machinery is unable to handle
non-ASCII characters in module *names*: it fails if the filesystem
encoding is not UTF-8 (eg. it fails on Windows). There is another
exception: Python doesn't support (yet) non encodable module paths on
Windows. On Windows, you can use any character in directory names, but
Python 3.2 encodes paths to the filesystem encoding (ANSI code page)
which is a smaller charset. In practical, this Windows specific
limitation on module paths doesn't really matter.

I plan to fix all these issues in Python 3.3: see #3080.

--

> Could you please make it clear in documentation and web pages,
> that this feature is not working yet.

What's New in Python 3.2 documentation has this sentence: "Python’s
import mechanism can now load modules installed in directories with
non-ASCII characters in the path name. This solved an aggravating
problem with home directories for users with non-ASCII characters in
their usernames." which is correct.

Which web page should updated/fixed?

So I don't think they have it working in module names either. Besides
Perl, all of Python, Ruby, Java, and Go offer Unicode identifiers, with
various restrictions.

* Python does seem to do the IDS/IDC thing, so you might see idents
with combining marks, but these are run through NFC so tend to go
away for the common cases.

* Java I know to have filesystem issues, but Java also allows for
random control characters in its identifiers, which it completely
ignores and do not become part of those names.

* In contrast Go does not seem to use IDS/IDC, because you get compiler
errors if you have combining marks (NFD forms):

% 6g idents.go
idents.go:4: invalid identifier character 0x301
idents.go:5: invalid identifier character 0x301

% uniquote -x < idents.go
package main
func main() {
var \x{E9}cran = "NFC screen"
var e\x{301}cran = "NFD screen"
println("tes \x{E9}crans sont ", \x{E9}cran, " and ", e\x{301}cran)
}

So it doesn't mind E9, but dislikes 301.

(BTW, I keep making errors in Python because of there being no strict
vars declaration that I can find the equivalent of, whereas with
Go you don't have that problem.)

* I haven't poked at Ruby hard enough to know what it does here
with external names. But internally, NFC and NFD forms are
distinct instead of normalized:

% ruby ident.ruby
nfc
nfd

% uniquote -x < ident.ruby
#!/usr/bin/env ruby
#coding: utf-8
ni\x{F1}o = "nfc"
nin\x{303}o = "nfd"
puts ni\x{F1}o
puts nin\x{303}o

--tom

p5pRT added Severity Low distro-openbsd labels Oct 19, 2019

xenu removed the Severity Low label Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perl needs to normalize its identifiers #11573

Perl needs to normalize its identifiers #11573

p5pRT commented Aug 11, 2011

p5pRT commented Aug 11, 2011

p5pRT commented Aug 12, 2011

p5pRT commented Aug 12, 2011

p5pRT commented Aug 12, 2011

p5pRT commented Aug 12, 2011

p5pRT commented Aug 12, 2011

Perl needs to normalize its identifiers #11573

Perl needs to normalize its identifiers #11573

Comments

p5pRT commented Aug 11, 2011

p5pRT commented Aug 11, 2011

From tchrist@perl.com

p5pRT commented Aug 12, 2011

From @Hugmeir

p5pRT commented Aug 12, 2011

p5pRT commented Aug 12, 2011

From tchrist@perl.com

p5pRT commented Aug 12, 2011

From @nwc10

p5pRT commented Aug 12, 2011

From tchrist@perl.com