Skip Menu |
 
Report information
Id: 96814
Status: open
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: tom christiansen <tchrist [at] perl.com>
Cc:
AdminCc:

Operating System: openbsd
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Subject: Perl needs to normalize its identifiers
Date: Thu, 11 Aug 2011 13:38:32 -0600
To: Perl5 Porters Mailing List <perl5-porters [...] perl.org>, perlbug [...] perl.org
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 3.3k
Python runs its Unicode identifiers through NFD transforms, although Perl, Ruby, and Java do not. That means a user has to know which form all his idents are in, and which form his editor condescended to enter for him, even though he cannot see which is which in his editor. This is prone to bugs and errors, some of which will go long unnoticed. *You* cannot tell which one got entered, and *you* cannot see which is which, but Perl distinguished otherwise identifical things. How can this possibly not be a bug? I get figure out a tie map for hashes to make this work right, so that your strings are autonormalized, but I cannot figure out how to do that sort of magic to lookups in stashes, let alone in pads. Since this is something each user must take especially care to do "right" every single time, or else he gets bugs, it is something that Perl should be doing for him, based on the proven principle that nothing too important to risk bieng forgotten should be *able* to be forgotten. --tom Summary of my perl5 (revision 5 version 14 subversion 0) configuration: Platform: osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd uname='openbsd chthon 4.4 generic#0 i386 ' config_args='-des' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-lgdbm -lm -lutil -lc perllibs=-lm -lutil -lc libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector' Characteristics of this binary (from libperl): Compile-time options: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under openbsd Compiled at Jun 11 2011 11:48:28 %ENV: PERL_UNICODE="SA" @INC: /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd /usr/local/lib/perl5/site_perl/5.14.0 /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd /usr/local/lib/perl5/5.14.0 /usr/local/lib/perl5/site_perl/5.12.3 /usr/local/lib/perl5/site_perl/5.11.3 /usr/local/lib/perl5/site_perl/5.10.1 /usr/local/lib/perl5/site_perl/5.10.0 /usr/local/lib/perl5/site_perl/5.8.7 /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl .
CC: bugs-bitbucket [...] rt.perl.org
Subject: Re: [perl #96814] Perl needs to normalize its identifiers
Date: Fri, 12 Aug 2011 04:26:00 -0300
To: perl5-porters [...] perl.org
From: Brian Fraser <fraserbn [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On Thu, Aug 11, 2011 at 4:39 PM, tchrist1 <perlbug-followup@perl.org> wrote: Show quoted text
> > Python runs its Unicode identifiers through NFD transforms, although Perl, > Ruby, and Java do not.
Does Python use NFD? PEP 3131 recommends either NFC or NFKC, but I haven't gotten too far into the accompanying discussion. In any case, I agree that this needs to change, but I have doubts on how it would be called from Perl-space. 'use normalization qw< NFD >;' implies that all of the source is normalized, including string literals, so you'd actually need to do something like 'use normalization indentifiers => "NFD";' to avoid confusion... But that gives the impression that you can also normalize other areas. And what about symbolic references, should those be normalized too? Can you opt(in|out) of that? :) Show quoted text
> > I get figure out a tie map for hashes to make this work right, so that your > strings are autonormalized, but I cannot figure out how to do that sort of > magic to lookups in stashes, let alone in pads. > >
Tieing stashes is broken, so that won't do for the moment. Without giving it much thought, I imagine we could "simply" add checks in the core, or maybe install store/fetch hooks for GVs/pads, if those aren't a hugely terrible idea. Unrelated to the bug report, what does Python do with bidi control characters? The PEP thread has a couple of suggestions ( http://mail.python.org/pipermail/python-3000/2007-May/007750.html, http://mail.python.org/pipermail/python-3000/2007-May/007823.html,<http://mail.python.org/pipermail/python-3000/2007-May/007823.html> http://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I don't how what they ended up implementing.
Subject: Re: [perl #96814] Perl needs to normalize its identifiers
Date: Fri, 12 Aug 2011 02:10:55 -0600
To: perlbug-followup [...] perl.org
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 3.5k
"Brian Fraser via RT" <perlbug-followup@perl.org> wrote on Fri, 12 Aug 2011 00:26:34 PDT: Show quoted text
>> Python runs its Unicode identifiers through NFD transforms, although >> Perl, Ruby, and Java do not.
Show quoted text
> Does Python use NFD? PEP 3131 recommends either NFC or NFKC, but I haven't > gotten too far into the accompanying discussion.
Sorry, you're right, it's NFC: #!/usr/bin/env python3.2 # -*- coding: UTF-8 -*- écran = "NFD screen" écran = "NFC screen" print("First screen is", écran) print("Second screen is", écran) print out First screen is NFC screen Second screen is NFC screen I was worried about how this plays with Apple's HSF+, given that it uses NFD. If you can a module named Écran, I get nervous about how it gains a code point in length in the filesystem. Show quoted text
> In any case, I agree that this needs to change, but I have doubts on how it > would be called from Perl-space. 'use normalization qw< NFD >;' implies that > all of the source is normalized, including string literals, so you'd > actually need to do something like 'use normalization indentifiers => > "NFD";' to avoid confusion... But that gives the impression that you can > also normalize other areas. And what about symbolic references, should those > be normalized too? Can you opt(in|out) of that? :)
I agree that it has to be just for identifiers, not string literals, because there are times you need to compare with something exactly. $nfd = "écran"; $nfc = "écran"; Those need to be distinct. I think the solution for hashes should probably be a tie layer that normalizes its keys. That doesn't require any core changes. Show quoted text
>> I get figure out a tie map for hashes to make this work right, so that your >> strings are autonormalized, but I cannot figure out how to do that sort of >> magic to lookups in stashes, let alone in pads.
Show quoted text
> Tieing stashes is broken, so that won't do for the moment.
I was kinda just kidding, because I did remember this. Show quoted text
> Without giving it much thought, I imagine we could "simply" add checks > in the core, or maybe install store/fetch hooks for GVs/pads, if those > aren't a hugely terrible idea.
Show quoted text
> Unrelated to the bug report, what does Python do with bidi control > characters? The PEP thread has a couple of suggestions (
Show quoted text
Haven't looked at that. Bidi is ugly, since Perl stuff goes left to right, and an RTL string could flip around weak bidi mirrors so they look different. Interesting: Show quoted text
>> I'll repeat that UTR#39 explicitly discourages support >> for formatting characters in identifiers.
And this one http://mail.python.org/pipermail/python-3000/2007-May/007725.html points out that Java can get away with this because they have all these default-ignorables they let by in source code. Yes, you can put nulls and bells all over your Java source and the compiler will ignore them outside literals. Scary. This http://mail.python.org/pipermail/python-3000/2007-May/007833.html seems as far as they got. I don't see any resolution. Too tired to hack out stupid bidi tricks right now to test. Hm, I wonder whether this has anything useful to say about the matter, since they've had to think about it for URLs: http://www.w3.org/International/iri-edit/draft-duerst-iri-05.txt --tom
CC: perlbug-followup [...] perl.org
Subject: Re: [perl #96814] Perl needs to normalize its identifiers
Date: Fri, 12 Aug 2011 10:23:09 +0100
To: Tom Christiansen <tchrist [...] perl.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.7k
On Fri, Aug 12, 2011 at 02:10:55AM -0600, Tom Christiansen wrote: Show quoted text
> I was worried about how this plays with Apple's HSF+, given > that it uses NFD. If you can a module named Écran, I get nervous > about how it gains a code point in length in the filesystem.
Strictly it doesn't: http://developer.apple.com/library/mac/technotes/tn/tn1150.html#UnicodeSubtleties IMPORTANT: An implementation must not use the Unicode utilities implemented by its native platform (for decomposition and comparison), unless those algorithms are equivalent to the HFS Plus algorithms defined here, and are guaranteed to be so forever. This is rarely the case. Platform algorithms tend to evolve with the Unicode standard. The HFS Plus algorithms cannot evolve because such evolution would invalidate existing HFS Plus volumes. It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*. And it's not allowed to change. Which I think was an issue Father C raised - Unicode evolves, therefore normalisation changes. Should Perl snapshot a particular normalisation and keep that as canonical forever? Or should we run the (small risk) that (dangerously written) scripts will change behaviour as a side effect of running on a perl (newer or older) that doesn't use the same Unicode database. This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that there isn't a working Python solution to adopt. Show quoted text
> This > > http://mail.python.org/pipermail/python-3000/2007-May/007833.html > > seems as far as they got. I don't see any resolution. Too tired to > hack out stupid bidi tricks right now to test.
Shame. Does any language have a working implementation of normalised Unicode identifiers? Nicholas Clark
CC: perlbug-followup [...] perl.org
Subject: Re: [perl #96814] Perl needs to normalize its identifiers
Date: Fri, 12 Aug 2011 07:26:21 -0600
To: Nicholas Clark <nick [...] ccl4.org>
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 6.5k
Nicholas Clark <nick@ccl4.org> wrote on Fri, 12 Aug 2011 10:23:09 BST: Show quoted text
>On Fri, Aug 12, 2011 at 02:10:55AM -0600, Tom Christiansen wrote:
Show quoted text
>> I was worried about how this plays with Apple's HSF+, given >> that it uses NFD. If you can a module named Écran, I get nervous >> about how it gains a code point in length in the filesystem.
Show quoted text
>Strictly it doesn't:
... Show quoted text
> It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*. > And it's not allowed to change.
I usually hedge that by saying that it's quasi-NFD. I don't know any module that implements it, so it's really annoying to predict. I hate the poke it and see what shows up approach, but maybe that's all one can do. Show quoted text
> Which I think was an issue Father C raised - Unicode evolves, therefore > normalisation changes. Should Perl snapshot a particular normalisation and > keep that as canonical forever? Or should we run the (small risk) that > (dangerously written) scripts will change behaviour as a side effect of > running on a perl (newer or older) that doesn't use the same Unicode database.
Is the fear that an unassigned code point would later get assigned something that changes under normalization? If people are using unassigned code points, then I suppose this may happen, but I can't see any other way. That's because of Unicode's strong stability guarantee on normalization. The key point is the last of the lines I quote below: http://unicode.org/policies/stability_policy.html Unlike many other standards, the Unicode Standard is continually expanding—new characters are added to meet a variety of uses, ranging from technical symbols to letters for archaic languages. Character properties are also expanded or revised to meet implementation requirements. In each new version of the Unicode Standard, the Unicode Consortium may add characters or make certain changes to characters that were encoded in a previous version of the standard. However, the Consortium imposes limitations on the types of changes that can be made, in an effort to minimize the impact on existing implementations. ... Normalization Stability Strong Normalization Stability Applicable Version: Unicode 4.1+ If a string contains only characters from a given version of Unicode, and it is put into a normalized form in accordance with that version of Unicode, then the results will be identical to the results of putting that string into a normalized form in accordance with any subsequent version of Unicode. More formally, given versions V and U of Unicode, and any string S which only contains characters assigned according to both V and U, the following are always true: toNFCV(S) = toNFCU(S) toNFDV(S) = toNFDU(S) toNFKCV(S) = toNFKCU(S) toNFKDV(S) = toNFKDU(S) In particular, once a character is encoded, its canonical combining class and decomposition mapping will not be changed in any way. Now, HSF+ came out in 1998, but the stability guarantee only applies to Unicode version 4.1 and up, and 4.1 itself came out 2005-03-31. Show quoted text
> This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that > there isn't a working Python solution to adopt.
I can't see that they've done anything about bidis. Show quoted text
> Does any language have a working implementation of normalised > Unicode identifiers?
What exactly do you mean by this? As I said, Python runs them through NFC. This may have ramifications on HFS+. Python issue 11230 is about being able to import library modules with non-ASCII names, as http://bugs.python.org/issue11230 And in particular http://bugs.python.org/msg128724 which reads: Short answer: In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path. Longer answer: I fixed the import machinery to handle correctly non-ASCII characters in module *paths*. But the import machinery is unable to handle non-ASCII characters in module *names*: it fails if the filesystem encoding is not UTF-8 (eg. it fails on Windows). There is another exception: Python doesn't support (yet) non encodable module paths on Windows. On Windows, you can use any character in directory names, but Python 3.2 encodes paths to the filesystem encoding (ANSI code page) which is a smaller charset. In practical, this Windows specific limitation on module paths doesn't really matter. I plan to fix all these issues in Python 3.3: see #3080. -- Show quoted text
> Could you please make it clear in documentation and web pages, > that this feature is not working yet.
What's New in Python 3.2 documentation has this sentence: "Python’s import mechanism can now load modules installed in directories with non-ASCII characters in the path name. This solved an aggravating problem with home directories for users with non-ASCII characters in their usernames." which is correct. Which web page should updated/fixed? So I don't think they have it working in module names either. Besides Perl, all of Python, Ruby, Java, and Go offer Unicode identifiers, with various restrictions. * Python does seem to do the IDS/IDC thing, so you might see idents with combining marks, but these are run through NFC so tend to go away for the common cases. * Java I know to have filesystem issues, but Java also allows for random control characters in its identifiers, which it completely ignores and do not become part of those names. * In contrast Go does not seem to use IDS/IDC, because you get compiler errors if you have combining marks (NFD forms): % 6g idents.go idents.go:4: invalid identifier character 0x301 idents.go:5: invalid identifier character 0x301 % uniquote -x < idents.go package main func main() { var \x{E9}cran = "NFC screen" var e\x{301}cran = "NFD screen" println("tes \x{E9}crans sont ", \x{E9}cran, " and ", e\x{301}cran) } So it doesn't mind E9, but dislikes 301. (BTW, I keep making errors in Python because of there being no strict vars declaration that I can find the equivalent of, whereas with Go you don't have that problem.) * I haven't poked at Ruby hard enough to know what it does here with external names. But internally, NFC and NFD forms are distinct instead of normalized: % ruby ident.ruby nfc nfd % uniquote -x < ident.ruby #!/usr/bin/env ruby #coding: utf-8 ni\x{F1}o = "nfc" nin\x{303}o = "nfd" puts ni\x{F1}o puts nin\x{303}o --tom


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org