Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bogus error message "Malformed UTF-8 character" when using a non-word Unicode character #9862

Closed
p5pRT opened this issue Sep 6, 2009 · 23 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 6, 2009

Migrated from rt.perl.org#69032 (status was 'resolved')

Searchable as RT69032$

@p5pRT
Copy link
Author

p5pRT commented Sep 6, 2009

From @moritz

Created by moritz@faui2k3.org

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at", even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $» = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

A better error message might be "Character '%s' not allowed in
identifier", or
something like that.

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.0:

Configured by Debian Project at Fri Aug 28 22:23:22 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.30.5-dsa-amd64,
archname=x86_64-linux-gnu-thread-multi
    uname='linux brahms 2.6.30.5-dsa-amd64 #1 smp mon aug 17 02:18:43
cest 2009 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.10.0
-Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio
-Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib
-Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
-pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:



@INC for perl 5.10.0:
    /home/moritz/cpan/lib
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/moritz
    LANG=en_US.UTF-8
    LANGUAGE=C
    LC_CTYPE=de_DE.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)

PATH=/bin:/sbin:/usr/bin:/usr/sbin:/home/moritz/bin:/usr/games:/usr/local/Eiffel54/studio/spec/linux-glibc2.1/bin:/usr/bin/X11:/usr/local/bin:/usr/local/Wolfram/Mathematica/5.0/Executables/:/mnt/ex/moritz/matlab/bin
    PERL5LIB=/home/moritz/cpan/lib
    PERL6LIB=/home/moritz/src/svg-plot/lib:/home/moritz/src/svg/lib
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2009

From john.imrie@vodafoneemail.co.uk

Moritz Lenz (via RT) wrote​:

# New Ticket Created by Moritz Lenz
# Please include the string​: [perl #69032]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=69032 >

This is a bug report for perl from moritz@​faui2k3.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at", even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $» = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

Humm. Do we actually allow *any* unicode codepoint in an identifier or
only those matching \p{ID_Start}\p{ID_Continue}* ?

______________________________________________
This email has been scanned by Netintelligence
http​://www.netintelligence.com/email

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2009

From @ikegami

On Sun, Sep 6, 2009 at 2​:35 PM, Moritz Lenz <perlbug-followup@​perl.org>wrote​:

# New Ticket Created by Moritz Lenz
# Please include the string​: [perl #69032]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=69032 >

This is a bug report for perl from moritz@​faui2k3.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at", even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $» = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

A better error message might be "Character '%s' not allowed in
identifier", or
something like that.

There are two problems.

If the character that follows "$" is not normally allowed as part of an
identifier, it is taken to be the entire name of a special package var (like
$_, $$, $[, etc). The first problem is that the tokeniser treats the *byte*
following "$" as the name of the special variable, leaving a partial UTF-8
character for the tokeniser to find. That accounts for the first error
message.

The second problem is the poor error message "Unrecognized character %s".
Even "Unknown operator %s" would be more helpful. It would be even more
useful to assume an 8+ bit character following an identifier was meant to be
part of the identifier, in which case the message should be as Moritz
suggested.

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2009

From @ikegami

On Mon, Sep 7, 2009 at 4​:53 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Sun, Sep 6, 2009 at 2​:35 PM, Moritz Lenz <perlbug-followup@​perl.org>wrote​:

# New Ticket Created by Moritz Lenz
# Please include the string​: [perl #69032]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=69032 >

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a program

which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at", even
though the file is in perfectly fine UTF-8.

A better error message might be "Character '%s' not allowed in

identifier", or
something like that.

There are two problems.

If the character that follows "$" is not normally allowed as part of an
identifier, it is taken to be the entire name of a special package var (like
$_, $$, $[, etc). The first problem is that the tokeniser treats the *byte*
following "$" as the name of the special variable, leaving a partial UTF-8
character for the tokeniser to find. That accounts for the first error
message.

The second problem is the poor error message "Unrecognized character %s".
Even "Unknown operator %s" would be more helpful. It would be even more
useful to assume an 8+ bit character following an identifier was meant to be
part of the identifier, in which case the message should be as Moritz
suggested.

perl -CO -E"say qq{use utf8; my \$i\x{2660};}" | perl
Unrecognized character \xE2 in column 16 at - line 1.

Moritz suggests "Character \x2660 not allowed in identifier in column 16 at
- line 1."

perl -CO -E"say qq{use utf8; 0+\x{2660};}" | perl
Unrecognized character \xE2 in column 13 at - line 1.

The identified character is wrong (E2 instead of 2660). Otherwise, this is
consistent with "use utf8" absent.

perl -CO -E"say qq{use utf8; my \$\x{2660};}" | perl
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \x99 in column 15 at - line 1.

If consistent with "use utf8" absent, it would be "Can't use global $♠ in
"my" at - line 1, near "my $♠""

perl -CO -E"say qq{use utf8; \$\x{2660};}" | perl
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \x99 in column 12 at - line 1.

If consistent with "use utf8" absent, this would not be an error at all.

perl -CO -E"say qq{use utf8; my \$\x{2660}i;}" | perl
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \x99 in column 15 at - line 1.

If consistent with "use utf8" absent, it would be "Bareword found where
operator expected at - line 1, near "$♠i" (Missing operator before i?)"

- Eric "ikegami" Brine

@p5pRT
Copy link
Author

p5pRT commented Apr 17, 2011

From perl-diddler@tlinx.org

Created by perl-diddler@tlinx.org

I was trying to use the utf-8 character U+2424, called
'SYMBOL FOR NEWLINE' as an identifier containing "\n" as
follows​:

0 #!/usr/bin/perl -w
1 use strict;
2 use utf8;
3 use Readonly my $␤ => "\n";
4 print "line1$␤line2\n";

The unicode character (which may not display correctly, in
whatever viewer you are using, so 'beware') is on lines 3
and 4. On line 3, it's right after the "$" and before the "<tab>=>".
On line 4, it's also after the "$" between "line1" and "line2".

A Hexdump of the above​:
00000000 23 21 2f 75 73 72 2f 62 69 6e 2f 70 65 72 6c 20 |#!/usr/bin/perl |
00000010 2d 77 0a 75 73 65 20 73 74 72 69 63 74 3b 0a 75 |-w.use strict;.u|
00000020 73 65 20 75 74 66 38 3b 0a 75 73 65 20 52 65 61 |se utf8;.use Rea|
00000030 64 6f 6e 6c 79 20 6d 79 20 24 e2 90 a4 09 3d 3e |donly my $....=>|
00000040 20 22 5c 6e 22 3b 0a 70 72 69 6e 74 20 22 6c 69 | "\n";.print "li|
00000050 6e 65 31 24 e2 90 a4 6c 69 6e 65 32 5c 6e 22 3b |ne1$...line2\n";|
00000060 0a 0a |..|
00000062

Shows U+2424 correctly encoded as "0xe290a4" at hexaddrs 03A
and 0x54.

However, when I try to run this, I get​:
Malformed UTF-8 character (unexpected end of string) at /tmp/ptest.pl line 4.
Unrecognized character \x90 in column 18 at /tmp/ptest.pl line 4.

Note, FWIW, I've successfully used other characters in the same
way in another program like this (a fragment from another prog)​:

----
use utf8;
binmode STDOUT, 'encoding(UTF-8)';
use Readonly; BEGIN{*RO=\&Readonly​::Readonly}

my %constants = (
  'Phi' => .5*(5.**.5-1),
  'Φ' => .5*(5.**.5-1),
  'phi' => .5*(1.+5.**.5),
  'ɸ' => .5*(1.+5.**.5),
  'pi' => 4*atan2(1,1),
  'π' => 4*atan2(1,1),
);

sub init_constants (;$) {
  my $no_banner=$_[0];
  print "Constants​: " unless $no_banner;
  my $sep="";

  foreach my $k (keys %constants){
  my $v=$constants{$k};
  print $sep,"\$",$k unless $no_banner;
  $sep=", ";
  RO $$k => $v;
  }

  print "\n" unless $no_banner;
}
&init_constants;
----

So I'm suprised at this specific failure, since in looking at the hex,
the character encoding appears correct.

Let me know if you have any questions.
(an excellent utf-8 character util, that, unfortunately, is windows only,
can be gotten from http​://www.babelstone.co.uk/Software/BabelMap.html).

Perl Info

Flags:
    category=core
    severity=medium

This perlbug was built using Perl 5.10.0 - Fri Jul 30 00:12:10 UTC 2010
It is being executed now by  Perl 5.10.0 - Thu Sep 16 16:14:28 UTC 2010.

Site configuration information for perl 5.10.0:

Configured by abuild at Thu Sep 16 16:14:28 UTC 2010.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.31, archname=x86_64-linux-thread-multi
    uname='linux build35 2.6.31 #1 smp 2010-01-06 16:07:25 +0100 x86_64 x86_64 x86_64 gnulinux '
    config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true -DEBUGGING=both -Doptimize=-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -Wall -pipe -Accflags=-DPERL_USE_SAFE_PUTENV'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -DDEBUGGING -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -Wall -pipe -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_USE_SAFE_PUTENV -DDEBUGGING -fno-strict-aliasing -pipe'
    ccversion='', gccversion='4.4.1 [gcc-4_4-branch revision 150839]', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib64'
    libpth=/lib64 /usr/lib64 /usr/local/lib64
    libs=-lm -ldl -lcrypt -lpthread
    perllibs=-lm -ldl -lcrypt -lpthread
    libc=/lib64/libc-2.10.1.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.10.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.10.0/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64'

Locally applied patches:
    


@INC for perl 5.10.0:
    /usr/local/lib/perl/5.8
    /usr/lib/perl5/5.10.0/x86_64-linux-thread-multi
    /usr/lib/perl5/5.10.0
    /usr/lib/perl5/site_perl/5.10.0/x86_64-linux-thread-multi
    /usr/lib/perl5/site_perl/5.10.0
    /usr/lib/perl5/vendor_perl/5.10.0/x86_64-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.10.0
    /usr/lib/perl5/vendor_perl
    .


Environment for perl 5.10.0:
    HOME=/home/law
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_CTYPE=en_US.UTF-8
    LD_LIBRARY_PATH=/usr/lib64/mpi/gcc/openmpi/lib64
    LOGDIR (unset)
    PATH=.:/sbin:/usr/local/sbin:/usr/lib64/mpi/gcc/openmpi/bin:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin:/usr/sbin
    PERL5LIB=/usr/local/lib/perl/5.8
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

From tchrist@perl.com

Linda Walsh (via RT) <perlbug-followup@​perl.org> wrote
  on Sun, 17 Apr 2011 01​:22​:25 PDT​:

I was trying to use the utf-8 character U+2424, called
'SYMBOL FOR NEWLINE' as an identifier containing "\n" as
follows​:

0 #!/usr/bin/perl -w
1 use strict;
2 use utf8;
3 use Readonly my $␤ => "\n";
4 print "line1$␤line2\n";

That isn't allowed. U+2424 isn't an ID_Start character (IDS)
nor even an ID_Continue character. In fact, it's not a \w
but a \p{Symbol}, which is not legal in an identifier.

  % perl -lE 'say "\x{2424}" =~ /\p{IDS}/ || 0'
  0

At http​://training.perl.com/scripts/uniprops, you can get a tool
that may help for this; it's now updated for 5.14.

  % uniprops -a 2424
  U+2424 ‹␤› \N{SYMBOL FOR NEWLINE}
  \pS \p{So}
  All Any Assigned InControlPictures Common Zyyy Control_Pictures So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol
  Pat_Syn Pattern_Syntax PatSyn Print Symbol
  Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Control_Pictures Canonical_Combining_Class=0
  Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None
  DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
  Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
  Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
  Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
  Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
  Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX
  Word_Break=Other WB=XX Word_Break=XX _X_Begin

The unicode character (which may not display correctly, in
whatever viewer you are using, so 'beware') is on lines 3
and 4. On line 3, it's right after the "$" and before the "<tab>=>".
On line 4, it's also after the "$" between "line1" and "line2".

A Hexdump of the above​:
00000000 23 21 2f 75 73 72 2f 62 69 6e 2f 70 65 72 6c 20 |#!/usr/bin/perl |
00000010 2d 77 0a 75 73 65 20 73 74 72 69 63 74 3b 0a 75 |-w.use strict;.u|
00000020 73 65 20 75 74 66 38 3b 0a 75 73 65 20 52 65 61 |se utf8;.use Rea|
00000030 64 6f 6e 6c 79 20 6d 79 20 24 e2 90 a4 09 3d 3e |donly my $....=>|
00000040 20 22 5c 6e 22 3b 0a 70 72 69 6e 74 20 22 6c 69 | "\n";.print "li|
00000050 6e 65 31 24 e2 90 a4 6c 69 6e 65 32 5c 6e 22 3b |ne1$...line2\n";|
00000060 0a 0a |..|
00000062

Shows U+2424 correctly encoded as "0xe290a4" at hexaddrs 03A
and 0x54.

Eek! Hexdumps! Non-logical characters! The horror!

At http​://training.perl.com/scripts/uniquote, you can get a tool that
will help with this, with the first form being the best. Isn't that much
easier to read??

  % uniquote -v /tmp/lw
  #!/usr/bin/perl -w
  use strict;
  use utf8;
  use Readonly my $\N{SYMBOL FOR NEWLINE} => "\n";
  print "line1$\N{SYMBOL FOR NEWLINE}line2\n";

  % uniquote -x /tmp/lw
  #!/usr/bin/perl -w
  use strict;
  use utf8;
  use Readonly my $\x{2424} => "\n";
  print "line1$\x{2424}line2\n";

  % uniquote -b /tmp/lw
  #!/usr/bin/perl -w
  use strict;
  use utf8;
  use Readonly my $\xE2\x90\xA4 => "\n";
  print "line1$\xE2\x90\xA4line2\n";

However, when I try to run this, I get​:
Malformed UTF-8 character (unexpected end of string) at /tmp/ptest.pl line 4.
Unrecognized character \x90 in column 18 at /tmp/ptest.pl line 4.

The bug, and there is a bug, is that it should be reporting that
U+2424 is not a valid identifier character. It should not be
grinching about \x90. This is a known problem, although I don't
know its bugno.

Note, FWIW, I've successfully used other characters in the same
way in another program like this (a fragment from another prog)​:

----
use utf8;
binmode STDOUT, 'encoding(UTF-8)';
use Readonly; BEGIN{*RO=\&Readonly​::Readonly}

my %constants = (
'Phi' => .5*(5.**.5-1),
'Φ' => .5*(5.**.5-1),
'phi' => .5*(1.+5.**.5),
'ɸ' => .5*(1.+5.**.5),
'pi' => 4*atan2(1,1),
'π' => 4*atan2(1,1),
);

sub init_constants (;$) {
my $no_banner=$_[0];
print "Constants​: " unless $no_banner;
my $sep="";

foreach my $k (keys %constants){
my $v=$constants{$k};
print $sep,"\$",$k unless $no_banner;
$sep=", ";
RO $$k => $v;
}

print "\n" unless $no_banner;
}
&init_constants;

First of all, those are both identifier (IDS) characters​:

  % uniprops pi phi
  U+03C0 ‹π› \N{GREEK SMALL LETTER PI}
  \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
  All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
  Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue
  IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
  U+0278 ‹ɸ› \N{LATIN SMALL LETTER PHI}
  \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
  All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue

Secondly, you're using $$k. That's a symbolic dereference.
That means you can get at the symbol table without the persnickety
lexer kvetching all over your lunch​:

  % perl -E '$a = `cat /bin/cat`; $$a = length($a); say $$a'
  43296

I use this all the time.

  my $file = "/tmp/foo";
  open($file, "<", $file);
  while (<$file>) {
  if ( ... ) {
  warn "crudola";
  }
  }

so that I get proper filenames in my warn/die messages.

That doesn't change that U+2424 isn't an identifier character.
You can still use it as a variable name, provided you use
symbolic dereferences to get at it​:

  % perl -E '$name = "\x{2424}"; $$name = `whoami`; print $$name'
  tchrist

  % perl -E '$name = "\x{2424}"; say $name'
  ␤

So I'm suprised at this specific failure, since in looking at the hex,
the character encoding appears correct.

Don't look at hex. Look at code points, with uniquote -x or -v.

Let me know if you have any questions.
an excellent utf-8 character util, that, unfortunately, is windows only,
can be gotten from http​://www.babelstone.co.uk/Software/BabelMap.html).

I have a lot of to-me-excellent Unicode tools in
http​://training.perl.com/scripts/​:

  leo nfd rename unichars uniquote
  macroman nfkc tcgrep uninames uwc
  nfc nfkd ucsort uniprops

No guarantees, though. :)

--tom

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

From @cpansprout

Tom Christiansen wrote​:

That isn't allowed. U+2424 isn't an ID_Start character (IDS)
nor even an ID_Continue character. In fact, it's not a \w
but a \p{Symbol}, which is not legal in an identifier.

But what about punctuation variables?

In a Latin-1 script, one can write $£.
It doesn’t work in a utf8 script.

In a UTF-8 terminal​:

$ perl -e 'use utf8; print q\$£\' | perl
$ perl -CO -e 'use utf8; print q\$£\' | perl -Mutf8
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3 in column 2 at - line 1.

So yes, this is a bug.
 

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

From tchrist@perl.com

But what about punctuation variables?

Oh blech. Yes, I've known of this "hole".
I chose not to complain about it. :)

In a Latin-1 script, one can write $£.
It doesn’t work in a utf8 script.

In a UTF-8 terminal​:

Um, why should that matter?

$ perl -e 'use utf8; print q\$£\' | perl
$ perl -CO -e 'use utf8; print q\$£\' | perl -Mutf8
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3 in column 2 at - line 1.

So yes, this is a bug.

Are you saying that just because Perl allows one-character ASCII (and
Latin-1) punctuation characters, it should allow any single code point
variable no matter what it is?

Are you sure?

Or are you just saying that Latin-1 should be grandfathered, since
a

Anyway, that backslash as a delimiter for q// is simply wicked.
This is a much clearer demo​:

  Given that​:
 
  A DOLLAR SIGN is code point U+0024.
  A POUND SIGN is code point U+00A3.

  % perl -C0 -E 'say "\x{24}\x{A3} = 1"' | uniquote -b
  $\xA3 = 1

  % perl -C0 -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -c
  - syntax OK

but

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -v
  $\N{POUND SIGN} = 1

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -x
  $\x{A3} = 1

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -b
  $\xC2\xA3 = 1

So then​:

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -Mutf8 | & uniquote -b
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\xC2<-- HERE near column 2 at - line 1.
  Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -b
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\xC2<-- HERE near column 2 at - line 1.
  Exit 255

Notice that we are generating illegal UTF-8. That's wrong. But at
least that bug is fixed in blead, kinda​:

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -x
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  /home/tchrist/scripts/uniquote​: utf8 "\xC2" does not map to Unicode at standard input line 2
  Exit 1

That was bad.

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\x{C2}<-- HERE near column 2 at - line 1.
  Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -C0 -Mutf8 | & uniquote -b
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\xC2<-- HERE near column 2 at - line 1.
  Exit 255

But it's fixed so as not to generate illegal UTF-8 anymore when
the std streams are in that encoding​:

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -b
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\xC3\x82<-- HERE near column 2 at - line 1.
  Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\x{C2}<-- HERE near column 2 at - line 1.
  Exit 255

  % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -v
  Malformed UTF-8 character (unexpected end of string) at - line 1.
  Unrecognized character \xA3; marked by <-- HERE after $\N{LATIN CAPITAL LETTER A WITH CIRCUMFLEX}<-- HERE near column 2 at - line 1.
  Exit 255

So it is a *slight* improvement, eh? :)

--tom

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

From @cpansprout

On Apr 17, 2011, at 6​:41 PM, Tom Christiansen wrote​:

But what about punctuation variables?

Oh blech. Yes, I've known of this "hole".
I chose not to complain about it. :)

In a Latin-1 script, one can write $£.
It doesn’t work in a utf8 script.

In a UTF-8 terminal​:

Um, why should that matter?

Just so you know what I’m feeding to perl.

$ perl -e 'use utf8; print q\$£\' | perl
$ perl -CO -e 'use utf8; print q\$£\' | perl -Mutf8
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3 in column 2 at - line 1.

So yes, this is a bug.

Are you saying that just because Perl allows one-character ASCII (and
Latin-1) punctuation characters, it should allow any single code point
variable no matter what it is?

Yes.

Are you sure?

No, but it makes sense to me that way. Non-\w vars should also be forced into main. Er, maybe this is not such a good idea, because of the whole IDS vs XIDS vs alphanumeric mess. :-)

Or are you just saying that Latin-1 should be grandfathered, since
a

Maybe they should, but I don’t

Anyway, that backslash as a delimiter for q// is simply wicked.

:-)

It stands out, doesn’t it?

This is a much clearer demo​:

Given that​:

   A DOLLAR SIGN is code point U\+0024\.
   A  POUND SIGN is code point U\+00A3\.  

% perl -C0 -E 'say "\x{24}\x{A3} = 1"' | uniquote -b
$\xA3 = 1

% perl -C0 -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -c
- syntax OK

but

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -v
$\N{POUND SIGN} = 1

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -x
$\x{A3} = 1

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -b
$\xC2\xA3 = 1

So then​:

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -Mutf8 | & uniquote -b
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\xC2<-- HERE near column 2 at - line 1.
Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -b
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\xC2<-- HERE near column 2 at - line 1.
Exit 255

Notice that we are generating illegal UTF-8. That's wrong. But at
least that bug is fixed in blead, kinda​:

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -x
Malformed UTF-8 character (unexpected end of string) at - line 1.
/home/tchrist/scripts/uniquote​: utf8 "\xC2" does not map to Unicode at standard input line 2
Exit 1

That was bad.

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\x{C2}<-- HERE near column 2 at - line 1.
Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -C0 -Mutf8 | & uniquote -b
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\xC2<-- HERE near column 2 at - line 1.
Exit 255

But it's fixed so as not to generate illegal UTF-8 anymore when
the std streams are in that encoding​:

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -b
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\xC3\x82<-- HERE near column 2 at - line 1.
Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\x{C2}<-- HERE near column 2 at - line 1.
Exit 255

% perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -v
Malformed UTF-8 character (unexpected end of string) at - line 1.
Unrecognized character \xA3; marked by <-- HERE after $\N{LATIN CAPITAL LETTER A WITH CIRCUMFLEX}<-- HERE near column 2 at - line 1.
Exit 255

So it is a *slight* improvement, eh? :)

--tom

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

From lawalsh@tlinx.org

tchrist1 via RT wrote​:

Linda Walsh (via RT) <perlbug-followup@​perl.org> wrote
on Sun, 17 Apr 2011 01​:22​:25 PDT​:

I was trying to use the utf-8 character U+2424, called
'SYMBOL FOR NEWLINE' as an identifier containing "\n" as
follows​:

0 #!/usr/bin/perl -w
1 use strict;
2 use utf8;
3 use Readonly my $␤ => "\n";
4 print "line1$␤line2\n";

That isn't allowed. U+2424 isn't an ID_Start character (IDS)
nor even an ID_Continue character. In fact, it's not a \w
but a \p{Symbol}, which is not legal in an identifier.


  I already thought of that and rejected it as irrelevant.

  Your reasoning doesn't jive with the error message.

  If it wasn't allowed as an identifier it would say 'invalid identifier. That's not what this is. It's a UTF-8 parsing error "Malformed UTF-8 character". It's not
a malformed UTF-8 character. That's what the bug is reporting.

  As for whether or not a symbol can be in a variable name, the variables, perlvar
lists a whole bunch of variable names that are $<symbol>, like​:
$' $+ $. $/ $| $\ (do I have to put in the whole list?). They are *syntactically* valid
as variable names. From a practical standpoint, other than a few(1?) like $£, there aren't any LATIN1 symbols that aren't reserved. But as Father Chrysostomos mentions, it is a valid variable in LATIN1 - but fails in UTF-8​: not because it is an invalid variable
name, but because the parser thinks it is invalid UTF-8, which it is not.

  BTW, later, you write​:

Secondly, you're using $$k. That's a symbolic dereference.
That means you can get at the symbol table without the persnickety
lexer kvetching all over your lunch....


  That was just the way that program was written. It's not essential for
any of those characters. (You might try testing actual behavior before commenting
about why it is allowed (or wouldn't be) in an actual variable name...


perl -e '
use utf8;
my $π="pi as id";
print $π . "\n";
'
pi as id


  The lexer is happy with '$π' and the others ones (phi and capital phi).
Using '$$' was just a clean way for me to pre-install them as symbols
usable in a calculator, i.e. I can type $pi or $π and it will give value for pi.

  Actually, I want to type just 'π', but that's currently broken due to a
bug in 'use constant'...(not my day for UTF-8)...

I have a lot of to-me-excellent Unicode tools in ...


  I'm sure!

  :-)

@p5pRT
Copy link
Author

p5pRT commented Apr 18, 2011

From tchrist@perl.com

Secondly, you're using $$k. That's a symbolic dereference.
That means you can get at the symbol table without the persnickety
lexer kvetching all over your lunch....

That was just the way that program was written. It's not essential for
any of those characters. (You might try testing actual behavior before commenting
about why it is allowed (or wouldn't be) in an actual variable name...

I beg your pardon, but I most certainly have "tested actual behavior".
Whyever would you think I hadn't? I happen to use UTF-8 identifiers all
the time. I also knew about the issue with non-IDS/IDC chars not giving
good error messages.

perl -e '
use utf8;
my $π="pi as id";
print $π . "\n";
'
pi as id

The lexer is happy with '$π' and the others ones (phi and capital phi).

--tom

@p5pRT
Copy link
Author

p5pRT commented Apr 22, 2011

From perl-diddler@tlinx.org

Father Chrysostomos via RT wrote​:

So yes, this is a bug.


  Is this a separate bug or another instance of the same bug?

I tried​:


#!/usr/bin/perl -w
use strict;
use Readonly; sub RO(\[$@​%]@​) {goto &Readonly};
use utf8;

RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand
RO my $Tclear&home2 => `tput clear`; # U+FF06 FullWidth Ampersand


Neither work, and both fail with​:

Unrecognized character \xEF in column 14 at ./amp.pl line [6|7].

@p5pRT
Copy link
Author

p5pRT commented Apr 22, 2011

From tchrist@perl.com

Linda Walsh <perl-diddler@​tlinx.org> wrote
  on Fri, 22 Apr 2011 08​:54​:52 PDT​:

RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand
RO my $Tclear&home2 => `tput clear`; # U+FF06 FullWidth Ampersand

-----
Neither work, and both fail with​:

Unrecognized character \xEF in column 14 at ./amp.pl line [6|7].

The bug is not that they fail; they *should* fail. The only bug is that
the error message is in bytes not characters.

via `uniquote`​:

RO my $Tclear\N{U+FE60}home => `tput clear`; # U+FE60 Small Ampersand
RO my $Tclear\N{U+FF06}home2 => `tput clear`; # U+FF06 FullWidth Ampersand

or via `uniquote -v`​:

RO my $Tclear\N{SMALL AMPERSAND}home => `tput clear`; # U+FE60 Small Ampersand
RO my $Tclear\N{FULLWIDTH AMPERSAND}home2 => `tput clear`; # U+FF06 FullWidth Amp

  % uniprops fe60 ff06
  U+FE60 ‹﹠› \N{SMALL AMPERSAND}
  \pP \p{Po}
  All Any Assigned InSmallFormVariants Changes_When_NFKC_Casefolded CWKCF Common Zyyy Po
  P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Print Punctuation
  Small_Form_Variants
  U+FF06 ‹&› \N{FULLWIDTH AMPERSAND}
  \pP \p{Po}
  All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded CWKCF
  Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Halfwidth_And_Fullwidth_Forms
  Other_Punctuation Punct Print Punctuation

As you see, those are not IDC code points, so do not belong in an
identifier. This is made clear here​:

  % perl -E 'say "\x{fe60}" =~ /\p{IDC}/ || 0'
  0
  % perl -E 'say "\x{ff06}" =~ /\p{IDC}/ || 0'
  0

So what bug are you thinking this is?

It is not a bug that those are illegal characters. They are.

It is *only* a bug saying that the character is \xEF
instead of saying it is \x{FE60} or \x{FF06}.

--tom

@p5pRT
Copy link
Author

p5pRT commented Apr 22, 2011

From perl-diddler@tlinx.org

tchrist1 via RT wrote​:

Linda Walsh <perl-diddler@​tlinx.org> wrote
on Fri, 22 Apr 2011 08​:54​:52 PDT​:

RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand
RO my $Tclear&home2 => `tput clear`; # U+FF06 FullWidth Ampersand

-----
Neither work, and both fail with​:

Unrecognized character \xEF in column 14 at ./amp.pl line [6|7].

The bug is not that they fail; they *should* fail. The only bug is that
the error message is in bytes not characters.

I really think you are getting hung up on the props for the characters.

I don't see them as being useful to enforce in the context we are using them.

When someone goes and looks at unicode characters, all the 'props'
are NOT listed. They are detailed arcana that will confuse most users.

I know you may not like that answer, since it seems to be something that you
think is really important, but I don't thing the vast majority of users will
see it that way -- they will just wonder why a perfectly valid value doesn't
work.

Example -- I want to use "​:" in song titles -- so I use the
FULL Width "​:" -- imagine if MS enforced your props and threw it out for
no good reason. "​:" is banned because it is used/needed.

The other symbols I am mentioning are ones that I'm using in place of
ones that perl has claimed for its own operator set. It's bad precedent
and **non-perl**, to reserve a bunch of things needlessly.

The basic design philosophy of perl is "Do what I mean"(perlsyn), not
"do the right thing". To adhere to doing the 'right thing' over doing
the 'useful thing' that would be what 99% of the users would expect and
want it to do would be a harmful design decision with no apparent benefit.

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2011

From @cpansprout

On Sun Sep 06 11​:35​:30 2009, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a
program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at",
even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $� = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now, as of dfb1828 and the
preceding commits.

But another issue that came up later in the ticket, that $♠♠ produces
‘Unrecognized character \xE2’ instead of mentioning the Unicode code
point, is still not fixed.

@p5pRT
Copy link
Author

p5pRT commented Oct 7, 2011

From @Hugmeir

On Thu, Oct 6, 2011 at 6​:50 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 06 11​:35​:30 2009, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a
program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at",
even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $� = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now, as of dfb1828 and the
preceding commits.

Basically, as of right now in blead, variables of length one match (?&sigil)
\p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really want \p{Any},
which allows a whole bunch of problematic characters? If not, what do we
restrict it to?

But another issue that came up later in the ticket, that $♠♠ produces
‘Unrecognized character \xE2’ instead of mentioning the Unicode code
point, is still not fixed.

This is fixed in the other gsoc branch thingy, so maybe in a couple of
whiles it will. Hopefully!

Incidentally, Father C, mad props for cleaning up the gv/stash stuff!

@p5pRT
Copy link
Author

p5pRT commented Oct 7, 2011

From @cpansprout

On Thu Oct 06 19​:52​:11 2011, Hugmeir wrote​:

On Thu, Oct 6, 2011 at 6​:50 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 06 11​:35​:30 2009, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a
program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at",
even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $� = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now, as of dfb1828 and the
preceding commits.

Basically, as of right now in blead, variables of length one match
(?&sigil)
\p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really want \p{Any},
which allows a whole bunch of problematic characters? If not, what do we
restrict it to?

\S or whatever Unicode equivalent Tom Christiansen says is more appropriate.

I probably pushed the changes too soon, but I didn’t discover this till
afterwards.

Also, my $♠ is now permitted, which is a bug.

And $  (that’s a non-breaking space, but Firefox is untrustworthy), too.

But another issue that came up later in the ticket, that $♠♠ produces
‘Unrecognized character \xE2’ instead of mentioning the Unicode code
point, is still not fixed.

This is fixed in the other gsoc branch thingy, so maybe in a couple of
whiles it will. Hopefully!

Incidentally, Father C, mad props for cleaning up the gv/stash stuff!

I still need to write a summary explaining why some parts were modified
or omitted.

@p5pRT
Copy link
Author

p5pRT commented Oct 7, 2011

From @cpansprout

On Thu Oct 06 19​:52​:11 2011, Hugmeir wrote​:

On Thu, Oct 6, 2011 at 6​:50 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Sep 06 11​:35​:30 2009, moritz wrote​:

This is a bug report for perl from moritz@​faui2k3.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

As pointed out on <http​://www.perlmonks.org/?node_id=793800>, a
program
which
tries to use a non-ASCII non-alphanumeric character in a variable name
throws
an error "Malformed UTF-8 character (unexpected end of string) at",
even
though the file is in perfectly fine UTF-8.

Example​:
$ cat foo.pl
use utf8;
my $� = 1;
$ perl foo.pl
Malformed UTF-8 character (unexpected end of string) at foo.pl line 2.
Unrecognized character \xBB in column 5 at foo.pl line 2.

Unicode punctuation variables work now, as of dfb1828 and the
preceding commits.

Basically, as of right now in blead, variables of length one match
(?&sigil)
\p{Any} (?=\z) instead of (?&sigil) \C (?=\z). Do we really want \p{Any},
which allows a whole bunch of problematic characters? If not, what do we
restrict it to?

But another issue that came up later in the ticket, that $♠♠ produces
‘Unrecognized character \xE2’ instead of mentioning the Unicode code
point, is still not fixed.

This is fixed in the other gsoc branch thingy, so maybe in a couple of
whiles it will. Hopefully!

OK, where do I start? (I actually want to finish reimplementing $[
first, so it may be a while.)

Incidentally, Father C, mad props for cleaning up the gv/stash stuff!

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2012

From @cpansprout

On Thu Oct 06 20​:39​:42 2011, sprout wrote​:

Also, my $♠ is now permitted, which is a bug.

I’ve made a separate ticket for that, #111980.

And $  (that’s a non-breaking space, but Firefox is untrustworthy), too.

When we deal with Unicode brackets, we can deal with Unicode whitespace,
too. See my note at
<https://rt-archive.perl.org/perl5/Ticket/Display.html?id=89032#txn-1097256>.

But another issue that came up later in the ticket, that $♠♠ produces
‘Unrecognized character \xE2’ instead of mentioning the Unicode code
point, is still not fixed.

This is fixed in the other gsoc branch thingy, so maybe in a couple of
whiles it will. Hopefully!

It was integrated recently. See
<https://rt-archive.perl.org/perl5/Ticket/Display.html?id=107008#txn-1099872>.
I’m not sure which patch did it, but this bug is now fixed.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2012

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Mar 29, 2012
@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2012

From @nwc10

On Thu Mar 29 00​:14​:09 2012, sprout wrote​:

On Thu Oct 06 20​:39​:42 2011, sprout wrote​:

But another issue that came up later in the ticket, that $♠♠
produces
‘Unrecognized character \xE2’ instead of mentioning the Unicode code
point, is still not fixed.

This is fixed in the other gsoc branch thingy, so maybe in a couple of
whiles it will. Hopefully!

It was integrated recently. See
<https://rt-archive.perl.org/perl5/Ticket/Display.html?id=107008#txn-1099872>.
I’m not sure which patch did it, but this bug is now fixed.

While I was doing something else, I set off a bisect run.
The answer is​:

HEAD is now at 734ab32 toke.c​: S_no_op cleanup
good - non-zero exit from ./perl -Ilib -e eval "my \$\x{2660}\x{2660}";
die $@​ unless $@​ =~ /Unrecognized character \\x\{2660\}/
e2f06df is the first bad commit
commit e2f06df
Author​: Brian Fraser <fraserbn@​gmail.com>
Date​: Sat Aug 6 07​:55​:06 2011 +0100

  toke.c​: 'Unrecognized character' croak cleanup.

:040000 040000 cab624cfbcf5d9693603b516d54d74126e2db1e6
2ac92c7f6ba76f070c2a2f2680c9a3fa909a2104 M t
:100644 100644 3a3cddb7606c1ccee8fe60d376dedec91459d7c2
c0a5cdaf09292fd3ed2e484b9526e7f087371080 M toke.c
bisect run success
That took 1528 seconds

Nicholas Clark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant