New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bogus error message "Malformed UTF-8 character" when using a non-word Unicode character #9862
Comments
From @moritzCreated by moritz@faui2k3.orgAs pointed out on <http://www.perlmonks.org/?node_id=793800>, a program Example: A better error message might be "Character '%s' not allowed in Perl Info
|
From john.imrie@vodafoneemail.co.ukMoritz Lenz (via RT) wrote:
Humm. Do we actually allow *any* unicode codepoint in an identifier or ______________________________________________ |
The RT System itself - Status changed from 'new' to 'open' |
From @ikegamiOn Sun, Sep 6, 2009 at 2:35 PM, Moritz Lenz <perlbug-followup@perl.org>wrote:
There are two problems. If the character that follows "$" is not normally allowed as part of an The second problem is the poor error message "Unrecognized character %s". |
From @ikegamiOn Mon, Sep 7, 2009 at 4:53 PM, Eric Brine <ikegami@adaelis.com> wrote:
Moritz suggests "Character \x2660 not allowed in identifier in column 16 at
The identified character is wrong (E2 instead of 2660). Otherwise, this is
If consistent with "use utf8" absent, it would be "Can't use global $♠ in
If consistent with "use utf8" absent, this would not be an error at all.
If consistent with "use utf8" absent, it would be "Bareword found where - Eric "ikegami" Brine |
From perl-diddler@tlinx.orgCreated by perl-diddler@tlinx.orgI was trying to use the utf-8 character U+2424, called 0 #!/usr/bin/perl -w The unicode character (which may not display correctly, in A Hexdump of the above: Shows U+2424 correctly encoded as "0xe290a4" at hexaddrs 03A However, when I try to run this, I get: Note, FWIW, I've successfully used other characters in the same ---- my %constants = ( sub init_constants (;$) { foreach my $k (keys %constants){ print "\n" unless $no_banner; So I'm suprised at this specific failure, since in looking at the hex, Let me know if you have any questions. Perl Info
|
From tchrist@perl.comLinda Walsh (via RT) <perlbug-followup@perl.org> wrote
That isn't allowed. U+2424 isn't an ID_Start character (IDS) % perl -lE 'say "\x{2424}" =~ /\p{IDS}/ || 0' At http://training.perl.com/scripts/uniprops, you can get a tool % uniprops -a 2424
Eek! Hexdumps! Non-logical characters! The horror! At http://training.perl.com/scripts/uniquote, you can get a tool that % uniquote -v /tmp/lw % uniquote -x /tmp/lw % uniquote -b /tmp/lw
The bug, and there is a bug, is that it should be reporting that
First of all, those are both identifier (IDS) characters: % uniprops pi phi Secondly, you're using $$k. That's a symbolic dereference. % perl -E '$a = `cat /bin/cat`; $$a = length( I use this all the time. my $file = "/tmp/foo"; so that I get proper filenames in my warn/die messages. That doesn't change that U+2424 isn't an identifier character. % perl -E '$name = "\x{2424}"; $$name = `whoami`; print $$name' % perl -E '$name = "\x{2424}"; say $name'
Don't look at hex. Look at code points, with uniquote -x or -v.
I have a lot of to-me-excellent Unicode tools in leo nfd rename unichars uniquote No guarantees, though. :) --tom |
The RT System itself - Status changed from 'new' to 'open' |
From @cpansproutTom Christiansen wrote:
But what about punctuation variables? In a Latin-1 script, one can write $£. In a UTF-8 terminal: $ perl -e 'use utf8; print q\$£\' | perl So yes, this is a bug. |
From tchrist@perl.com
Oh blech. Yes, I've known of this "hole".
Um, why should that matter?
Are you saying that just because Perl allows one-character ASCII (and Are you sure? Or are you just saying that Latin-1 should be grandfathered, since Anyway, that backslash as a delimiter for q// is simply wicked. Given that: % perl -C0 -E 'say "\x{24}\x{A3} = 1"' | uniquote -b % perl -C0 -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -c but % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -v % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -x % perl -CS -E 'say "\x{24}\x{A3} = 1"' | uniquote -b So then: % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -C0 -Mutf8 | & uniquote -b % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -b Notice that we are generating illegal UTF-8. That's wrong. But at % perl -CS -E 'say "\x{24}\x{A3} = 1"' | perl -CS -Mutf8 | & uniquote -x That was bad. % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -C0 -Mutf8 | & uniquote -b But it's fixed so as not to generate illegal UTF-8 anymore when % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -b % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -x % perl -CS -E 'say "\x{24}\x{A3} = 1"' | blead -CS -Mutf8 | & uniquote -v So it is a *slight* improvement, eh? :) --tom |
From @cpansproutOn Apr 17, 2011, at 6:41 PM, Tom Christiansen wrote:
Just so you know what I’m feeding to perl.
Yes.
No, but it makes sense to me that way. Non-\w vars should also be forced into main. Er, maybe this is not such a good idea, because of the whole IDS vs XIDS vs alphanumeric mess. :-)
Maybe they should, but I don’t
:-) It stands out, doesn’t it?
|
From lawalsh@tlinx.orgtchrist1 via RT wrote:
I already thought of that and rejected it as irrelevant. Your reasoning doesn't jive with the error message. If it wasn't allowed as an identifier it would say 'invalid identifier. That's not what this is. It's a UTF-8 parsing error "Malformed UTF-8 character". It's not As for whether or not a symbol can be in a variable name, the variables, perlvar BTW, later, you write:
That was just the way that program was written. It's not essential for perl -e ' The lexer is happy with '$π' and the others ones (phi and capital phi). Actually, I want to type just 'π', but that's currently broken due to a
I'm sure! :-) |
From tchrist@perl.com
I beg your pardon, but I most certainly have "tested actual behavior".
--tom |
From perl-diddler@tlinx.orgFather Chrysostomos via RT wrote:
Is this a separate bug or another instance of the same bug? I tried: #!/usr/bin/perl -w RO my $Tclear﹠home => `tput clear`; # U+FE60 Small Ampersand Neither work, and both fail with: Unrecognized character \xEF in column 14 at ./amp.pl line [6|7]. |
From tchrist@perl.comLinda Walsh <perl-diddler@tlinx.org> wrote
The bug is not that they fail; they *should* fail. The only bug is that via `uniquote`:
or via `uniquote -v`:
% uniprops fe60 ff06 As you see, those are not IDC code points, so do not belong in an % perl -E 'say "\x{fe60}" =~ /\p{IDC}/ || 0' So what bug are you thinking this is? It is not a bug that those are illegal characters. They are. It is *only* a bug saying that the character is \xEF --tom |
From perl-diddler@tlinx.orgtchrist1 via RT wrote:
I really think you are getting hung up on the props for the characters. I don't see them as being useful to enforce in the context we are using them. When someone goes and looks at unicode characters, all the 'props' I know you may not like that answer, since it seems to be something that you Example -- I want to use ":" in song titles -- so I use the The other symbols I am mentioning are ones that I'm using in place of The basic design philosophy of perl is "Do what I mean"(perlsyn), not |
From @cpansproutOn Sun Sep 06 11:35:30 2009, moritz wrote:
Unicode punctuation variables work now, as of dfb1828 and the But another issue that came up later in the ticket, that $♠♠ produces |
From @HugmeirOn Thu, Oct 6, 2011 at 6:50 PM, Father Chrysostomos via RT <
Basically, as of right now in blead, variables of length one match (?&sigil)
This is fixed in the other gsoc branch thingy, so maybe in a couple of Incidentally, Father C, mad props for cleaning up the gv/stash stuff! |
From @cpansproutOn Thu Oct 06 19:52:11 2011, Hugmeir wrote:
\S or whatever Unicode equivalent Tom Christiansen says is more appropriate. I probably pushed the changes too soon, but I didn’t discover this till Also, my $♠ is now permitted, which is a bug. And $ (that’s a non-breaking space, but Firefox is untrustworthy), too.
I still need to write a summary explaining why some parts were modified |
From @cpansproutOn Thu Oct 06 19:52:11 2011, Hugmeir wrote:
OK, where do I start? (I actually want to finish reimplementing $[
|
From @cpansproutOn Thu Oct 06 20:39:42 2011, sprout wrote:
I’ve made a separate ticket for that, #111980.
When we deal with Unicode brackets, we can deal with Unicode whitespace,
It was integrated recently. See -- Father Chrysostomos |
@cpansprout - Status changed from 'open' to 'resolved' |
From @nwc10On Thu Mar 29 00:14:09 2012, sprout wrote:
While I was doing something else, I set off a bisect run. HEAD is now at 734ab32 toke.c: S_no_op cleanup toke.c: 'Unrecognized character' croak cleanup. :040000 040000 cab624cfbcf5d9693603b516d54d74126e2db1e6 Nicholas Clark |
Migrated from rt.perl.org#69032 (status was 'resolved')
Searchable as RT69032$
The text was updated successfully, but these errors were encountered: