New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eval() of non-ASCII bytes under unicode_eval and unicode_strings doesn't give them Latin1 meanings #9040
Comments
From zefram@fysh.orgCreated by zefram@fysh.org$ perl -we '$a="require x\x{f1}y::z"; eval What I show above as "ZZ" was originally a sequence of two non-ASCII The phenomenon we see here is that the syntax of Perl, as judged by What, exactly, is Perl's identifier syntax? Is U+00f1 a valid identifier Perl Info
|
From nospam-abuse@bloodgate.comMoin, On Saturday 22 September 2007 23:55:20 Zefram wrote:
The sequence C3B1 is UTF-8 for "character 0xf1" so that is right.
When you don't do "use utf8;" you script is expected to be in latin1 However, it seems eval() (or require?) doesn't know about this. Plus, I am #!perl still fails to compile with: Unrecognized character \x82 at t.pl line 5. perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers. perldoc utf8 says: Enabling the "utf8" pragma has the following effect: Bytes in the source text that have their high‐bit set will be But it doesn't seem to work in v5.8.8 at least. All the best, Tels -- "Spammed if you do, spammed if you don't." -- Murphy's Law |
The RT System itself - Status changed from 'new' to 'open' |
From @rgsOn 23/09/2007, Tels <nospam-abuse@bloodgate.com> wrote:
Right, there can be double encoding. That will need to be fixed.
Identifiers must start with letters; € isn't one. [rafael@stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à' |
From nospam-abuse@bloodgate.comMoin, On Monday 24 September 2007 10:42:37 Rafael Garcia-Suarez wrote:
Ok.
Wouldn't perlsyn be a good place to document this tidbit, then? And, of course, I tried that with "$a€", too, see below :P
v5.8.8: # perl -Mutf8 -le '$à=42;print $à' That mighty Euro seems to be special, it is not allowed even after a letter, All the best, Tels -- "Not King yet." |
From @JuerdRafael Garcia-Suarez skribis 2007-09-24 10:42 (+0200):
Still, the character is not \x82 but \x{20ac}, so the error message is \x82 isn't even the first byte of the UTF-8 encoding of \x{20ac}. It's Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @JuerdTels skribis 2007-09-24 14:58 (+0200):
There's more to it than just the first character. IIRC, identifiers are [[:alpha:]_]\w+ Euro isn't in there. Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @chipdudeOn Mon, Sep 24, 2007 at 10:42:37AM +0200, Rafael Garcia-Suarez wrote:
I disagree. The only extra encoding is manual here. The call to I call no bug.
-- |
From zefram@fysh.orgChip Salzenberg wrote:
utf8::upgrade isn't a user-visible encoding step. It changes how the -zefram |
From @chipdudeOn Wed, Aug 26, 2009 at 11:19:39PM +0100, Zefram wrote:
You have just agreed with me. "Change of representation" = "encoding". Perl's parser takes bytes and gives them meaning. If you change the bytes, |
From zefram@fysh.orgChip Salzenberg wrote:
utf8::upgrade affects *internal* encoding. Not the user-visible content
String eval is an operation on a string. A string of *characters*, in Obviously it's internally working with bytes. A Perl source file on I believe the bug here is that the Perl parser is not consistently -zefram |
From @chipdudeOn Wed, Aug 26, 2009 at 11:47:11PM +0100, Zefram wrote:
"User-visible" is a vague term, because the utf8 flag *is* visible.
Does it? If so, then it's a documentation bug.
That's not a bug, it's a feature. (I'm mostly serious about that.) |
From zefram@fysh.orgChip Salzenberg wrote:
I refer to, for example, perldata(1): Values are usually referred to by name, or through a named reference. Clearly referring to characters there, not bytes. It's not so clear If an appropriate encoding is specified, identifiers within the The internal Latin-1 encoding of a downgraded string seems an appropriate
I don't see how the inconsistency can ever be a good thing. -zefram |
From @demerphq2009/8/27 Chip Salzenberg <chip@pobox.com>:
No. It is *not*. We operate on character not bytes. Bytes are
No it is not. It is a bug.
I dont agree. It *is* worth fixing. Yves -- |
From @chipdudeOn Thu, Aug 27, 2009 at 12:19:26AM +0100, Zefram wrote:
It does appear that L<perlfunc/eval> needs a note.
Calling it "inconsistency" misses the point. The C<eval> operator simply do EXPR Uses the value of EXPR as a filename and executes the There's no way ever to fully lift parsing out of the world of bytes. |
From @demerphq2009/8/27 Chip Salzenberg <chip@pobox.com>:
I think *this* documentation is buggy. Not the other way around. The exact same file /in terms of bytes/ will NOT do the same thing on We have debated on p5p the subtleties of encoding, characters, As I said earlier, bytes are meaningless. They are numbers. We dont cheers, -- |
From @janduboisOn Wed, 26 Aug 2009, demerphq wrote:
Did anybody summarize these conclusions somewhere? Or can you at least Getting to "full Unicode semantics at every level" sounds like a huge Cheers, |
From @demerphq2009/8/27 Jan Dubois <jand@activestate.com>:
Ill try to put together a summary. The general agreement concerned
Is that not a good thing? Forget the amount of work for a moment. What cheers, -- |
From @rgs2009/8/27 Chip Salzenberg <chip@pobox.com>:
Well no it's not. The UTF8 flag shouldn't have any effect on anything
That's a bug in my book. Perl's parser (or to be more precise, Perltodo states : =head2 Properly Unicode safe tokeniser and pads. The tokeniser isn't actually very UTF-8 clean. C<use utf8;> is a hack - |
From @davidnicolOn Wed, Aug 26, 2009 at 7:16 PM, Jan Dubois<jand@activestate.com> wrote:
Except that the competition, by which I mean at least Python and e262, Yes, major reengineering is required, nobody wants to try to cross an Completely decoupled byte and character storage implementations means It seems like a good way to get there is by, for instance, perl to |
From @chipdudeOn Thu, Aug 27, 2009 at 02:43:51AM +0200, demerphq wrote:
The robot devil is, as always, in the details.
Indeed. In the specific case of eval, for example, idealism and the gains If the limited goal is to make utf8::ugrade a NOP -- basically to make our If the ambitious goal is to allow the lexer to identify and use arbitrary Finally, and most importantly, consider: What does C<use bytes> *mean* inside an eval of a utf8 string? How about In short: Forget big fish vs. small fish. This isn't even a fish, it's just |
From @khwilliamsondemerphq wrote:
One principal that I think there was consensus on (and I certainly hope |
From marvin@rectangular.comOn Fri, Aug 28, 2009 at 07:52:51AM -0600, karl williamson wrote:
Practically speaking, I think it's unrealistic to do anything ambitious with That said, I'm grateful that for those of us who have that expertise, it *is* Looking forward... what if this directive implied a source file encoding of use 5.012; Marvin Humphrey |
From @nwc10On Fri, Aug 28, 2009 at 08:00:04AM -0700, Marvin Humphrey wrote:
What would I use, if I had a script written in some other encoding, but Nicholas Clark |
From marvin@rectangular.comOn Fri, Aug 28, 2009 at 04:02:19PM +0100, Nicholas Clark wrote:
Obviously, I am implying that such a script would need to be updated. You are Marvin Humphrey |
From @nwc10On Fri, Aug 28, 2009 at 08:04:53AM -0700, Marvin Humphrey wrote:
This would reduce functionality by conflating two orthogonal features. This Particularly as use 5.010 had no such semantic overloading. Nicholas Clark |
From zefram@fysh.orgNicholas Clark wrote:
+1
$ perl -e 'say "foo"' I'm afraid that boat has already sailed. -zefram |
From @rjbsHow does the creation of evalbytes and unicode_eval affect this ticket, if at all? |
From @cpansproutOn Thu Mar 01 18:46:16 2012, rjbs wrote:
It isn’t enough. I believe the patches for which #107008 exists will -- Father Chrysostomos |
From @cpansproutOn Fri Mar 02 08:59:58 2012, sprout wrote:
The example shown in the original post still fails the same way. It -- Father Chrysostomos |
This is still a problem in 5.35.10, and to avoid having to read through the ticket, the real issue is the first example in the OP post, but not what I think people have said. The claim is that source without 'use utf8' is presumed encoded as Latin1. But in fact non-ASCII bytes are not assumed to be Latin1, but just anonymous bytes with no properties except for their code points and that they aren't \w, aren't \s, aren't controls .... But one would think that unicode_eval or unicode_strings would change these to their Latin1 values, but that doesn't happen |
Migrated from rt.perl.org#45673 (status was 'open')
Searchable as RT45673$
The text was updated successfully, but these errors were encountered: