Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval() of non-ASCII bytes under unicode_eval and unicode_strings doesn't give them Latin1 meanings #9040

Open
p5pRT opened this issue Sep 22, 2007 · 31 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 22, 2007

Migrated from rt.perl.org#45673 (status was 'open')

Searchable as RT45673$

@p5pRT
Copy link
Author

p5pRT commented Sep 22, 2007

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -we '$a="require x\x{f1}y​::z"; eval $a; print $@​'
Warning​: Use of "require" without parentheses is ambiguous at (eval 1) line 1.
Unrecognized character \xF1 at (eval 1) line 1.
$ perl -we '$a="require x\x{f1}y​::z"; utf8​::upgrade($a); eval $a; print $@​'
Can't locate xZZy/z.pm in @​INC (@​INC contains​: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 .) at (eval 1) line 3.
$

What I show above as "ZZ" was originally a sequence of two non-ASCII
characters​: U+00c3 (Latin capital letter A with tilde) and U+00b1
(plus-minus sign). I've replaced them with ASCII characters to avoid
unpredictable manglement.

The phenomenon we see here is that the syntax of Perl, as judged by
eval(), varies according to whether the input string is physically
encoded in UTF8. If it is so encoded then U+00f1, Latin small letter N
with tilde, is an acceptable identifier character, and so can be part
of a module name. If not, then the very same character is invalid in
that context and causes a syntax error.

What, exactly, is Perl's identifier syntax? Is U+00f1 a valid identifier
character?

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl v5.8.8:

Configured by Debian Project at Wed Dec  6 23:17:41 UTC 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.18.3, archname=i486-linux-gnu-thread-multi
    uname='linux saens 2.6.18.3 #1 smp sat nov 25 13:39:52 est 2006 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-20)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8
    gnulibc_version='2.3.6'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /etc/perl
    /usr/local/lib/perl/5.8.8
    /usr/local/share/perl/5.8.8
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    /usr/local/lib/perl/5.8.4
    /usr/local/share/perl/5.8.4
    .


Environment for perl v5.8.8:
    HOME=/home/zefram
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Sep 23, 2007

From nospam-abuse@bloodgate.com

Moin,

On Saturday 22 September 2007 23​:55​:20 Zefram wrote​:

# New Ticket Created by Zefram
[snip]

$ perl -we '$a="require x\x{f1}y​::z"; eval $a; print $@​'
Warning​: Use of "require" without parentheses is ambiguous at (eval 1)
line 1. Unrecognized character \xF1 at (eval 1) line 1.
$ perl -we '$a="require x\x{f1}y​::z"; utf8​::upgrade($a); eval $a; print
$@​' Can't locate xZZy/z.pm in @​INC (@​INC contains​: /etc/perl
/usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5
/usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8
/usr/local/lib/site_perl /usr/local/lib/perl/5.8.4
/usr/local/share/perl/5.8.4 .) at (eval 1) line 3. $

What I show above as "ZZ" was originally a sequence of two non-ASCII
characters​: U+00c3 (Latin capital letter A with tilde) and U+00b1
(plus-minus sign). I've replaced them with ASCII characters to avoid
unpredictable manglement.

The sequence C3B1 is UTF-8 for "character 0xf1" so that is right.

The phenomenon we see here is that the syntax of Perl, as judged by
eval(), varies according to whether the input string is physically
encoded in UTF8. If it is so encoded then U+00f1, Latin small letter N
with tilde, is an acceptable identifier character, and so can be part
of a module name. If not, then the very same character is invalid in
that context and causes a syntax error.

What, exactly, is Perl's identifier syntax? Is U+00f1 a valid identifier
character?

When you don't do "use utf8;" you script is expected to be in latin1
(iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8,
it can contain any UTF-8.

However, it seems eval() (or require?) doesn't know about this. Plus, I am
not entirely sure how much Unicode you can use in identifiers as something
like this​:

  #!perl
  use utf8;
  my $€ = 1;

still fails to compile with​:

  Unrecognized character \x82 at t.pl line 5.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

perldoc utf8 says​:

  Enabling the "utf8" pragma has the following effect​:

  Bytes in the source text that have their high‐bit set will be
  treated as being part of a literal UTF−8 character. This
  includes most literals such as identifier names, string
  constants, and constant regular expression patterns.

But it doesn't seem to work in v5.8.8 at least.

All the best,

Tels

--
Signed on Sun Sep 23 18​:05​:15 2007 with key 0x93B84C15.
Get one of my photo posters​: http​://bloodgate.com/posters
PGP key on http​://bloodgate.com/tels.asc or per email.

"Spammed if you do, spammed if you don't."

  -- Murphy's Law

@p5pRT
Copy link
Author

p5pRT commented Sep 23, 2007

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2007

From @rgs

On 23/09/2007, Tels <nospam-abuse@​bloodgate.com> wrote​:

When you don't do "use utf8;" you script is expected to be in latin1
(iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8,
it can contain any UTF-8.

However, it seems eval() (or require?) doesn't know about this.

Right, there can be double encoding. That will need to be fixed.

Plus, I am
not entirely sure how much Unicode you can use in identifiers as something
like this​:

    \#\!perl
    use utf8;
    my $€ = 1;

still fails to compile with​:

    Unrecognized character \\x82 at t\.pl line 5\.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

Identifiers must start with letters; € isn't one.

[rafael@​stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à'
42
[rafael@​stcosmo ~]$ bleadperl -le '$à=42;print $à'
Unrecognized character \xA0 in column 3 at -e line 1.

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2007

From nospam-abuse@bloodgate.com

Moin,

On Monday 24 September 2007 10​:42​:37 Rafael Garcia-Suarez wrote​:

On 23/09/2007, Tels <nospam-abuse@​bloodgate.com> wrote​:

When you don't do "use utf8;" you script is expected to be in latin1
(iso.-8859-1). (we leave "use locale" out of this for now). Under use
utf8, it can contain any UTF-8.

However, it seems eval() (or require?) doesn't know about this.

Right, there can be double encoding. That will need to be fixed.

Ok.

Plus, I am
not entirely sure how much Unicode you can use in identifiers as
something like this​:

    \#\!perl
    use utf8;
    my $€ = 1;

still fails to compile with​:

    Unrecognized character \\x82 at t\.pl line 5\.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about
identifiers.

Identifiers must start with letters; € isn't one.

Wouldn't perlsyn be a good place to document this tidbit, then?

And, of course, I tried that with "$a€", too, see below :P

[rafael@​stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à'
42
[rafael@​stcosmo ~]$ bleadperl -le '$à=42;print $à'
Unrecognized character \xA0 in column 3 at -e line 1.

v5.8.8​:

  # perl -Mutf8 -le '$à=42;print $à'
  42
  # perl -Mutf8 -le '$aà=42;print $aà'
  42
  # perl -Mutf8 -le '$a€=42;print $a€'
  Unrecognized character \xE2 at -e line 1.
  # perl -Mutf8 -le '$€=42;print $€'
  Unrecognized character \x82 at -e line 1.

That mighty Euro seems to be special, it is not allowed even after a letter,
and it's sometimes recognized as \x82 and sometimes as \xE2. Huh?

All the best,

Tels

--
Signed on Mon Sep 24 14​:54​:04 2007 with key 0x93B84C15.
View my photo gallery​: http​://bloodgate.com/photos
PGP key on http​://bloodgate.com/tels.asc or per email.

"Not King yet."

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2007

From @Juerd

Rafael Garcia-Suarez skribis 2007-09-24 10​:42 (+0200)​:

    use utf8;
    my $€ = 1;

still fails to compile with​:
Unrecognized character \x82 at t.pl line 5.
Identifiers must start with letters; € isn't one.

Still, the character is not \x82 but \x{20ac}, so the error message is
incorrect.

\x82 isn't even the first byte of the UTF-8 encoding of \x{20ac}. It's
the second. Perhaps the first byte (\xe2) is accepted as latin1,
even though utf8.pm is in effect.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2007

From @Juerd

Tels skribis 2007-09-24 14​:58 (+0200)​:

Identifiers must start with letters; € isn't one.
Wouldn't perlsyn be a good place to document this tidbit, then?
And, of course, I tried that with "$a€", too, see below :P

There's more to it than just the first character.

IIRC, identifiers are [[​:alpha​:]_]\w+

Euro isn't in there.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From @chipdude

On Mon, Sep 24, 2007 at 10​:42​:37AM +0200, Rafael Garcia-Suarez wrote​:

On 23/09/2007, Tels <nospam-abuse@​bloodgate.com> wrote​:

When you don't do "use utf8;" you script is expected to be in latin1
(iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8,
it can contain any UTF-8.

However, it seems eval() (or require?) doesn't know about this.

Right, there can be double encoding. That will need to be fixed.

I disagree. The only extra encoding is manual here. The call to
utf8​::upgrade is performing that encoding step explicitly at the user's
request.

I call no bug.

Plus, I am
not entirely sure how much Unicode you can use in identifiers as something
like this​:

    \#\!perl
    use utf8;
    my $? = 1;

still fails to compile with​:

    Unrecognized character \\x82 at t\.pl line 5\.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

Identifiers must start with letters; ? isn't one.

[rafael@​stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à'
42
[rafael@​stcosmo ~]$ bleadperl -le '$à=42;print $à'
Unrecognized character \xA0 in column 3 at -e line 1.

--
Chip Salzenberg

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From zefram@fysh.org

Chip Salzenberg wrote​:

I disagree. The only extra encoding is manual here. The call to
utf8​::upgrade is performing that encoding step explicitly at the user's
request.

utf8​::upgrade isn't a user-visible encoding step. It changes how the
string is represented internally, but leaves the string containing the
same characters as before. Later operations on the string ought to
be responding to the same character sequence in the same way. Now,
if I'd done Encode​::encode("UTF-8", ...), *that* would be a manual,
explicitly-requested, encoding step, and I'd expect that to produce a
string with different behaviour from the input.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From @chipdude

On Wed, Aug 26, 2009 at 11​:19​:39PM +0100, Zefram wrote​:

Chip Salzenberg wrote​:

I disagree. The only extra encoding is manual here. The call to
utf8​::upgrade is performing that encoding step explicitly at the user's
request.

utf8​::upgrade isn't a user-visible encoding step. It changes how the
string is represented internally [...]

You have just agreed with me. "Change of representation" = "encoding".

Perl's parser takes bytes and gives them meaning. If you change the bytes,
you can't expect Perl's parser to ignore that.
--
Chip Salzenberg

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From zefram@fysh.org

Chip Salzenberg wrote​:

You have just agreed with me. "Change of representation" = "encoding".

utf8​::upgrade affects *internal* encoding. Not the user-visible content
of the string.

Perl's parser takes bytes and gives them meaning. If you change the bytes,
you can't expect Perl's parser to ignore that.

String eval is an operation on a string. A string of *characters*, in
current Perl. The Perl parser claims to ascribe meaning to characters,
not to bytes per se.

Obviously it's internally working with bytes. A Perl source file on
disk is really a sequence of bytes, and the interpretation of those
bytes as characters is influenced by the "use utf8" pragma. In the case
of string eval, the Perl string object already knows what characters it
represents, so without any pragma it already knows whether the internal
byte sequence needs to be interpreted via Latin-1 or UTF-8. ("use utf8"
in a string eval seems meaningless.)

I believe the bug here is that the Perl parser is not consistently
responding to the character sequence. This is presumably due to it
being implemented at the byte level, with insufficient abstraction.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From @chipdude

On Wed, Aug 26, 2009 at 11​:47​:11PM +0100, Zefram wrote​:

Chip Salzenberg wrote​:

You have just agreed with me. "Change of representation" = "encoding".

utf8​::upgrade affects *internal* encoding. Not the user-visible content
of the string.

"User-visible" is a vague term, because the utf8 flag *is* visible.

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so, then it's a documentation bug.

I believe the bug here is that the Perl parser is not consistently
responding to the character sequence. This is presumably due to it
being implemented at the byte level, with insufficient abstraction.

That's not a bug, it's a feature. (I'm mostly serious about that.)
And it's not worth "fixing". (I'm entirely serious about that.)
--
Chip Salzenberg

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From zefram@fysh.org

Chip Salzenberg wrote​:

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so, then it's a documentation bug.

I refer to, for example, perldata(1)​:

  Values are usually referred to by name, or through a named reference.
  [...] Usually this name is a single identifier, that is, a string
  beginning with a letter or underscore, and containing letters,
  underscores, and digits.

Clearly referring to characters there, not bytes. It's not so clear
about what qualifies as a "letter". perlunicode(1) expounds a bit more​:

  If an appropriate encoding is specified, identifiers within the
  Perl script may contain Unicode alphanumeric characters, including
  ideographs. Perl does not currently attempt to canonicalize
  variable names.

The internal Latin-1 encoding of a downgraded string seems an appropriate
encoding for the representation of U+f1, a Unicode letter character.

That's not a bug, it's a feature. (I'm mostly serious about that.)

I don't see how the inconsistency can ever be a good thing.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From @demerphq

2009/8/27 Chip Salzenberg <chip@​pobox.com>​:

On Wed, Aug 26, 2009 at 11​:47​:11PM +0100, Zefram wrote​:

Chip Salzenberg wrote​:

You have just agreed with me. "Change of representation" = "encoding".

utf8​::upgrade affects *internal* encoding. Not the user-visible content
of the string.

"User-visible" is a vague term, because the utf8 flag *is* visible.

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so, then it's a documentation bug.

No. It is *not*. We operate on character not bytes. Bytes are
meaningless. Characters are not. We go to *great* trouble to operate
on characters not on bytes. Reverting to bytes is a HUGE step
backwards and contradict MANY MANY things that happened in perl 5.10
and were planned for perl 5.12 and were in core long before either.
Some things to consider​: the regex engine operates on characters. The
behaviour of Perl on EBCDIC machines should be the same as it is on
latin-1, or unicode machines. Thus to claim that perl operates on
bytes contradicts MASSIVE amounts of code in the core.

I believe the bug here is that the Perl parser is not consistently
responding to the character sequence. This is presumably due to it
being implemented at the byte level, with insufficient abstraction.

That's not a bug, it's a feature. (I'm mostly serious about that.)

No it is not. It is a bug.

And it's not worth "fixing". (I'm entirely serious about that.)

I dont agree. It *is* worth fixing.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From @chipdude

On Thu, Aug 27, 2009 at 12​:19​:26AM +0100, Zefram wrote​:

Chip Salzenberg wrote​:

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so, then it's a documentation bug.

I refer to, for example, perldata(1) [and] perlunicode(1)

It does appear that L<perlfunc/eval> needs a note.

That's not a bug, it's a feature. (I'm mostly serious about that.)
I don't see how the inconsistency can ever be a good thing.

Calling it "inconsistency" misses the point. The C<eval> operator simply
never got a byte->character behavioral change when much (not all!) of the
rest of Perl did. Consider this, from perlfunc​:

  do EXPR Uses the value of EXPR as a filename and executes the
  contents of the file as a Perl script.
  do 'stat.pl';
  is just like
  eval `cat stat.pl`;
  except that it's more efficient and concise, keeps track of the
  current filename for error messages, searches the @​INC
  directories, and updates %INC if the file is found.

There's no way ever to fully lift parsing out of the world of bytes.
I think that's OK.
--
Chip Salzenberg

@p5pRT
Copy link
Author

p5pRT commented Aug 26, 2009

From @demerphq

2009/8/27 Chip Salzenberg <chip@​pobox.com>​:

On Thu, Aug 27, 2009 at 12​:19​:26AM +0100, Zefram wrote​:

Chip Salzenberg wrote​:

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so, then it's a documentation bug.

I refer to, for example, perldata(1) [and] perlunicode(1)

It does appear that L<perlfunc/eval> needs a note.

That's not a bug, it's a feature. (I'm mostly serious about that.)
I don't see how the inconsistency can ever be a good thing.

Calling it "inconsistency" misses the point. The C<eval> operator simply
never got a byte->character behavioral change when much (not all!) of the
rest of Perl did. Consider this, from perlfunc​:

  do EXPR Uses the value of EXPR as a filename and executes the
      contents of the file as a Perl script\.
          do 'stat\.pl';
      is just like
          eval \`cat stat\.pl\`;
      except that it's more efficient and concise\, keeps track of the
      current filename for error messages\, searches the @&#8203;INC
      directories\, and updates %INC if the file is found\.

There's no way ever to fully lift parsing out of the world of bytes.
I think that's OK.

I think *this* documentation is buggy. Not the other way around.

The exact same file /in terms of bytes/ will NOT do the same thing on
EBCDIC as it will on non EBCDIC. The exact same sequence of bytes will
not match the same way if it is "unicode" or "non-unicode", if we stop
paying attention to encoding we will end up in a very very serious
world of pain. The plan /was/ to revert to unicode semantics in ALL
respects. This means that bytes are irrelevant, characters are. Not
following through on this plan would be IMO a huge step backwards.

We have debated on p5p the subtleties of encoding, characters,
semantics, etc in the last few years, and came to some kind of general
consensus that the way forward was to assume full unicode semantics at
every level, as every other option sucks much much worse. Perhaps you
missed these debates, or their conclusions. I for one would not
welcome reopening these debates.

As I said earlier, bytes are meaningless. They are numbers. We dont
code in numbers, we code in characters. To go back to thinking of code
as numbers would be like returning to the dark ages from the age of
enlightenment.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2009

From @jandubois

On Wed, 26 Aug 2009, demerphq wrote​:

We have debated on p5p the subtleties of encoding, characters,
semantics, etc in the last few years, and came to some kind of general
consensus that the way forward was to assume full unicode semantics at
every level, as every other option sucks much much worse. Perhaps you
missed these debates, or their conclusions. I for one would not
welcome reopening these debates.

Did anybody summarize these conclusions somewhere? Or can you at least
point to the key list messages that give an overview on what was agreed
to before?

Getting to "full Unicode semantics at every level" sounds like a huge
undertaking. Unless we get rid of the SvUTF8 flag and indiscriminately
store all strings internally as UTF8, we would have to modify
virtually *all* APIs that currently take char* arguments and replace
them with SV*s, including all the OS level wrappings, like access
to the environment and file system.

Cheers,
-Jan

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2009

From @demerphq

2009/8/27 Jan Dubois <jand@​activestate.com>​:

On Wed, 26 Aug 2009, demerphq wrote​:

We have debated on p5p the subtleties of encoding, characters,
semantics, etc in the last few years, and came to some kind of general
consensus that the way forward was to assume full unicode semantics at
every level, as every other option sucks much much worse. Perhaps you
missed these debates, or their conclusions. I for one would not
welcome reopening these debates.

Did anybody summarize these conclusions somewhere? Or can you at least
point to the key list messages that give an overview on what was agreed
to before?

Ill try to put together a summary. The general agreement concerned
using unicode case foliding rules everywhere and eliminating the nasty
"latin1" versus "unciode" difference in behaviour in various
subsystems. Perhaps "everywhere" is too general.

Getting to "full Unicode semantics at every level" sounds like a huge
undertaking. Unless we get rid of the SvUTF8 flag and indiscriminately
store all strings internally as UTF8, we would have to modify
virtually *all* APIs that currently take char* arguments and replace
them with SV*s, including all the OS level wrappings, like access
to the environment and file system.

Is that not a good thing? Forget the amount of work for a moment. What
is the right design decision?

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2009

From @rgs

2009/8/27 Chip Salzenberg <chip@​pobox.com>​:

On Wed, Aug 26, 2009 at 11​:19​:39PM +0100, Zefram wrote​:

Chip Salzenberg wrote​:

I disagree.  The only extra encoding is manual here.  The call to
utf8​::upgrade is performing that encoding step explicitly at the user's
request.

utf8​::upgrade isn't a user-visible encoding step.  It changes how the
string is represented internally [...]

You have just agreed with me.  "Change of representation" = "encoding".

Well no it's not. The UTF8 flag shouldn't have any effect on anything
at perl level. It does, currently, and that can't be changed without
breaking backwards compatibility (notably in uc, lc, //i...), but I
think it's important enough to be changed for 5.12.

Perl's parser takes bytes and gives them meaning.  If you change the bytes,
you can't expect Perl's parser to ignore that.

That's a bug in my book. Perl's parser (or to be more precise,
tokenizer) should take characters, not bytes; it doesn't always.

Perltodo states :

=head2 Properly Unicode safe tokeniser and pads.

The tokeniser isn't actually very UTF-8 clean. C<use utf8;> is a hack -
variable names are stored in stashes as raw bytes, without the utf-8 flag
set. The pad API only takes a C<char *> pointer, so that's all bytes too. The
tokeniser ignores the UTF-8-ness of C<PL_rsfp>, or any SVs returned from
source filters. All this could be fixed.

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2009

From @davidnicol

On Wed, Aug 26, 2009 at 7​:16 PM, Jan Dubois<jand@​activestate.com> wrote​:

Getting to "full Unicode semantics at every level" sounds like a huge
undertaking. Unless we get rid of the SvUTF8 flag and indiscriminately
store all strings internally as UTF8, we would have to modify
virtually *all* APIs that currently take char* arguments and replace
them with SV*s, including all the OS level wrappings, like access
to the environment and file system.

Cheers,
-Jan

Except that the competition, by which I mean at least Python and e262,
are already there.

Yes, major reengineering is required, nobody wants to try to cross an
ocean in a knarr made entirely of band-aids.

Completely decoupled byte and character storage implementations means
rethinking how scalar values are represented.

It seems like a good way to get there is by, for instance, perl to
e262 translation software, taking v8+k7 as a compilation target. With
perl 5.14 as a front-end to a different abstraction, further debate
could be limited to feature design and implementation.

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2009

From @chipdude

On Thu, Aug 27, 2009 at 02​:43​:51AM +0200, demerphq wrote​:

The general agreement concerned using unicode case foliding rules
everywhere and eliminating the nasty "latin1" versus "unciode" difference
in behaviour in various subsystems.

The robot devil is, as always, in the details.

Perhaps "everywhere" is too general.

Indeed. In the specific case of eval, for example, idealism and the gains
(whatever they may be) from Unicode-aware parsing should be measured against
the bedrock fact that Perl assumes, deeply, that source code is a sequence of
bytes.

If the limited goal is to make utf8​::ugrade a NOP -- basically to make our
lexer work with a series of characters, none of which falls outside the
range 0-255 -- nothing of any real value is gained.

If the ambitious goal is to allow the lexer to identify and use arbitrary
Unicode characters, well, first that's a big job (not that I can tell anyone
how to spend their time); but second, it *still* gains us nothing of any
real value. Our lexer and parser are entirely happy with multi-byte
operators like "cmp". No significant work is required to allow them to work
with multi-byte operators that happen to be the UTF-8 sequence for \N{OPEN
SMILEY FACE}.

Finally, and most importantly, consider​:

What does C<use bytes> *mean* inside an eval of a utf8 string? How about
C<use utf8>? If you're about to tell me that the rules will be X if it's
utf8 string and Y if it isn't, then you've broken the very identity that you
wanted to retain!

In short​: Forget big fish vs. small fish. This isn't even a fish, it's just
a painting of a fish. Ceci n'est pas un fish. Let's go find real fish to fry.
--
Chip Salzenberg

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2009

From @khwilliamson

demerphq wrote​:

2009/8/27 Jan Dubois <jand@​activestate.com>​:

On Wed, 26 Aug 2009, demerphq wrote​:

We have debated on p5p the subtleties of encoding, characters,
semantics, etc in the last few years, and came to some kind of general
consensus that the way forward was to assume full unicode semantics at
every level, as every other option sucks much much worse. Perhaps you
missed these debates, or their conclusions. I for one would not
welcome reopening these debates.
Did anybody summarize these conclusions somewhere? Or can you at least
point to the key list messages that give an overview on what was agreed
to before?

Ill try to put together a summary. The general agreement concerned
using unicode case foliding rules everywhere and eliminating the nasty
"latin1" versus "unciode" difference in behaviour in various
subsystems. Perhaps "everywhere" is too general.

One principal that I think there was consensus on (and I certainly hope
no one disputes it) is that the the way Perl stores something internally
should have no effect on the user-level semantics (unless one is really
digging, like in 'use bytes'). This is sadly not currently the case.

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2009

From marvin@rectangular.com

On Fri, Aug 28, 2009 at 07​:52​:51AM -0600, karl williamson wrote​:

One principal that I think there was consensus on (and I certainly hope
no one disputes it) is that the the way Perl stores something internally
should have no effect on the user-level semantics (unless one is really
digging, like in 'use bytes'). This is sadly not currently the case.

Practically speaking, I think it's unrealistic to do anything ambitious with
UTF-8 without understanding the role of the SVf_UTF8 flag and being able to
troubleshoot using Devel​::Peek, etc. There are too many opportunities to make
mistakes.

That said, I'm grateful that for those of us who have that expertise, it *is*
possible to do ambitious things with UTF-8. :) I understand the backwards
compatibility constraints under which the system was designed, and I'm very
impressed by what was achieved.

Looking forward... what if this directive implied a source file encoding of
UTF-8? :)

  use 5.012;

Marvin Humphrey

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2009

From @nwc10

On Fri, Aug 28, 2009 at 08​:00​:04AM -0700, Marvin Humphrey wrote​:

Looking forward... what if this directive implied a source file encoding of
UTF-8? :)

use 5\.012;

What would I use, if I had a script written in some other encoding, but
needed to enforce a requirement for v5.12.0 or later?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2009

From marvin@rectangular.com

On Fri, Aug 28, 2009 at 04​:02​:19PM +0100, Nicholas Clark wrote​:

Looking forward... what if this directive implied a source file encoding of
UTF-8? :)

use 5\.012;

What would I use, if I had a script written in some other encoding, but
needed to enforce a requirement for v5.12.0 or later?

Obviously, I am implying that such a script would need to be updated. You are
already modding it by adding the "use" directive, no?

Marvin Humphrey

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2009

From @nwc10

On Fri, Aug 28, 2009 at 08​:04​:53AM -0700, Marvin Humphrey wrote​:

On Fri, Aug 28, 2009 at 04​:02​:19PM +0100, Nicholas Clark wrote​:

Looking forward... what if this directive implied a source file encoding of
UTF-8? :)

use 5\.012;

What would I use, if I had a script written in some other encoding, but
needed to enforce a requirement for v5.12.0 or later?

Obviously, I am implying that such a script would need to be updated. You are
already modding it by adding the "use" directive, no?

This would reduce functionality by conflating two orthogonal features. This
reduces choice. I think that forcing the conversion is a bad thing.

Particularly as use 5.010 had no such semantic overloading.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2009

From zefram@fysh.org

Nicholas Clark wrote​:

This would reduce functionality by conflating two orthogonal features. This
reduces choice. I think that forcing the conversion is a bad thing.

+1

Particularly as use 5.010 had no such semantic overloading.

$ perl -e 'say "foo"'
Unquoted string "say" may clash with future reserved word at -e line 1.
String found where operator expected at -e line 1, near "say "foo""
  (Do you need to predeclare say?)
syntax error at -e line 1, near "say "foo""
Execution of -e aborted due to compilation errors.
$ perl -e 'use 5.010; say "foo"'
foo

I'm afraid that boat has already sailed.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @rjbs

How does the creation of evalbytes and unicode_eval affect this ticket, if at all?

@p5pRT
Copy link
Author

p5pRT commented Mar 2, 2012

From @cpansprout

On Thu Mar 01 18​:46​:16 2012, rjbs wrote​:

How does the creation of evalbytes and unicode_eval affect this
ticket, if at all?

It isn’t enough. I believe the patches for which #107008 exists will
fix it, but if I knew for certain that would only be because I had
already integrated them (which I haven’t). :-)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2012

From @cpansprout

On Fri Mar 02 08​:59​:58 2012, sprout wrote​:

On Thu Mar 01 18​:46​:16 2012, rjbs wrote​:

How does the creation of evalbytes and unicode_eval affect this
ticket, if at all?

It isn’t enough. I believe the patches for which #107008 exists will
fix it, but if I knew for certain that would only be because I had
already integrated them (which I haven’t). :-)

The example shown in the original post still fails the same way. It
probably has something to do with require() being a syscall, so it
doesn’t respect utf8-ness (see the tickets linked to #105914). However,
require() should be able to preserve the utf8ness at least for reporting
failure. So this particular example is not resolved yet (nor do I think
it pressing enough for 5.16). But the bug described in the original
post (ignoring the example given) is fixed.

--

Father Chrysostomos

@khwilliamson
Copy link
Contributor

This is still a problem in 5.35.10, and to avoid having to read through the ticket, the real issue is the first example in the OP post, but not what I think people have said. The claim is that source without 'use utf8' is presumed encoded as Latin1. But in fact non-ASCII bytes are not assumed to be Latin1, but just anonymous bytes with no properties except for their code points and that they aren't \w, aren't \s, aren't controls .... But one would think that unicode_eval or unicode_strings would change these to their Latin1 values, but that doesn't happen

@khwilliamson khwilliamson changed the title parsing in eval() varies with UTF8ness eval() of non-ASCII bytes under unicode_eval and unicode_strings doesn't give them Latin1 meanings Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants