Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deparse forgets use utf8 #11334

Open
p5pRT opened this issue May 14, 2011 · 14 comments
Open

Deparse forgets use utf8 #11334

p5pRT opened this issue May 14, 2011 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented May 14, 2011

Migrated from rt.perl.org#90590 (status was 'open')

Searchable as RT90590$

@p5pRT
Copy link
Author

p5pRT commented May 14, 2011

From tchrist@perl.com

This should certainly be emitting a use utf8 at the top​:

  % perl -CS -MO=Deparse,-p -E 'say "\N{U+3b1}-\N{U+3c9}"'
  BEGIN {
  $^H{'feature_unicode'} = q(1);
  $^H{'feature_say'} = q(1);
  $^H{'feature_state'} = q(1);
  $^H{'feature_switch'} = q(1);
  }
  say('α-ω');
  -e syntax OK

--tom

  Summary of my perl5 (revision 5 version 12 subversion 3) configuration​:
 
  Platform​:
  osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
  uname='openbsd chthon 4.4 generic#0 i386 '
  config_args='-des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-lgdbm -lm -lutil -lc
  perllibs=-lm -lutil -lc
  libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
  cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

  Characteristics of this binary (from libperl)​:
  Compile-time options​: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
  USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF
  Built under openbsd
  Compiled at Feb 14 2011 07​:32​:03
  %ENV​:
  PERL_UNICODE="SA"
  @​INC​:
  /usr/local/lib/perl5/site_perl/5.12.3/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/site_perl/5.12.3
  /usr/local/lib/perl5/5.12.3/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/5.12.3
  /usr/local/lib/perl5/site_perl/5.11.3
  /usr/local/lib/perl5/site_perl/5.10.1
  /usr/local/lib/perl5/site_perl/5.10.0
  /usr/local/lib/perl5/site_perl/5.8.7
  /usr/local/lib/perl5/site_perl/5.8.0
  /usr/local/lib/perl5/site_perl/5.6.0
  /usr/local/lib/perl5/site_perl/5.005
  /usr/local/lib/perl5/site_perl
  .

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2012

From @cpansprout

On Sat May 14 14​:14​:45 2011, tom christiansen wrote​:

This should certainly be emitting a use utf8 at the top​:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of
utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2012

From @cpansprout

On Thu Jan 05 14​:04​:42 2012, sprout wrote​:

On Sat May 14 14​:14​:45 2011, tom christiansen wrote​:

This should certainly be emitting a use utf8 at the top​:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of
utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

Don’t forget that (under use v5.16) eval("'\x{100}'") does the same
thing as evalbytes("use utf8; '\xc4\x80'").

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2012

From @ikegami

On Thu, Jan 5, 2012 at 5​:04 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sat May 14 14​:14​:45 2011, tom christiansen wrote​:

This should certainly be emitting a use utf8 at the top​:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

That's not relevant. The issue is that the string built by the program
generated by Deparse is different than the string built by the original
program.

$ perl -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;'
3

$ perl -MO=Deparse -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' | perl
Wide character in print at
/home/eric/usr/perlbrew/perls/perl-5.14.0t/lib/5.14.0/B/Deparse.pm line
1213.
-e syntax OK
5

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2012

From @cpansprout

On Thu Jan 05 14​:24​:41 2012, ikegami@​adaelis.com wrote​:

On Thu, Jan 5, 2012 at 5​:04 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sat May 14 14​:14​:45 2011, tom christiansen wrote​:

This should certainly be emitting a use utf8 at the top​:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

That's not relevant. The issue is that the string built by the program
generated by Deparse is different than the string built by the original
program.

$ perl -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;'
3

$ perl -MO=Deparse -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' | perl
Wide character in print at
/home/eric/usr/perlbrew/perls/perl-5.14.0t/lib/5.14.0/B/Deparse.pm line
1213.
-e syntax OK
5

Note the wide character warning. If one were to eval() the string
instead of outputting it, it would produce the same result.

It makes sense to me to *encode* output as utf8 by default, with ‘use
utf8’. But unless we are going to encode it before it reaches the
PerlIO layer (because when we have ‘use utf8’, the bytes, not the
characters, make up the source code), it doesn’t make sense (to me) to
add ‘use utf8’.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jan 5, 2012

From @ikegami

On Thu, Jan 5, 2012 at 5​:39 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

unless we are going to encode it before it reaches the
PerlIO layer (because when we have ‘use utf8’, the bytes, not the
characters, make up the source code), it doesn’t make sense (to me) to
add ‘use utf8’.

Agree.

It makes sense to me to *encode* output as utf8 by default, with ‘use

utf8’.

Agree.

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2012

From @ap

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2012-01-05 23​:05]​:

On Sat May 14 14​:14​:45 2011, tom christiansen wrote​:

This should certainly be emitting a use utf8 at the top​:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for
it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist
of utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

I think Deparse needs to do better for strings. The output here should
look like this instead​:

  BEGIN {
  $^H{'feature_unicode'} = q(1);
  $^H{'feature_say'} = q(1);
  $^H{'feature_state'} = q(1);
  $^H{'feature_switch'} = q(1);
  }
  say("\x{03B1}-\x{03C9}");

That would be independent of encodings (well, beyond… ASCII I guess) as
well as semantically explicit.

Currently Deparse actually does the opposite transform for strings – if
you have a "\x{03B1}" in your source it will claim Perl saw a 'α'. That
is correct but not truly in the spirit of “showing you what Perl thought
you meant”, esp. when you consider that it gives you no (easy) way to
tell whether that 'ñ' was really "\x{D1}" or actually "n\x{0303}".

(Maybe it should even use \N by default. In fact I would be sure, if it
weren’t for the verbosity that this entails. As things are, I’d say that
it should be requestable by argument instead and can in any case be left
out for later.)

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2012

From @cpansprout

On Thu Jan 05 16​:33​:41 2012, aristotle wrote​:

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2012-01-05
23​:05]​:

On Sat May 14 14​:14​:45 2011, tom christiansen wrote​:

This should certainly be emitting a use utf8 at the top​:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for
it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist
of utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

I think Deparse needs to do better for strings. The output here should
look like this instead​:

BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\("\\x\{03B1\}\-\\x\{03C9\}"\);

That would be independent of encodings (well, beyond… ASCII I guess) as
well as semantically explicit.

Currently Deparse actually does the opposite transform for strings – if
you have a "\x{03B1}" in your source it will claim Perl saw a 'α'. That
is correct but not truly in the spirit of “showing you what Perl thought
you meant”, esp. when you consider that it gives you no (easy) way to
tell whether that 'ñ' was really "\x{D1}" or actually "n\x{0303}".

What about symbol names?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2012

From @ap

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2012-01-06 01​:45]​:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they
are irrelevant beyond the question of whether they’re identical or not,
so B​::Deobfuscate demonstrates one way of dealing with them​: they could
be replaced with some unambiguous representation of the original names
(when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

Then again identifiers should be getting normalised anyway (which Brian
is working on anyhow, I think?), so the question may be less pressing
for them in the first place than it is for strings, which the parser
obviously has to retain faithfully.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2014

From @cpansprout

This ticket is about whether B​::Deparse output should use "\x{100}" or "Ā" and whether the latter should be encoded or not and whether the output should include ‘use utf8’.

On Thu Jan 05 19​:30​:00 2012, aristotle wrote​:

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2012-01-06 01​:45]​:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they
are irrelevant beyond the question of whether they’re identical or not,
so B​::Deobfuscate demonstrates one way of dealing with them​: they could
be replaced with some unambiguous representation of the original names
(when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

To make things more complex​: What about /(?<айдэнтыфайер>)/? You can’t escape those characters, because you get a syntax error. You can’t change them, because they correspond to hash keys.

Also, the question as to whether coderef2text output should be evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of bytes, so it can be output without wide char warnings​:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу'
Wide character in print at lib/B/Deparse.pm line 1588.
use utf8;
our $фу;
-e syntax OK

But that coderef2text should be a Unicode string so it can be fed to ‘eval’.)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2014

From @cpansprout

On Wed Dec 10 22​:00​:32 2014, sprout wrote​:

This ticket is about whether B​::Deparse output should use "\x{100}" or
"Ā" and whether the latter should be encoded or not and whether the
output should include ‘use utf8’.

On Thu Jan 05 19​:30​:00 2012, aristotle wrote​:

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2012-01-06
01​:45]​:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they
are irrelevant beyond the question of whether they’re identical or
not,
so B​::Deobfuscate demonstrates one way of dealing with them​: they
could
be replaced with some unambiguous representation of the original
names
(when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

To make things more complex​: What about /(?<айдэнтыфайер>)/? You
can’t escape those characters, because you get a syntax error. You
can’t change them, because they correspond to hash keys.

Also, the question as to whether coderef2text output should be
evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of
bytes, so it can be output without wide char warnings​:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу'
Wide character in print at lib/B/Deparse.pm line 1588.
use utf8;
our $фу;
-e syntax OK

But that coderef2text should be a Unicode string so it can be fed to
‘eval’.)

And here is a similar issue​:

  use utf8;
  my $e = "Böck";
  ok(utf8​::is_utf8($e),"got a unicode string - rt75680");

I recently made it so that the "Böck" is output with an escape, just to avoid malformation errors. (It was being emitted as Latin-1, so the output fed back to perl resulted in corrupt strings.)

But now the problem is that the test (from t/re/pat.t) fails, because we do longer have a utf8-flagged string. Granted, this test is too sensitive, in that it is checking the internal storage of a scalar. But this is a *core* test that just ensures that the tests that follow are testing what we think they are testing. This is another case where the core tests don’t lend themselves to being deparsed and re-run.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Dec 13, 2014

From @ap

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2014-12-11 07​:05]​:

To make things more complex​: What about /(?<айдэнтыфайер>)/? You
can’t escape those characters, because you get a syntax error. You
can’t change them, because they correspond to hash keys.

Ugh. *scrunchface* Your nose for lurking evil is just too good… :-)

Now, what answer do you expect? If that leaves no other option, then it
leaves no other option. If there is only one way it can work, then that
is the way it has to work. It would still be nice to get string literals
with escapes… But quite evidently now they are a special case, with the
general case going the other way.

Also, the question as to whether coderef2text output should be
evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of
bytes, so it can be output without wide char warnings​:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу'
Wide character in print at lib/B/Deparse.pm line 1588.
use utf8;
our $фу;
-e syntax OK

Certainly.

But that coderef2text should be a Unicode string so it can be fed to
‘eval’.)

Seems a wash outside of the usability issue that people are probably
more likely to use `eval` and not even know about `evalbytes`, so yeah,
I suppose.

* Father Chrysostomos via RT <perlbug-followup@​perl.org> [2014-12-11 07​:20]​:

And here is a similar issue​:

    use utf8;
    my $e = "Böck";
    ok\(utf8&#8203;::is\_utf8\($e\)\,"got a unicode string \- rt75680"\);

I recently made it so that the "Böck" is output with an escape, just
to avoid malformation errors. (It was being emitted as Latin-1, so the
output fed back to perl resulted in corrupt strings.)

But now the problem is that the test (from t/re/pat.t) fails, because
we do longer have a utf8-flagged string. Granted, this test is too
sensitive, in that it is checking the internal storage of a scalar.
But this is a *core* test that just ensures that the tests that follow
are testing what we think they are testing. This is another case where
the core tests don’t lend themselves to being deparsed and re-run.

Is that testing the regexp engine or the parser? If it’s not testing the
parser – and it looks to me like it isn’t – then why is it testing how
the string was parsed, instead of just forcibly up- and downgrading it
as needed to ensure the UTF-8 flag value required by following tests?

It should still assert that the flag has the required value, of course,
just not rely on parser internals to set it.

I don’t like perl making promises that particular forms of writing the
same string as a literal will reliably yield a particular UTF-8 flag
value, and user code should not be relying on that. Of course, due to
the imperfect state of the platform, some code has to care about the
state of the UTF-8 flag, even though ideally none ever would. But even
code which has such legitimate needs should not be relying on exactly
how literals are parsed, IMO. It should upgrade or downgrade explicitly.

So as far as I care, this is a bug in the test. Not a bug in Deparse.
As far as I care, Deparse here is working correctly (after your fix).

  — • —

OTOH, if the test *were* trying to test the parser, I would say this is
somewhat of a conundrum case. I would still maintain that Deparse works
correctly. It’s just that the test tests something that depends on the
exact form of the source – which Deparse will never be able to promise
to preserve in pristine perfection.

It’s one thing for Deparse to preserve the exact semantics of a program.
It certainly ought to try its damnedest to do that, even if that is not
attainable in the general case. I.e. deviations from this ideal are bugs
even if they have to be considered unfixable.

But because there are many semantically identical representations of any
one thing in a program, and the verbatim original representation is not
preserved, Deparse is, by its very nature, inapplicable to the class of
programs whose semantics depend on the specific choice among multiple
possible representations.

So at best you can test that Deparse re-deparses them consistently after
they have been parsed again, as I recently mused elsewhere. (It ought to
always roundtrip identically when fed its own output.)

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@khwilliamson
Copy link
Contributor

What this does in 5.35.10 is
BEGIN { $^W = 1; } use feature 'current_sub', 'bitwise', 'evalbytes', 'fc', 'postderef_qq', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval'; say("\x{3b1}-\x{3c9}"); -e syntax OK
And we get this
perl -CS -MO=Deparse,-p -E 'use utf8; qr/(?<айдэнтыфайер>)/' BEGIN { $^W = 1; } use feature 'current_sub', 'bitwise', 'evalbytes', 'fc', 'postderef_qq', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval'; use utf8; qr/(?<\x{430}\x{439}\x{434}\x{44d}\x{43d}\x{442}\x{44b}\x{444}\x{430}\x{439}\x{435}\x{440}>)/u;
This is related to http://nntp.perl.org/group/perl.perl5.porters/262961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants