Deparse forgets use utf8 #11334

p5pRT · 2011-05-14T21:14:43Z

Migrated from rt.perl.org#90590 (status was 'open')

Searchable as RT90590$

p5pRT · 2011-05-14T21:14:45Z

From tchrist@perl.com

This should certainly be emitting a use utf8 at the top:

% perl -CS -MO=Deparse,-p -E 'say "\N{U+3b1}-\N{U+3c9}"'
BEGIN {
$^H{'feature_unicode'} = q(1);
$^H{'feature_say'} = q(1);
$^H{'feature_state'} = q(1);
$^H{'feature_switch'} = q(1);
}
say('α-ω');
-e syntax OK

--tom

Summary of my perl5 (revision 5 version 12 subversion 3) configuration:

Platform:
osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
uname='openbsd chthon 4.4 generic#0 i386 '
config_args='-des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
optimize='-O2',
cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-lgdbm -lm -lutil -lc
perllibs=-lm -lutil -lc
libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl):
Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF
Built under openbsd
Compiled at Feb 14 2011 07:32:03
%ENV:
PERL_UNICODE="SA"
@INC:
/usr/local/lib/perl5/site_perl/5.12.3/OpenBSD.i386-openbsd
/usr/local/lib/perl5/site_perl/5.12.3
/usr/local/lib/perl5/5.12.3/OpenBSD.i386-openbsd
/usr/local/lib/perl5/5.12.3
/usr/local/lib/perl5/site_perl/5.11.3
/usr/local/lib/perl5/site_perl/5.10.1
/usr/local/lib/perl5/site_perl/5.10.0
/usr/local/lib/perl5/site_perl/5.8.7
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl/5.005
/usr/local/lib/perl5/site_perl
.

p5pRT · 2012-01-05T22:04:42Z

From @cpansprout

On Sat May 14 14:14:45 2011, tom christiansen wrote:

This should certainly be emitting a use utf8 at the top:

% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q\(1\);
    $^H\{'feature\_say'\} = q\(1\);
    $^H\{'feature\_state'\} = q\(1\);
    $^H\{'feature\_switch'\} = q\(1\);
\}
say\('α\-ω'\);
\-e syntax OK

If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of
utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

--

Father Chrysostomos

p5pRT · 2012-01-05T22:04:43Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2012-01-05T22:12:54Z

From @cpansprout

On Thu Jan 05 14:04:42 2012, sprout wrote:

On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist of
utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

Don’t forget that (under use v5.16) eval("'\x{100}'") does the same
thing as evalbytes("use utf8; '\xc4\x80'").

--

Father Chrysostomos

p5pRT · 2012-01-05T22:24:41Z

From @ikegami

On Thu, Jan 5, 2012 at 5:04 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:

On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?

That's not relevant. The issue is that the string built by the program
generated by Deparse is different than the string built by the original
program.

$ perl -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;'
3

$ perl -MO=Deparse -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' | perl
Wide character in print at
/home/eric/usr/perlbrew/perls/perl-5.14.0t/lib/5.14.0/B/Deparse.pm line
1213.
-e syntax OK
5

p5pRT · 2012-01-05T22:39:37Z

From @cpansprout

On Thu Jan 05 14:24:41 2012, ikegami@adaelis.com wrote:

On Thu, Jan 5, 2012 at 5:04 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:
On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for it
to output a stream of utf8 bytes without -CS?
That's not relevant. The issue is that the string built by the program
generated by Deparse is different than the string built by the original
program.

$ perl -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;'
3

$ perl -MO=Deparse -E'$_="\N{U+3b1}-\N{U+3c9}"; say length;' | perl
Wide character in print at
/home/eric/usr/perlbrew/perls/perl-5.14.0t/lib/5.14.0/B/Deparse.pm line
1213.
-e syntax OK
5

Note the wide character warning. If one were to eval() the string
instead of outputting it, it would produce the same result.

It makes sense to me to *encode* output as utf8 by default, with ‘use
utf8’. But unless we are going to encode it before it reaches the
PerlIO layer (because when we have ‘use utf8’, the bytes, not the
characters, make up the source code), it doesn’t make sense (to me) to
add ‘use utf8’.

--

Father Chrysostomos

p5pRT · 2012-01-05T22:43:58Z

From @ikegami

On Thu, Jan 5, 2012 at 5:39 PM, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:

unless we are going to encode it before it reaches the
PerlIO layer (because when we have ‘use utf8’, the bytes, not the
characters, make up the source code), it doesn’t make sense (to me) to
add ‘use utf8’.

Agree.

It makes sense to me to *encode* output as utf8 by default, with ‘use

utf8’.

Agree.

p5pRT · 2012-01-06T00:33:41Z

From @ap

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-05 23:05]:

On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for
it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist
of utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?

I think Deparse needs to do better for strings. The output here should
look like this instead:

BEGIN {
$^H{'feature_unicode'} = q(1);
$^H{'feature_say'} = q(1);
$^H{'feature_state'} = q(1);
$^H{'feature_switch'} = q(1);
}
say("\x{03B1}-\x{03C9}");

That would be independent of encodings (well, beyond… ASCII I guess) as
well as semantically explicit.

Currently Deparse actually does the opposite transform for strings – if
you have a "\x{03B1}" in your source it will claim Perl saw a 'α'. That
is correct but not truly in the spirit of “showing you what Perl thought
you meant”, esp. when you consider that it gives you no (easy) way to
tell whether that 'ñ' was really "\x{D1}" or actually "n\x{0303}".

(Maybe it should even use \N by default. In fact I would be sure, if it
weren’t for the verbosity that this entails. As things are, I’d say that
it should be requestable by argument instead and can in any case be left
out for later.)

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

p5pRT · 2012-01-06T00:42:59Z

From @cpansprout

On Thu Jan 05 16:33:41 2012, aristotle wrote:

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-05
23:05]:
On Sat May 14 14:14:45 2011, tom christiansen wrote:
This should certainly be emitting a use utf8 at the top:
% perl \-CS \-MO=Deparse\,\-p \-E 'say "\\N\{U\+3b1\}\-\\N\{U\+3c9\}"'
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$'α\-ω'$;
\-e syntax OK
If it were to put ‘use utf8’ at the top, would it not make sense for
it to output a stream of utf8 bytes without -CS?

After all, ‘use utf8’ indicates that the *bytes* that follow consist
of utf8.

And should it behave differently depending on whether the output is
going straight to STDERR/OUT (whichever it is) or being returned by
coderef2text?
I think Deparse needs to do better for strings. The output here should
look like this instead:
BEGIN \{
    $^H\{'feature\_unicode'\} = q$1$;
    $^H\{'feature\_say'\} = q$1$;
    $^H\{'feature\_state'\} = q$1$;
    $^H\{'feature\_switch'\} = q$1$;
\}
say$"\\x\{03B1\}\-\\x\{03C9\}"$;
That would be independent of encodings (well, beyond… ASCII I guess) as
well as semantically explicit.

Currently Deparse actually does the opposite transform for strings – if
you have a "\x{03B1}" in your source it will claim Perl saw a 'α'. That
is correct but not truly in the spirit of “showing you what Perl thought
you meant”, esp. when you consider that it gives you no (easy) way to
tell whether that 'ñ' was really "\x{D1}" or actually "n\x{0303}".

What about symbol names?

--

Father Chrysostomos

p5pRT · 2012-01-06T03:30:00Z

From @ap

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-06 01:45]:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they
are irrelevant beyond the question of whether they’re identical or not,
so B::Deobfuscate demonstrates one way of dealing with them: they could
be replaced with some unambiguous representation of the original names
(when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

Then again identifiers should be getting normalised anyway (which Brian
is working on anyhow, I think?), so the question may be less pressing
for them in the first place than it is for strings, which the parser
obviously has to retain faithfully.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

p5pRT · 2014-12-11T06:00:32Z

From @cpansprout

This ticket is about whether B::Deparse output should use "\x{100}" or "Ā" and whether the latter should be encoded or not and whether the output should include ‘use utf8’.

On Thu Jan 05 19:30:00 2012, aristotle wrote:

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-06 01:45]:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they
are irrelevant beyond the question of whether they’re identical or not,
so B::Deobfuscate demonstrates one way of dealing with them: they could
be replaced with some unambiguous representation of the original names
(when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

To make things more complex: What about /(?<айдэнтыфайер>)/? You can’t escape those characters, because you get a syntax error. You can’t change them, because they correspond to hash keys.

Also, the question as to whether coderef2text output should be evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of bytes, so it can be output without wide char warnings:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу'
Wide character in print at lib/B/Deparse.pm line 1588.
use utf8;
our $фу;
-e syntax OK

But that coderef2text should be a Unicode string so it can be fed to ‘eval’.)

--

Father Chrysostomos

p5pRT · 2014-12-11T06:16:32Z

From @cpansprout

On Wed Dec 10 22:00:32 2014, sprout wrote:

This ticket is about whether B::Deparse output should use "\x{100}" or
"Ā" and whether the latter should be encoded or not and whether the
output should include ‘use utf8’.

On Thu Jan 05 19:30:00 2012, aristotle wrote:

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2012-01-06
01:45]:

What about symbol names?

They don’t have an easy answer I can think of.

The subset of lexical variable names sort of has one, insofar as they
are irrelevant beyond the question of whether they’re identical or
not,
so B::Deobfuscate demonstrates one way of dealing with them: they
could
be replaced with some unambiguous representation of the original
names
(when so requested by some switch).

But no generalised solution for all identifiers comes to mind.

To make things more complex: What about /(?<айдэнтыфайер>)/? You
can’t escape those characters, because you get a syntax error. You
can’t change them, because they correspond to hash keys.

Also, the question as to whether coderef2text output should be
evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of
bytes, so it can be output without wide char warnings:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу'
Wide character in print at lib/B/Deparse.pm line 1588.
use utf8;
our $фу;
-e syntax OK

But that coderef2text should be a Unicode string so it can be fed to
‘eval’.)

And here is a similar issue:

use utf8;
my $e = "Böck";
ok(utf8::is_utf8($e),"got a unicode string - rt75680");

I recently made it so that the "Böck" is output with an escape, just to avoid malformation errors. (It was being emitted as Latin-1, so the output fed back to perl resulted in corrupt strings.)

But now the problem is that the test (from t/re/pat.t) fails, because we do longer have a utf8-flagged string. Granted, this test is too sensitive, in that it is checking the internal storage of a scalar. But this is a *core* test that just ensures that the tests that follow are testing what we think they are testing. This is another case where the core tests don’t lend themselves to being deparsed and re-run.

--

Father Chrysostomos

p5pRT · 2014-12-13T09:45:59Z

From @ap

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2014-12-11 07:05]:

To make things more complex: What about /(?<айдэнтыфайер>)/? You
can’t escape those characters, because you get a syntax error. You
can’t change them, because they correspond to hash keys.

Ugh. *scrunchface* Your nose for lurking evil is just too good… :-)

Now, what answer do you expect? If that leaves no other option, then it
leaves no other option. If there is only one way it can work, then that
is the way it has to work. It would still be nice to get string literals
with escapes… But quite evidently now they are a special case, with the
general case going the other way.

Also, the question as to whether coderef2text output should be
evallable or evalbytesable is still unanswered.

(My gut feeling is that output from -MO=Deparse should be a stream of
bytes, so it can be output without wide char warnings:

$ ./perl -Ilib -MO=Deparse -e 'use utf8; our $фу'
Wide character in print at lib/B/Deparse.pm line 1588.
use utf8;
our $фу;
-e syntax OK

Certainly.

But that coderef2text should be a Unicode string so it can be fed to
‘eval’.)

Seems a wash outside of the usability issue that people are probably
more likely to use `eval` and not even know about `evalbytes`, so yeah,
I suppose.

* Father Chrysostomos via RT <perlbug-followup@perl.org> [2014-12-11 07:20]:

And here is a similar issue:
    use utf8;
    my $e = "Böck";
    ok$utf8&#8203;::is\_utf8\($e$\,"got a unicode string \- rt75680"\);
I recently made it so that the "Böck" is output with an escape, just
to avoid malformation errors. (It was being emitted as Latin-1, so the
output fed back to perl resulted in corrupt strings.)

But now the problem is that the test (from t/re/pat.t) fails, because
we do longer have a utf8-flagged string. Granted, this test is too
sensitive, in that it is checking the internal storage of a scalar.
But this is a *core* test that just ensures that the tests that follow
are testing what we think they are testing. This is another case where
the core tests don’t lend themselves to being deparsed and re-run.

Is that testing the regexp engine or the parser? If it’s not testing the
parser – and it looks to me like it isn’t – then why is it testing how
the string was parsed, instead of just forcibly up- and downgrading it
as needed to ensure the UTF-8 flag value required by following tests?

It should still assert that the flag has the required value, of course,
just not rely on parser internals to set it.

I don’t like perl making promises that particular forms of writing the
same string as a literal will reliably yield a particular UTF-8 flag
value, and user code should not be relying on that. Of course, due to
the imperfect state of the platform, some code has to care about the
state of the UTF-8 flag, even though ideally none ever would. But even
code which has such legitimate needs should not be relying on exactly
how literals are parsed, IMO. It should upgrade or downgrade explicitly.

So as far as I care, this is a bug in the test. Not a bug in Deparse.
As far as I care, Deparse here is working correctly (after your fix).

— • —

OTOH, if the test *were* trying to test the parser, I would say this is
somewhat of a conundrum case. I would still maintain that Deparse works
correctly. It’s just that the test tests something that depends on the
exact form of the source – which Deparse will never be able to promise
to preserve in pristine perfection.

It’s one thing for Deparse to preserve the exact semantics of a program.
It certainly ought to try its damnedest to do that, even if that is not
attainable in the general case. I.e. deviations from this ideal are bugs
even if they have to be considered unfixable.

But because there are many semantically identical representations of any
one thing in a program, and the verbatim original representation is not
preserved, Deparse is, by its very nature, inapplicable to the class of
programs whose semantics depend on the specific choice among multiple
possible representations.

So at best you can test that Deparse re-deparses them consistently after
they have been parsed again, as I recently mused elsewhere. (It ought to
always roundtrip identically when fed its own output.)

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

khwilliamson · 2022-04-11T21:08:34Z

What this does in 5.35.10 is
BEGIN { $^W = 1; } use feature 'current_sub', 'bitwise', 'evalbytes', 'fc', 'postderef_qq', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval'; say("\x{3b1}-\x{3c9}"); -e syntax OK
And we get this
perl -CS -MO=Deparse,-p -E 'use utf8; qr/(?<айдэнтыфайер>)/' BEGIN { $^W = 1; } use feature 'current_sub', 'bitwise', 'evalbytes', 'fc', 'postderef_qq', 'say', 'state', 'switch', 'unicode_strings', 'unicode_eval'; use utf8; qr/(?<\x{430}\x{439}\x{434}\x{44d}\x{43d}\x{442}\x{44b}\x{444}\x{430}\x{439}\x{435}\x{440}>)/u;
This is related to http://nntp.perl.org/group/perl.perl5.porters/262961

p5pRT added Severity Low distro-openbsd labels Oct 19, 2019

p5pRT mentioned this issue Oct 19, 2019

substitution within (?{}) causes segmentation fault #15353

Open

p5pRT added the khw label Oct 25, 2019

toddr removed the khw label Oct 25, 2019

xenu removed the Severity Low label Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deparse forgets use utf8 #11334

Deparse forgets use utf8 #11334

p5pRT commented May 14, 2011

p5pRT commented May 14, 2011

p5pRT commented Jan 5, 2012

p5pRT commented Jan 5, 2012

p5pRT commented Jan 5, 2012

p5pRT commented Jan 5, 2012

p5pRT commented Jan 5, 2012

p5pRT commented Jan 5, 2012

p5pRT commented Jan 6, 2012

p5pRT commented Jan 6, 2012

p5pRT commented Jan 6, 2012

p5pRT commented Dec 11, 2014

p5pRT commented Dec 11, 2014

p5pRT commented Dec 13, 2014

khwilliamson commented Apr 11, 2022

Deparse forgets use utf8 #11334

Deparse forgets use utf8 #11334

Comments

p5pRT commented May 14, 2011

p5pRT commented May 14, 2011

From tchrist@perl.com

p5pRT commented Jan 5, 2012

From @cpansprout

p5pRT commented Jan 5, 2012

p5pRT commented Jan 5, 2012

From @cpansprout

p5pRT commented Jan 5, 2012

From @ikegami

p5pRT commented Jan 5, 2012

From @cpansprout

p5pRT commented Jan 5, 2012

From @ikegami

p5pRT commented Jan 6, 2012

From @ap

p5pRT commented Jan 6, 2012

From @cpansprout

p5pRT commented Jan 6, 2012

From @ap

p5pRT commented Dec 11, 2014

From @cpansprout

p5pRT commented Dec 11, 2014

From @cpansprout

p5pRT commented Dec 13, 2014

From @ap

khwilliamson commented Apr 11, 2022