dump.c cannot dump Unicode stash names #11762

p5pRT · 2011-11-21T00:23:45Z

Migrated from rt.perl.org#104116 (status was 'resolved')

Searchable as RT104116$

p5pRT · 2011-11-21T00:23:45Z

From @cpansprout

$ ./perl -Ilib -Mutf8 -MDevel::Peek -e '*fòò:: = *bǎr::; Dump \%fòò::'
SV = IV(0x8666cc) at 0x8666d0
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x867e40
SV = PVHV(0x808330) at 0x867e40
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x28c610
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "bǎr"
NAMECOUNT = 2
ENAME = "bǎr", "f??"

Those question marks represent Latin-1 bytes that my UTF-8 terminal could not render. bǎr is output in UTF-8.

Flags:
category=core
severity=low

Site configuration information for perl 5.15.4:

Configured by sprout at Wed Nov 2 09:06:14 PDT 2011.

Summary of my perl5 (revision 5 version 15 subversion 4) configuration:
Snapshot of: f364061
Platform:
osname=darwin, osvers=10.5.0, archname=darwin-thread-multi-2level
uname='darwin pint.local 10.5.0 darwin kernel version 10.5.0: fri nov 5 23:20:39 pdt 2010; root:xnu-1504.9.17~1release_i386 i386 '
config_args='-de -Doptimize=-g -Dusedevel -Duseithreads -Dmad'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
optimize='-g',
cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-ldbm -ldl -lm -lutil -lc
perllibs=-ldl -lm -lutil -lc
libc=, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -fstack-protector'

Locally applied patches:

@INC for perl 5.15.4:
/usr/local/lib/perl5/site_perl/5.15.4/darwin-thread-multi-2level
/usr/local/lib/perl5/site_perl/5.15.4
/usr/local/lib/perl5/5.15.4/darwin-thread-multi-2level
/usr/local/lib/perl5/5.15.4
/usr/local/lib/perl5/site_perl
.

Environment for perl 5.15.4:
DYLD_LIBRARY_PATH (unset)
HOME=/Users/sprout
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/local/bin
PERL_BADLANG (unset)
SHELL=/bin/bash

p5pRT · 2012-02-03T04:40:07Z

From @Hugmeir

On Sun Nov 20 16:23:45 2011, sprout wrote:

$ ./perl -Ilib -Mutf8 -MDevel::Peek -e '*fòò:: = *bǎr::; Dump \%fòò::'
SV = IV(0x8666cc) at 0x8666d0
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x867e40
SV = PVHV(0x808330) at 0x867e40
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x28c610
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "bǎr"
NAMECOUNT = 2
ENAME = "bǎr", "f??"

Those question marks represent Latin-1 bytes that my UTF-8 terminal
could not render. bǎr is output in UTF-8.
---
Flags:
category=core
severity=low
---
Site configuration information for perl 5.15.4:

Configured by sprout at Wed Nov 2 09:06:14 PDT 2011.

Summary of my perl5 (revision 5 version 15 subversion 4)
configuration:
Snapshot of: f364061
Platform:
osname=darwin, osvers=10.5.0, archname=darwin-thread-multi-2level
uname='darwin pint.local 10.5.0 darwin kernel version 10.5.0: fri
nov 5 23:20:39 pdt 2010; root:xnu-1504.9.17~1release_i386 i386 '
config_args='-de -Doptimize=-g -Dusedevel -Duseithreads -Dmad'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define,
usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing
-pipe -fstack-protector -I/usr/local/include',
optimize='-g',
cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)',
gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='
-fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-ldbm -ldl -lm -lutil -lc
perllibs=-ldl -lm -lutil -lc
libc=, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup
-L/usr/local/lib -fstack-protector'

Locally applied patches:

---
@INC for perl 5.15.4:
/usr/local/lib/perl5/site_perl/5.15.4/darwin-thread-multi-2level
/usr/local/lib/perl5/site_perl/5.15.4
/usr/local/lib/perl5/5.15.4/darwin-thread-multi-2level
/usr/local/lib/perl5/5.15.4
/usr/local/lib/perl5/site_perl
.

---
Environment for perl 5.15.4:
DYLD_LIBRARY_PATH (unset)
HOME=/Users/sprout
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/
usr/local/bin
PERL_BADLANG (unset)
SHELL=/bin/bash

Howdy all. I'm looking for opinions on how to go about fixing this.
Usually, when Devel::Peek finds something with the UTF-8 flag on, it'll
display it like this:
$ perl -MDevel::Peek -E 'Dump "\x{30cb}"'

SV = PV(0x90d8a94) at 0x90f5b4c
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x90fdc74 "\343\203\213"\0 [UTF8 "\x{30cb}"]
CUR = 3
LEN = 12

That is, "escaped-bytestring"\0 [UTF8 "escaped-character-string"].

Should it also follow that convention for stash names and the like? Or
should it just show the escaped character string? Neither of those, and
output UTF8 when possible? Or something else entirely?

To get the point across, here's something like what the output would
look like for the three options:

$ perl -MDevel::Peek -E '*{"f\xe9::"} = *{"b\x{30cb}::"}; Dump \%{"f
\xe9::"}'

First option
SV = IV(0x8d62b38) at 0x8d62b3c
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x8e278bc
SV = PVHV(0x8d1b39c) at 0x8e278bc
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x8e2399c
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "b\343\203\213" [UTF8 "b\x{30cb}"]
NAMECOUNT = 2
ENAME = "b\343\203\213" [UTF8 "b\x{30cb}"], "f\351"

Second option
SV = IV(0x8d62b38) at 0x8d62b3c
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x8e278bc
SV = PVHV(0x8d1b39c) at 0x8e278bc
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x8e2399c
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "b\x{30cb}"
NAMECOUNT = 2
ENAME = "b\x{30cb}", "f\351"

Third option
SV = IV(0x8d62b38) at 0x8d62b3c
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x8e278bc
SV = PVHV(0x8d1b39c) at 0x8e278bc
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x8e2399c
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "bニ"
NAMECOUNT = 2
ENAME = "bニ", "fé"

Personally, I think the first option sucks -- it sort of starts alright
but breaks down easily for other types, like a coderef, which gets
printed like "STASH" :: "NAME", and would become "STASH" [UTF8
"STASH"] :: "NAME" [UTF8 "NAME"] if both of those were in UTF8. But
admittedly this is a stylistic concern more than anything.

Meanwhile, the third makes debugging dependent on having a font that
can display all symbols and not getting anything invisible in your
names -- Otherwise,good luck discerning \xe9 and e\N{COMBINING ACUTE
ACCENT}, or why ref(bless {}, "Yep") ne ref(bless {}, "Ye\0p").

So I will probably go for the second, unless someone has objections and/
or a better idea.

p5pRT · 2012-02-03T04:40:08Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2012-02-03T06:47:28Z

From @cpansprout

On Thu Feb 02 20:40:07 2012, Hugmeir wrote:

So I will probably go for the second, unless someone has objections and/
or a better idea.

2 sounds good.

--

Father Chrysostomos

p5pRT · 2012-02-03T06:47:28Z

From [Unknown Contact. See original ticket]

On Thu Feb 02 20:40:07 2012, Hugmeir wrote:

So I will probably go for the second, unless someone has objections and/
or a better idea.

2 sounds good.

--

Father Chrysostomos

p5pRT · 2012-02-03T13:41:19Z

From @nwc10

On Thu, Feb 02, 2012 at 08:40:08PM -0800, Brian Fraser via RT wrote:

Second option
SV = IV(0x8d62b38) at 0x8d62b3c
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x8e278bc
SV = PVHV(0x8d1b39c) at 0x8e278bc
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x8e2399c
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "b\x{30cb}"
NAMECOUNT = 2
ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

Because I'm thinking that some (documented, unambiguous) convention like that
would make for better reading than an explicit longhand character sequence
all the time.

Personally, I think the first option sucks -- it sort of starts alright
but breaks down easily for other types, like a coderef, which gets
printed like "STASH" :: "NAME", and would become "STASH" [UTF8
"STASH"] :: "NAME" [UTF8 "NAME"] if both of those were in UTF8. But
admittedly this is a stylistic concern more than anything.

Meanwhile, the third makes debugging dependent on having a font that
can display all symbols and not getting anything invisible in your
names -- Otherwise,good luck discerning \xe9 and e\N{COMBINING ACUTE
ACCENT}, or why ref(bless {}, "Yep") ne ref(bless {}, "Ye\0p").

Yes, but in both cases the "style" is really about conveying information
accurately without clutter, so it's important.

So I will probably go for the second, unless someone has objections and/
or a better idea.

Yes, the second looks the best idea (so far)
We can change it if someone has a better idea. The dump format isn't
sacrosanct.

Nicholas Clark

p5pRT · 2012-02-03T16:48:38Z

From @demerphq

On 3 February 2012 14:40, Nicholas Clark <nick@ccl4.org> wrote:

ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

As an aside, there are number of bits of code that use octal for
codepoints <= 255, and hex for codepoints > 255.

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.

cheer
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-04T10:12:36Z

From @demerphq

On 3 February 2012 17:48, demerphq <demerphq@gmail.com> wrote:

On 3 February 2012 14:40, Nicholas Clark <nick@ccl4.org> wrote:

ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

As an aside, there are number of bits of code that use octal for
codepoints <= 255, and hex for codepoints > 255.

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.

I should have added that the argument I have seen in at least one
place in the code (as a comment) is that octal is used for low byte
escapes because it is shorter. IOW, 100 nulls will be 200 chars long,
whereas with unbraced hex it would be 400, and with braces 500.

I personally think for stuff like this the rule should be, if there is
a named escape use it, if it is null use \0, otherwise use braced hex
if it is codepoints, and unbraced hex (2 digit) if it is bytes being
output.

I also think that Dump output should be in ASCII unless requested to
do otherwise.

Also, I will note that the regex engine debug output does not use \ as
the escape character (anymore (for a long time)), it uses % so as to
make it absolutely clear whether we are talking about an escape from
dumping, or an escape in the pattern. So there is precedence for
having diagnostics be a little different from the normal rules of
perl.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-04T12:34:54Z

From @nwc10

On Sat, Feb 04, 2012 at 11:11:51AM +0100, demerphq wrote:

On 3 February 2012 17:48, demerphq <demerphq@gmail.com> wrote:

On 3 February 2012 14:40, Nicholas Clark <nick@ccl4.org> wrote:

ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

As an aside, there are number of bits of code that use octal for
codepoints <= 255, and hex for codepoints > 255.

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.

Although thinking further I realise that that these two *aren't* the same,
and a dump output should continue to show that:

$ ./perl -Ilib -MDevel::Peek -e '$a = "N" . chr 255; chop $a; Dump($a)'
SV = PV(0x100801070) at 0x100812ae8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x100601b80 "N"\0
CUR = 1
LEN = 16
$ ./perl -Ilib -MDevel::Peek -e '$a = "N" . chr 256; chop $a; Dump($a)'
SV = PV(0x100801070) at 0x100812ae8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x100601b80 "N"\0 [UTF8 "N"]
CUR = 1
LEN = 16

I should have added that the argument I have seen in at least one
place in the code (as a comment) is that octal is used for low byte
escapes because it is shorter. IOW, 100 nulls will be 200 chars long,
whereas with unbraced hex it would be 400, and with braces 500.

I personally think for stuff like this the rule should be, if there is
a named escape use it, if it is null use \0, otherwise use braced hex
if it is codepoints, and unbraced hex (2 digit) if it is bytes being
output.

Thinking about that, I like it. It also avoids any confusion between string
escapes and backslash escapes, and things like "\0123" (a.k.a "\n3", not "S")

Although "\0" will need to be special cased in some fashion if followed by
a digit. either as "\000" or "\x00". Possibly the latter.

I also think that Dump output should be in ASCII unless requested to
do otherwise.

"printable" ASCII. (As you implied above)
Agree. Because really the lowest common denominator is all that can be relied
on.

Also, I will note that the regex engine debug output does not use \ as
the escape character (anymore (for a long time)), it uses % so as to
make it absolutely clear whether we are talking about an escape from
dumping, or an escape in the pattern. So there is precedence for
having diagnostics be a little different from the normal rules of
perl.

Which might mean a B or U prefix. (As Devel::Peek effectively has a \0
suffix)

As an aside, Devel::Peek's tests are probably the right place to test this
sort of stuff.

Nicholas Clark

p5pRT · 2012-02-05T16:11:25Z

From @rjbs

* demerphq <demerphq@gmail.com> [2012-02-04T05:11:51]

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.
[…]
I personally think for stuff like this the rule should be, if there is
a named escape use it, if it is null use \0, otherwise use braced hex
if it is codepoints, and unbraced hex (2 digit) if it is bytes being
output.

I have the same dislike for the current behavior, and your suggestion seems
like about what I'd like, too.

--
rjbs

p5pRT · 2017-12-13T04:08:41Z

From zefram@fysh.org

This was fixed in commit 0eb335d in
Perl 5.19.8.

-zefram

p5pRT · 2017-12-13T09:58:42Z

@iabyn - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Dec 13, 2017

p5pRT added Severity Low distro-darwin type-core labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dump.c cannot dump Unicode stash names #11762

dump.c cannot dump Unicode stash names #11762

p5pRT commented Nov 21, 2011

p5pRT commented Nov 21, 2011

p5pRT commented Feb 3, 2012

p5pRT commented Feb 3, 2012

p5pRT commented Feb 3, 2012

p5pRT commented Feb 3, 2012

p5pRT commented Feb 3, 2012

p5pRT commented Feb 3, 2012

p5pRT commented Feb 4, 2012

p5pRT commented Feb 4, 2012

p5pRT commented Feb 5, 2012

p5pRT commented Dec 13, 2017

p5pRT commented Dec 13, 2017

dump.c cannot dump Unicode stash names #11762

dump.c cannot dump Unicode stash names #11762

Comments

p5pRT commented Nov 21, 2011

p5pRT commented Nov 21, 2011

From @cpansprout

p5pRT commented Feb 3, 2012

From @Hugmeir

p5pRT commented Feb 3, 2012

p5pRT commented Feb 3, 2012

From @cpansprout

p5pRT commented Feb 3, 2012

From [Unknown Contact. See original ticket]

p5pRT commented Feb 3, 2012

From @nwc10

p5pRT commented Feb 3, 2012

From @demerphq

p5pRT commented Feb 4, 2012

From @demerphq

p5pRT commented Feb 4, 2012

From @nwc10

p5pRT commented Feb 5, 2012

From @rjbs

p5pRT commented Dec 13, 2017

From zefram@fysh.org

p5pRT commented Dec 13, 2017