Skip Menu |
Report information
Id: 104116
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors:
Cc:
AdminCc:

Operating System: darwin
PatchStatus: (no value)
Severity: low
Type: core
Perl Version: 5.15.4
Fixed In: 5.19.8



Subject: dump.c cannot dump Unicode stash names
Date: Sun, 20 Nov 2011 16:23:19 -0800
To: perlbug [...] perl.org
From: Father Chrysostomos <sprout [...] cpan.org>
Download (untitled) / with headers
text/plain 2.9k
$ ./perl -Ilib -Mutf8 -MDevel::Peek -e '*fòò:: = *bǎr::; Dump \%fòò::' SV = IV(0x8666cc) at 0x8666d0 REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x867e40 SV = PVHV(0x808330) at 0x867e40 REFCNT = 2 FLAGS = (OOK,SHAREKEYS) ARRAY = 0x28c610 KEYS = 0 FILL = 0 MAX = 7 RITER = -1 EITER = 0x0 NAME = "bǎr" NAMECOUNT = 2 ENAME = "bǎr", "f??" Those question marks represent Latin-1 bytes that my UTF-8 terminal could not render. bǎr is output in UTF-8. --- Flags: category=core severity=low --- Site configuration information for perl 5.15.4: Configured by sprout at Wed Nov 2 09:06:14 PDT 2011. Summary of my perl5 (revision 5 version 15 subversion 4) configuration: Snapshot of: f3640611309ab8d6271598d071119f09fd9e8cf0 Platform: osname=darwin, osvers=10.5.0, archname=darwin-thread-multi-2level uname='darwin pint.local 10.5.0 darwin kernel version 10.5.0: fri nov 5 23:20:39 pdt 2010; root:xnu-1504.9.17~1release_i386 i386 ' config_args='-de -Doptimize=-g -Dusedevel -Duseithreads -Dmad' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include', optimize='-g', cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-ldbm -ldl -lm -lutil -lc perllibs=-ldl -lm -lutil -lc libc=, so=dylib, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -fstack-protector' Locally applied patches: --- @INC for perl 5.15.4: /usr/local/lib/perl5/site_perl/5.15.4/darwin-thread-multi-2level /usr/local/lib/perl5/site_perl/5.15.4 /usr/local/lib/perl5/5.15.4/darwin-thread-multi-2level /usr/local/lib/perl5/5.15.4 /usr/local/lib/perl5/site_perl . --- Environment for perl 5.15.4: DYLD_LIBRARY_PATH (unset) HOME=/Users/sprout LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/local/bin PERL_BADLANG (unset) SHELL=/bin/bash
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 5.6k
On Sun Nov 20 16:23:45 2011, sprout wrote: Show quoted text
> $ ./perl -Ilib -Mutf8 -MDevel::Peek -e '*fòò:: = *bǎr::; Dump \%fòò::' > SV = IV(0x8666cc) at 0x8666d0 > REFCNT = 1 > FLAGS = (TEMP,ROK) > RV = 0x867e40 > SV = PVHV(0x808330) at 0x867e40 > REFCNT = 2 > FLAGS = (OOK,SHAREKEYS) > ARRAY = 0x28c610 > KEYS = 0 > FILL = 0 > MAX = 7 > RITER = -1 > EITER = 0x0 > NAME = "bǎr" > NAMECOUNT = 2 > ENAME = "bǎr", "f??" > > Those question marks represent Latin-1 bytes that my UTF-8 terminal > could not render. bǎr is output in UTF-8. > --- > Flags: > category=core > severity=low > --- > Site configuration information for perl 5.15.4: > > Configured by sprout at Wed Nov 2 09:06:14 PDT 2011. > > Summary of my perl5 (revision 5 version 15 subversion 4) > configuration: > Snapshot of: f3640611309ab8d6271598d071119f09fd9e8cf0 > Platform: > osname=darwin, osvers=10.5.0, archname=darwin-thread-multi-2level > uname='darwin pint.local 10.5.0 darwin kernel version 10.5.0: fri > nov 5 23:20:39 pdt 2010; root:xnu-1504.9.17~1release_i386 i386 ' > config_args='-de -Doptimize=-g -Dusedevel -Duseithreads -Dmad' > hint=recommended, useposix=true, d_sigaction=define > useithreads=define, usemultiplicity=define > useperlio=define, d_sfio=undef, uselargefiles=define, > usesocks=undef > use64bitint=undef, use64bitall=undef, uselongdouble=undef > usemymalloc=n, bincompat5005=undef > Compiler: > cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing > -pipe -fstack-protector -I/usr/local/include', > optimize='-g', > cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe > -fstack-protector -I/usr/local/include' > ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)', > gccosandvers='' > intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 > d_longlong=define, longlongsize=8, d_longdbl=define, > longdblsize=16 > ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', > lseeksize=8 > alignbytes=8, prototype=define > Linker and Libraries: > ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' > -fstack-protector -L/usr/local/lib' > libpth=/usr/local/lib /usr/lib > libs=-ldbm -ldl -lm -lutil -lc > perllibs=-ldl -lm -lutil -lc > libc=, so=dylib, useshrplib=false, libperl=libperl.a > gnulibc_version='' > Dynamic Linking: > dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' > cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup > -L/usr/local/lib -fstack-protector' > > Locally applied patches: > > > --- > @INC for perl 5.15.4: > /usr/local/lib/perl5/site_perl/5.15.4/darwin-thread-multi-2level > /usr/local/lib/perl5/site_perl/5.15.4 > /usr/local/lib/perl5/5.15.4/darwin-thread-multi-2level > /usr/local/lib/perl5/5.15.4 > /usr/local/lib/perl5/site_perl > . > > --- > Environment for perl 5.15.4: > DYLD_LIBRARY_PATH (unset) > HOME=/Users/sprout > LANG=en_US.UTF-8 > LANGUAGE (unset) > LD_LIBRARY_PATH (unset) > LOGDIR (unset) > PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/
usr/local/bin Show quoted text
> PERL_BADLANG (unset) > SHELL=/bin/bash >
Howdy all. I'm looking for opinions on how to go about fixing this. Usually, when Devel::Peek finds something with the UTF-8 flag on, it'll display it like this: $ perl -MDevel::Peek -E 'Dump "\x{30cb}"' SV = PV(0x90d8a94) at 0x90f5b4c REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8) PV = 0x90fdc74 "\343\203\213"\0 [UTF8 "\x{30cb}"] CUR = 3 LEN = 12 That is, "escaped-bytestring"\0 [UTF8 "escaped-character-string"]. Should it also follow that convention for stash names and the like? Or should it just show the escaped character string? Neither of those, and output UTF8 when possible? Or something else entirely? To get the point across, here's something like what the output would look like for the three options: $ perl -MDevel::Peek -E '*{"f\xe9::"} = *{"b\x{30cb}::"}; Dump \%{"f \xe9::"}' First option SV = IV(0x8d62b38) at 0x8d62b3c REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x8e278bc SV = PVHV(0x8d1b39c) at 0x8e278bc REFCNT = 2 FLAGS = (OOK,SHAREKEYS) ARRAY = 0x8e2399c KEYS = 0 FILL = 0 MAX = 7 RITER = -1 EITER = 0x0 NAME = "b\343\203\213" [UTF8 "b\x{30cb}"] NAMECOUNT = 2 ENAME = "b\343\203\213" [UTF8 "b\x{30cb}"], "f\351" Second option SV = IV(0x8d62b38) at 0x8d62b3c REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x8e278bc SV = PVHV(0x8d1b39c) at 0x8e278bc REFCNT = 2 FLAGS = (OOK,SHAREKEYS) ARRAY = 0x8e2399c KEYS = 0 FILL = 0 MAX = 7 RITER = -1 EITER = 0x0 NAME = "b\x{30cb}" NAMECOUNT = 2 ENAME = "b\x{30cb}", "f\351" Third option SV = IV(0x8d62b38) at 0x8d62b3c REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x8e278bc SV = PVHV(0x8d1b39c) at 0x8e278bc REFCNT = 2 FLAGS = (OOK,SHAREKEYS) ARRAY = 0x8e2399c KEYS = 0 FILL = 0 MAX = 7 RITER = -1 EITER = 0x0 NAME = "bニ" NAMECOUNT = 2 ENAME = "bニ", "fé" Personally, I think the first option sucks -- it sort of starts alright but breaks down easily for other types, like a coderef, which gets printed like "STASH" :: "NAME", and would become "STASH" [UTF8 "STASH"] :: "NAME" [UTF8 "NAME"] if both of those were in UTF8. But admittedly this is a stylistic concern more than anything. Meanwhile, the third makes debugging dependent on having a font that can display all symbols and not getting anything invisible in your names -- Otherwise,good luck discerning \xe9 and e\N{COMBINING ACUTE ACCENT}, or why ref(bless {}, "Yep") ne ref(bless {}, "Ye\0p"). So I will probably go for the second, unless someone has objections and/ or a better idea.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 181b
On Thu Feb 02 20:40:07 2012, Hugmeir wrote: Show quoted text
> So I will probably go for the second, unless someone has objections and/ > or a better idea.
2 sounds good. -- Father Chrysostomos
CC: perl5-porters [...] perl.org
Subject: Re: [perl #104116] dump.c cannot dump Unicode stash names
Date: Fri, 3 Feb 2012 13:40:48 +0000
To: Brian Fraser via RT <perlbug-followup [...] perl.org>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 1.7k
On Thu, Feb 02, 2012 at 08:40:08PM -0800, Brian Fraser via RT wrote: Show quoted text
> Second option > SV = IV(0x8d62b38) at 0x8d62b3c > REFCNT = 1 > FLAGS = (TEMP,ROK) > RV = 0x8e278bc > SV = PVHV(0x8d1b39c) at 0x8e278bc > REFCNT = 2 > FLAGS = (OOK,SHAREKEYS) > ARRAY = 0x8e2399c > KEYS = 0 > FILL = 0 > MAX = 7 > RITER = -1 > EITER = 0x0 > NAME = "b\x{30cb}" > NAMECOUNT = 2 > ENAME = "b\x{30cb}", "f\351"
You're intentionally using octal to distinguish things-as-bytes from hex for things-as-UTF-8? Or is that just a side effect of the values chosen? Because I'm thinking that some (documented, unambiguous) convention like that would make for better reading than an explicit longhand character sequence all the time. Show quoted text
> Personally, I think the first option sucks -- it sort of starts alright > but breaks down easily for other types, like a coderef, which gets > printed like "STASH" :: "NAME", and would become "STASH" [UTF8 > "STASH"] :: "NAME" [UTF8 "NAME"] if both of those were in UTF8. But > admittedly this is a stylistic concern more than anything. > > Meanwhile, the third makes debugging dependent on having a font that > can display all symbols and not getting anything invisible in your > names -- Otherwise,good luck discerning \xe9 and e\N{COMBINING ACUTE > ACCENT}, or why ref(bless {}, "Yep") ne ref(bless {}, "Ye\0p").
Yes, but in both cases the "style" is really about conveying information accurately without clutter, so it's important. Show quoted text
> So I will probably go for the second, unless someone has objections and/ > or a better idea.
Yes, the second looks the best idea (so far) We can change it if someone has a better idea. The dump format isn't sacrosanct. Nicholas Clark
CC: Brian Fraser via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #104116] dump.c cannot dump Unicode stash names
Date: Fri, 3 Feb 2012 17:48:09 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 701b
On 3 February 2012 14:40, Nicholas Clark <nick@ccl4.org> wrote: Show quoted text
>>     ENAME = "b\x{30cb}", "f\351"
> > You're intentionally using octal to distinguish things-as-bytes from hex for > things-as-UTF-8? Or is that just a side effect of the values chosen?
As an aside, there are number of bits of code that use octal for codepoints <= 255, and hex for codepoints > 255. I personally hate it, for some reason I don't think octal anywhere near as well as hex and i find it really confusing when the same line has both. The code for emitting a quoted escaped string in perl supports a few modes, we could decide to use whatever we want. cheer Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Brian Fraser via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #104116] dump.c cannot dump Unicode stash names
Date: Sat, 4 Feb 2012 11:11:51 +0100
To: Nicholas Clark <nick [...] ccl4.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On 3 February 2012 17:48, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 3 February 2012 14:40, Nicholas Clark <nick@ccl4.org> wrote:
>>>     ENAME = "b\x{30cb}", "f\351"
>> >> You're intentionally using octal to distinguish things-as-bytes from hex for >> things-as-UTF-8? Or is that just a side effect of the values chosen?
> > As an aside, there are number of bits of code that use octal for > codepoints <= 255, and hex for codepoints > 255. > > I personally hate it, for some reason I don't think octal anywhere > near as well as hex and i find it really confusing when the same line > has both. The code for emitting a quoted escaped string in perl > supports a few modes, we could decide to use whatever we want.
I should have added that the argument I have seen in at least one place in the code (as a comment) is that octal is used for low byte escapes because it is shorter. IOW, 100 nulls will be 200 chars long, whereas with unbraced hex it would be 400, and with braces 500. I personally think for stuff like this the rule should be, if there is a named escape use it, if it is null use \0, otherwise use braced hex if it is codepoints, and unbraced hex (2 digit) if it is bytes being output. I also think that Dump output should be in ASCII unless requested to do otherwise. Also, I will note that the regex engine debug output does not use \ as the escape character (anymore (for a long time)), it uses % so as to make it absolutely clear whether we are talking about an escape from dumping, or an escape in the pattern. So there is precedence for having diagnostics be a little different from the normal rules of perl. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Brian Fraser via RT <perlbug-followup [...] perl.org>, perl5-porters [...] perl.org
Subject: Re: [perl #104116] dump.c cannot dump Unicode stash names
Date: Sat, 4 Feb 2012 12:34:15 +0000
To: demerphq <demerphq [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 2.8k
On Sat, Feb 04, 2012 at 11:11:51AM +0100, demerphq wrote: Show quoted text
> On 3 February 2012 17:48, demerphq <demerphq@gmail.com> wrote:
> > On 3 February 2012 14:40, Nicholas Clark <nick@ccl4.org> wrote:
> >>>     ENAME = "b\x{30cb}", "f\351"
> >> > >> You're intentionally using octal to distinguish things-as-bytes from hex for > >> things-as-UTF-8? Or is that just a side effect of the values chosen?
> > > > As an aside, there are number of bits of code that use octal for > > codepoints <= 255, and hex for codepoints > 255. > > > > I personally hate it, for some reason I don't think octal anywhere > > near as well as hex and i find it really confusing when the same line > > has both. The code for emitting a quoted escaped string in perl > > supports a few modes, we could decide to use whatever we want.
Although thinking further I realise that that these two *aren't* the same, and a dump output should continue to show that: $ ./perl -Ilib -MDevel::Peek -e '$a = "N" . chr 255; chop $a; Dump($a)' SV = PV(0x100801070) at 0x100812ae8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x100601b80 "N"\0 CUR = 1 LEN = 16 $ ./perl -Ilib -MDevel::Peek -e '$a = "N" . chr 256; chop $a; Dump($a)' SV = PV(0x100801070) at 0x100812ae8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x100601b80 "N"\0 [UTF8 "N"] CUR = 1 LEN = 16 Show quoted text
> I should have added that the argument I have seen in at least one > place in the code (as a comment) is that octal is used for low byte > escapes because it is shorter. IOW, 100 nulls will be 200 chars long, > whereas with unbraced hex it would be 400, and with braces 500. > > I personally think for stuff like this the rule should be, if there is > a named escape use it, if it is null use \0, otherwise use braced hex > if it is codepoints, and unbraced hex (2 digit) if it is bytes being > output.
Thinking about that, I like it. It also avoids any confusion between string escapes and backslash escapes, and things like "\0123" (a.k.a "\n3", not "S") Although "\0" will need to be special cased in some fashion if followed by a digit. either as "\000" or "\x00". Possibly the latter. Show quoted text
> I also think that Dump output should be in ASCII unless requested to > do otherwise.
"printable" ASCII. (As you implied above) Agree. Because really the lowest common denominator is all that can be relied on. Show quoted text
> Also, I will note that the regex engine debug output does not use \ as > the escape character (anymore (for a long time)), it uses % so as to > make it absolutely clear whether we are talking about an escape from > dumping, or an escape in the pattern. So there is precedence for > having diagnostics be a little different from the normal rules of > perl.
Which might mean a B or U prefix. (As Devel::Peek effectively has a \0 suffix) As an aside, Devel::Peek's tests are probably the right place to test this sort of stuff. Nicholas Clark
Subject: Re: [perl #104116] dump.c cannot dump Unicode stash names
Date: Sun, 5 Feb 2012 11:10:46 -0500
To: perl5-porters [...] perl.org
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Download (untitled) / with headers
text/plain 686b
* demerphq <demerphq@gmail.com> [2012-02-04T05:11:51] Show quoted text
> > I personally hate it, for some reason I don't think octal anywhere > > near as well as hex and i find it really confusing when the same line > > has both. The code for emitting a quoted escaped string in perl > > supports a few modes, we could decide to use whatever we want.
> […] > I personally think for stuff like this the rule should be, if there is > a named escape use it, if it is null use \0, otherwise use braced hex > if it is codepoints, and unbraced hex (2 digit) if it is bytes being > output.
I have the same dislike for the current behavior, and your suggestion seems like about what I'd like, too. -- rjbs
Download signature.asc
application/pgp-signature 490b

Message body not shown because it is not plain text.

Subject: Re: [perl #104116] dump.c cannot dump Unicode stash names
From: Zefram <zefram [...] fysh.org>
To: perl5-porters [...] perl.org
Date: Wed, 13 Dec 2017 04:08:28 +0000
This was fixed in commit 0eb335df32a07389fed6e07a4743d529fb77ac0c in Perl 5.19.8. -zefram


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org