unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255 #11147

p5pRT · 2011-02-22T12:46:13Z

Migrated from rt.perl.org#84670 (status was 'open')

Searchable as RT84670$

p5pRT · 2011-02-22T12:46:14Z

From @moonlibs

Created by @moonlibs

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090

while.
perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

but with use bytes

perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

Perl Info


Flags:
    category=core
    severity=medium

Site configuration information for perl 5.12.0:

Configured by mons at Fri May  7 14:44:18 MSD 2010.

Summary of my perl5 (revision 5 version 12 subversion 0) configuration:
   
  Platform:
    osname=freebsd, osvers=7.1-release-p2, archname=amd64-freebsd
    uname='freebsd veda.park.rambler.ru 7.1-release-p2 freebsd 7.1-release-p2 #0: thu feb 12 22:34:21 msk 2009 root@veda.park.rambler.ru:usrobjusrsrcsysdevel amd64 '
    config_args='-des -Dprefix=/home/mons -Duse64bitint -DDEBUG_LEAKING_SCALARS -DDEBUGGING -Dinc_version_list=none -Dccflags=-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -Doptimize=-O2 -g3'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -fstack-protector -I/usr/local/include',
    optimize='-O2 -g3',
    cppflags='-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.2.1 20070719  [FreeBSD]', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-E  -fstack-protector -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib
    libs=-lgdbm -lm -lcrypt -lutil -lc
    perllibs=-lm -lcrypt -lutil -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.12.0:
    /home/mons/perl/perl-ex
    /home/mons/lib/perl5/site_perl/5.12.0/amd64-freebsd
    /home/mons/lib/perl5/site_perl/5.12.0
    /home/mons/lib/perl5/5.12.0/amd64-freebsd
    /home/mons/lib/perl5/5.12.0
    .


Environment for perl 5.12.0:
    HOME=/home/mons
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/mons/home-bin:/home/mons/bin:/home/mons/flex/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/home/mons/bin
    PERL5LIB=/home/mons/perl/perl-ex
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/bash

p5pRT · 2011-02-22T14:21:25Z

From @nwc10

On Tue, Feb 22, 2011 at 04:46:15AM -0800, mons @ cpan. org wrote:

[Please describe your issue here]

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090

while.
perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

but with use bytes

perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

It's a documented behaviour change introduced in 5.10, as described in
perl5100delta.pod:

=head1 Incompatible Changes

=head2 Packing and UTF-8 strings

The semantics of pack() and unpack() regarding UTF-8-encoded data has been
changed. Processing is now by default character per character instead of
byte per byte on the underlying encoding. Notably, code that used things
like C<pack("a*", $string)> to see through the encoding of string will now
simply get back the original $string. Packed strings can also get upgraded
during processing when you store upgraded characters. You can get the old
behaviour by using C<use bytes>.

To be consistent with pack(), the C<C0> in unpack() templates indicates
that the data is to be processed in character mode, i.e. character by
character; on the contrary, C<U0> in unpack() indicates UTF-8 mode, where
the packed string is processed in its UTF-8-encoded Unicode form on a byte
by byte basis. This is reversed with regard to perl 5.8.X, but now consistent
between pack() and unpack().

Nicholas Clark

p5pRT · 2011-02-22T14:21:26Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2011-02-22T15:19:46Z

From @moonlibs

Than maybe note this behaviuor in perldoc -f pack?
Since if I see

c A signed char (8-bit) value.
C An unsigned char (octet) value.

Then I expect an octet.
At least I think we should add reference to "Pack and unpack can operate
in two modes..."

And since it is documented, It is seems to be a good idea to add
bytes::pack() and bytes::unpack()

p5pRT · 2011-02-22T18:13:05Z

From @ikegami

On Tue, Feb 22, 2011 at 7:46 AM, mons@cpan.org <perlbug-followup@perl.org>wrote:

# New Ticket Created by mons@cpan.org
# Please include the string: [perl #84670]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=84670 >

This is a bug report for perl from mons@cpan.org,
generated with the help of perlbug 1.39 running under perl 5.12.0.

-----------------------------------------------------------------
[Please describe your issue here]

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090

It's doing the best it can when given garbage input. It requries a string of
bytes, but it's given a string that contains non-bytes.

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

while.
perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

These versions peeked into the internals and produced unpredictable results
for bytes above 0x80. When this was fixed, the nonsense answer they gave for
inputs >0xFF was fixed as well.

- Eric

p5pRT · 2011-02-22T18:21:18Z

From @ikegami

On Tue Feb 22 10:13:05 2011, ikegami@adaelis.com wrote:

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
Character(s) in 'N' format wrapped in unpack at -e line 1.
42

$ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
442

I suppose the latter could do like the former (warn and "& 0xFF" the
input), but the latter's behaviour is so much more useful.

p5pRT · 2011-02-28T23:06:10Z

From perl5-porters@ton.iguana.be

In article <rt-3.6.HEAD-24085-1298398878-801.84670-15-0@perl.org>,
"Eric Brine via RT" <perlbug-followup@perl.org> writes:

On Tue Feb 22 10:13:05 2011, ikegami@adaelis.com wrote:

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
Character(s) in 'N' format wrapped in unpack at -e line 1.
42

$ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
442

I suppose the latter could do like the former (warn and "& 0xFF" the
input), but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was
seen as a possible backward incompatibility problem and on p5p I was
asked to add another character to mean "full single character semantics",
which became the "W" (word) character. But I only did that for pack it seems:

perl -wle 'print ord pack("C", 1000)'
Character in 'C' format wrapped in pack at -e line 1.
232

perl -wle 'print ord pack("W", 1000)'
1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack so that

unpack "C", "\x{442}" would give 66 (1090 % 256) together with a format
wrap warning
(notice that it still won't give 209 which is a nonsense answer
corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still be
available with W,

unpack "W", "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f pack)

C An unsigned char (octet) value.
W An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

p5pRT · 2019-04-16T04:06:55Z

From @khwilliamson

On Mon, 28 Feb 2011 15:06:10 -0800, perl5-porters@ton.iguana.be wrote:

In article <rt-3.6.HEAD-24085-1298398878-801.84670-15-0@perl.org>,
"Eric Brine via RT" <perlbug-followup@perl.org> writes:

On Tue Feb 22 10:13:05 2011, ikegami@adaelis.com wrote:

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
Character(s) in 'N' format wrapped in unpack at -e line 1.
42

$ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
442

I suppose the latter could do like the former (warn and "& 0xFF" the
input), but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was
seen as a possible backward incompatibility problem and on p5p I was
asked to add another character to mean "full single character
semantics",
which became the "W" (word) character. But I only did that for pack it
seems:

perl -wle 'print ord pack("C", 1000)'
Character in 'C' format wrapped in pack at -e line 1.
232

perl -wle 'print ord pack("W", 1000)'
1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack
so that

unpack "C", "\x{442}" would give 66 (1090 % 256) together with a
format
wrap warning
(notice that it still won't give 209 which is a nonsense answer
corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still
be
available with W,

unpack "W", "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f
pack)

C An unsigned char (octet) value.
W An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

No one responded to this. It looks ok to me.
--
Karl Williamson

khwilliamson · 2020-03-09T03:56:32Z

Actually, W appears to have been added to unpack already, so the one remaining issue in this ticket is unpack C

ikegami · 2020-03-09T08:29:40Z

Isn't it kinda late to be changing unpack C now?

$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v; 5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done
5.8t: 209
5.10t: 1090
5.12t: 1090
5.14t: 1090
5.16t: 1090
5.18t: 1090
5.20t: 1090
5.22t: 1090
5.24t: 1090
5.26t: 1090
5.30t: 1090

ikegami · 2020-03-09T08:36:37Z

Who would be harmed from the change? People using "C" that should be using "W".

Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning.

I have no idea how large these groups of people are.

It seems to me that adding a warning without changing the output would be the most beneficial?

khwilliamson · 2020-03-09T13:08:04Z

On 3/9/20 2:36 AM, Eric Brine wrote: Isn't it kinda late to be changing unpack C now? |$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v; 5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done 5.8t: 209 5.10t: 1090 5.12t: 1090 5.14t: 1090 5.16t: 1090 5.18t: 1090 5.20t: 1090 5.22t: 1090 5.24t: 1090 5.26t: 1090 5.30t: 1090 | Who would be harmed from the change? People using "C" that should be using "W". Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning. I have no idea how large these groups of people are. It seems to me that adding a warning without changing the output would be the most beneficial?

That sounds reasonable to me.

p5pRT added Severity Medium distro-freebsd type-core labels Oct 19, 2019

toddr removed the khw label Oct 25, 2019

xenu removed the affects-5.12 label Nov 19, 2021

xenu removed the Severity Medium label Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255 #11147

unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255 #11147

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

p5pRT commented Feb 28, 2011

p5pRT commented Apr 16, 2019

khwilliamson commented Mar 9, 2020

ikegami commented Mar 9, 2020

ikegami commented Mar 9, 2020 •

edited

Loading

khwilliamson commented Mar 9, 2020 via email

unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255 #11147

unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255 #11147

Comments

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

From @moonlibs

Created by @moonlibs

p5pRT commented Feb 22, 2011

From @nwc10

p5pRT commented Feb 22, 2011

p5pRT commented Feb 22, 2011

From @moonlibs

p5pRT commented Feb 22, 2011

From @ikegami

p5pRT commented Feb 22, 2011

From @ikegami

p5pRT commented Feb 28, 2011

From perl5-porters@ton.iguana.be

p5pRT commented Apr 16, 2019

From @khwilliamson

khwilliamson commented Mar 9, 2020

ikegami commented Mar 9, 2020

ikegami commented Mar 9, 2020 • edited Loading

khwilliamson commented Mar 9, 2020 via email

ikegami commented Mar 9, 2020 •

edited

Loading