Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpack(C => ...) on string with UTF8 FLAG without <use bytes> may return value more than 255 #11147

Open
p5pRT opened this issue Feb 22, 2011 · 12 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 22, 2011

Migrated from rt.perl.org#84670 (status was 'open')

Searchable as RT84670$

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2011

From @moonlibs

Created by @moonlibs

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090

while.
perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

but with use bytes

perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.12.0:

Configured by mons at Fri May  7 14:44:18 MSD 2010.

Summary of my perl5 (revision 5 version 12 subversion 0) configuration:
   
  Platform:
    osname=freebsd, osvers=7.1-release-p2, archname=amd64-freebsd
    uname='freebsd veda.park.rambler.ru 7.1-release-p2 freebsd 7.1-release-p2 #0: thu feb 12 22:34:21 msk 2009 root@veda.park.rambler.ru:usrobjusrsrcsysdevel amd64 '
    config_args='-des -Dprefix=/home/mons -Duse64bitint -DDEBUG_LEAKING_SCALARS -DDEBUGGING -Dinc_version_list=none -Dccflags=-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -Doptimize=-O2 -g3'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -fstack-protector -I/usr/local/include',
    optimize='-O2 -g3',
    cppflags='-O2 -march=athlon64 -fomit-frame-pointer -pipe -ggdb -g3 -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -DDEBUGGING -fno-strict-aliasing -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.2.1 20070719  [FreeBSD]', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-E  -fstack-protector -L/usr/local/lib'
    libpth=/usr/lib /usr/local/lib
    libs=-lgdbm -lm -lcrypt -lutil -lc
    perllibs=-lm -lcrypt -lutil -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-DPIC -fPIC', lddlflags='-shared  -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.12.0:
    /home/mons/perl/perl-ex
    /home/mons/lib/perl5/site_perl/5.12.0/amd64-freebsd
    /home/mons/lib/perl5/site_perl/5.12.0
    /home/mons/lib/perl5/5.12.0/amd64-freebsd
    /home/mons/lib/perl5/5.12.0
    .


Environment for perl 5.12.0:
    HOME=/home/mons
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/mons/home-bin:/home/mons/bin:/home/mons/flex/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/home/mons/bin
    PERL5LIB=/home/mons/perl/perl-ex
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2011

From @nwc10

On Tue, Feb 22, 2011 at 04​:46​:15AM -0800, mons @​ cpan. org wrote​:

[Please describe your issue here]

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090

while.
perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

but with use bytes

perl5.12.0 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.13.2 -Mbytes -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

It's either worth adding sub unpack into bytes.pm and fix documentation or fix this issue.

It's a documented behaviour change introduced in 5.10, as described in
perl5100delta.pod​:

  =head1 Incompatible Changes
 
  =head2 Packing and UTF-8 strings
 
  The semantics of pack() and unpack() regarding UTF-8-encoded data has been
  changed. Processing is now by default character per character instead of
  byte per byte on the underlying encoding. Notably, code that used things
  like C<pack("a*", $string)> to see through the encoding of string will now
  simply get back the original $string. Packed strings can also get upgraded
  during processing when you store upgraded characters. You can get the old
  behaviour by using C<use bytes>.
 
  To be consistent with pack(), the C<C0> in unpack() templates indicates
  that the data is to be processed in character mode, i.e. character by
  character; on the contrary, C<U0> in unpack() indicates UTF-8 mode, where
  the packed string is processed in its UTF-8-encoded Unicode form on a byte
  by byte basis. This is reversed with regard to perl 5.8.X, but now consistent
  between pack() and unpack().

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2011

From @moonlibs

Than maybe note this behaviuor in perldoc -f pack?
Since if I see

c A signed char (8-bit) value.
C An unsigned char (octet) value.

Then I expect an octet.
At least I think we should add reference to "Pack and unpack can operate
in two modes..."

And since it is documented, It is seems to be a good idea to add
bytes​::pack() and bytes​::unpack()

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2011

From @ikegami

On Tue, Feb 22, 2011 at 7​:46 AM, mons@​cpan.org <perlbug-followup@​perl.org>wrote​:

# New Ticket Created by mons@​cpan.org
# Please include the string​: [perl #84670]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=84670 >

This is a bug report for perl from mons@​cpan.org,
generated with the help of perlbug 1.39 running under perl 5.12.0.

-----------------------------------------------------------------
[Please describe your issue here]

perl5.12.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.13.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090
perl5.10.0 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 1090

It's doing the best it can when given garbage input. It requries a string of
bytes, but it's given a string that contains non-bytes.

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

while.
perl5.8.9 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209
perl5.6.2 -Mutf8 -e '$_ = "\x{442}"; print unpack ( C => $_ ), "\n"'
# 209

These versions peeked into the internals and produced unpredictable results
for bytes above 0x80. When this was fixed, the nonsense answer they gave for
inputs >0xFF was fixed as well.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2011

From @ikegami

On Tue Feb 22 10​:13​:05 2011, ikegami@​adaelis.com wrote​:

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
Character(s) in 'N' format wrapped in unpack at -e line 1.
42

$ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
442

I suppose the latter could do like the former (warn and "& 0xFF" the
input), but the latter's behaviour is so much more useful.

@p5pRT
Copy link
Author

p5pRT commented Feb 28, 2011

From perl5-porters@ton.iguana.be

In article <rt-3.6.HEAD-24085-1298398878-801.84670-15-0@​perl.org>,
  "Eric Brine via RT" <perlbug-followup@​perl.org> writes​:

On Tue Feb 22 10​:13​:05 2011, ikegami@​adaelis.com wrote​:

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
Character(s) in 'N' format wrapped in unpack at -e line 1.
42

$ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
442

I suppose the latter could do like the former (warn and "& 0xFF" the
input), but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was
seen as a possible backward incompatibility problem and on p5p I was
asked to add another character to mean "full single character semantics",
which became the "W" (word) character. But I only did that for pack it seems​:

perl -wle 'print ord pack("C", 1000)'
Character in 'C' format wrapped in pack at -e line 1.
232

perl -wle 'print ord pack("W", 1000)'
1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack so that

unpack "C", "\x{442}" would give 66 (1090 % 256) together with a format
wrap warning
(notice that it still won't give 209 which is a nonsense answer
corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still be
available with W,

unpack "W", "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f pack)

  C An unsigned char (octet) value.
  W An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

@p5pRT
Copy link
Author

p5pRT commented Apr 16, 2019

From @khwilliamson

On Mon, 28 Feb 2011 15​:06​:10 -0800, perl5-porters@​ton.iguana.be wrote​:

In article <rt-3.6.HEAD-24085-1298398878-801.84670-15-0@​perl.org>,
"Eric Brine via RT" <perlbug-followup@​perl.org> writes​:

On Tue Feb 22 10​:13​:05 2011, ikegami@​adaelis.com wrote​:

You didn't say what you expect it to do. I suppose it could throw an
exception, but the current behaviour is quite reasonable to me.

$ perl -we'printf "%02X\n", unpack "N", "\0\0\0\x{442}"'
Character(s) in 'N' format wrapped in unpack at -e line 1.
42

$ perl -wle'printf "%02X\n", unpack "C", "\x{442}"'
442

I suppose the latter could do like the former (warn and "& 0xFF" the
input), but the latter's behaviour is so much more useful.

Actually when I made the unicode pack/unpack patch the "C" format was
seen as a possible backward incompatibility problem and on p5p I was
asked to add another character to mean "full single character
semantics",
which became the "W" (word) character. But I only did that for pack it
seems​:

perl -wle 'print ord pack("C", 1000)'
Character in 'C' format wrapped in pack at -e line 1.
232

perl -wle 'print ord pack("W", 1000)'
1000

So the "C" format basically works "modulo 256"

I think its entirely reasonable to have the same behaviour for unpack
so that

unpack "C", "\x{442}" would give 66 (1090 % 256) together with a
format
wrap warning
(notice that it still won't give 209 which is a nonsense answer
corresponding to internal details)

The admittedly much more sane behaviour of returning 1090 would still
be
available with W,

unpack "W", "\x{442}" would give 1090

This woould be completely in line with the documented (in perldoc -f
pack)

C An unsigned char (octet) value.
W An unsigned char value (can be greater than 255).

"W" was always meant as the unicode sane version of "C"

I can make a patch if people agree with this...

No one responded to this. It looks ok to me.
--
Karl Williamson

@khwilliamson
Copy link
Contributor

Actually, W appears to have been added to unpack already, so the one remaining issue in this ticket is unpack C

@ikegami
Copy link
Contributor

ikegami commented Mar 9, 2020

Isn't it kinda late to be changing unpack C now?

$ for v in 8 10 12 14 16 18 20 22 24 26 30; do printf '5.%st: ' $v; 5.${v}t/bin/perl -wle'print unpack "C", "\x{442}"'; done
5.8t: 209
5.10t: 1090
5.12t: 1090
5.14t: 1090
5.16t: 1090
5.18t: 1090
5.20t: 1090
5.22t: 1090
5.24t: 1090
5.26t: 1090
5.30t: 1090

@ikegami
Copy link
Contributor

ikegami commented Mar 9, 2020

Who would be harmed from the change? People using "C" that should be using "W".

Who would be helped by the change? People inadvertently passing a string of Unicode Code Points to unpack C and getting garbage. They would still get garbage, but they would start receiving a warning.

I have no idea how large these groups of people are.

It seems to me that adding a warning without changing the output would be the most beneficial?

@khwilliamson
Copy link
Contributor

khwilliamson commented Mar 9, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants