Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack propagates UTF8 strings #13989

Open
p5pRT opened this issue Jul 16, 2014 · 21 comments
Open

pack propagates UTF8 strings #13989

p5pRT opened this issue Jul 16, 2014 · 21 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 16, 2014

Migrated from rt.perl.org#122308 (status was 'open')

Searchable as RT122308$

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2014

From @mikelward

I know this is a consequence of an intended change in 5.10, but it breaks existing code in non-obvious ways - there should at least be documentation, if not an option to pack() to change this behavior.

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string. When the packed data is passed to another program the entire string gets encoded, including any binary data along with it.

Consider the attached program. It packs an integer and a string and sends the result over a message queue. Because the packed string has the UTF8 encoded attribute it gets encoded and the received string does not unpack to the correct values.

Having pack() handle UTF8 strings the way it does is only useful if the packed data is not sent outside the process, which is sort of the whole reason for pack in the first place...

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2014

From @mikelward

utf8pack.pl

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2014

From @demerphq

On 16 July 2014 22​:18, Mike L <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Mike L
# Please include the string​: [perl #122308]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122308 >

I know this is a consequence of an intended change in 5.10, but it breaks
existing code in non-obvious ways - there should at least be documentation,
if not an option to pack() to change this behavior.

If you pass a UTF8 encoded string to pack() the output of pack() is now a
UTF8 encoded string. When the packed data is passed to another program the
entire string gets encoded, including any binary data along with it.

Consider the attached program. It packs an integer and a string and sends
the result over a message queue. Because the packed string has the UTF8
encoded attribute it gets encoded and the received string does not unpack
to the correct values.

Having pack() handle UTF8 strings the way it does is only useful if the
packed data is not sent outside the process, which is sort of the whole
reason for pack in the first place...

FWIW, I concur. $work has been bitten by this more than it should.

This change has caused far far more trouble than it should have, was
backwards incompatible, and really makes pack less than useful for its
intended purpose, data exchange between internal and external systems.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2014

From @ikegami

Sounds like you're saying there's a bug in msgsnd? If so, the solution is
to fix msgsnd.

On Thu, Jul 17, 2014 at 3​:27 AM, demerphq <demerphq@​gmail.com> wrote​:

On 16 July 2014 22​:18, Mike L <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Mike L
# Please include the string​: [perl #122308]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122308 >

I know this is a consequence of an intended change in 5.10, but it breaks
existing code in non-obvious ways - there should at least be documentation,
if not an option to pack() to change this behavior.

If you pass a UTF8 encoded string to pack() the output of pack() is now a
UTF8 encoded string. When the packed data is passed to another program the
entire string gets encoded, including any binary data along with it.

Consider the attached program. It packs an integer and a string and sends
the result over a message queue. Because the packed string has the UTF8
encoded attribute it gets encoded and the received string does not unpack
to the correct values.

Having pack() handle UTF8 strings the way it does is only useful if the
packed data is not sent outside the process, which is sort of the whole
reason for pack in the first place...

FWIW, I concur. $work has been bitten by this more than it should.

This change has caused far far more trouble than it should have, was
backwards incompatible, and really makes pack less than useful for its
intended purpose, data exchange between internal and external systems.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2014

From @ikegami

On Thu, Jul 17, 2014 at 10​:21 AM, Eric Brine <ikegami@​adaelis.com> wrote​:

Sounds like you're saying there's a bug in msgsnd? If so, the solution is
to fix msgsnd.

Confirmed.

const char * const mbuf = SvPV_const(mstr, len);

should be

const char * const mbuf = SvPVbyte_const(mstr, len);

(SvPVbyte_const doesn't currently exist.)

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2014

From @demerphq

On 17 July 2014 16​:36, Eric Brine <ikegami@​adaelis.com> wrote​:

On Thu, Jul 17, 2014 at 10​:21 AM, Eric Brine <ikegami@​adaelis.com> wrote​:

Sounds like you're saying there's a bug in msgsnd? If so, the solution is
to fix msgsnd.

Confirmed.

const char * const mbuf = SvPV_const(mstr, len);

should be

const char * const mbuf = SvPVbyte_const(mstr, len);

(SvPVbyte_const doesn't currently exist.)

That may be. But the behaviour of pack doesn't make any sense to anyone who
uses pack for what it is intended for.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2014

From @mikelward

On Thu Jul 17 13​:08​:45 2014, demerphq wrote​:

That may be. But the behaviour of pack doesn't make any sense to anyone who
uses pack for what it is intended for.

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because they're UTF8 characters) what is msgsend supposed to do about it? Pack is supposed to let you control the layout of data.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @ap

* Mike L via RT <perlbug-followup@​perl.org> [2014-07-18 01​:05]​:

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because
they're UTF8 characters) what is msgsend supposed to do about it? Pack
is supposed to let you control the layout of data.

Same thing everything is supposed to do about it. Decode the buffer and
send the character code points as byte values; *if* they fit into bytes,
that is, otherwise warn and send the UTF8-encoded raw data – just like
e.g. print does.

I mean, what do you expect pack to do when you give it a UTF8=on string?
If you start with data like that and you intend to end in I/O, *someone*
is going to have to ask for bytes from it eventually, and that someone
is going to do exactly what I outlined.

That’s how strings work in Perl (since 5.8 anyhow).

And if msgsend doesn’t do that then it’s clearly broken, since the data
it’s fed need not come from pack. It can always receive UTF8=on strings
from somewhere, and in that situation it needs to DTRT regardless.

There’s an argument to be had about whether pack ought to do the same
with UTF8=on strings, but the bug in msgsend is incontrovertible.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @mikelward

On Thu Jul 17 18​:30​:49 2014, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer and
send the character code points as byte values; *if* they fit into bytes,
that is, otherwise warn and send the UTF8-encoded raw data – just like
e.g. print does.

Yes, and that's what it does now. That's why I don't think msgsnd is the problem - pack is. pack is supposed to pack things so the user can lay out the bytes the way they want them. If pack is going to produce a UTF8 string as it's output, which will then have its bytes manipulated, then pack() is not doing what it was designed to do - layout the bytes the way the caller specified.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @ikegami

On Thu, Jul 17, 2014 at 4​:08 PM, demerphq <demerphq@​gmail.com> wrote​:

That may be. But the behaviour of pack doesn't make any sense to anyone
who uses pack for what it is intended for.

"Doesn't exibit optimal performance" is a far cry from "doesn't make any
sense".

A non-fatal sv_downgrade on exit should be ok, but I seem to remember you
wanted to do something quite different.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @ikegami

On Thu, Jul 17, 2014 at 4​:53 PM, Mike L via RT <perlbug-followup@​perl.org>
wrote​:

On Thu Jul 17 13​:08​:45 2014, demerphq wrote​:

That may be. But the behaviour of pack doesn't make any sense to anyone
who
uses pack for what it is intended for.

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because
they're UTF8 characters) what is msgsend supposed to do about it? Pack is
supposed to let you control the layout of data.

The string only has 8 bytes. Your code even prints them out. msgsend is
suppose to send those bytes. It's currently sending some of the internal
structure of a scalar instead of the string. That's the bug.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @ap

* Mike L via RT <perlbug-followup@​perl.org> [2014-07-18 07​:35]​:

On Thu Jul 17 18​:30​:49 2014, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer
and send the character code points as byte values; *if* they fit
into bytes, that is, otherwise warn and send the UTF8-encoded raw
data – just like e.g. print does.

Yes, and that's what it does now.

You are welcome to think so. The source code says otherwise.

That's why I don't think msgsnd is the problem - pack is. pack is
supposed to pack things so the user can lay out the bytes the way they
want them. If pack is going to produce a UTF8 string as it's output,
which will then have its bytes manipulated, then pack() is not doing
what it was designed to do - layout the bytes the way the caller
specified.

The only catch being that Perl doesn’t have byte strings.

(It ought to, I’m inclined to think. But status quo is it doesn’t.)

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @demerphq

On 18 July 2014 03​:30, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

* Mike L via RT <perlbug-followup@​perl.org> [2014-07-18 01​:05]​:

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because
they're UTF8 characters) what is msgsend supposed to do about it? Pack
is supposed to let you control the layout of data.

Same thing everything is supposed to do about it. Decode the buffer and
send the character code points as byte values; *if* they fit into bytes,
that is, otherwise warn and send the UTF8-encoded raw data – just like
e.g. print does.

Look, there is a serious bug if by packing a string it *upgrades* the rest
of the packed structure. Suddenly a perfectly fine 4 byte integer turns
into some UTF8 corrupted crap.

I mean, what do you expect pack to do when you give it a UTF8=on string?

Exactly what used to happen before we introduced this backwards
incompatible behavior *which makes no sense anyway*, concat the octets in
the string into the pack buffer and assume the person calling pack knows
what they are doing.

If you start with data like that and you intend to end in I/O, *someone*
is going to have to ask for bytes from it eventually, and that someone
is going to do exactly what I outlined.

I wonder if you even understand the problem the OP is talking about, and if
you ever actually *use* pack for interoperability between discrete systems.
Because what you say makes me think you don't really understand the problem
here at all.

That’s how strings work in Perl (since 5.8 anyhow).

No. Its not how pack worked for a long time. The 5.8 behavior was sane.

And if msgsend doesn’t do that then it’s clearly broken, since the data
it’s fed need not come from pack. It can always receive UTF8=on strings
from somewhere, and in that situation it needs to DTRT regardless.

There’s an argument to be had about whether pack ought to do the same
with UTF8=on strings, but the bug in msgsend is incontrovertible.

This bug is about pack doing the wrong thing. Please dont try to turn it
into a different subject.

$ perl -MEncode=encode_utf8 -we'my $str="\x{100}" x 128; my $packed= pack
"V/a", $str; print encode_utf8($packed)' | od -x
0000000 80c2 0000 c400 c480 c480 c480 c480 c480
0000020 c480 c480 c480 c480 c480 c480 c480 c480
*
0000400 c480 c480 0080
0000405

Why did the length get corrupted?!!! Why was there no warning when it was
corrupted.

This does not work right, and has not worked right ever since we introduce
this stupid utf8 support into pack, which has I bet NEVER done anything
good for people using pack for interoperability. It might have made people
using pack for internal perlish stuff easier/safer, but it broke the
primary purpose of pack.

Yves

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @demerphq

On 18 July 2014 09​:47, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

* Mike L via RT <perlbug-followup@​perl.org> [2014-07-18 07​:35]​:

On Thu Jul 17 18​:30​:49 2014, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer
and send the character code points as byte values; *if* they fit
into bytes, that is, otherwise warn and send the UTF8-encoded raw
data – just like e.g. print does.

Yes, and that's what it does now.

You are welcome to think so. The source code says otherwise.

That's why I don't think msgsnd is the problem - pack is. pack is
supposed to pack things so the user can lay out the bytes the way they
want them. If pack is going to produce a UTF8 string as it's output,
which will then have its bytes manipulated, then pack() is not doing
what it was designed to do - layout the bytes the way the caller
specified.

The only catch being that Perl doesn’t have byte strings.

(It ought to, I’m inclined to think. But status quo is it doesn’t.)

Nonsense. You and others have tried to force this broken view on the rest
of us, but that doesn't make it a status-quo, it means the people who know
you are wrong gave up arguing with you.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2014

From @demerphq

On 18 July 2014 10​:28, demerphq <demerphq@​gmail.com> wrote​:

On 18 July 2014 09​:47, Aristotle Pagaltzis <pagaltzis@​gmx.de> wrote​:

* Mike L via RT <perlbug-followup@​perl.org> [2014-07-18 07​:35]​:

On Thu Jul 17 18​:30​:49 2014, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer
and send the character code points as byte values; *if* they fit
into bytes, that is, otherwise warn and send the UTF8-encoded raw
data – just like e.g. print does.

Yes, and that's what it does now.

You are welcome to think so. The source code says otherwise.

That's why I don't think msgsnd is the problem - pack is. pack is
supposed to pack things so the user can lay out the bytes the way they
want them. If pack is going to produce a UTF8 string as it's output,
which will then have its bytes manipulated, then pack() is not doing
what it was designed to do - layout the bytes the way the caller
specified.

The only catch being that Perl doesn’t have byte strings.

(It ought to, I’m inclined to think. But status quo is it doesn’t.)

Nonsense. [elided]

My apologies. This particular subject makes me testier than usual and I
spoke rudely and inappropriately.

I feel very strongly there is a bug here, which I have seen catch multiple
clever people coming from other languages using pack. Even experienced
hands get bitten by this let alone newbies from other languages.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2014

From @ikegami

On Fri, Jul 18, 2014 at 4​:27 AM, demerphq <demerphq@​gmail.com> wrote​:

$ perl -MEncode=encode_utf8 -we'my $str="\x{100}" x 128; my $packed= pack
"V/a", $str; print encode_utf8($packed)' | od -x
0000000 80c2 0000 c400 c480 c480 c480 c480 c480
0000020 c480 c480 c480 c480 c480 c480 c480 c480
*
0000400 c480 c480 0080
0000405

Why did the length get corrupted?!!!

Because you encoded it using encode_utf8. Why would you do that?!

Why was there no warning when it was corrupted.

How can encode_utf8 know that you're encoding something that isn't meant to
be used as text?

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2014

From @ap

* demerphq <demerphq@​gmail.com> [2014-07-18 15​:25]​:

On 18 July 2014 10​:28, demerphq <demerphq@​gmail.com> wrote​:

Nonsense. [elided]

My apologies. This particular subject makes me testier than usual
and I spoke rudely and inappropriately.

No apologies necessary.

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2014

From @rgs

On 16 July 2014 22​:18, Mike L <perlbug-followup@​perl.org> wrote​:

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string.

This is a serious bug. It's even possible to generate nonsense output
due to that, which perl should never allow​:

$ bleadperl -MDevel​::Peek -wE 'Dump pack "U0v/a","foo\x{1f0}"'
Character(s) in 'a' format wrapped in pack at -e line 1.
SV = PV(0x7f82618040a0) at 0x7f8262003b30
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x7f8261509950 "\4\0foo\360"\0Malformed UTF-8 character (1
byte, need 4, after start byte 0xf0) in Dump at -e line 1.
[UTF8 "\x{4}\x{0}foo\x{0}"]
  CUR = 6
  LEN = 10

In my opinion the output of pack should never have the UTF8 flag on.

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2014

From @demerphq

On 25 July 2014 13​:25, Rafael Garcia-Suarez <rgarciasuarez@​gmail.com> wrote​:

On 16 July 2014 22​:18, Mike L <perlbug-followup@​perl.org> wrote​:

If you pass a UTF8 encoded string to pack() the output of pack() is now
a UTF8 encoded string.

This is a serious bug. It's even possible to generate nonsense output
due to that, which perl should never allow​:

$ bleadperl -MDevel​::Peek -wE 'Dump pack "U0v/a","foo\x{1f0}"'
Character(s) in 'a' format wrapped in pack at -e line 1.
SV = PV(0x7f82618040a0) at 0x7f8262003b30
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x7f8261509950 "\4\0foo\360"\0Malformed UTF-8 character (1
byte, need 4, after start byte 0xf0) in Dump at -e line 1.
[UTF8 "\x{4}\x{0}foo\x{0}"]
CUR = 6
LEN = 10

In my opinion the output of pack should never have the UTF8 flag on.

Exactly. Never ever ever.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2014

From @ikegami

On Fri, Jul 25, 2014 at 7​:25 AM, Rafael Garcia-Suarez <
rgarciasuarez@​gmail.com> wrote​:

On 16 July 2014 22​:18, Mike L <perlbug-followup@​perl.org> wrote​:

If you pass a UTF8 encoded string to pack() the output of pack() is now
a UTF8 encoded string.

This is a serious bug. It's even possible to generate nonsense output
due to that, which perl should never allow​:

$ bleadperl -MDevel​::Peek -wE 'Dump pack "U0v/a","foo\x{1f0}"'
Character(s) in 'a' format wrapped in pack at -e line 1.
SV = PV(0x7f82618040a0) at 0x7f8262003b30
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x7f8261509950 "\4\0foo\360"\0Malformed UTF-8 character (1
byte, need 4, after start byte 0xf0) in Dump at -e line 1.
[UTF8 "\x{4}\x{0}foo\x{0}"]
CUR = 6
LEN = 10

What you describe is a problem with U0 specifically. U0 is broken.

In my opinion the output of pack should never have the UTF8 flag on.

You'd break C<< pack("n/a*", "\x{2660}"); >>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants