Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpack fails on utf-8 strings #7742

Closed
p5pRT opened this issue Jan 9, 2005 · 19 comments
Closed

unpack fails on utf-8 strings #7742

p5pRT opened this issue Jan 9, 2005 · 19 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 9, 2005

Migrated from rt.perl.org#33734 (status was 'resolved')

Searchable as RT33734$

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2005

From pcg@goof.com

Created by pcg@goof.com

The following program should output "65535" twice, but doesn't​:

  use Convert​::Scalar;
  my $s = "\xff\xff";
  printf "%d\n", unpack "n", $s;
  Convert​::Scalar​::utf8_upgrade $s;
  printf "%d\n", unpack "n", $s;

The program creates the string "\xff\xff" and runs it through unpack, once
when the string internally is in latin1, once when the string is in utf-8
(Convert​::Scalar​::utf8_upgrade just runs utf8_upgrade on the string).

The result must be the same in both cases (same string content), but the
second print gives "50111".

As the internal encoding (wether latin1 or utf8) does NOT change the
string on the perl level, unpack must work consistently.

(I found this bug because for some reason perl upgraded my string to
utf-8 internally, causing very funny effects when I ran various unpacks
to decode the protocol. As perl can do that in various unexpected ways,
I chose severity "high" because there is no easy workaround on the perl
level​: feel free to correct this :)

The solution is to downgrade the string to latin1 before converting it
within unpack, or failing if the string cnanot be converted.

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl v5.8.6:

Configured by Marc Lehmann at Tue Nov 30 00:54:44 CET 2004.

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
  Platform:
    osname=linux, osvers=2.6.10-rc1, archname=amd64-linux
    uname='linux cerebro 2.6.10-rc1 #1 smp mon nov 22 05:47:21 cet 2004 x86_64 gnulinux '
    config_args='-Duselargefiles -Dxuse64bitint -Uxuse64bitall -Dusemymalloc=y -Dcc=gcc-3.4 -Dccflags=-ggdb -Dcppflags=-D_GNU_SOURCE -I/opt/include -Doptimize=-O4 -march=opteron -mtune=opteron -funroll-loops -fno-strict-aliasing -Dcccdlflags=-fPIC -Dldflags=-L/opt/perl/lib -L/opt/lib -Dlibs=-ldl -lm -lcrypt -Darchname=amd64-linux -Dprefix=/opt/perl -Dprivlib=/opt/perl/lib/perl5 -Darchlib=/opt/perl/lib/perl5 -Dvendorprefix=/opt/perl -Dvendorlib=/opt/perl/lib/perl5 -Dvendorarch=/opt/perl/lib/perl5 -Dsiteprefix=/opt/perl -Dsitelib=/opt/perl/lib/perl5 -Dsitearch=/opt/perl/lib/perl5 -Dsitebin=/opt/perl/bin -Dman1dir=/opt/perl/man/man1 -Dman3dir=/opt/perl/man/man3 -Dsiteman1dir=/opt/perl/man/man1 -Dsiteman3dir=/opt/perl/man/man3 -Dman1ext=1 -Dman3ext=3 -Dpager=/usr/bin/less -Uafs -Uusesfio -Uusenm -Uuseshrplib -Dd_dosuid -Dusethreads=undef -Duse5005threads=undef -Duseithreads=undef -Dusemultiplicity=undef -Demail=perl-binary@plan9.de -Dcf_email=perl-binary@plan9.de -Dcf_by=Marc Lehmann -Dlocincpth=/opt/perl/include /opt/include -Dmyhostname=localhost -Dmultiarch=undef -Dbin=/opt/perl/bin -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=define use64bitall=define uselongdouble=undef
    usemymalloc=y, bincompat5005=undef
  Compiler:
    cc='gcc-3.4', ccflags ='-ggdb -fno-strict-aliasing -pipe -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O4 -march=opteron -mtune=opteron -funroll-loops -fno-strict-aliasing',
    cppflags='-D_GNU_SOURCE -I/opt/include -ggdb -fno-strict-aliasing -pipe -I/opt/include'
    ccversion='', gccversion='3.4.2 (Debian 3.4.2-3)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc-3.4', ldflags ='-L/opt/perl/lib -L/opt/lib -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-ldl -lm -lcrypt
    perllibs=-ldl -lm -lcrypt
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/opt/perl/lib -L/opt/lib -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.6:
    /root/src/sex
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    .


Environment for perl v5.8.6:
    HOME=/root
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=de_DE.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/root/s2:/root/s:/opt/bin:/opt/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11/bin:/usr/games:/root/src/uunet:.
    PERL5LIB=/root/src/sex
    PERL5_CPANPLUS_CONFIG=/root/.cpanplus/config
    PERLDB_OPTS=ornaments=0
    PERL_BADLANG (unset)
    PERL_UNICODE=SAL
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 10, 2005

From @nwc10

On Sun, Jan 09, 2005 at 10​:19​:43PM -0000, Marc Lehmann wrote​:

As the internal encoding (wether latin1 or utf8) does NOT change the
string on the perl level, unpack must work consistently.

I agree. Well, I thought I did. Then..

(I found this bug because for some reason perl upgraded my string to
utf-8 internally, causing very funny effects when I ran various unpacks
to decode the protocol. As perl can do that in various unexpected ways,
I chose severity "high" because there is no easy workaround on the perl
level​: feel free to correct this :)

The solution is to downgrade the string to latin1 before converting it
within unpack, or failing if the string cnanot be converted.

However, I'm confused. There is this code in pp_pack.c​:

#ifdef PACKED_IS_OCTETS
  /* Packed side is assumed to be octets - so force downgrade if it
  has been UTF-8 encoded by accident
  */
  register char *s = SvPVbyte(right, rlen);
#else
  register char *s = SvPV(right, rlen);
#endif

and the default is the #else clause. If I recompile with -DPACKED_IS_OCTETS

  Failed 40 test scripts out of 903, 95.57% okay.

which doesn't look great. It looks like some cases in unpack expect to find
utf8 data in the source string. Great. :-(
I wonder if it's viable to make the integer conversion operators (and the
floating point operators) downgrade just enough characters to be useful?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Jan 10, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 10, 2005

From nick@ing-simmons.net

Nicholas Clark <nick@​ccl4.org> writes​:

On Sun, Jan 09, 2005 at 10​:19​:43PM -0000, Marc Lehmann wrote​:

As the internal encoding (wether latin1 or utf8) does NOT change the
string on the perl level, unpack must work consistently.

I agree. Well, I thought I did. Then..

I know I agree ;-)
But I was overruled.

(I found this bug because for some reason perl upgraded my string to
utf-8 internally, causing very funny effects when I ran various unpacks
to decode the protocol. As perl can do that in various unexpected ways,
I chose severity "high" because there is no easy workaround on the perl
level​: feel free to correct this :)

The solution is to downgrade the string to latin1 before converting it
within unpack, or failing if the string cnanot be converted.

However, I'm confused. There is this code in pp_pack.c​:

#ifdef PACKED_IS_OCTETS
/* Packed side is assumed to be octets - so force downgrade if it
has been UTF-8 encoded by accident
*/
register char *s = SvPVbyte(right, rlen);
#else
register char *s = SvPV(right, rlen);
#endif

and the default is the #else clause. If I recompile with -DPACKED_IS_OCTETS

I am reasonably sure that was my code, and the #ifdef backs it out
for obscure 5.6 compatibility and/or Camel-III promissed reasons.

Failed 40 test scripts out of 903, 95.57% okay.

which doesn't look great. It looks like some cases in unpack expect to find
utf8 data in the source string. Great. :-(
I wonder if it's viable to make the integer conversion operators (and the
floating point operators) downgrade just enough characters to be useful?

Then snag is that with 5.6 using pack/unpack and messing with SvUTF8 flag
by obscure means was only way to do 'encoding'.

@p5pRT
Copy link
Author

p5pRT commented Jan 11, 2005

From pcg@goof.com

On Mon, Jan 10, 2005 at 02​:35​:18PM -0000, Nick Ing-Simmons via RT <perlbug-followup@​perl.org> wrote​:

Nicholas Clark <nick@​ccl4.org> writes​:

I agree. Well, I thought I did. Then..

I know I agree ;-)
But I was overruled.

Reminds me of mozilla, "MSIE has the same bug, so we don't follow the RFC
here" :/

and the default is the #else clause. If I recompile with -DPACKED_IS_OCTETS

I am reasonably sure that was my code, and the #ifdef backs it out
for obscure 5.6 compatibility and/or Camel-III promissed reasons.

Hmm..

which doesn't look great. It looks like some cases in unpack expect to find
utf8 data in the source string. Great. :-(
I wonder if it's viable to make the integer conversion operators (and the
floating point operators) downgrade just enough characters to be useful?

Then snag is that with 5.6 using pack/unpack and messing with SvUTF8 flag
by obscure means was only way to do 'encoding'.

Which shouldn't be a problem, as perl-5.6 unicode code will not work on
perl-5.8 anyways in most cases, because 5.6 was *completely* and *utterly*
broken with respect to unicode. I don't think bugs should stay around.

The problem with this bug is that, on the perl level, there is no easy way
to get it working in 5.8. 5.8 might upgrade a string for a lot of obscure
reasons (appending a string encoded in utf-8 already for other reasons,
but not containing utf-8 characters).

Not fixing this bug means that a programmer must ALWAYS downgrade the
scalar manually, or must know enough about all perl internals (and modules
in use!) involved that he can be sure that the scalar won't be upgraded
accidentally.

Bug-compatibility with earlier versions of perl doesn't seem like a reason to
keep it that way.

(I know you agree, but if you are still overruled, it might be useful to
argue once more​: I was bitten by this bug and it took me 3 hours to find
the actual reason (and it was old code decoding a protocol, too). And I do
know _a lot_ about perl unicode internals compared to your average perl
programmer. I don't think other people would quite so quickly find out
about this).

At the very least this breakage should be documented, and a workaround
be available (I guess Encode​::encode "iso-8859-1" would be working..
horrible).

--
  The choice of a
  -----==- _GNU_
  ----==-- _ generation Marc Lehmann
  ---==---(_)__ __ ____ __ pcg@​goof.com
  --==---/ / _ \/ // /\ \/ / http​://schmorp.de/
  -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

@p5pRT
Copy link
Author

p5pRT commented Jan 11, 2005

From pcg@goof.com

On Mon, Jan 10, 2005 at 01​:42​:14PM -0000, Nicholas Clark via RT <perlbug-followup@​perl.org> wrote​:

On Sun, Jan 09, 2005 at 10​:19​:43PM -0000, Marc Lehmann wrote​:
#ifdef PACKED_IS_OCTETS
/* Packed side is assumed to be octets - so force downgrade if it
has been UTF-8 encoded by accident

Yupp - pack makes octets out of data, unpack makes data out of octets.

 \*/
register char \*s = SvPVbyte\(right\, rlen\);

#else
register char *s = SvPV(right, rlen);
#endif

and the default is the #else clause. If I recompile with -DPACKED_IS_OCTETS

Failed 40 test scripts out of 903, 95.57% okay.

Uh.

which doesn't look great. It looks like some cases in unpack expect to find
utf8 data in the source string. Great. :-(

Do you know which cases? I'd say (Without knowing them :) that they are
bugs...

I cannot find any conversion operator that would make sense when feed with
non-octect-data (in the perlfunc manpage, except maybe "U", but even "U"
should work on octets, not on an utf-8 string, i.e. it should generate two
characters for \x80, not one).

I wonder if it's viable to make the integer conversion operators (and the
floating point operators) downgrade just enough characters to be useful?

That would still break "b" and would have questionable semantics on "a"
for example.

I frankly cannot see any reason why >255 characters can make any sense as
argument to unpack, and if the testsuite fails, I guess that is then a bug in
the testsuite.

--
  The choice of a
  -----==- _GNU_
  ----==-- _ generation Marc Lehmann
  ---==---(_)__ __ ____ __ pcg@​goof.com
  --==---/ / _ \/ // /\ \/ / http​://schmorp.de/
  -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

@p5pRT
Copy link
Author

p5pRT commented Jan 13, 2005

From @nwc10

On Tue, Jan 11, 2005 at 06​:08​:26PM +0100, Marc A. Lehmann wrote​:

I cannot find any conversion operator that would make sense when feed with
non-octect-data (in the perlfunc manpage, except maybe "U", but even "U"
should work on octets, not on an utf-8 string, i.e. it should generate two
characters for \x80, not one).

I didn't know, but looking at the pack implementation, it's 'U', and only 'U'​:

$ ./perl -Ilib -we 'use Devel​::Peek; Dump pack "U", 256'
SV = PV(0x1801448) at 0x1801240
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x600350 "\304\200"\0 [UTF8 "\x{100}"]
  CUR = 2
  LEN = 14

It expects UTF8 on the way back in, *marked* with the UTF8 flag.

$ ./perl -Ilib -MCarp -we 'use Devel​::Peek; $a = pack "U", 256; utf8​::encode $a; Dump $a; Dump unpack "U", $a'
SV = PV(0x1801478) at 0x1808e70
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x6016a0 "\304\200"\0
  CUR = 2
  LEN = 3
SV = IV(0x1809f68) at 0x1801090
  REFCNT = 1
  FLAGS = (TEMP,IOK,pIOK)
  IV = 196

I'm about to test a hack that might make most things work.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Jan 13, 2005

From @nwc10

On Thu, Jan 13, 2005 at 01​:56​:33PM +0000, Nicholas Clark wrote​:

I didn't know, but looking at the pack implementation, it's 'U', and only 'U'​:

Seems to be 'C' and 'U'

I'm about to test a hack that might make most things work.

After changing t/op/join.t to avoid using H* to probe the innards of UTF8
scalars, the appended diff does make all tests pass. However, I'm not
convinced that it's the way to go.

Nicholas Clark

==== //depot/perl/pp_pack.c#52 - /Users/nick/p4perl/perl/pp_pack.c ====

Inline Patch
--- /tmp/tmp.5559.0     Thu Jan 13 21:21:50 2005
+++ /Users/nick/p4perl/perl/pp_pack.c   Thu Jan 13 16:26:17 2005
@@ -1869,7 +1869,13 @@ PP(pp_unpack)
      */
     register char *s = SvPVbyte(right, rlen);
 #else
-    register char *s = SvPV(right, rlen);
+    /* This is a hack. Only the "U" pattern requires Unicode input, so
+       downgrade everything else. We're assuing that no-one is mad enough
+       to mix U patterns and regular packed data. This will, of course, be
+       wrong.
+    */
+    register char *s = (strchr (pat, 'U') || strchr (pat, 'C'))
+       ? SvPV(right, rlen) : SvPVbyte(right, rlen);
 #endif
     char *strend = s + rlen;
     register char *patend = pat + llen;

@p5pRT
Copy link
Author

p5pRT commented Jan 14, 2005

From pcg@goof.com

On Thu, Jan 13, 2005 at 09​:24​:16PM -0000, Nicholas Clark via RT <perlbug-followup@​perl.org> wrote​:

On Thu, Jan 13, 2005 at 01​:56​:33PM +0000, Nicholas Clark wrote​:

I didn't know, but looking at the pack implementation, it's 'U', and only 'U'​:

Seems to be 'C' and 'U'

But "C" is documented as​:

  An unsigned char value. Only does bytes. See U for
  Unicode.

I assume that "byte" == "octet", which is not generally true in perl, but
is common language.

While "U" is documented as​:

  A Unicode character number. Encodes to UTF-8
  internally (or UTF-EBCDIC in EBCDIC platforms).

If that indeed encodes to a unicode string, not to UTF-8 (as the confusing
"internally" seems to imply), it is badly worded. In any case, this is the
only flag, and I do remember the discussion on perl5-porters that it is
broken and not easily supportable, as Encode provides the correct way to
do that, and much more.

After changing t/op/join.t to avoid using H* to probe the innards of UTF8
scalars, the appended diff does make all tests pass. However, I'm not
convinced that it's the way to go.

Well, then how do you propose to fix the situaiton? The current behaviour
is completely erratic, as the same scalar is interpretetd differently by
unpack, depending on the perl version and it's usage history.

The question, if this is not the right fix, is what the semantics of
unicode strings as arguments to unpack are?

As of now, I don't see how I can control this on the perl level, as perl
can essentially upgrade my scalar anytime. The data in the scalar will
stay the same, by definition, except for unpack, which acts in a rather
undefined way.

In any case, the current state of undocumented random breakage is a
bug. It either needs to get defined semantics, or made comaptible with the
rest of perl.

Try this​:

  $x = "\xff\xff";
  $x =~ /\x{100}/;

  die unpack "n", $x;

Can you guess the output of this program without running it? For every
current and future perl version? What about other operations? What about
this​:

  $x = "\xff\xff";
  $y = "\x{100}";
  substr $y, 0, 1, "";

  die unpack "n", "$x$y";

So concatenation with an empty string suddenly changes how the firts two
bytes are being interpreted?

It gets worse​:

  use utf8;
  $x = "\xff\xff echt übel";
  die unpack "n", $x;

Now the output suddenly depends on thenormalization form the text editor
used to edit the text used, or wether perl does normalization or not.

Does I/O to a utf-8-encoded file change the unpack outcome? Does
interpolation into a utf-8-encoded string change unpack outcome? HAs this
been the case in earlier versions? Will this stay the same in future
versions?

Why have all of the above code snippets defined behaviour regardless of
the outcome of the questions if I match or substr or concatenate the
scalars, but not when unpakc is involved?

Upgrading behaviour can change with every perl version, and often enough
has.

If that is supposed behaviour, then I would be happy if it's semantics
were clarified somewhere. Saying "perl might change your scalar anytime
behind your back and thus change unpack results" sounds a bit like "your
memory has corrupted bits".

Forcing dveelopers to paste a call to Encode between every scalar and
unpack to avoid circumstances, module usage or future versions changing
it's unpack value is not what I would call deterministic behaviour.

In any case, if the old behaviour is to stay it simply needs to be defined
behaviour. If there is no way to make it behave deterministically on the perl
level, it isn't defined behaviour in my eyes.

This is just a time bomb (and it exploded in my code, and might do so in
other code).

--
  The choice of a
  -----==- _GNU_
  ----==-- _ generation Marc Lehmann
  ---==---(_)__ __ ____ __ pcg@​goof.com
  --==---/ / _ \/ // /\ \/ / http​://schmorp.de/
  -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

@p5pRT
Copy link
Author

p5pRT commented Jan 22, 2005

From perl5-porters@ton.iguana.be

Some points I forgot to make in the previous mail​:

- If the first character in the unpackstring is U it now behaves like
  there is an U0 before (this is consistent with the documentation
  and behaviour of pack, and needed for reversibility)
- Only literal C0 and U0 cause a mode switch, not ones implied by
  something like​: unpack("C/U", "\x00")
  (I consider that a bugfix that should also be applied to maint)
- C0 and U0 modes are scoped to (), so in format "s(nU0v)2S", the U0 mode
  only applies to the v, NOT to the S or the n (in the second round)
  (I also consider the old behaviour here a bug. It made multiround
  groups and C/() style groups too unpredictable)

A few bugfixes and improvements relative to the previous patch​:

Inline Patch
--- pp_pack.c.1	Sat Jan 22 14:43:22 2005
+++ pp_pack.c	Sat Jan 22 16:06:04 2005
@@ -668,21 +668,32 @@
 STATIC bool 
 need_utf8(const char *pat, const char *patend)
 {
-    if (pat >= patend) return FALSE;
-    if (*pat == 'U') return TRUE;
+    bool first = TRUE;
     while (pat < patend) {
 	if (pat[0] == '#') {
 	    pat++;
 	    pat = memchr(pat, '\n', patend-pat);
 	    if (!pat) return FALSE;
 	} else if (pat[0] == 'U') {
-	    if (pat[1] == '0') return TRUE;
-	}
+	    if (first || pat[1] == '0') return TRUE;
+	} else first = FALSE;
 	pat++;
     }
     return FALSE;
 }
 
+STATIC char
+first_symbol(const char *pat, const char *patend) {
+    while (pat < patend) {
+	if (pat[0] != '#') return pat[0];
+	pat++;
+	pat = memchr(pat, '\n', patend-pat);
+	if (!pat) return 0;
+	pat++;
+    }
+    return 0;
+}
+
 /*
 =for apidoc unpack_str
 
@@ -707,7 +718,7 @@
 	flags |= FLAG_UNPACK_DO_UTF8;
     }
 
-    if (pat < patend && *pat == 'U') {
+    if (first_symbol(pat, patend) == 'U') {
 	/*
 	  if (!(flags & FLAG_UNPACK_DO_UTF8)) 
 	      Perl_croak(aTHX_ "U0 mode on a byte string");
@@ -747,7 +758,7 @@
 	flags |= FLAG_UNPACK_DO_UTF8;
     }
 
-    if (pat < patend && *pat == 'U') {
+    if (first_symbol(pat, patend) == 'U') {
 	/*
 	  if (!(flags & FLAG_UNPACK_DO_UTF8)) 
 	      Perl_croak(aTHX_ "U0 mode on a byte string");
@@ -774,7 +785,7 @@
        warnings), but these two mean we make no progress in the string and
        might enter an infinite loop */
     if (retlen == (STRLEN) -1 || retlen == 0)
-	Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+	Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
     if (val >= 0x100) Perl_croak(aTHX_ "'%c' applied to character value %"UVf,
 				 (int) datumtype, val);
     *s += retlen;
@@ -782,37 +793,58 @@
 }
 
 STATIC bool
-next_uni_uu(pTHX_ char **s, const char *end, I32 *out)
+next_uni_bytes(pTHX_ char **s, const char *end, char *buf, int buf_len)
 {
     UV val;
     STRLEN retlen;
     char *from = *s;
-    val = UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen, UTF8_CHECK_ONLY));
-    if (val >= 0x100 || !ISUUCHAR(val) || 
-	retlen == (STRLEN) -1 || retlen == 0) {
-	*out = 0;
-	return FALSE;
+    int bad = 0;
+    U32 flags = ckWARN(WARN_UTF8) ? 
+	UTF8_CHECK_ONLY : (UTF8_CHECK_ONLY | UTF8_ALLOW_ANY);
+    for (;buf_len > 0; buf_len--) {
+	if (from >= end) return FALSE;
+	val = UNI_TO_NATIVE(utf8n_to_uvuni(from, end-from, &retlen, flags));
+	if (retlen == (STRLEN) -1 || retlen == 0) {
+	    from += UTF8SKIP(from);
+	    bad |= 1;
+	} else from += retlen;
+	if (val >= 0x100) {
+	    bad |= 2;
+	    val &= 0xff;
+	}
+	*(unsigned char *)buf++ = val;
+    }
+    /* We have enough characters for the buffer. Did we have problems ? */
+    if (bad) {
+	if (bad & 1) {
+	    /* Rewalk the string fragment while warning */
+	    char *ptr;
+	    flags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY;
+	    for (ptr = *s; ptr < from; ptr += UTF8SKIP(ptr))
+		utf8n_to_uvuni(ptr, end-ptr, &retlen, flags);
+	    if (from > end) from = end;
+	}
+	if ((bad & 2) && ckWARN(WARN_UNPACK))
+	    Perl_warner(aTHX_ packWARN(WARN_UNPACK),
+			"Character(s) wrapped in unpack");
     }
-    *out = PL_uudmap[val] & 077;
     *s = from;
     return TRUE;
 }
 
 STATIC bool
-next_uni_bytes(pTHX_ char **s, const char *end, char *buf, int buf_len)
+next_uni_uu(pTHX_ char **s, const char *end, I32 *out)
 {
     UV val;
     STRLEN retlen;
     char *from = *s;
-    while (buf_len > 0) {
-	if (from >= end) return 1;
-	val = UNI_TO_NATIVE(utf8n_to_uvuni(from, end-from, &retlen,
-					   UTF8_CHECK_ONLY));
-	if (val >= 0x100 || retlen == (STRLEN) -1 || retlen == 0) return FALSE;
-	from += retlen;
-	*(unsigned char *)buf++ = val;
-	buf_len--;
+    val = UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen, UTF8_CHECK_ONLY));
+    if (val >= 0x100 || !ISUUCHAR(val) || 
+	retlen == (STRLEN) -1 || retlen == 0) {
+	*out = 0;
+	return FALSE;
     }
+    *out = PL_uudmap[val] & 077;
     *s = from;
     return TRUE;
 }
@@ -884,6 +916,8 @@
 	      symptr->level++;
 	      PUTBACK;
 	      while (len--) {
+		  if (utf8) symptr->flags |=  FLAG_UNPACK_PARSE_UTF8;
+		  else      symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
 		  symptr->patptr = savsym.grpbeg;
 		  unpack_rec(symptr, s, strbeg, strend, &s);
 		  if (s == strend && savsym.howlen == e_star)
@@ -952,7 +986,7 @@
 		I32 l = 0;
 		for (hop = strbeg; hop < s; hop += UTF8SKIP(hop)) l++;
 		if (s != hop)
-		    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+		    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
 		ai32 = l % len;
 	    } else ai32 = (s - strbeg) % len;
 	    if (ai32 == 0) break;
@@ -985,12 +1019,12 @@
 		for (l=len, hop=s; l>0; l--, hop += UTF8SKIP(hop)) {
 		    if (hop >= strend) {
 			if (hop > strend)
-			    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+			    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
 			break;
 		    }
 		}
 		if (hop > strend)
-		    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+		    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
 		len = hop - s;
 	    } else if (len > strend - s)
 		len = strend - s;
@@ -1168,16 +1202,9 @@
 	    break;
 	  case 'C':
 	    if (len == 0) {
-		if (literal) {
+		if (literal)
 		    /* Switch to "natural" mode */
-		    if (symptr->flags & FLAG_UNPACK_DO_UTF8) {
-			symptr->flags |= FLAG_UNPACK_PARSE_UTF8;
-			utf8 = 1;
-		    } else {
-			symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
-			utf8 = 0;
-		    }
-		}
+		    utf8 = (symptr->flags & FLAG_UNPACK_DO_UTF8) ? 1 : 0;
 		break;
 	    }
 	  uchar_checksum:
@@ -1215,14 +1242,11 @@
 	    if (len == 0) {
 		if (literal) {
 		    /* Switch to "bytes in utf-8" mode */
-		    if (symptr->flags & FLAG_UNPACK_DO_UTF8) {
-			utf8 = 0;
-			symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
-		    } else {
+		    if (symptr->flags & FLAG_UNPACK_DO_UTF8) utf8 = 0;
+		    else
 			/* Should be impossible due to the need_utf8() test */
 			Perl_croak(aTHX_ "U0 mode on a byte string");
 		    }
-		}
 		break;
 	    }
 	    if (len > strend - s) len = strend - s;
@@ -1235,7 +1259,7 @@
 		STRLEN retlen;
 		UV auv = utf8n_to_uvuni((U8*)s, strend - s, &retlen, ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV);
 		if (retlen == (STRLEN) -1 || retlen == 0)
-		    Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+		    Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
 		s += retlen;
 		if (!checksum)
 		    PUSHs(sv_2mortal(newSVuv((UV) auv)));

@p5pRT
Copy link
Author

p5pRT commented Jan 22, 2005

@p5pRT
Copy link
Author

p5pRT commented Apr 27, 2006

From @nwc10

On Tue, Jan 11, 2005 at 06​:08​:26PM +0100, Marc A. Lehmann wrote​:

On Mon, Jan 10, 2005 at 01​:42​:14PM -0000, Nicholas Clark via RT <perlbug-followup@​perl.org> wrote​:

I wonder if it's viable to make the integer conversion operators (and the
floating point operators) downgrade just enough characters to be useful?

That would still break "b" and would have questionable semantics on "a"
for example.

I agree, but I suspect that there are also programs out there which are
buggily relying on using a, b and h to inspect the internal representation
of scalars.

I frankly cannot see any reason why >255 characters can make any sense as
argument to unpack, and if the testsuite fails, I guess that is then a bug in
the testsuite.

I'm never so confident about that. The testsuite may be buggy, but I tend to
assume (rightly or wrongly) that it's representative of Perl code out there.
So if a change to the internals can be made without the testsuite catching
fire, then I assume that it's unlikely to cause problems with code out on
CPAN.

On Fri, Jan 14, 2005 at 05​:18​:32PM +0100, Marc A. Lehmann wrote​:

Well, then how do you propose to fix the situaiton? The current behaviour
is completely erratic, as the same scalar is interpretetd differently by
unpack, depending on the perl version and it's usage history.

The question, if this is not the right fix, is what the semantics of
unicode strings as arguments to unpack are?

I think Ton's semantics as incorporated into 5.9.x are right (or at least a
heck of a lot less wrong than current) but they change far more than I'm
comfortable with for maint.

In any case, if the old behaviour is to stay it simply needs to be defined
behaviour. If there is no way to make it behave deterministically on the perl
level, it isn't defined behaviour in my eyes.

This is just a time bomb (and it exploded in my code, and might do so in
other code).

I agree.

There is already this fleeting reference to the current behaviour in
perlunicode.pod​:

  =item *
 
  Most operators that deal with positions or lengths in a string will
  automatically switch to using character positions, including
  C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
  C<sprintf()>, C<write()>, and C<length()>. Operators that
  specifically do not switch include C<vec()>, C<pack()>, and
  C<unpack()>. Operators that really don't care include
  operators that treats strings as a bucket of bits such as C<sort()>,
  and operators dealing with filenames.

I've merged Ton's code into maint, disabled all the new features, and kept
the current behaviour for all the string, character, bit and hex operators.
However, I am comfortable with changing all the numeric and pointer formatting
operators to deal with converting down from UTF-8, so have enabled that.
I doubt that any sane program is relying on using 'n', 's', 'i' etc to
generate a number based on the UTF-8 representation of some string.

I don't like this compromise, but it seems less likely to trigger new bugs
than either keeping the old behaviour, or entirely adopting Ton's changes.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Apr 28, 2006

From perl@nevcal.com

On approximately 4/27/2006 8​:04 AM, came the following characters from
the keyboard of Nicholas Clark​:

I think Ton's semantics as incorporated into 5.9.x are right (or at least a
heck of a lot less wrong than current) but they change far more than I'm
comfortable with for maint.

I don't like this compromise, but it seems less likely to trigger new bugs
than either keeping the old behaviour, or entirely adopting Ton's changes.

You produced some amazing magic with loadable Regexp work, to get Dave's
Regexp fixes available to maint, without actually changing maint, so
that people using 5.8.x could useh the fixed Regexp if they choose. I
haven't looked at the detalis of how you did that, and I guess it might
still have some rough edges from the last I've read on the topic....

But it makes me wonder if Ton's un/pack changes could be turned into a
loadable module for maint?

use feature newpack;

(I forget the syntax you used for the new regexp stuff)

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

p5pRT commented Apr 28, 2006

From schmorp@schmorp.de

On Thu, Apr 27, 2006 at 10​:05​:19PM -0700, Glenn Linderman via RT <perlbug-followup@​perl.org> wrote​:

I think Ton's semantics as incorporated into 5.9.x are right (or at least a
heck of a lot less wrong than current) but they change far more than I'm
comfortable with for maint.

I don't like this compromise, but it seems less likely to trigger new bugs
than either keeping the old behaviour, or entirely adopting Ton's changes.

You produced some amazing magic with loadable Regexp work, to get Dave's
Regexp fixes available to maint, without actually changing maint, so
that people using 5.8.x could useh the fixed Regexp if they choose. I
haven't looked at the detalis of how you did that, and I guess it might
still have some rough edges from the last I've read on the topic....

But it makes me wonder if Ton's un/pack changes could be turned into a
loadable module for maint?

use feature newpack;

(I forget the syntax you used for the new regexp stuff)

It's probably not worth it​: I think its important to fix unpack. Wetehr
documented or not, its buggy, because there is no official way to make it
work, except to rely on implementation-defined behaviour (namely that a
certain string is encoded internally as utf-8 or not, which can and does
change between different versions of perl).

However, wether it gets fixed with 5.9 or 5.8 is a matter of taste, and if
it breaks things (which I can't see why, as it would be rather broken to
rely on random internal behaviour, e.g. in modules), then it should simply
stay out of 5.8. People who want fixed behaviour can simply go to 5.10,
instead of a comparatively major effort of making this runtime-selectable
(and documenting and maintaining that stuff).

Just my thoughts...

--
  The choice of a
  -----==- _GNU_
  ----==-- _ generation Marc Lehmann
  ---==---(_)__ __ ____ __ pcg@​goof.com
  --==---/ / _ \/ // /\ \/ / http​://schmorp.de/
  -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2007

From guest@guest.guest.xxxxxxxx

Could this issue be revisited? I get tons of reports from people who
suffer from this bug (sorry, it *is* a bug, even if it is documented as
a bug).

Example​: people transfer a binary string via JSON​::XS. JSON​::XS might
happen to output that binary string UTF-X encoded. people put this into
compress​::zlib which uses unpack, and then fails because their *binary*
string starts with character #255, but unpack "C" for example returns a
totally bogus 195, due to the fatc it gives random results depending on
the internal encoding. This is horrendouesly buggy, and *needs* to be
fixed. This breaks lots of modules that would otherwise just work.

These kind of bugs actively hurt perls reputation for being
unicode-capable. Users have *no* way to make sure their strings stay
downgraded. This is an *internal* implementation detail.

If you really want to keep unpack's broken semantics, then you *need* to
docment all the rules on when strings get upgraded and downgraded, and
*all* users of unpack need to fix their otherwise good code, just
because unpack does not follow the same string model as the rest of perl.

can this bug be fixed? it is still in bleedperl, after so many years.

it is trivial to fix​: just downgrade the argument string, this is
*correct*. Anything else is sheer idiocy, as it gives basically *random*
results and breaks the perl string model completely.

(the same problem is SvPV, which should be aliases to SvPVbyte, as this
is what everybody expects as it is 5.6-compatible. As it is, all modules
that use SvPV depend on internal implementation details).

Users do rightly expect that *the same* string gives the *the same*
results in the same context, regardless of how it was used before.
Forcing users to understand implementation details that can change from
version to version for everyday work (such as decoding a file header in
Compress​::Zlib) is a bug. It needs to be addressed, not documented, the
latter is useless.

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2007

From guest@guest.guest.xxxxxxxx

The actual exmaple, btw, is​:

  unpack ('CCCCVCC', $$string);

fatc is, programs use unpack to extratc structs regularly. They do not
expect that "n" (or "V") work correctly regarding the unicdoe model
while "C" doesn't, especially as there is no way to extratc a single
byte from a string with unpack, there is no format specifier for that.

The othe rusage, using unpack to peek at internal implementation details
of string encoding, in comparison, does not exist. If it exists, it is
new usage which could just as well use a special modifier for that purpose.

Giving "C" any semantics other than "extract the next byte", compatible
to "n" (extract two bytes and combine them) and other format specifiers
mindlessly breaks lots of existing modules.

Adding another specifier that ignores bytes and peeks at the internal
in-memory format does not break existing code, as this capability simply
did not exist in older perl versions (as 5.6 did not have working
unicode support).

This needs to be fixed ASAP. The more people fall into that trap the
worse it becomes.

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2011

From @cpansprout

This was fixed in 5.8.9 and 5.10.0. Didn’t you fix this?

@p5pRT
Copy link
Author

p5pRT commented Oct 6, 2011

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Oct 6, 2011
@p5pRT
Copy link
Author

p5pRT commented Oct 7, 2011

From schmorp@schmorp.de

On Thu, Oct 06, 2011 at 02​:24​:35PM -0700, Father Chrysostomos via RT <perlbug-followup@​perl.org> wrote​:

This was fixed in 5.8.9 and 5.10.0. Didn’t you fix this?

My memory is fuzzy, and this is long ago, so, could well be...

In any case, yes, its indeed fixed - thanks!

--
  The choice of a Deliantra, the free code+content MORPG
  -----==- _GNU_ http​://www.deliantra.net
  ----==-- _ generation
  ---==---(_)__ __ ____ __ Marc Lehmann
  --==---/ / _ \/ // /\ \/ / schmorp@​schmorp.de
  -=====/_/_//_/\_,_/ /_/\_\

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant