Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8::decode doesn't work under -T #7312

Closed
p5pRT opened this issue May 24, 2004 · 15 comments
Closed

utf8::decode doesn't work under -T #7312

p5pRT opened this issue May 24, 2004 · 15 comments

Comments

@p5pRT
Copy link

p5pRT commented May 24, 2004

Migrated from rt.perl.org#29841 (status was 'resolved')

Searchable as RT29841$

@p5pRT
Copy link
Author

p5pRT commented May 24, 2004

From stas@stason.org

Created by stas@rabbit.stason.org

The bug seems to be in utf8​::decode (and probably related functions) when
running under -T. Please observe​:

% perl -MDevel​::Peek -le '$_ = shift; Dump $_; utf8​::decode $_; Dump $_' \
  `perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`

SV = PV(0x804df00) at 0x804dbd4
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x8064570 "\320\262\320\260"\0
  CUR = 4
  LEN = 5
SV = PV(0x804df00) at 0x804dbd4
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x8064570 "\320\262\320\260"\0 [UTF8 "\x{432}\x{430}"]
  CUR = 4
  LEN = 5

And now with -T​:

% perl -T -MDevel​::Peek -le '$_ = shift; Dump $_; utf8​::decode $_; Dump $_' \
  `perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`

SV = PVMG(0x8087c10) at 0x804dbd4
  REFCNT = 1
  FLAGS = (GMG,SMG,pPOK)
  IV = 0
  NV = 0
  PV = 0x8064560 "\320\262\320\260"\0
  CUR = 4
  LEN = 5
  MAGIC = 0x806a618
  MG_VIRTUAL = &PL_vtbl_taint
  MG_TYPE = PERL_MAGIC_taint(t)
  MG_LEN = 1
SV = PVMG(0x8087c10) at 0x804dbd4
  REFCNT = 1
  FLAGS = (GMG,SMG,pPOK)
  IV = 0
  NV = 0
  PV = 0x8064560 "\320\262\320\260"\0
  CUR = 4
  LEN = 5
  MAGIC = 0x806a618
  MG_VIRTUAL = &PL_vtbl_taint
  MG_TYPE = PERL_MAGIC_taint(t)
  MG_LEN = 1

As you can see the first invocation w/o -T correctly decodes input,
the second does not.

Perl Info

Flags:
     category=utilities
     severity=critical

Site configuration information for perl v5.8.4:

Configured by stas at Sat May  8 23:36:56 PDT 2004.

Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
   Platform:
     osname=linux, osvers=2.6.3-9mdk, archname=i686-linux-thread-multi
     uname='linux rabbit.stason.org 2.6.3-9mdk #1 fri apr 23 16:41:09 edt 2004 
i686 unknown unknown gnulinux '
     config_args='-des -Dprefix=/home/stas/perl/5.8.4-ithread -Dusethreads 
-Doptimize=-g -Duseshrplib -Dusedevel -Accflags=-DDEBUG_LEAKING_SCALARS'
     hint=recommended, useposix=true, d_sigaction=define
     usethreads=define use5005threads=undef useithreads=define 
usemultiplicity=define
     useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
     use64bitint=undef use64bitall=undef uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-DDEBUG_LEAKING_SCALARS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
     optimize='-g',
     cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS 
-DDEBUG_LEAKING_SCALARS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include 
-I/usr/include/gdbm'
     ccversion='', gccversion='3.3.2 (Mandrake Linux 10.0 3.3.2-6mdk)', 
gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=4, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -L/usr/local/lib'
     libpth=/usr/local/lib /lib /usr/lib
     libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
     libc=/lib/libc-2.3.3.so, so=so, useshrplib=true, libperl=libperl.so
     gnulibc_version='2.3.3'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E 
-Wl,-rpath,/home/stas/perl/5.8.4-ithread/lib/5.8.4/i686-linux-thread-multi/CORE'
     cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:



@INC for perl v5.8.4:
     /home/stas/perl/5.8.4-ithread/lib/5.8.4/i686-linux-thread-multi
     /home/stas/perl/5.8.4-ithread/lib/5.8.4
     /home/stas/perl/5.8.4-ithread/lib/site_perl/5.8.4/i686-linux-thread-multi
     /home/stas/perl/5.8.4-ithread/lib/site_perl/5.8.4
     /home/stas/perl/5.8.4-ithread/lib/site_perl
     .


Environment for perl v5.8.4:
     HOME=/home/stas
     LANG=en_GB
     LANGUAGE=en_GB:en
     LC_ADDRESS=en_CA
     LC_COLLATE=en_GB
     LC_CTYPE=en_GB
     LC_IDENTIFICATION=en_CA
     LC_MEASUREMENT=en_CA
     LC_MESSAGES=en_GB
     LC_MONETARY=en_CA
     LC_NAME=en_CA
     LC_NUMERIC=en_CA
     LC_PAPER=en_CA
     LC_TELEPHONE=en_CA
     LC_TIME=en_GB
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
 
PATH=/usr//bin:/bin:/usr/bin:.:/usr/local/bin:/usr/X11R6/bin/:/usr/games:/home/stas/bin:/home/stas/bin:/usr/local/bin:/usr/X11R6/bin:/usr/java/j2re1.4.0/bin/
     PERLDOC_PAGER=less -R
     PERL_BADLANG (unset)
     SHELL=/bin/tcsh


-- 
__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

@p5pRT
Copy link
Author

p5pRT commented May 30, 2004

From @iabyn

On Mon, May 24, 2004 at 11​:30​:20PM -0000, Stas Bekman wrote​:

The bug seems to be in utf8​::decode (and probably related functions) when
running under -T. Please observe​:

% perl -MDevel​::Peek -le '$_ = shift; Dump $_; utf8​::decode $_; Dump $_' \
`perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`

SV = PV(0x804df00) at 0x804dbd4
REFCNT = 1
FLAGS = (POK,pPOK)

(snip)

And now with -T​:

% perl -T -MDevel​::Peek -le '$_ = shift; Dump $_; utf8​::decode $_; Dump $_' \
`perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`

SV = PVMG(0x8087c10) at 0x804dbd4
REFCNT = 1
FLAGS = (GMG,SMG,pPOK)

The immediate cause is the lack of a POK flag in the tainted version
of the original string. Perl_sv_utf8_decode() just skips the decode
unless that flag is set.

Whether its right for only the pPOK flag to be on the tainted string,
I don't know.

--
There's a traditional definition of a shyster​: a lawyer who, when the law
is against him, pounds on the facts; when the facts are against him,
pounds on the law; and when both the facts and the law are against him,
pounds on the table.
  -- Eben Moglen referring to SCO

@p5pRT
Copy link
Author

p5pRT commented May 30, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 31, 2004

From stas@stason.org

Dave Mitchell wrote​:

On Mon, May 24, 2004 at 11​:30​:20PM -0000, Stas Bekman wrote​:

The bug seems to be in utf8​::decode (and probably related functions) when
running under -T. Please observe​:

% perl -MDevel​::Peek -le '$_ = shift; Dump $_; utf8​::decode $_; Dump $_' \
`perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`

SV = PV(0x804df00) at 0x804dbd4
REFCNT = 1
FLAGS = (POK,pPOK)

(snip)

And now with -T​:

% perl -T -MDevel​::Peek -le '$_ = shift; Dump $_; utf8​::decode $_; Dump $_' \
`perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`

SV = PVMG(0x8087c10) at 0x804dbd4
REFCNT = 1
FLAGS = (GMG,SMG,pPOK)

The immediate cause is the lack of a POK flag in the tainted version
of the original string. Perl_sv_utf8_decode() just skips the decode
unless that flag is set.

Whether its right for only the pPOK flag to be on the tainted string,
I don't know.

Thanks Dave,

So it's unclear whose fault it is, the code that populates sv with pv under -T
or Perl_sv_utf8_decode(), which may need to check pPOK instead of POK?

Also is it possible to feed utf8 chars directly from shell? i.e. replacing
`perl -le 'binmode STDOUT, "​:utf8"; print "\x{0432}\x{0430}"'`
in my example with some escape chars?

--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http​://stason.org/ mod_perl Guide ---> http​://perl.apache.org
mailto​:stas@​stason.org http​://use.perl.org http​://apacheweek.com
http​://modperlbook.org http​://apache.org http​://ticketmaster.com

@p5pRT
Copy link
Author

p5pRT commented May 31, 2004

From BQW10602@nifty.com

On Sun, 30 May 2004 21​:43​:38 -0700
Stas Bekman <stas@​stason.org> wrote​:

So it's unclear whose fault it is, the code that populates sv with pv under -T
or Perl_sv_utf8_decode(), which may need to check pPOK instead of POK?

You can use <$_ = Encode​::decode('utf8', $_)>
instead of <utf8​::decode $_>.
Encode​::decode('utf8') is based on Encode​::utf8​::decode_xs() and
it does not use "broken" sv_utf8_decode().
I think it should work well even if taint mode.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented May 31, 2004

From stas@stason.org

SADAHIRO Tomoyuki wrote​:

On Sun, 30 May 2004 21​:43​:38 -0700
Stas Bekman <stas@​stason.org> wrote​:

So it's unclear whose fault it is, the code that populates sv with pv under -T
or Perl_sv_utf8_decode(), which may need to check pPOK instead of POK?

You can use <$_ = Encode​::decode('utf8', $_)>
instead of <utf8​::decode $_>.
Encode​::decode('utf8') is based on Encode​::utf8​::decode_xs() and
it does not use "broken" sv_utf8_decode().
I think it should work well even if taint mode.

Thank you, Sadahiro. It indeed fixes my simple test case (I haven't tested
that solution in the application yet).

perl-5.8.0 -T -MEncode -MDevel​::Peek -le '$_ = shift; Dump $_; $_ =
Encode​::decode('utf8', $_); Dump $_' `perl -le 'binmode STDOUT, "​:utf8"; print
"\x{0432}\x{0430}"'`

Shouldn't sv_utf8_decode() (and potentially other utf8 functions) then be
replaced with Encode​::utf8​::decode_xs() (and equivalents from Encode?).

--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http​://stason.org/ mod_perl Guide ---> http​://perl.apache.org
mailto​:stas@​stason.org http​://use.perl.org http​://apacheweek.com
http​://modperlbook.org http​://apache.org http​://ticketmaster.com

@p5pRT
Copy link
Author

p5pRT commented Jun 1, 2004

From nick@ing-simmons.net

Stas Bekman <stas@​stason.org> writes​:

SADAHIRO Tomoyuki wrote​:

On Sun, 30 May 2004 21​:43​:38 -0700
Stas Bekman <stas@​stason.org> wrote​:

So it's unclear whose fault it is, the code that populates sv with pv under -T
or Perl_sv_utf8_decode(), which may need to check pPOK instead of POK?

You can use <$_ = Encode​::decode('utf8', $_)>
instead of <utf8​::decode $_>.
Encode​::decode('utf8') is based on Encode​::utf8​::decode_xs() and
it does not use "broken" sv_utf8_decode().
I think it should work well even if taint mode.

Thank you, Sadahiro. It indeed fixes my simple test case (I haven't tested
that solution in the application yet).

perl-5.8.0 -T -MEncode -MDevel​::Peek -le '$_ = shift; Dump $_; $_ =
Encode​::decode('utf8', $_); Dump $_' `perl -le 'binmode STDOUT, "​:utf8"; print
"\x{0432}\x{0430}"'`

Shouldn't sv_utf8_decode() (and potentially other utf8 functions) then be
replaced with Encode​::utf8​::decode_xs() (and equivalents from Encode?).

sv_utf8_decode is an internals thing that is exported to help out
XS code at perl5.6 or so. We have discussed the pPOK vs POK thing before
and I forget the final outcome.

Normally you have to do SvGETMAGIC() to get POK flag on, and I think
internals routine assumes caller does that.

We could change that, but that won't help older perls.

@p5pRT
Copy link
Author

p5pRT commented Jun 1, 2004

From BQW10602@nifty.com

On Tue, 01 Jun 2004 08​:50​:33 +0100
Nick Ing-Simmons <nick@​ing-simmons.net> wrote​:

Shouldn't sv_utf8_decode() (and potentially other utf8 functions) then be
replaced with Encode​::utf8​::decode_xs() (and equivalents from Encode?).

sv_utf8_decode is an internals thing that is exported to help out
XS code at perl5.6 or so. We have discussed the pPOK vs POK thing before
and I forget the final outcome.

I think Change 22652 fixes sv_pvutf8n_force (and SvPVutf8_force)
  and Change 22842 fixes sv_utf8_upgrade (and SvPVutf8).

Normally you have to do SvGETMAGIC() to get POK flag on, and I think
internals routine assumes caller does that.

We could change that, but that won't help older perls.

As I understand correctly, what current sv_utf8_decode does is
little better than seeing if a string is in utf-8 and then setting UTF8 flag on
if PV is in valid utf-8.

I'm paying attention to SvPOK(sv) at the first conditional
in Perl_sv_utf8_decode. I suppose it might be possibly intended
that Perl_sv_utf8_decode does nothing for a scalar of !SvPOK.

If SvPOK(sv) in question is replaced by SvPOKp(sv),
not only SV with POK but SV with pPOK will be decoded as utf8.
However I'm afraid simply turning UTF8 flag on may cause
something harmful which I'm not aware of.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Jun 1, 2004

From stas@stason.org

Nick Ing-Simmons wrote​:

Stas Bekman <stas@​stason.org> writes​:

SADAHIRO Tomoyuki wrote​:

On Sun, 30 May 2004 21​:43​:38 -0700
Stas Bekman <stas@​stason.org> wrote​:

So it's unclear whose fault it is, the code that populates sv with pv under -T
or Perl_sv_utf8_decode(), which may need to check pPOK instead of POK?

You can use <$_ = Encode​::decode('utf8', $_)>
instead of <utf8​::decode $_>.
Encode​::decode('utf8') is based on Encode​::utf8​::decode_xs() and
it does not use "broken" sv_utf8_decode().
I think it should work well even if taint mode.

Thank you, Sadahiro. It indeed fixes my simple test case (I haven't tested
that solution in the application yet).

perl-5.8.0 -T -MEncode -MDevel​::Peek -le '$_ = shift; Dump $_; $_ =
Encode​::decode('utf8', $_); Dump $_' `perl -le 'binmode STDOUT, "​:utf8"; print
"\x{0432}\x{0430}"'`

Now tested the application and Encode works a treat.

Shouldn't sv_utf8_decode() (and potentially other utf8 functions) then be
replaced with Encode​::utf8​::decode_xs() (and equivalents from Encode?).

sv_utf8_decode is an internals thing that is exported to help out
XS code at perl5.6 or so. We have discussed the pPOK vs POK thing before
and I forget the final outcome.

If it is exported for XS code, it shouldn't end up in the perl space, no? XS
code doesn't need an XSUB to use it.

If it remains in the perl space, please document that utf8​::xxx is not be to
used in the perl code, but one should use Encode​::xxx instead.

Normally you have to do SvGETMAGIC() to get POK flag on, and I think
internals routine assumes caller does that.

In this case I was trying to use it in perl code, so obviously I couldn't do
anything like that.

We could change that, but that won't help older perls.

--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http​://stason.org/ mod_perl Guide ---> http​://perl.apache.org
mailto​:stas@​stason.org http​://use.perl.org http​://apacheweek.com
http​://modperlbook.org http​://apache.org http​://ticketmaster.com

@p5pRT
Copy link
Author

p5pRT commented Jun 3, 2004

@rgs - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Jun 3, 2004
@p5pRT
Copy link
Author

p5pRT commented Jun 3, 2004

From stas@stason.org

The issue hasn't been resolved. utf8​::decode still doesn't work under -T.

--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http​://stason.org/ mod_perl Guide ---> http​://perl.apache.org
mailto​:stas@​stason.org http​://use.perl.org http​://apacheweek.com
http​://modperlbook.org http​://apache.org http​://ticketmaster.com

1 similar comment
@p5pRT
Copy link
Author

p5pRT commented Jun 5, 2004

From stas@stason.org

The issue hasn't been resolved. utf8​::decode still doesn't work under -T.

--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http​://stason.org/ mod_perl Guide ---> http​://perl.apache.org
mailto​:stas@​stason.org http​://use.perl.org http​://apacheweek.com
http​://modperlbook.org http​://apache.org http​://ticketmaster.com

@p5pRT
Copy link
Author

p5pRT commented Jun 5, 2004

From BQW10602@nifty.com

Hello. Here is a patch​:
(1) code fix​: s/SvPOK(sv)/SvPOKp(sv)/ for sv_utf8_(decode|downgrade).
(2) doc patch for perlapi.pod (through sv.c) and utf8.pm
(3) adding tests in t/op/utftaint.t.
  (note​: if -T is not used, all the tests in utftaint.t should success
  even with perl 5.8.4.)

Since sv_utf8_decode and sv_utf8_downgrade are marked with "may change"
in embed.fnc and they are mentioned so in perlapi.pod,
I think that, at present, utf8​::decode and utf8​::downgrade should
be regarded as experimental, as well as the API functions wrapped by them.

Please note that sv_utf8_downgrade and sv_utf8_decode may fail
while sv_utf8_upgrade and sv_utf8_encode never fail.
Sv_utf8_decode will fail if string is illegal as UTF-8
and sv_utf8_downgrade will fail if string contains any character
outside the repertoire of Latin-1.
In my opinion, using utf8​::decode and utf8​::downgrade without checking
the return value are not recommended.

Inline Patch
diff -urN perl~/lib/utf8.pm perl/lib/utf8.pm
--- perl~/lib/utf8.pm	Sun Apr 04 22:32:36 2004
+++ perl/lib/utf8.pm	Sat Jun 05 23:33:34 2004
@@ -113,45 +113,59 @@
 
 =item * $num_octets = utf8::upgrade($string)
 
-Converts (in-place) internal representation of string to Perl's
-internal I<UTF-X> form.  Returns the number of octets necessary to
-represent the string as I<UTF-X>.  Can be used to make sure that the
-UTF-8 flag is on, so that C<\w> or C<lc()> work as expected on strings
-containing characters in the range 0x80-0xFF (oon ASCII and
-derivatives).  Note that this should not be used to convert a legacy
-byte encoding to Unicode: use Encode for that.  Affected by the
-encoding pragma.
+Converts in-place the octet sequence in the native encoding
+(Latin-1 or EBCDIC) to the equivalent character sequence in I<UTF-X>.
+I<$string> already encoded as characters does no harm.
+Returns the number of octets necessary to represent the string as I<UTF-X>.
+Can be used to make sure that the UTF-8 flag is on,
+so that C<\w> or C<lc()> work as Unicode on strings
+containing characters in the range 0x80-0xFF (on ASCII and
+derivatives).
 
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
+
+Affected by the encoding pragma.
+
 =item * $success = utf8::downgrade($string[, FAIL_OK])
 
-Converts (in-place) internal representation of string to be un-encoded
-bytes.  Returns true on success. On failure dies or, if the value of
-FAIL_OK is true, returns false.  Can be used to make sure that the
-UTF-8 flag is off, e.g. when you want to make sure that the substr()
-or length() function works with the usually faster byte algorithm.
-Note that this should not be used to convert Unicode back to a legacy
-byte encoding: use Encode for that.  B<Not> affected by the encoding
-pragma.
+Converts in-place the character sequence in I<UTF-X>
+to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
+I<$string> already encoded as octets does no harm.
+Returns true on success. On failure dies or, if the value of
+C<FAIL_OK> is true, returns false.
+Can be used to make sure that the UTF-8 flag is off,
+e.g. when you want to make sure that the substr() or length() function
+works with the usually faster byte algorithm.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
+
+B<Not> affected by the encoding pragma.
 
+B<NOTE:> this function is experimental and may change
+or be removed without notice.
+
 =item * utf8::encode($string)
 
-Converts in-place the octets of the I<$string> to the octet sequence
-in Perl's I<UTF-X> encoding.  Returns nothing.  B<Note that this does
-not change the "type" of I<$string> to UTF-8>, and that this handles
-only ISO 8859-1 (or EBCDIC) as the source character set. Therefore
-this should not be used to convert a legacy 8-bit encoding to Unicode:
-use Encode::decode() for that.  In the very limited case of wanting to
-handle just ISO 8859-1 (or EBCDIC), you could use utf8::upgrade().
+Converts in-place the character sequence to the corresponding octet sequence
+in I<UTF-X>.  The UTF-8 flag is turned off.  Returns nothing.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
 
 =item * utf8::decode($string)
 
-Attempts to convert I<$string> in-place from Perl's I<UTF-X> encoding
-into octets.  Returns nothing.  B<Note that this does not change the
-"type" of <$string> from UTF-8>, and that this handles only ISO 8859-1
-(or EBCDIC) as the destination character set.  Therefore this should
-not be used to convert Unicode back to a legacy 8-bit encoding:
-use Encode::encode() for that.  In the very limited case of wanting
-to handle just ISO 8859-1 (or EBCDIC), you could use utf8::downgrade().
+Attempts to convert in-place the octet sequence in I<UTF-X>
+to the corresponding character sequence.  The UTF-8 flag is turned on
+only if the source string contains multiple-byte I<UTF-X> characters.
+If I<$string> is invalid as I<UTF-X>, returns false; otherwise returns true.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
+
+B<NOTE:> this function is experimental and may change
+or be removed without notice.
 
 =item * $flag = utf8::is_utf8(STRING)
 
diff -urN perl~/pod/perlapi.pod perl/pod/perlapi.pod
--- perl~/pod/perlapi.pod	Thu Jun 03 02:08:58 2004
+++ perl/pod/perlapi.pod	Sat Jun 05 22:44:34 2004
@@ -4834,9 +4834,11 @@
 
 =item sv_utf8_decode
 
-Convert the octets in the PV from UTF-8 to chars. Scan for validity and then
-turn off SvUTF8 if needed so that we see characters. Used as a building block
-for decode_utf8 in Encode.xs
+If the PV of the SV is an octet sequence in UTF-8
+and contains a multiple-byte character, the C<SvUTF8> flag is turned on
+so that it looks like a character. If the PV contains only single-byte
+characters, the C<SvUTF8> flag stays being off.
+Scans PV for validity and returns false if the PV is invalid UTF-8.
 
 NOTE: this function is experimental and may change or be
 removed without notice.
@@ -4848,9 +4850,9 @@
 
 =item sv_utf8_downgrade
 
-Attempt to convert the PV of an SV from UTF-8-encoded to byte encoding.
-This may not be possible if the PV contains non-byte encoding characters;
-if this is the case, either returns false or, if C<fail_ok> is not
+Attempts to convert the PV of an SV from characters to bytes.
+If the PV contains a character beyond byte, this conversion will fail;
+in this case, either returns false or, if C<fail_ok> is not
 true, croaks.
 
 This is not as a general purpose Unicode to byte encoding interface:
@@ -4866,9 +4868,8 @@
 
 =item sv_utf8_encode
 
-Convert the PV of an SV to UTF-8-encoded, but then turn off the C<SvUTF8>
-flag so that it looks like octets again. Used as a building block
-for encode_utf8 in Encode.xs
+Converts the PV of an SV to UTF-8, but then turns the C<SvUTF8>
+flag off so that it looks like octets again.
 
 	void	sv_utf8_encode(SV *sv)
 
@@ -4877,7 +4878,7 @@
 
 =item sv_utf8_upgrade
 
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
 Forces the SV to string form if it is not already.
 Always sets the SvUTF8 flag to avoid future validity checks even
 if all the bytes have hibit clear.
@@ -4892,7 +4893,7 @@
 
 =item sv_utf8_upgrade_flags
 
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
 Forces the SV to string form if it is not already.
 Always sets the SvUTF8 flag to avoid future validity checks even
 if all the bytes have hibit clear. If C<flags> has C<SV_GMAGIC> bit set,
diff -urN perl~/sv.c perl/sv.c
--- perl~/sv.c	Tue May 25 01:18:32 2004
+++ perl/sv.c	Sat Jun 05 22:43:56 2004
@@ -3907,7 +3907,7 @@
 /*
 =for apidoc sv_utf8_upgrade
 
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
 Forces the SV to string form if it is not already.
 Always sets the SvUTF8 flag to avoid future validity checks even
 if all the bytes have hibit clear.
@@ -3917,7 +3917,7 @@
 
 =for apidoc sv_utf8_upgrade_flags
 
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
 Forces the SV to string form if it is not already.
 Always sets the SvUTF8 flag to avoid future validity checks even
 if all the bytes have hibit clear. If C<flags> has C<SV_GMAGIC> bit set,
@@ -3986,9 +3986,9 @@
 /*
 =for apidoc sv_utf8_downgrade
 
-Attempt to convert the PV of an SV from UTF-8-encoded to byte encoding.
-This may not be possible if the PV contains non-byte encoding characters;
-if this is the case, either returns false or, if C<fail_ok> is not
+Attempts to convert the PV of an SV from characters to bytes.
+If the PV contains a character beyond byte, this conversion will fail;
+in this case, either returns false or, if C<fail_ok> is not
 true, croaks.
 
 This is not as a general purpose Unicode to byte encoding interface:
@@ -4000,7 +4000,7 @@
 bool
 Perl_sv_utf8_downgrade(pTHX_ register SV* sv, bool fail_ok)
 {
-    if (SvPOK(sv) && SvUTF8(sv)) {
+    if (SvPOKp(sv) && SvUTF8(sv)) {
         if (SvCUR(sv)) {
 	    U8 *s;
 	    STRLEN len;
@@ -4030,9 +4030,8 @@
 /*
 =for apidoc sv_utf8_encode
 
-Convert the PV of an SV to UTF-8-encoded, but then turn off the C<SvUTF8>
-flag so that it looks like octets again. Used as a building block
-for encode_utf8 in Encode.xs
+Converts the PV of an SV to UTF-8, but then turns the C<SvUTF8>
+flag off so that it looks like octets again.
 
 =cut
 */
@@ -4053,9 +4052,11 @@
 /*
 =for apidoc sv_utf8_decode
 
-Convert the octets in the PV from UTF-8 to chars. Scan for validity and then
-turn off SvUTF8 if needed so that we see characters. Used as a building block
-for decode_utf8 in Encode.xs
+If the PV of the SV is an octet sequence in UTF-8
+and contains a multiple-byte character, the C<SvUTF8> flag is turned on
+so that it looks like a character. If the PV contains only single-byte
+characters, the C<SvUTF8> flag stays being off.
+Scans PV for validity and returns false if the PV is invalid UTF-8.
 
 =cut
 */
@@ -4063,7 +4064,7 @@
 bool
 Perl_sv_utf8_decode(pTHX_ register SV *sv)
 {
-    if (SvPOK(sv)) {
+    if (SvPOKp(sv)) {
         U8 *c;
         U8 *e;
 
diff -urN perl~/t/op/utftaint.t perl/t/op/utftaint.t
--- perl~/t/op/utftaint.t	Tue May 25 01:33:40 2004
+++ perl/t/op/utftaint.t	Sat Jun 05 22:55:58 2004
@@ -23,12 +23,17 @@
 use Scalar::Util qw(tainted);
 
 use Test;
-plan tests => 3*10;
+plan tests => 3*10 + 3*8 + 2*16;
 my $cnt = 0;
 
 my $arg = $ENV{PATH}; # a tainted value
 use constant UTF8 => "\x{1234}";
 
+sub is_utf8 {
+    my $s = shift;
+    return 0xB6 != ord pack('a*', chr(0xB6).$s);
+}
+
 for my $ary ([ascii => 'perl'], [latin1 => "\xB6"], [utf8 => "\x{100}"]) {
     my $encode = $ary->[0];
     my $string = $ary->[1];
@@ -40,7 +45,7 @@
 
     my $lconcat = $taint;
        $lconcat .= UTF8;
-    print $lconcat eq $string."\x{1234}"
+    print $lconcat eq $string.UTF8
 	? "ok " : "not ok ", ++$cnt, " # compare: $encode, concat left\n";
 
     print tainted($lconcat) == tainted($arg)
@@ -48,7 +53,7 @@
 
     my $rconcat = UTF8;
        $rconcat .= $taint;
-    print $rconcat eq "\x{1234}".$string
+    print $rconcat eq UTF8.$string
 	? "ok " : "not ok ", ++$cnt, " # compare: $encode, concat right\n";
 
     print tainted($rconcat) == tainted($arg)
@@ -71,3 +76,111 @@
     print tainted($taint) == tainted($arg)
 	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, after test\n";
 }
+
+
+for my $ary ([ascii => 'perl'], [latin1 => "\xB6"], [utf8 => "\x{100}"]) {
+    my $encode = $ary->[0];
+
+    my $utf8 = pack('U*') . $ary->[1];
+    my $byte = pack('C0a*', $utf8);
+
+    my $taint = $arg; substr($taint, 0) = $utf8;
+    utf8::encode($taint);
+
+    print $taint eq $byte
+	? "ok " : "not ok ", ++$cnt, " # compare: $encode, encode utf8\n";
+
+    print pack('a*',$taint) eq pack('a*',$byte)
+	? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, encode utf8\n";
+
+    print !is_utf8($taint)
+	? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, encode utf8\n";
+
+    print tainted($taint) == tainted($arg)
+	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, encode utf8\n";
+
+    my $taint = $arg; substr($taint, 0) = $byte;
+    utf8::decode($taint);
+
+    print $taint eq $utf8
+	? "ok " : "not ok ", ++$cnt, " # compare: $encode, decode byte\n";
+
+    print pack('a*',$taint) eq pack('a*',$utf8)
+	? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, decode byte\n";
+
+    print is_utf8($taint) eq ($encode ne 'ascii')
+	? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, decode byte\n";
+
+    print tainted($taint) == tainted($arg)
+	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, decode byte\n";
+}
+
+
+for my $ary ([ascii => 'perl'], [latin1 => "\xB6"]) {
+    my $encode = $ary->[0];
+
+    my $up   = pack('U*') . $ary->[1];
+    my $down = pack('C0a*', $ary->[1]);
+
+    my $taint = $arg; substr($taint, 0) = $up;
+    utf8::upgrade($taint);
+
+    print $taint eq $up
+	? "ok " : "not ok ", ++$cnt, " # compare: $encode, upgrade up\n";
+
+    print pack('a*',$taint) eq pack('a*',$up)
+	? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, upgrade up\n";
+
+    print is_utf8($taint)
+	? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, upgrade up\n";
+
+    print tainted($taint) == tainted($arg)
+	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, upgrade up\n";
+
+    my $taint = $arg; substr($taint, 0) = $down;
+    utf8::upgrade($taint);
+
+    print $taint eq $up
+	? "ok " : "not ok ", ++$cnt, " # compare: $encode, upgrade down\n";
+
+    print pack('a*',$taint) eq pack('a*',$up)
+	? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, upgrade down\n";
+
+    print is_utf8($taint)
+	? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, upgrade down\n";
+
+    print tainted($taint) == tainted($arg)
+	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, upgrade down\n";
+
+    my $taint = $arg; substr($taint, 0) = $up;
+    utf8::downgrade($taint);
+
+    print $taint eq $down
+	? "ok " : "not ok ", ++$cnt, " # compare: $encode, downgrade up\n";
+
+    print pack('a*',$taint) eq pack('a*',$down)
+	? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, downgrade up\n";
+
+    print !is_utf8($taint)
+	? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, downgrade up\n";
+
+    print tainted($taint) == tainted($arg)
+	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, downgrade up\n";
+
+    my $taint = $arg; substr($taint, 0) = $down;
+    utf8::downgrade($taint);
+
+    print $taint eq $down
+	? "ok " : "not ok ", ++$cnt, " # compare: $encode, downgrade down\n";
+
+    print pack('a*',$taint) eq pack('a*',$down)
+	? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, downgrade down\n";
+
+    print !is_utf8($taint)
+	? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, downgrade down\n";
+
+    print tainted($taint) == tainted($arg)
+	? "ok " : "not ok ", ++$cnt, " # tainted: $encode, downgrade down\n";
+}
+
+
###End of patch

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Jun 6, 2004

From @rgs

SADAHIRO Tomoyuki wrote​:

Hello. Here is a patch​:
(1) code fix​: s/SvPOK(sv)/SvPOKp(sv)/ for sv_utf8_(decode|downgrade).
(2) doc patch for perlapi.pod (through sv.c) and utf8.pm
(3) adding tests in t/op/utftaint.t.
(note​: if -T is not used, all the tests in utftaint.t should success
even with perl 5.8.4.)

Thanks, applied as change #22902.

Since sv_utf8_decode and sv_utf8_downgrade are marked with "may change"
in embed.fnc and they are mentioned so in perlapi.pod,
I think that, at present, utf8​::decode and utf8​::downgrade should
be regarded as experimental, as well as the API functions wrapped by them.

Please note that sv_utf8_downgrade and sv_utf8_decode may fail
while sv_utf8_upgrade and sv_utf8_encode never fail.
Sv_utf8_decode will fail if string is illegal as UTF-8
and sv_utf8_downgrade will fail if string contains any character
outside the repertoire of Latin-1.
In my opinion, using utf8​::decode and utf8​::downgrade without checking
the return value are not recommended.

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2004

From stas@stason.org

Rafael Garcia-Suarez wrote​:

SADAHIRO Tomoyuki wrote​:

Hello. Here is a patch​:
(1) code fix​: s/SvPOK(sv)/SvPOKp(sv)/ for sv_utf8_(decode|downgrade).
(2) doc patch for perlapi.pod (through sv.c) and utf8.pm
(3) adding tests in t/op/utftaint.t.
(note​: if -T is not used, all the tests in utftaint.t should success
even with perl 5.8.4.)

Thanks, applied as change #22902.

It works for me. Thanks a lot, SADAHIRO!

--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http​://stason.org/ mod_perl Guide ---> http​://perl.apache.org
mailto​:stas@​stason.org http​://use.perl.org http​://apacheweek.com
http​://modperlbook.org http​://apache.org http​://ticketmaster.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant